Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bruvo's distance missing estimation: use ordered alleles + deprecation #141

Merged
merged 19 commits into from Aug 20, 2017
Merged

Bruvo's distance missing estimation: use ordered alleles + deprecation #141

merged 19 commits into from Aug 20, 2017

Conversation

zkamvar
Copy link
Member

@zkamvar zkamvar commented Aug 20, 2017

Motivation

Problem

Previously brought up in #139, Bruvo's distance from poppr version 1.1 to the current version (06f36f5) has featured the ability to use the methods of estimating distances of polyploids between partial heterozygotes by averaging over all possible combinations of alleles under the genome addition or genome loss models.

The assumption, however was that this was considering all unordered combinations of alleles, which resulted higher weighting of homozygous genotypes in this calculation.

For example, comparing the genotypes 51/52 and 52/52/53/55 results in a distance of 0.34375 under the combined addition/loss model with the ordered combinations of alleles and 0.35495 with unordered combinations.

Who it affects

This affects those with ALL of the following data characteristics:

  • Polyploid data with
  • more than 2 missing/ambiguous alleles
  • using the genome addition or genome loss model

Solution

Since the calculation for Bruvo's distance is calculated in a manner that is agnostic of order, I was able to fix this bug by multiplying the resulting distance and the counter by the number of unique permutations of the alleles used to fill the shorter genotype.

Calculation of multinomial coefficient (no. unique permutations):

https://github.com/zkamvar/poppr/blob/e724d753dec80355bdf3af0ea78e4e4c37682c3d/src/poppr_distance.c#L795-L842

Implementation in genome addition model:

https://github.com/zkamvar/poppr/blob/e724d753dec80355bdf3af0ea78e4e4c37682c3d/src/poppr_distance.c#L941-L944

This addition is:

  • Tested
  • Documented
  • Added to NEWS

User-facing changes

Repeat length warning -> error

Because Bruvo's distance requires knowledge of the repeat lengths, I have soft-deprecated using the default repeat length setting by issuing the following warning to users:

> pcd <- bruvo.dist(partial_clone)
Warning in bruvo.dist(partial_clone) :

	--------------------------------------------------------------
	                       !!! ALERT !!!

	This warning will become an ERROR in future versions of poppr.
	Please define your repeat lengths to avoid this error.
	--------------------------------------------------------------

Repeat length vector for loci is not equal to the number of loci represented.
Estimating repeat lengths from data:
 c(3, 1, 2, 1, 2, 4, 2, 3, 1, 2)


Option to switch between Bruvo model implementation

I have set an option for the user to switch between the unordered and ordered versions of the genome addition and genome loss models. The user can use options(old.bruvo.model = TRUE) to revert to the previous model, but it is set to FALSE by default, using the correct implementation.

I will most likely rename it, but so far, it seems
to work. It will calculate n!/a!b!c! where n is the
total number of items in a vector and a, b, and c
are the counts of items per class.

An example:
MISSISSIPPI: 11! / (1!2!4!4!)
- change name from expand_binomial to
  multinomial_coeff (cuz that's what it is)
- add shortcuts to multinomial_coeff for easy
  situations (e.g. 1 or 2 missing alleles)
- remove factorial calculation from
  multinomial_coeff and place within bruvo_dist
- add multinomial_coeff documentation
- move allocation of distance array memory to
  existing loop
Note: tests with recursion will not pass due to this
implementation.
By using indices and not fragment sizes, we allow
for duplicated fragments to be properly counted.
These are based on the values from polysat
This global option allows me to give the users a
switch without having to switch it on for every
function that uses Bruvo's distance.
@zkamvar zkamvar self-assigned this Aug 20, 2017
@zkamvar zkamvar merged commit f3472ff into grunwaldlab:master Aug 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant