New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bruvo's distance missing estimation: use ordered alleles + deprecation #141
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I will most likely rename it, but so far, it seems to work. It will calculate n!/a!b!c! where n is the total number of items in a vector and a, b, and c are the counts of items per class. An example: MISSISSIPPI: 11! / (1!2!4!4!)
- change name from expand_binomial to multinomial_coeff (cuz that's what it is) - add shortcuts to multinomial_coeff for easy situations (e.g. 1 or 2 missing alleles) - remove factorial calculation from multinomial_coeff and place within bruvo_dist - add multinomial_coeff documentation - move allocation of distance array memory to existing loop
Note: tests with recursion will not pass due to this implementation.
By using indices and not fragment sizes, we allow for duplicated fragments to be properly counted.
These are based on the values from polysat
This global option allows me to give the users a switch without having to switch it on for every function that uses Bruvo's distance.
…o bruvo-binomial
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Problem
Previously brought up in #139, Bruvo's distance from poppr version 1.1 to the current version (06f36f5) has featured the ability to use the methods of estimating distances of polyploids between partial heterozygotes by averaging over all possible combinations of alleles under the genome addition or genome loss models.
The assumption, however was that this was considering all unordered combinations of alleles, which resulted higher weighting of homozygous genotypes in this calculation.
For example, comparing the genotypes 51/52 and 52/52/53/55 results in a distance of 0.34375 under the combined addition/loss model with the ordered combinations of alleles and 0.35495 with unordered combinations.
Who it affects
This affects those with ALL of the following data characteristics:
Solution
Since the calculation for Bruvo's distance is calculated in a manner that is agnostic of order, I was able to fix this bug by multiplying the resulting distance and the counter by the number of unique permutations of the alleles used to fill the shorter genotype.
Calculation of multinomial coefficient (no. unique permutations):
https://github.com/zkamvar/poppr/blob/e724d753dec80355bdf3af0ea78e4e4c37682c3d/src/poppr_distance.c#L795-L842
Implementation in genome addition model:
https://github.com/zkamvar/poppr/blob/e724d753dec80355bdf3af0ea78e4e4c37682c3d/src/poppr_distance.c#L941-L944
This addition is:
User-facing changes
Repeat length warning -> error
Because Bruvo's distance requires knowledge of the repeat lengths, I have soft-deprecated using the default repeat length setting by issuing the following warning to users:
Option to switch between Bruvo model implementation
I have set an option for the user to switch between the unordered and ordered versions of the genome addition and genome loss models. The user can use
options(old.bruvo.model = TRUE)
to revert to the previous model, but it is set toFALSE
by default, using the correct implementation.