-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow for infinite allele violations in simulation/validation #17
Comments
Great that you're thinking about this. The situations of infinite sites versus repeated alleles seem sufficiently incomparable that handling them as if they were comparable seems like it might create problems. In any case I'd want to know how frequently it happened. So how frequently does it happen? Could we just say that the assumption was violated in x% of the simulations? |
Indeed repeated alleles in the simulation happens quite often. It depends very much on the simulation parameters but I have seen several parameter regimes where 90-99% of all simulations have been terminated because of repeated alleles. |
The conclusion from todays talk was that we are going do deal with this by making a list of all possible trees with this convergently evolved leaf, then do the validation metrics on these and take the worst tree as the results. Problem is then that the number of trees increase exponentially with the number of repeated alleles. This could be resolved by iterating through each unresolved repeat, finding the worst (by a given validation metric) way to resolve it and then move on to the next repeat. |
We can now simulate repeated genotypes, and MRCA and COAR validation metrics work (although RF is wonky). |
We need validations that are robust to simulation results that violate the infinite alleles assumption. Currently the simulation code is set up to retry if the collapsed tree has repeated alleles, but it would be better if our validation could handle this somehow. Maybe for MRCA we score that allele using the copy/lineage that results in the largest hamming distance compared to the ancestor in the inferred tree, or maybe take the mean of all of them?
The text was updated successfully, but these errors were encountered: