Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow for infinite allele violations in simulation/validation #17

Closed
wsdewitt opened this issue Feb 6, 2017 · 4 comments
Closed

allow for infinite allele violations in simulation/validation #17

wsdewitt opened this issue Feb 6, 2017 · 4 comments

Comments

@wsdewitt
Copy link
Collaborator

wsdewitt commented Feb 6, 2017

We need validations that are robust to simulation results that violate the infinite alleles assumption. Currently the simulation code is set up to retry if the collapsed tree has repeated alleles, but it would be better if our validation could handle this somehow. Maybe for MRCA we score that allele using the copy/lineage that results in the largest hamming distance compared to the ancestor in the inferred tree, or maybe take the mean of all of them?

@matsen
Copy link
Contributor

matsen commented Feb 6, 2017

Great that you're thinking about this.

The situations of infinite sites versus repeated alleles seem sufficiently incomparable that handling them as if they were comparable seems like it might create problems. In any case I'd want to know how frequently it happened.

So how frequently does it happen? Could we just say that the assumption was violated in x% of the simulations?

@krdav
Copy link

krdav commented Feb 26, 2017

Indeed repeated alleles in the simulation happens quite often. It depends very much on the simulation parameters but I have seen several parameter regimes where 90-99% of all simulations have been terminated because of repeated alleles.

@krdav
Copy link

krdav commented Mar 21, 2017

The conclusion from todays talk was that we are going do deal with this by making a list of all possible trees with this convergently evolved leaf, then do the validation metrics on these and take the worst tree as the results.

Problem is then that the number of trees increase exponentially with the number of repeated alleles. This could be resolved by iterating through each unresolved repeat, finding the worst (by a given validation metric) way to resolve it and then move on to the next repeat.

@wsdewitt
Copy link
Collaborator Author

We can now simulate repeated genotypes, and MRCA and COAR validation metrics work (although RF is wonky).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants