-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve polytomy resolution #109
Comments
My first thought was to throw geographic information in a second data partition. IQTree allows this: http://www.iqtree.org/doc/Complex-Models#partition-models . There aren't currently any completely custom models, but you can supply a custom amino acid transition matrix, and so as a proof of concept we could use 20 states. If it gives reasonable results, I very much suspect that the IQTree folks would accept a PR to implement more general custom models, given the importance of your application. @afmagee points out, correctly, that this has an inherent difference to the types of discrete-trait migration models we are using here, versus those that BEAST does. Namely, in BEAST they are using a clock proportional to calendar time on the time tree, whereas here migration would happen in units of mutation-time. My response to that issue is that you're re-doing all the branch lengths in TreeTime anyway, and this would just be a way to get a branch for cases where we know things should cluster together. We had a short meeting with @trvrb about this. He thought it was not-crazy, but I'd like to hear your thoughts. |
As @matsen pointed out to me, one possible way to resolve polytomies is to simply run neighbor joining. I came up with a back of the envelope calculation that might allow one to combine neighbor joining distances between two different data sources, such as DNA and geography, and then tried it out on some simulated datasets. The attached PDF has some more details on the experiments. I think the key takeaways are:
One could also consider using bootstrapping to generate a set of possible resolutions to the polytomy and considering those as well. |
The main puzzle I always faced here is how to combine temporal and geographical constraints. With genetic constraints, we assume a strict hierarchy (genetics first, temporal constraints affect topology only in absence of mutations). But geography and time don't have a clear hierarchy as far as I can tell. So somehow I think we need an objective function that listens to time and geo constraints... |
@matsen and I were thinking about geography as data in the same way that the genetic sequences are data. In a full BEAST analysis, with nucleotide data and a geographic character, and not using a structured coalescent model, the geographic character is treated another site with its own substitution model and clock rate. The clock rate helps determine its weight in the likelihood against the nucleotide sites. If geography incorporated in the initial tree inference, before running TreeTime, then it might reduce the number/size of polytomies that exist. Otherwise, it would have to be incorporated later, probably at the polytomy resolution stage. If geography evolves enough faster than the nucleotides, then given both a single mutation and a single change in geography, the mutation should be weighted more highly by the likelihood to resolve a relationship. In this case, using geography as data only the polytomy-breaking stage, to favor some relationships over others, might be close to equivalent to having used it to infer the tree as well as during a TreeTime analysis. I think in general the rate of geographic change (the mugration rate) should be higher, so this might be viable as an approximation. |
We have
|
Thanks, @afmagee ! I might also suggest two things:
|
Thanks! I am just trying to think under what conditions we might actually want to group the nodes with >0 mutations with some that have none. But the case of subset of identical nodes is definitely a super useful one to explore. |
The problem
TreeTime has currently a very rudimentary way of resolving polytomies (multi-furcations in the tree).
When this was initially put in place it was never meant to resolve large polytomies but mostly meant to split 3- or 4-fold nodes.
Hence the entire process is very ad-hoc (and slow).
In essence, for all pairs of nodes in a polytomy we compute the LH gain when "pulling" this pair out of the polytomy:
https://github.com/neherlab/treetime/blob/master/treetime/treetime.py#L556
This currently only uses temporal ordering and is horribly inefficient (n^2 with polytomy size -- didn't bother other when n=5.)
Desired improvements
We'd like to be able to use other types of information (geography, known linkages, etc) to inform polytomie resolution.
Ideally that would also scale as n^1.5 (fast NJ).
Possible courses of actions
_poly
function). => extract polytomy and reinfect a (partially) resolved treeThe text was updated successfully, but these errors were encountered: