Improve polytomy resolution #109

rneher · 2020-03-25T18:45:30Z

The problem

TreeTime has currently a very rudimentary way of resolving polytomies (multi-furcations in the tree).
When this was initially put in place it was never meant to resolve large polytomies but mostly meant to split 3- or 4-fold nodes.
Hence the entire process is very ad-hoc (and slow).

In essence, for all pairs of nodes in a polytomy we compute the LH gain when "pulling" this pair out of the polytomy:

https://github.com/neherlab/treetime/blob/master/treetime/treetime.py#L556

This currently only uses temporal ordering and is horribly inefficient (n^2 with polytomy size -- didn't bother other when n=5.)

Desired improvements

We'd like to be able to use other types of information (geography, known linkages, etc) to inform polytomie resolution.
Ideally that would also scale as n^1.5 (fast NJ).

Possible courses of actions

refactor this part and isolate the resolution (current _poly function). => extract polytomy and reinfect a (partially) resolved tree
find a solution to flexibly use different kind of resolutions. Challenge is to provide the necessary information and distance measure in a robust and flexible way
One could try to use an established tree-builder that allows flexible distance functions.

The text was updated successfully, but these errors were encountered:

matsen · 2020-03-27T10:44:27Z

My first thought was to throw geographic information in a second data partition. IQTree allows this: http://www.iqtree.org/doc/Complex-Models#partition-models . There aren't currently any completely custom models, but you can supply a custom amino acid transition matrix, and so as a proof of concept we could use 20 states.

If it gives reasonable results, I very much suspect that the IQTree folks would accept a PR to implement more general custom models, given the importance of your application.

@afmagee points out, correctly, that this has an inherent difference to the types of discrete-trait migration models we are using here, versus those that BEAST does. Namely, in BEAST they are using a clock proportional to calendar time on the time tree, whereas here migration would happen in units of mutation-time.

My response to that issue is that you're re-doing all the branch lengths in TreeTime anyway, and this would just be a way to get a branch for cases where we know things should cluster together.

We had a short meeting with @trvrb about this. He thought it was not-crazy, but I'd like to hear your thoughts.
I acknowledge that this doesn't solve your larger problem of polytomy resolution, but we're thinking about that too.

afmagee · 2020-04-02T18:14:10Z

As @matsen pointed out to me, one possible way to resolve polytomies is to simply run neighbor joining. I came up with a back of the envelope calculation that might allow one to combine neighbor joining distances between two different data sources, such as DNA and geography, and then tried it out on some simulated datasets.

The attached PDF has some more details on the experiments. I think the key takeaways are:

Plain NJ might be able to resolve polytomies.
Combining nucleotide and geography data at the level of unrooted trees could work (that is, the suggestion to add geography at the IQTree level could work despite my previous objections).

One could also consider using bootstrapping to generate a set of possible resolutions to the polytomy and considering those as well.

unrooted_trees_with_nucleotides_and_mugration.pdf

rneher · 2020-04-05T22:01:34Z

The main puzzle I always faced here is how to combine temporal and geographical constraints. With genetic constraints, we assume a strict hierarchy (genetics first, temporal constraints affect topology only in absence of mutations). But geography and time don't have a clear hierarchy as far as I can tell. So somehow I think we need an objective function that listens to time and geo constraints...

afmagee · 2020-04-07T20:34:14Z

@matsen and I were thinking about geography as data in the same way that the genetic sequences are data. In a full BEAST analysis, with nucleotide data and a geographic character, and not using a structured coalescent model, the geographic character is treated another site with its own substitution model and clock rate. The clock rate helps determine its weight in the likelihood against the nucleotide sites.

If geography incorporated in the initial tree inference, before running TreeTime, then it might reduce the number/size of polytomies that exist. Otherwise, it would have to be incorporated later, probably at the polytomy resolution stage.

If geography evolves enough faster than the nucleotides, then given both a single mutation and a single change in geography, the mutation should be weighted more highly by the likelihood to resolve a relationship. In this case, using geography as data only the polytomy-breaking stage, to favor some relationships over others, might be close to equivalent to having used it to infer the tree as well as during a TreeTime analysis. I think in general the rate of geographic change (the mugration rate) should be higher, so this might be viable as an approximation.

matsen · 2020-04-17T16:50:08Z

We have

time
number of mutations that happened on branches (these are, practically-speaking, terminal mutations)
sampling location

afmagee · 2020-04-22T00:49:30Z

I think that there may be a relatively fast way to pull out a subtree from a polytomy, and in doing so increase the likelihood. This is a partial solution to the overall problem, but it may be a useful starting point.

I'm imagining that we have a polytomy somewhere in the tree with n children. If we can identify that m of these children are identical to the parent node (which is an O(n) comparison), then what we essentially want to do is minimize the sum of the branch lengths involving these nodes. The likelihood for a single branch with identical mother and daughter sequences is maximized when the branch is length 0, and should decrease monotonically. There are some number of identical sequences with currently non-zero branch lengths between them, so if we shorten those branches we should have a better likelihood.

I think that an agglomerative approach would work to make a new subtree for the m identical sequences. One would need the time t_i of each node. The algorithm proceeds as follows, where node p is the parent of the polytomy, time increases backwards from the present, and l_i is the length of the branch subtending node i:

Initialize a set of active nodes with the m nodes.

while ( size(active) > 1 ) { // this would be if( size(active) > 2 ) if m == n
    choose the pair of nodes i,j with the most similar ages
    define d = t_p - max(t_i,t_j)
    remove i,j as children of p
    add a new node k that is a child of p with branch length l_p = d
    give node k children i,j with branch lengths l_i - d and l_k - d
    remove i,j from the active set
    add k to the active set
}

Each iteration of the algorithm reduces the sum of branch lengths with no mutations (by an amount d), and in doing so should increase the likelihood. When done, it is possible that one would need to account for the possible placements of the n-m non-identical nodes into the subtree. I don't know that there is a guarantee that they would not best be placed somewhere in the new subtree.

Here's a cartoon of the process with m=4 and n=6, where the branches with tick-marks indicate that the child node is not identical to the parent (there are mutations) and the bare branches indicate that the child node is identical (there are no notations).

matsen · 2020-04-22T14:00:55Z

Thanks, @afmagee ! I might also suggest two things:

this could merely supply a ranking of pairs of nodes to try, and how far down that list would be up to TreeTime
such a list could also incorporate geography in a very simple-minded way, just biasing us towards parsimonious geographic transitions in the absence of any signal to the contrary.

rneher · 2020-05-10T22:06:29Z

Thanks! I am just trying to think under what conditions we might actually want to group the nodes with >0 mutations with some that have none. But the case of subset of identical nodes is definitely a super useful one to explore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve polytomy resolution #109

Improve polytomy resolution #109

rneher commented Mar 25, 2020

matsen commented Mar 27, 2020

afmagee commented Apr 2, 2020

rneher commented Apr 5, 2020

afmagee commented Apr 7, 2020

matsen commented Apr 17, 2020

afmagee commented Apr 22, 2020

matsen commented Apr 22, 2020

rneher commented May 10, 2020

Improve polytomy resolution #109

Improve polytomy resolution #109

Comments

rneher commented Mar 25, 2020

The problem

Desired improvements

Possible courses of actions

matsen commented Mar 27, 2020

afmagee commented Apr 2, 2020

rneher commented Apr 5, 2020

afmagee commented Apr 7, 2020

matsen commented Apr 17, 2020

afmagee commented Apr 22, 2020

matsen commented Apr 22, 2020

rneher commented May 10, 2020