-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[for later, just curious] Question about Branch Length Optimization #43
Comments
On 10.02.21 14:09, Sarah Lutteropp wrote:
Currently, branch length optimization is where most of the NetRAX
runtime is spent.
I was wondering:
If we change a single branch length, do we need to care about
loglikelihood changes in the rest of the tree, or is it enough to
re-evaluate loglikelihood at the parent node of the branch?
In a tree the answer is: no, as the CLVs to the left and right of this
branch will not be affected by the changed branch length. So we can just
update the likelihood right there because this is where the virtual root
of the tree is located anyway.
However, once we move to a different branch we need to make sure that
the CLVs are updated consistently to reflect the branch length that was
updated.
And if the answer to the question above is "we can ignore everything
above an edge's parent",
for a tree it is: we can ignore everything to the right and the left of
this branch, but once we move to another branch we need to be careful.
can we generalize this to phylogenetic
networks?
That's the part of the question you need to answer, I guess for most
branches it's the same as for trees.
… Because currently, after changing a branch length we recompute
the entire network loglikelihood...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#43>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6RCQVCPTQJ7QXORIPDS6JZN7ANCNFSM4XM2WFOA>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
It's tricky for networks. Because for the likelihood definitions we now use, we do not have the concept of a network CLV anymore. However, my feeling is that as soon as we re-introduce per-displayed-tree CLV vectors, we will be able to do some tricks here. To be further explored later... |
If one wants to be superduper correct, one still needs to optimize the likelihood of the entire network, likelihood of the subnetwork (rooted at the source node of the changed branch) likely isn't enough. (To be formally proven otherwise) But... who cares about being superduper correct? What if we just do it, and then look how this performs. We can check later on if this was only an approximation or the actual optimal choice. |
My gut feeling is that the following approach should work/ give either the optimal (would need formal proof, not intuitively clear) or a good-enough result:
|
I tried using subnetwork loglikelihood as branch-length optimization criterion. It was simple to implement, but did not perform well. Turns out total network loglikelihood only gets worse that way. |
Oh. Now I get where I had a thinking bug in there. Also for trees, one doesn't use subtree loglikelihood as brlen-optimization criterion. Instead, one re-roots the tree at the branch in question. Mhm... maybe a similar trick can be made for networks now that we have per-node displayed tree clvs? (not really re-rooting the network, but virtually re-rooting its displayed trees) |
What I need: Given a displayed tree at the network root, and given a branch (with its pmatrix index) that we want to optimize:
If we have this, the rest is straight-forward:
---> and voila, much much faster network loglikelihood evaluation when optimizing branch lengths 😎 |
The speedup potential of this idea is huge: For most branches, the number of CLV pairs to update will be much lower than the total number of displayed trees in the network, making this even more efficient... (because of the efficient non-redundant CLV storage, meaning multiple displayed trees share the same CLVs as long as they don't diverge) |
I need to be super careful though, exactly because different displayed trees can share the same CLV vectors. I currently believe that there is no need to store the old two CLVs of interest in a temporary variable and then redo the same stuff for the next displayed tree that has the same two CLVs of interest, but in case things go wrong, I need to double-check this assumption. |
Actually, it can be that two displayed trees share only one CLV vector, but not the other one. I need to draw some picture to check if this affects something/ needs special care regarding brlen opt speedup. Likely it doesn't, but better be on the safe side. |
Sarah, also keep in mind that once you have changed a branch length and
then move to a different virtual root (different branch), the CLVs along
the path from the old to the new virtual root need to be updated to
reflect the new branch length value, this is probably one of the most
tricky parts in ML implementations, as you must 100% make sure which CLV
entries are still valid and which are not.
…On 08.03.21 15:23, Sarah Lutteropp wrote:
Actually, it can be that two displayed trees share only one CLV vector,
but not the other one. I need to draw some picture to check if this
affects something/ needs special care. Likely it doesn't, but better be
on the safe side.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6WQ62PB5O5IJDPCV7TTCTFVZANCNFSM4XM2WFOA>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
Thanks for the reminder! For networks, this part should be simple. The network root never changes. During brlen optimization, we only evaluate the displayed trees at some other node tempRoot in the network (but take the displayed tree topologies and probs as if they were still rooted at the network root). After optimizing a branch and using tempRoot for it, we call the invalidateHigherCLVs(tempRoot) function, which invalidates all CLVs on the path from tempRoot to the network root. |
okay, that sounds consistent, getting this right was one of the most
nasty parts in raxml
…On 09.03.21 10:39, Sarah Lutteropp wrote:
Thanks for the reminder! For networks, this part should be simple. The
network root never changes. During brlen optimization, we only evaluate
the displayed trees at some other node tempRoot in the network.
After optimizing a branch and using tempRoot for it, we call the
invalidateHigherCLVs(tempRoot) function, which invalidates all CLVs on
the path from tempRoot to the network root.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6TGQQTYQ3UCVT3Y6T3TCXNFXANCNFSM4XM2WFOA>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
I've checked the pll-modules source code for logl computation during brlen optimization.
What I need to implement:
|
I need to double-check how one computes the changed displayed tree loglikelihood after a branch length has been changed during the brlen opt. Am I using the right CLVs? Is there some other problem we didn't think of? |
Maybe I need pll_compute_root_loglikelihood instead of pll_compute_edge_loglikelihood? |
No that can't be, pll_compute_edge_loglikelihood sounds about right... maybe drawing a picture helps |
The affected CLVs are exactly the ones on the path from the temporary virtual root to the network root. Thus, one not only needs to invalidate them after the branch has been optimized, but one also needs to recompute them before optimizing that branch starts (as this is what is behind the "change of virtual root" in pll-modules) |
Thus, I only need to implement a function that recomputes the affected displayed tree CLVs (i.e., the CLVs on the path from tempRoot to networkRoot) with regard to the temporary root as a preprocessing step before optimizing a branch. And then that thing should work :-) Now I need to be super careful though about whether CLVs shared by multiple displayed trees are somehow affected or not. I still expect that to be no problem. But if afterwards there still is a bug, this assumption needs to be checked (and if the CLV-share causes a problem, then some temporary CLV copying must be introduced). |
Now I get what you meant by there... it was too early in the morning when I first read it 😅 |
Turns out updating the CLVs is where the coding work is for the networks case, too. 😅 (the rest needed for faster brlenopt-network-logl-computation was straight-forward) |
That is what I was afraid of :-) great progress though
…On 09.03.21 16:45, Sarah Lutteropp wrote:
okay, that sounds consistent, getting this right was one of the most
nasty parts in raxml
Turns out updating the CLVs also is the most nasty part for the networks
case. 😅
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6VGI2L6YQQMCA2OUP3TCYX67ANCNFSM4XM2WFOA>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
One thing I need to take special care of:
... But does this cause a problem? This needs a careful case distinction. One thing is clear: If a branch is in a dead part for a given displayed tree, then the loglikelihood of that displayed tree stays the same no matter what length that branch has. And if a branch is in an active part for a given displayed tree, then we can safely virtually re-root the displayed tree to the branch source node. ----> Conclusion: Skip recomputation of displayed tree loglikelihood for displayed trees where the branch is in a dead part (and simply use the old logl of that displayed tree), go on like previously planned if the branch is in an active part. |
I thus need to implement functions like these:
I can simplify the parent/children stuff by precomputing all node parents once when changing the virtual root. |
this all sounds reasonable
…On 10.03.21 14:10, Sarah Lutteropp wrote:
I thus need to implement functions like these:
* bool isActiveBranch(AnnotatedNetwork& ann_network, const
DisplayedTreeData& displayedTree, unsigned int pmatrix_index)
* Node* getNodeParent(AnnotatedNetwork& ann_network, Node* node, Node*
virtualRoot)
* std::vector<Node*> getNodeChildren(AnnotatedNetwork& ann_network,
Node* node, Node* virtualRoot)
* void updateClvsVirtualRerootTrees(AnnotatedNetwork& ann_network,
Node* old_virtual_root, Node* new_virtual_root)
I can simplify the parent/children stuff by precomputing all node
parents once when changing the virtual root.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXB6RGUDUU22EDNKON6ALTC5OS5ANCNFSM4XM2WFOA>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
To be done with |
Likely the last redesign step I need, then it makes all sense:
These two things are rather simple and quick to implement. The real work lies in here:
|
I need to write 3 entirely new node processing functions for this, as I cannot use the old ones. In the functions I had so far (for normal incremental network loglh computation), I processed the network nodes in a bottom-up order (reversed topological sort, to be particularly specific). Going up the network like this, I inferred for each DisplayedTreeData/CLV which reticulations need to be set in which way based on pair-wise combining its children DisplayedTreeData's/CLVs. Now for the virtual re-root case, at the moment I see no other way than iterating over the displayed trees, following a more top-down approach. This is because the edge directions now depend on the displayed tree. The tricky part is how to combine this with reusing/sharing CLV vectors among multiple displayed trees! Still drawing and thinking, figuring out how to do this best. It may be a bit tricky, but I am 100% sure it is doable and I'll figure it out by some more thinking. |
I figured it out despite having a mean headache: "Which reticulation choices make this node the parent of the current clv?" -> this gives additional reticulation restrictions to store at the DisplayedTreeData/CLV. Together with the overall reticulation choices for the displayed trees in the network being known beforehand during brlenopt, this can easily be put together. It allows me to adapt and then reuse the processNode functions. |
Also, the different edge directions change something in the order of processing the CLVs. |
Okay, now I got it: We don't need to redo update-CLVs by going over each displayed tree. I have figured out when the edge directions change: We need to do separate update-CLVs calls on each of the paths from old virtual root to new virtual root in the network (with following the network edges in reverse direction, i.e. going to parents all the time). |
And the overlapping part (which only needs to be processed once) of all these paths, the part that is equal for all these paths, is exactly the path from the old virtual root until the first encountered reticulation node on a path to the new virtual root. |
Sounds great and consistent :-)Prof. Alexandros Stamatakiswww.exelixis-lab.org
-------- Ursprüngliche Nachricht --------Von: Sarah Lutteropp ***@***.***> Datum: 12.03.21 14:54 (GMT+02:00) An: lutteropp/NetRAX ***@***.***> Cc: Alexis Stamatakis ***@***.***>, Comment ***@***.***> Betreff: Re: [lutteropp/NetRAX] [for later, just curious] Question about Branch Length Optimization (#43)
And the overlapping part (which only needs to be processed once) of all these paths, the part that is equal for all these paths, is exactly the path from the old virtual root until the first encountered reticulation node on the path to new virtual root.
—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe.
|
A situation that gets tricky regarding which CLVs to use: Here, we have the paths (from old_virtual root to new_virtual_root) 7->4->8->5 and 7->6->8->5. The children reported below are with regard to new_displayed_root.
There are two possible ways for resolving the situation: A) Store which nodes were the children used for a given DisplayedTreeData. This is to be stored in the DisplayedTreeData object. Then, it makes sense to store multiple DisplayedTreeData's in a node, with the same reticulation choices but different set of children. And then one has to make sure to also check for compatible children setting when looking at a DisplayedTreeData from a child at a current node (it is only compatible if the current node does not show up in the list of children). To do final tree logl evaluation right, keep a flag stating whether a tree was newly added or not. And only evaluate over the newly added trees... B) Before processing the next path, kick out the old CLVs at all path nodes except for the new_virtual_root node. Detect in advance at which nodes we need to restore the old displayed trees from how they were when we had old_virtual_root, save and restore them accordingly. I have decided to proceed with solution B. |
Solution B does not work the way I want it to work. (Simple example: Call it with two times the same pmatrix_index. First time recovering the CLVs worked perfectly, second time because the first time overwrote some things we are not able to correctly recover the old CLVs anymore.) Thus simpler, but less efficient: Solution C: Use solution B. But after optimizing a branch, switch back to the normal network root. This should work fine and with simpler code. ---> Switching to solution C. |
These seem to be all special cases one needs to think of. First prototype version is working 🎉. It still computes more CLVs than needed though, as I implemented
This means I implemented virtual_root_1 -> network_root -> virtual_root_2 -> network_root -> virtual_root_3 -> network_root. Regarding minimizing total number of CLVs to recompute, it would be even better to go with
This would implement virtual_root_1 -> virtual_root_2 -> virtual_root_3 -> network_root. |
I need to double-check why I need this returning-to-network-root after optimizing a branch. I wrote it is because first time overwrote some things in the CLVs. Is this still the case? If so, then I need to mix solution A somehow into solution B. I suspect the problem is when going from one path to the other. The problem is obviously in this line from updateCLVsVirtualRerootTrees: What this line essentially does is: When processing nodes in a path, only append the new DisplayedTreeData's to a node if we are at new_virtual_root. Otherwise, clear the old data stored there. This of course doesn't work well with the next branch optimization wanting to restore some CLVs from how they were with the branch before it -> because here, we are interested in all CLVs computed by the paths before. But so far, the function is deleting the intermediate CLV results on the previous paths. The question is whether this issue can be mitigated by only keeping track of which children were used. Or if we also need to take into account the children of those children etc... |
Nice! 😎 Results turned out like I planned. Now the update CLVs calls are already much less of an issue. The main runtime now lies in pll_compute_edge_loglikelihood (76.14%). As can be seen in this callgraph. This makes further tuning the updateCLVsVirtualRerootTrees low priority.
|
Currently, branch length optimization is where most of the NetRAX runtime is spent.
I was wondering:
If we change a single branch length, do we need to care about loglikelihood changes in the rest of the tree, or is it enough to re-evaluate loglikelihood at the parent node of the branch?
And if the answer to the question above is "we can ignore everything above an edge's parent", can we generalize this to phylogenetic networks? Because currently, after changing a branch length we recompute the entire network loglikelihood...
The text was updated successfully, but these errors were encountered: