Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[for later, just curious] Question about Branch Length Optimization #43

Open
lutteropp opened this issue Feb 10, 2021 · 48 comments
Open

Comments

@lutteropp
Copy link
Owner

Currently, branch length optimization is where most of the NetRAX runtime is spent.

I was wondering:
If we change a single branch length, do we need to care about loglikelihood changes in the rest of the tree, or is it enough to re-evaluate loglikelihood at the parent node of the branch?

And if the answer to the question above is "we can ignore everything above an edge's parent", can we generalize this to phylogenetic networks? Because currently, after changing a branch length we recompute the entire network loglikelihood...

@lutteropp lutteropp changed the title Question about Branch Length Optimization [for later, just curious] Question about Branch Length Optimization Feb 10, 2021
@stamatak
Copy link
Collaborator

stamatak commented Feb 10, 2021 via email

@lutteropp
Copy link
Owner Author

It's tricky for networks. Because for the likelihood definitions we now use, we do not have the concept of a network CLV anymore. However, my feeling is that as soon as we re-introduce per-displayed-tree CLV vectors, we will be able to do some tricks here. To be further explored later...

@lutteropp
Copy link
Owner Author

If one wants to be superduper correct, one still needs to optimize the likelihood of the entire network, likelihood of the subnetwork (rooted at the source node of the changed branch) likely isn't enough. (To be formally proven otherwise)

But... who cares about being superduper correct? What if we just do it, and then look how this performs. We can check later on if this was only an approximation or the actual optimal choice.

@lutteropp
Copy link
Owner Author

lutteropp commented Feb 24, 2021

My gut feeling is that the following approach should work/ give either the optimal (would need formal proof, not intuitively clear) or a good-enough result:

  • For each node in the network, store and maintain per-displayed-tree CLVs.
  • When optimizing a branch (u,v), use the subnetwork likelihood of the subnetwork rooted at u (the source node of the branch) as target function for optimization.

@lutteropp
Copy link
Owner Author

lutteropp commented Feb 28, 2021

I tried using subnetwork loglikelihood as branch-length optimization criterion. It was simple to implement, but did not perform well. Turns out total network loglikelihood only gets worse that way.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 8, 2021

Oh. Now I get where I had a thinking bug in there. Also for trees, one doesn't use subtree loglikelihood as brlen-optimization criterion. Instead, one re-roots the tree at the branch in question. Mhm... maybe a similar trick can be made for networks now that we have per-node displayed tree clvs? (not really re-rooting the network, but virtually re-rooting its displayed trees)

@lutteropp
Copy link
Owner Author

What I need: Given a displayed tree at the network root, and given a branch (with its pmatrix index) that we want to optimize:

  • Find the two CLV vectors that belong to the displayed tree, at the two endpoints of the branch
  • Figure out how to compute the displayed tree loglikelihood this way (by checking Alex{is,ey}'s PhD theses and the Bioinformatics lecture slides)... something something pulley principle

If we have this, the rest is straight-forward:

  • go over all displayed trees of the network
  • for each displayed tree, find the two CLV vectors of interest
  • compute displayed tree loglikelihood with changed branch length
  • reuse old displayed tree probs (as they only depend on reticulation probs)
  • use the normal network loglikelihood computation routine, but no need to do pll-update-partials for more than the two CLVs of interest
  • (don't forget to fully update all CLVs after a brlen has been optimized)

---> and voila, much much faster network loglikelihood evaluation when optimizing branch lengths 😎

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 8, 2021

The speedup potential of this idea is huge: For most branches, the number of CLV pairs to update will be much lower than the total number of displayed trees in the network, making this even more efficient... (because of the efficient non-redundant CLV storage, meaning multiple displayed trees share the same CLVs as long as they don't diverge)

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 8, 2021

I need to be super careful though, exactly because different displayed trees can share the same CLV vectors. I currently believe that there is no need to store the old two CLVs of interest in a temporary variable and then redo the same stuff for the next displayed tree that has the same two CLVs of interest, but in case things go wrong, I need to double-check this assumption.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 8, 2021

Actually, it can be that two displayed trees share only one CLV vector, but not the other one. I need to draw some picture to check if this affects something/ needs special care regarding brlen opt speedup. Likely it doesn't, but better be on the safe side.

@stamatak
Copy link
Collaborator

stamatak commented Mar 9, 2021 via email

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

Thanks for the reminder! For networks, this part should be simple. The network root never changes. During brlen optimization, we only evaluate the displayed trees at some other node tempRoot in the network (but take the displayed tree topologies and probs as if they were still rooted at the network root).

After optimizing a branch and using tempRoot for it, we call the invalidateHigherCLVs(tempRoot) function, which invalidates all CLVs on the path from tempRoot to the network root.

@stamatak
Copy link
Collaborator

stamatak commented Mar 9, 2021 via email

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

I've checked the pll-modules source code for logl computation during brlen optimization.
If I got it correctly, then:

  • During optimization of a branch, we don't need to recompute any CLV vectors. They stay the same.
  • The displayed tree loglikelihood when rerooting at the branch (u,v) is pll_compute_edge_loglikelihood, with parent_clv being the displayed tree CLV stored at source node u and child_clv being the displayed tree CLV stored at target node v.
  • The pmatrix (when changing the branch length) is taken into account in the pll_compute_edge_loglikelihood call already.
  • (plus of course, don't forget invalidating CLV vectors once this optimization is done for an edge)

What I need to implement:

  • pll_compute_edge_loglikelihood with passing the clvs and scalers to use as function parameters.
  • Function that finds the two CLVs belonging to nodes u and v in the displayed tree (-> this is easy: at the network root node, we stored the reticulationChoices for the displayed tree. We only need to check with the reticulationChoicesCompatible function which of the CLVs stored at u and v are compatible with the displayed tree reticulation choices. This works because as we go further down a displayed tree, the restrictions get weaker, never stronger)
  • Customized local loglikelihood function for brlen optimization: We go over all displayed trees as they are stored at the root, recompute their loglikelihoods (with virtual tree root at edge source node u), and pass that value to the displayed tree's tree_logl value at the root. Rest is like normal network loglikelihood computation.
  • Don't forget to update the pmatrix before logl evaluation during brlen opt, and don't forget to invalidate all CLVs on the path from u to the network root (start and end point included) after optimizing a branch length is finished.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

I need to double-check how one computes the changed displayed tree loglikelihood after a branch length has been changed during the brlen opt. Am I using the right CLVs? Is there some other problem we didn't think of?
Because I have implemented the above and currently, it is giving a network loglikelihood that is way off...

@lutteropp
Copy link
Owner Author

Maybe I need pll_compute_root_loglikelihood instead of pll_compute_edge_loglikelihood?

@lutteropp
Copy link
Owner Author

No that can't be, pll_compute_edge_loglikelihood sounds about right... maybe drawing a picture helps

@lutteropp
Copy link
Owner Author

Screenshot from 2021-03-09 14-40-59
I have this zero-reticulation rooted network. I want to optimize the branch with pmatrix index 27, which is highlighted in green in the picture. The edge has source node 26, and target node 27.

What I tried was recomputing the loglikelihood like this:

  • source_clv = CLV stored at node 26
  • source_scaler = Scaler stored at node 26
  • target_clv = CLV stored at node 27
  • target_scaler = Scaler stored at node 27
tree_logl = pll_compute_edge_loglikelihood(partition, source->clv_index, source_clv, source_scaler 
                                                                target->clv_index, target_clv, target_scaler, 
                                                                pmatrix_index, ann_network.fake_treeinfo->param_indices[p], nullptr);

... Where is the mistake? >_>

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

20210309_145927
I asked Alexey how a CLV knows the rest of the tree (the parts not below it) and Alexey figured out the problem: When the virtual root of a tree gets changed in pll-modules, the CLVs need to be updated. Because that changes which nodes are the children of a node in pll_update_partials...

To fix this issue for the networks, we need to also do those CLV reroot-updates.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

The affected CLVs are exactly the ones on the path from the temporary virtual root to the network root. Thus, one not only needs to invalidate them after the branch has been optimized, but one also needs to recompute them before optimizing that branch starts (as this is what is behind the "change of virtual root" in pll-modules)

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

Thus, I only need to implement a function that recomputes the affected displayed tree CLVs (i.e., the CLVs on the path from tempRoot to networkRoot) with regard to the temporary root as a preprocessing step before optimizing a branch. And then that thing should work :-)

Now I need to be super careful though about whether CLVs shared by multiple displayed trees are somehow affected or not. I still expect that to be no problem. But if afterwards there still is a bug, this assumption needs to be checked (and if the CLV-share causes a problem, then some temporary CLV copying must be introduced).

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

Sarah, also keep in mind that once you have changed a branch length and then move to a different virtual root (different branch), the CLVs along the path from the old to the new virtual root need to be updated to reflect the new branch length value, this is probably one of the most tricky parts in ML implementations, as you must 100% make sure which CLV entries are still valid and which are not.

Now I get what you meant by there... it was too early in the morning when I first read it 😅

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 9, 2021

okay, that sounds consistent, getting this right was one of the most nasty parts in raxml

Turns out updating the CLVs is where the coding work is for the networks case, too. 😅 (the rest needed for faster brlenopt-network-logl-computation was straight-forward)

@stamatak
Copy link
Collaborator

stamatak commented Mar 10, 2021 via email

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 10, 2021

Old algorithm vs. new algorithm for network-logl-after-brlen-change in a picture side-by-side, to convince myself that this is really worth it:
Unbenannte Notiz - 10 03 2021 12 42 - Seite 1

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 10, 2021

One thing I need to take special care of:

  • Because of dead node issues, the root node of a displayed tree is not always the network root. (Since in phylogenetic tree loglikelihood, while the root node has only one child, the root will be reset to be that child node.)
  • When virtually rerooting a displayed tree, we need to make sure to re-detect where the actual displayed tree root is now.

... But does this cause a problem? This needs a careful case distinction. One thing is clear: If a branch is in a dead part for a given displayed tree, then the loglikelihood of that displayed tree stays the same no matter what length that branch has. And if a branch is in an active part for a given displayed tree, then we can safely virtually re-root the displayed tree to the branch source node.

----> Conclusion: Skip recomputation of displayed tree loglikelihood for displayed trees where the branch is in a dead part (and simply use the old logl of that displayed tree), go on like previously planned if the branch is in an active part.

@lutteropp
Copy link
Owner Author

I thus need to implement functions like these:

  • bool isActiveBranch(AnnotatedNetwork& ann_network, const DisplayedTreeData& displayedTree, unsigned int pmatrix_index)
  • Node* getNodeParent(AnnotatedNetwork& ann_network, Node* node, Node* virtualRoot)
  • std::vector<Node*> getNodeChildren(AnnotatedNetwork& ann_network, Node* node, Node* virtualRoot)
  • void updateClvsVirtualRerootTrees(AnnotatedNetwork& ann_network, Node* old_virtual_root, Node* new_virtual_root)

I can simplify the parent/children stuff by precomputing all node parents once when changing the virtual root.

@stamatak
Copy link
Collaborator

stamatak commented Mar 10, 2021 via email

@lutteropp
Copy link
Owner Author

I figured out some more details I need for implementing this correctly. Getting this virtual rerooting right is tricky, but doable. :)

Unbenannte Notiz - 11 03 2021 14 07 - Seite 1

@lutteropp
Copy link
Owner Author

To be done with std::vector<Node*> getParentPointers(AnnotatedNetwork& ann_network, const std::vector<ReticulationState>& reticulationChoices, Node* virtual_root) (currently implementing it). Rest stays straightforward as described before :)

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 11, 2021

Likely the last redesign step I need, then it makes all sense:

  • When going over the displayed trees for the new virtual root, do NOT overwrite the tree_logl field of the DisplayedTreeData at the network root. Instead, simply store the tree_logl at the DisplayedTreeData for the new virtual root.
  • Generalize the high-level evaluateTrees network loglh function to take the clv_index at the current virtual root, using tree_logl values in the DisplayedTreeData structs from there.

These two things are rather simple and quick to implement. The real work lies in here:

  • Write equivalents for the processNodeImproved, processNodeImprovedSingleChild, and processNodeImprovedTwoChildren functions, or generalize them in some smart way. These functions are responsible for updating the CLVs.
    The difficulty here is that while we had the fixed network root, a network node always had the same children and the same parent. But now with the virtual rerooting, the edge directions depend on what displayed tree we are currently looking at. This makes things more complicated. I need to draw a picture for this, in order to come up with something clever that still works with sharing some CLVs among displayed trees.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 11, 2021

I need to write 3 entirely new node processing functions for this, as I cannot use the old ones.

In the functions I had so far (for normal incremental network loglh computation), I processed the network nodes in a bottom-up order (reversed topological sort, to be particularly specific). Going up the network like this, I inferred for each DisplayedTreeData/CLV which reticulations need to be set in which way based on pair-wise combining its children DisplayedTreeData's/CLVs.

Now for the virtual re-root case, at the moment I see no other way than iterating over the displayed trees, following a more top-down approach. This is because the edge directions now depend on the displayed tree. The tricky part is how to combine this with reusing/sharing CLV vectors among multiple displayed trees!

Still drawing and thinking, figuring out how to do this best. It may be a bit tricky, but I am 100% sure it is doable and I'll figure it out by some more thinking.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 11, 2021

Also important to keep in mind when optimizing branch (u,v): When rerooting a displayed tree at the source node u, the children (for CLV-update computation) of u do not include the target node v.
Unbenannte Notiz - 11 03 2021 17 19 - Seite 1

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 12, 2021

I figured it out despite having a mean headache: "Which reticulation choices make this node the parent of the current clv?" -> this gives additional reticulation restrictions to store at the DisplayedTreeData/CLV. Together with the overall reticulation choices for the displayed trees in the network being known beforehand during brlenopt, this can easily be put together. It allows me to adapt and then reuse the processNode functions.

@lutteropp
Copy link
Owner Author

Also, the different edge directions change something in the order of processing the CLVs.
--> for each different set of edge directions, we need to do the CLV updating separately. But still, we can avoid redundancy by doing some clever tricks... needs more thinking

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 12, 2021

Okay, now I got it: We don't need to redo update-CLVs by going over each displayed tree. I have figured out when the edge directions change: We need to do separate update-CLVs calls on each of the paths from old virtual root to new virtual root in the network (with following the network edges in reverse direction, i.e. going to parents all the time).

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 12, 2021

And the overlapping part (which only needs to be processed once) of all these paths, the part that is equal for all these paths, is exactly the path from the old virtual root until the first encountered reticulation node on a path to the new virtual root.

@stamatak
Copy link
Collaborator

stamatak commented Mar 12, 2021 via email

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 13, 2021

And the overlapping part (which only needs to be processed once) of all these paths, the part that is equal for all these paths, is exactly the path from the old virtual root until the first encountered reticulation node on a path to the new virtual root.

While this first might sound reasonable, it is actually wrong. See example in this image:
Unbenannte Notiz - 13 03 2021 13 05 - Seite 1

Instead, what we can say for this example is that among all the old_virtual_root --> new_virtual_root paths, the part that always stays the same is the path from the last encountered reticulation node to the new_virtual_root. But this one doesn't incorporate much savings potential...

----> Conclusion: Don't try some additional CLV sharing tricks here. Instead, independently compute updated CLVs for each of the paths from old_virtual_root to new_virtual_root, setting the reticulation restrictions of each CLV encountered in this process to how we set the reticulation choices in order to get this path.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 13, 2021

Tricky stuff this virtual rerooting! Turns out, the reticulation node handling I wrote in the processNode functions is now also obsolete. Instead, I have to interpret the network here as if we don't have dedicated reticulation nodes, and always do a isBranchActive(from, to) check. Or better, get the reticulation restrictions we need in order to have a given branch being active.
Unbenannte Notiz - 13 03 2021 14 50 - Seite 1

@lutteropp
Copy link
Owner Author

Important observation:
Unbenannte Notiz - 13 03 2021 19 19 - Seite 1
If new_virtual_root == old_virtual_root, the path should not be empty! It still contains 1 node that needs its CLVs updated.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 13, 2021

A situation that gets tricky regarding which CLVs to use:

Here, we have the paths (from old_virtual root to new_virtual_root) 7->4->8->5 and 7->6->8->5. The children reported below are with regard to new_displayed_root.

  • In path 7->4->8->5, the node 4 has children 3 and 7. We process the paths and fill the node DisplayedTrees/CLVs accordingly.
  • When updating the CLVs for node 6 at path 7->6->8->5, node 7 has child 4. The CLVs for 4 we actually need here are the ones where 4 has child 3 (and inactive child 8). However, by processing the previous path, we have overwritten the CLVs stored at 4 to be the ones where 7 is a child of 4.

Unbenannte Notiz - 13 03 2021 22 26 - Seite 1

There are two possible ways for resolving the situation:

A) Store which nodes were the children used for a given DisplayedTreeData. This is to be stored in the DisplayedTreeData object. Then, it makes sense to store multiple DisplayedTreeData's in a node, with the same reticulation choices but different set of children. And then one has to make sure to also check for compatible children setting when looking at a DisplayedTreeData from a child at a current node (it is only compatible if the current node does not show up in the list of children). To do final tree logl evaluation right, keep a flag stating whether a tree was newly added or not. And only evaluate over the newly added trees...

B) Before processing the next path, kick out the old CLVs at all path nodes except for the new_virtual_root node. Detect in advance at which nodes we need to restore the old displayed trees from how they were when we had old_virtual_root, save and restore them accordingly.

I have decided to proceed with solution B.

@lutteropp
Copy link
Owner Author

Solution B does not work the way I want it to work. (Simple example: Call it with two times the same pmatrix_index. First time recovering the CLVs worked perfectly, second time because the first time overwrote some things we are not able to correctly recover the old CLVs anymore.)

Thus simpler, but less efficient:

Solution C: Use solution B. But after optimizing a branch, switch back to the normal network root. This should work fine and with simpler code.

---> Switching to solution C.

@lutteropp
Copy link
Owner Author

Another interesting observation, reposting here because it looks useful for the paper supplement to mention there as well:

Unbenannte Notiz - 14 03 2021 03 02 - Seite 1
The problem in the above graphic: Node 28 has only one displayed tree on its side. Node 15 has two displayed trees on its side. They are cut away from each other due to the branch between them being optimized.

---> Instead of going over the-source-trees-only for final loglh evaluation, we need to go over all pairs of trees, one in source node and one in target node.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 14, 2021

Going over pairs of trees is not enough, as can be seen in this example:
Unbenannte Notiz - 14 03 2021 04 22 - Seite 1

Instead, one needs to do this: If a source tree has no compatible tree in the target trees, use this source tree alone. If a target tree has no compatible tree in the source trees, use this target tree alone. We can reuse them from oldDisplayedTrees as the branch length has no effect in this case.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 14, 2021

These seem to be all special cases one needs to think of.

First prototype version is working 🎉. It still computes more CLVs than needed though, as I implemented

C) Before processing the next path, kick out the old CLVs at all path nodes except for the new_virtual_root node. Detect in advance at which nodes we need to restore the old displayed trees from how they were when we had old_virtual_root, save and restore them accordingly. After optimizing a branch, switch back to the normal network root, recomputing the CLVs on the path from edge source to network root.

This means I implemented virtual_root_1 -> network_root -> virtual_root_2 -> network_root -> virtual_root_3 -> network_root.


Regarding minimizing total number of CLVs to recompute, it would be even better to go with

A) Store which nodes were the children used for a given DisplayedTreeData. This is to be stored in the DisplayedTreeData object. Then, it makes sense to store multiple DisplayedTreeData's in a node, with the same reticulation choices but different set of children. And then one has to make sure to also check for compatible children setting when looking at a DisplayedTreeData from a child at a current node (it is only compatible if the current node does not show up in the list of children). To do final tree logl evaluation right, keep a flag stating whether a tree was newly added or not. And only evaluate over the newly added trees...

This would implement virtual_root_1 -> virtual_root_2 -> virtual_root_3 -> network_root.

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 14, 2021

I need to double-check why I need this returning-to-network-root after optimizing a branch. I wrote it is because first time overwrote some things in the CLVs. Is this still the case? If so, then I need to mix solution A somehow into solution B. I suspect the problem is when going from one path to the other.

The problem is obviously in this line from updateCLVsVirtualRerootTrees:
bool appendMode = ((p > 0) && (paths[p].path[i] == new_virtual_root));

What this line essentially does is: When processing nodes in a path, only append the new DisplayedTreeData's to a node if we are at new_virtual_root. Otherwise, clear the old data stored there. This of course doesn't work well with the next branch optimization wanting to restore some CLVs from how they were with the branch before it -> because here, we are interested in all CLVs computed by the paths before. But so far, the function is deleting the intermediate CLV results on the previous paths.

The question is whether this issue can be mitigated by only keeping track of which children were used. Or if we also need to take into account the children of those children etc...

@lutteropp
Copy link
Owner Author

lutteropp commented Mar 14, 2021

Nice! 😎 Results turned out like I planned. Now the update CLVs calls are already much less of an issue. The main runtime now lies in pll_compute_edge_loglikelihood (76.14%). As can be seen in this callgraph.

awesome_callgraph

This makes further tuning the updateCLVsVirtualRerootTrees low priority.
Instead, high priority now lies in:

  • Implement OpenMP parallelization of pll_compute_edge_loglikelihood and pll_compute_root_loglikelihood (this will help for our simulated data, but less for the big empirical dataset that has super tiny partitions)
  • Use Newton-Raphson instead of Brent for optimizing branches (this will give us some niiiiiice additional speedup 😎). This requires some Maths now, as I need to understand how to compute the derivative of network loglikelihood. Given the derivatives of the displayed tree loglikelihoods, this should be easy. I then need to understand how pll_modules computes likelihood derivatives for trees (something something sumtable), and how this can be generalized to networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants