Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting the number of trees in which each node (or clade) takes part #30

Open
willdumm opened this issue Oct 17, 2022 · 0 comments
Open

Comments

@willdumm
Copy link
Contributor

willdumm commented Oct 17, 2022

For #29 we need a way to count the number of trees which each of a DAG's nodes takes part in.

This is implemented in the Python package here:
https://github.com/matsengrp/historydag/blob/88db496bb6420adf85fce78a861e28ab74031694/historydag/dag.py#L1516

This map from nodes to tree counts can be constructed for all nodes in two traversals of the DAG, first postorder and then reverse postorder. (For use in #29 we may need to sample multiple trees at once for use in sample-optimize-merge, since sampling will be fast, but constructing this map will be relatively slow)

  • First, compute the number of subtrees below each node in the DAG (e.g. with a call to SubtreeWeight<TreeCount>.ComputeWeightBelow(). Let's call $b(n)$ the number of subtrees below any node $n$
  • We can compute the number of "above subtrees" $a(n)$ for a node $n$ as follows... Here $P(n)$ is the set of parent nodes of $n$. Also, given a node $n$ and one of its parents $p$, then $s(n, p) = \sum\limits_{s\in S(n, p)} b(s)$, where $S(n, p)$ is the set of child nodes of $p$ which descend from the same clade as $n$ (including n):
    $$a(n) = \sum_{p\in P(n)} \left [ \frac{a (p ) b(p )}{s(n, p)}\right ]$$
    Then for any node, $c(n) = a(n)b(n)$ is the number of trees the node $n$ is part of in the DAG.
  • Instead of a map from nodes to the number of trees they take part in, we actually need a map from clade unions to the number of trees they take part in. This can be computed by iterating through all nodes and summing values of the map $c$ on all nodes which have the same clade union. By clade union we mean union of child clades, as a subset of leaf labels (rather than set of sets of leaf labels as in LeafSet). This is the map called $N$ in Sampling from the DAG's topological fringe #29

Credit to @williamhowardsnyder for the algorithm and Python implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant