Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling from the DAG's topological fringe #29

Open
willdumm opened this issue Oct 17, 2022 · 0 comments
Open

Sampling from the DAG's topological fringe #29

willdumm opened this issue Oct 17, 2022 · 0 comments
Assignees

Comments

@willdumm
Copy link
Contributor

willdumm commented Oct 17, 2022

We would like to try sampling trees from the history DAG which are topologically strange, and use these preferentially in the sample-optimize-merge loop. Topological strangeness will be quantified as the sum of Robinson Foulds distance from a tree to any other tree in the DAG. We can compute this using the SubtreeWeight framework as follows:

  • We need first a map $N$ from any clade in the hDAG (that is, any set of taxa that can be realized as the union of a node's child clades) to the number of trees in the hDAG which contain a node with that clade. More on this in Counting the number of trees in which each node (or clade) takes part #30 .

  • I show here starting at line 115 that for any tree in the hDAG, the sum of the Robinson Foulds distances from that tree $t$ to every other tree in the DAG (a sum which I'll call $D_t$) is given by
    $$D_t = K + \sum_{e\in t} (|T| - 2N(c_e))$$
    where...

    • $K$ is an easily computable constant which we won't worry about (it's the sum of number of clades in each tree in the DAG),
    • the sum is over edges $e$ in the tree $t$,
    • $c_e$ is the clade of the child node of the edge $e$ in the tree $t$,
    • and $|T|$ is the number of trees in the hDAG.

    Note that the sum, and therefore the value $D_t - K$ which we'll be maximizing in trees, will probably be negative. Therefore, all we need for computing the (K-shifted) sum RF distance on all trees in the DAG via SubtreeWeight is an edge function which returns this value $|T| - 2N(c_e)$ for any edge $e$.

  • Once we have that implemented via SubtreeWeight, trees with higher shifted-sum-RF distances $D_t - K$ are the ones which are more exotic topologically, in the context of the other trees in the DAG. We now want to weight these higher in the distribution we're sampling trees from. This can be done by using cached weights from SubtreeWeight<ShiftedSumRFDistances>.ComputeWeightBelow(), and sampling edges below each clade in proportion to the relative cached weights of each edge's child node, in the same fashion as has been done for uniform sampling with SubtreeWeight<TreeCount>. The way this works will be related to the solution to Sampling from the DAG with probabilities (and constraints) #28 and how we resolve the discussion on PR Uniform sampling #31

@willdumm willdumm changed the title Sampling from the DAG's topological 'edge' Sampling from the DAG's topological fringe Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants