Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation protocol #3

Open
celinescornavacca opened this issue Nov 11, 2020 · 28 comments
Open

Simulation protocol #3

celinescornavacca opened this issue Nov 11, 2020 · 28 comments

Comments

@celinescornavacca
Copy link
Collaborator

We are going to simulate our networks under the birth-hybridization process of Zhang et al 2018.
In this process, we have a speciation rate l (lambda), a hybridisation rate v (nu) and the time we run the process t_0.

From their paper: "The simulator starts from the time of origin (t_0) with one species. A species splits into two (speciation) with rate l, and two species merge into one (hybridization) with rate v. At the moment of k branches, the total rate of change is r_tot = k l + choose (k 2) * v. We generate a waiting time ~exp(r_tot) and a random variable u ~U(0,1), If u < kl=r_tot, we randomly select a branch to split; otherwise, we randomly select two branches to join, and generate an inheritance
probability c ~U(0,1). The simulator stops at time t_0."

Zhang et al. - 2018 - Bayesian Inference of Species Networks from Multil.pdf

No transfer from now, because branches with no length cannot be recovered.

Given such generated network, we extract a random tree, using the inheritance probabilities at the reticulated nodes: for each reticulated node x, we draw a number y ~ U(0,1) and we choose the first parent p_x if y < inheritanceProb of the edge p_x,x. If we do this,we have atree but not a binary one. Then we clean the tree from the dead nodes and 1-indegree 1-outdegree nodes, as Sarah explained in her document, and we obtain a binary tree. On this binary tree, we simulate N sequences with Seq-Gen-1.3.4 (currently under this model -mHKY -t3.0 -f0.3,0.2,0.2,0.3 this can be changed).
We do this K times, obtaining an alignment of length K*N in which each group of N sites has its tree.

@celinescornavacca
Copy link
Collaborator Author

celinescornavacca commented Nov 11, 2020

In their paper, they use this parameters in the simulations

\lambda - \nu ~ exp(0.1) with mean 10
\tau_0 ~ exp(10) with mean 0.1
\nu / \lambda ~ Beta distribution with alpha = 1 and beta = 1 (same as uniform distribution between 0 and 1)
\gamma ~ Uniform distribution (between 0 and 1)

As I said in a slack message, this process gives Yule-type networks, that act as the Yule-type trees, see example below for trees:

index

Once they start having a certain amount of nodes, things get crazy and they start speciating and hybridising a lot.
It is not the focus of our paper to correctly simulate networks (we are not in a Bayesian framework, we do not need priors) and this is the best we can for now (for the next paper, we will do better, working on it).
So we can simply throw the networks we do not like as done in line 143-147 of this simulator, which only simulate networks and not trees and sequences.

RandomNetwork.txt
(in txt because py is not supported)

Sarah, you will find the parameters they use in the paper on lines 8-17, you can copy and paste this in your simulator and add a line similar to line 143, and you should be good.

@celinescornavacca
Copy link
Collaborator Author

I did play with the simulator for a while and I think that I found a way to set the parameters that gives in average networks with enough leaves and not too many reticulations.

I also suggest to use the ratio no_of_hybrids/no_of_leaves to do our choices (line 148)
Note that the choice of parameters has to be done in the "while simulateNetwork:" loop (line 46-48), not before as done in my previous file

RandomNetwork.txt

@lutteropp
Copy link
Owner

lutteropp commented Nov 19, 2020

Great, it works! 👍 It is very slow, but it works! (The slightly adapted version is in https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py)

Due to runtime issues, it is not realistic to exactly pre-specify wanted number of taxa and wanted number of reticulations with this simulator. Thus, we are categorizing the simulated datasets (accepting every network with at least 30 taxa and up to 10% reticulations) into low/medium/large buckets instead. For generating a set of acceptable datasets, we use https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py

Our current simulation setup is:

  • categorize simulated networks into low/medium/high buckets (in terms of #taxa/#reticulations)
  • use 5 random starting points in NetRAX
  • #of trees to simulate: each displayed tree
  • MSA-partition-size: 500, 1000 sites per tree
  • use both likelihood functions (average + best)

As a first step (to test the pipeline), we will also do the simulations on tiny networks (up to 10 taxa, up to 2 reticulations).

Slightly later, we will also run NetRAX with raxml-ng best tree (to check performance implications).

Much later (for second paper):

  • 3 simulators (celine, sarah, empirical datasets)
  • identifiability issue, different networks can have same displayed trees --> graph isomorphism on rooted DAGs for checking network topology

The entire simulation pipeline can currently be found here:
https://github.com/lutteropp/NetRAX/tree/master/simulator/src/network/logic

@stamatak
Copy link
Collaborator

stamatak commented Nov 19, 2020 via email

@lutteropp
Copy link
Owner

lutteropp commented Nov 20, 2020

There is one problem with our simulation setup from above - it ignores the displayed tree probabilities.

I implemented the following sampling types:

class SamplingType(Enum):
    STANDARD = 1 # randomly choose which tree to sample, then sample equal number of sites for each sampled tree - this is the only mode that uses the n_trees or m parameter for sampling
    PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as expected by the tree probability
    PERFECT_UNIFORM_SAMPLING = 3 # sample each displayed tree, with the same number of sites per tree (ignoring reticulation probabilities)
    SINGLE_SITE_SAMPLING = 4 # sample each site individually, with the reticulation probabilities in mind

What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the small dataset simulations with this for now, but I expect NetRAX to not infer the correct reticulation probabilities in this setting. Instead, I believe PERFECT_SAMPLING is better. What are your thoughts on this?

@stamatak
Copy link
Collaborator

stamatak commented Nov 21, 2020 via email

@celinescornavacca
Copy link
Collaborator Author

Same here!

@lutteropp
Copy link
Owner

lutteropp commented Nov 25, 2020

Okay, switched to PERFECT_SAMPLING as we all agree on that. Done in addb906.

(EDIT: Moved starting networks question to new issue)

@stamatak
Copy link
Collaborator

stamatak commented Nov 25, 2020 via email

@lutteropp
Copy link
Owner

lutteropp commented Nov 25, 2020

@stamatak Already in there ;-) I copy-pasted the struct I use for the results, as this is the shortest code part where one can see what is currently evaluated

class Result:
    def __init__(self, dataset):
        self.dataset = dataset
        self.bic_true = 0
        self.logl_true = 0
        self.bic_inferred = 0
        self.logl_inferred = 0
        self.bic_inferred_with_raxml = 0
        self.logl_inferred_with_raxml = 0
        self.bic_raxml = 0
        self.logl_raxml = 0
        self.n_reticulations_inferred = 0
        self.n_reticulations_inferred_with_raxml = 0
        self.topological_distances = {}
        self.topological_distances_with_raxml = {}

(inferred_with_raxml means that RAxML best tree was used as single starting network)

@celinescornavacca
Copy link
Collaborator Author

Looking good!

@lutteropp
Copy link
Owner

This is what we currently measure: #14

@lutteropp
Copy link
Owner

lutteropp commented Nov 30, 2020

I just realized that we may need both PERFECT_SAMPLING and PERFECT_UNIFORM_SAMPLING in our experiments:

  • PERFECT_SAMPLING is better for comparing loglikelihood and BIC score.
  • PERFECT_UNIFORM_SAMPLING is better for scoring well when comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account...

@stamatak
Copy link
Collaborator

stamatak commented Nov 30, 2020 via email

@lutteropp
Copy link
Owner

lutteropp commented Dec 6, 2020

There are three different settings on how to handle partitions:

  • PLLMOD_COMMON_BRLEN_UNLINKED: Each partition has its own independent branch lengths and reticulation probs.
  • PLLMOD_COMMON_BRLEN_LINKED: All partitions share the same branch lengths and reticulation probs.
  • PLLMOD_COMMON_BRLEN_SCALED: All partitions share the same base branch lengths and reticulation probs. In addition to that, each partition has its own scaling factor by which it is allowed to multiply the branch lengths for it.

... It appears to me that we need to include all these three possibilities in our experiments, as soon as we have more than one partition in the MSA.

@stamatak
Copy link
Collaborator

stamatak commented Dec 6, 2020 via email

@lutteropp
Copy link
Owner

@stamatak umm... all 3 versions are fully implemented in NetRAX already

@stamatak
Copy link
Collaborator

stamatak commented Dec 6, 2020 via email

@lutteropp
Copy link
Owner

should I switch the experiments to run with UNLINKED then? So far I used LINKED.

@stamatak
Copy link
Collaborator

stamatak commented Dec 7, 2020 via email

@celinescornavacca
Copy link
Collaborator Author

I would go with scaled but in the experiment linked should be enough since the data are linked (TODO for the second paper, use a scaled framework to simulate?)

PERFECT_UNIFORM_SAMPLING is better for scoring well when comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account...

I am not sure that I would do that, because we need in any case a way to check the gammas when reconstructing the network in the PERFECT_SAMPLING setting.

@lutteropp
Copy link
Owner

Cool, then let's stick to scaled (and if we under-estimate reticulations, switch to linked later) and only PERFECT_SAMPLING first :)

It is easier (in terms of total runtime, number of plots to look at, and interpretation) to not try every possible combination of parameters one could think of in the initial experiments...

@stamatak
Copy link
Collaborator

stamatak commented Dec 7, 2020 via email

@celinescornavacca
Copy link
Collaborator Author

We can test it after the scaled one, @stamatak, to avoid the burden to test them at the same time?

@lutteropp
Copy link
Owner

lutteropp commented Dec 7, 2020

Yes, let's make a separate experiment that only looks at the effect of scaled vs. unlinked brlens :)
EDIT: I created an issue for it: #28

@lutteropp
Copy link
Owner

lutteropp commented Dec 7, 2020

(I can do the runs with tons of different parameters on the cluster, but my problem is interpreting, plotting, and summarizing all these results for so many different parameter combinations at once; and still seeing the connections there... it's making tons of plots)

@lutteropp
Copy link
Owner

(also, having to entirely re-run some broken experiment is more painful when it was on a lot of parameter combinations)

@stamatak
Copy link
Collaborator

stamatak commented Dec 7, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants