# Infer gene trees

It may be of interest to compare the true genealogies (one or more) at a locus to the gene tree that would be inferred from the sequence data that was simulated on the genealogies. These may not match if ILS or introgression is high and causes the genealogies within a locus to vary significantly, or if there is insufficient information in the locus to reconstruct the gene tree. 

In [10]:
import ipcoal
import toytree

### Simulate data under a tree model

In [11]:
tre = toytree.rtree.unittree(5, treeheight=1e6, seed=111)

In [12]:
model = ipcoal.Model(tree=tre, Ne=1e6, recomb=1e-9, seed=111)

In [13]:
model.sim_loci(nloci=15, nsites=500)

### Infer gene trees for each locus

In [14]:
model.infer_gene_trees(inference_method="raxml")

### Save the results table to disk


In [15]:
model.df.to_csv("./sim-table.csv")

### View the result table


In [16]:
model.df.head(20)

Unnamed: 0,locus,start,end,nbps,nsnps,genealogy,inferred_tree
0,0,0,20,20,2,"(r1:3.81212e+06,(r0:1.29...",(r2:0.018236661006054526...
1,0,20,207,187,6,"((r0:1.08352e+06,r1:1.08...",(r2:0.018236661006054526...
2,0,207,500,293,15,"(r0:1.53895e+06,(r1:1.29...",(r2:0.018236661006054526...
3,1,0,1,1,1,"((r2:1.13896e+06,r0:1.13...",((r1:0.00734369332777404...
4,1,1,49,48,3,"((r2:1.13896e+06,r0:1.13...",((r1:0.00734369332777404...
5,1,49,232,183,16,"((r2:1.13896e+06,r0:1.13...",((r1:0.00734369332777404...
6,1,232,311,79,7,"((r2:1.13896e+06,r0:1.13...",((r1:0.00734369332777404...
7,1,311,500,189,8,"((r2:1.13896e+06,r0:1.13...",((r1:0.00734369332777404...
8,2,0,102,102,2,"(r4:1.14352e+06,((r3:690...",(r1:0.036231736161195168...
9,2,102,308,206,21,"(r1:4.36661e+06,(r4:1.14...",(r1:0.036231736161195168...


### Compare trees

The true genealogies vary across the locus, both in terms of coalescent times as well as in the topology. Neighboring trees are highly correlated, however, much more than genealogies from different unlinked loci. 

In [23]:
# the True genealogies at locus 0
mtre = toytree.mtree(model.df.genealogy[model.df.locus == 0])
mtre.draw_tree_grid(
    start=0, 
    nrows=1, ncols=3, 
    shared_axis=True,
    edge_type='c',
);

The inferred tree for this locus is shown below (three times) just for comparison with the trees above. However, when working empirically we typically cannot identify the exact breakpoint where one genealogical history or another was present. Therefore we infer just one gene tree for the entire locus (250bp in this case). Interestingly, the inferred gene tree topology matches the middle genealogy most closely even though the last topology contained the most SNPs and larger proportion of the locus history. 

In [22]:
# the inferred genealogy by raxml based on the sequence data
mtre = toytree.mtree(model.df.inferred_tree[model.df.locus == 0])
mtre.draw_tree_grid(
    start=0, 
    nrows=1, ncols=3, 
    shared_axis=True,
    edge_type='c',
    tip_labels_align=True,
);