## ipcoal: simulation and analysis of genealogies and gene trees

The `ipcoal` Python package provides a convenient framework for simulating and analyzing genealogies and inferred gene trees under complex demographic scenarios. You can generate demographic models representing population histories, species trees, or networks, using a newick file that can be visualized in `toytree`. The model parameters will be parsed by `ipcoal` to define a simulation framework in `msprime` to generate a distribution of genealogies from which SNPs, loci, or chromosomes can be simulated. The simulated sequence data can be saved to disk in a variety of formats, or, gene tree analyses can be automated and parallelized to infer empirical gene trees on the simulated data. The resulting true genealogies, summary statistics, and inferred trees are returned by `ipcoal` is a Pandas DataFrame for further statistical analysis. 

### Required software
All required software can be installed with the following conda command. 

In [1]:
# conda install ipcoal -c eaton-lab conda-forge

In [2]:
import toytree
import ipcoal

### The main functions of *ipcoal*
You start by initializing a `Model` class object by providing a species tree/network and additional optional model parameters (e.g., Ne, migration, mutation rate, recombination rate). Then you can simulate either loci or SNPs on the genealogies produced under this model. `ipcoal` makes it easy to either write the sequence data to files under a variety of formats, or to perform phylogenetic inference on the sequence data directly. You can then compare true simulated genealogies to the inferred trees. 

In [3]:
# init a model Class object for simulations
model = ipcoal.Model(tree=toytree.rtree.unittree(5, treeheight=1e6))

# simulate N unlinked SNPs (will run until N snps are produced)
model.sim_snps(100)

# simulate N loci of len L 
model.sim_loci(10, 10)

# access the genealogies in a table
model.df

# save to a CSV table
model.df.to_csv("./tree_table.csv")

# access the sequence data in an array
model.seqs

# write loci as separate phylip files to a directory
model.write_loci_to_phylip(outdir="./tests/")

# write concatenated loci or snps to a phylip file
model.write_seqs_to_phylip("test.phy")

# infer a tree for every locus
# model.infer_gene_trees(method='raxml')

# infer a species tree from the concatenated data
# model.infer_concatenation_tree(method='raxml')

# compare inferred gene trees to true genealogies (call after running infer_gene_trees)
# model.calculate_rf_dists()

wrote 10 loci (5 x 10bp) to home/deren/physeqs/notebooks/tests/[...].phy
wrote concatenated sequence file (5 x 100bp) to /home/deren/physeqs/notebooks/test.phy


### Define a species/population tree
Node heights should be in units of generations. 

In [4]:
# generate a random 6-tip tree with root height of 1M generations
tree = toytree.rtree.coaltree(6, seed=222).mod.node_scale_root_height(1e6)

# draw tree showing idx labels
tree.draw(tree_style='c');

### Define an ipcoal simulation model. 
Here you can define the demographic model by setting a global Ne value (overrides Ne values stored to the tree), and setting the mutation and recombination rates. You can define a admixture scenarios using a simple syntax provided by a list of tuples. In each tuple you list the (source, dest, edge_prop, rate), where edge_prop is a float value of the proportion of the length of the shared edge between two taxa from recent to the past at which the migration pulse took place. In other words, if you set this to (7, 4, 0.5, 0.1) then 10% of the population of 7 will migrate into population 4 (backwards in time) at the midpoint of the shared edge between them. 

In [5]:
model = ipcoal.Model(
    tree,
    Ne=1e6,
    mut=1e-8,
    recomb=1e-9,
    seed=123,
    admixture_edges=[(6, 4, 0.5, 0.1)],
)

### Simulate genealogies and sequences for N independent loci of length L
Because our simulation includes recombination each locus may represent multiple genealogical histories. You can see this in the dataframe below where loc 0 is represented by 5 genealogies. 

In [6]:
# run the simulation
model.sim_loci(nloci=10, nsites=500)

In [7]:
# view the genealogies and their summary stats
model.df.head(10)

Unnamed: 0,loc,start,end,nbps,nsnps,genealogy
0,0,0,104,104,7,"(r4:2676110.71983299450949,(r3:1119930.5197808..."
1,0,104,234,130,11,"((r0:2208173.71173396287486,r4:2208173.7117339..."
2,0,234,257,23,3,"((r4:2208173.71173396287486,(r0:1349517.661202..."
3,0,257,438,181,20,"((r3:2208173.71173396287486,(r0:1865495.190384..."
4,0,438,500,62,6,"((r3:2208173.71173396287486,r0:2208173.7117339..."
5,1,0,146,146,13,"((r2:1310300.31713711423799,r3:1310300.3171371..."
6,1,146,163,17,3,"((r1:840626.75758881168440,(r4:802206.53257372..."
7,1,163,175,12,2,"((r2:1310300.31713711423799,r3:1310300.3171371..."
8,1,175,212,37,2,"(r0:2790036.30116437561810,((r1:840626.7575888..."
9,1,212,242,30,3,"((r1:840626.75758881168440,(r4:802206.53257372..."


In [8]:
# view the sequence array for the first locus (first 20 bp)
model.seqs[0, :, :20]

array([[1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 1, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3]],
      dtype=uint8)

In [9]:
# write all loci as separate phylip files to a directory
model.write_loci_to_phylip()

wrote 10 loci (6 x 500bp) to home/deren/physeqs/notebooks/ipcoal-sims/[...].phy


In [10]:
# write all loci concatenated to a single sequence file
model.write_seqs_to_phylip()

wrote concatenated sequence file (6 x 5000bp) to /home/deren/physeqs/notebooks/test.phy


### Visualize genealogical variation using toytree

In [11]:
# load a multitree object from first 5 genealogies
mtre = toytree.mtree(model.df.genealogy.tolist())

# draw trees from the first locus
#  with 'shared_axis' to show diff in heights
#  with 'fixed_order' to show diff in topology (relative to first tree)
mtre.draw_tree_grid(
    start=0,
    ncols=4, 
    nrows=1,
    shared_axis=True,
    fixed_order=True,
    edge_type='c',
);

# draw trees from the second locus
mtre.draw_tree_grid(
    start=5,
    ncols=4, 
    nrows=1,
    shared_axis=True,
    fixed_order=True,
    edge_type='c',
);

### Simulate N unlinked SNPs

In [12]:
# simulate N unlinked SNPs
model.sim_snps(100)

In [13]:
# the genealogies for each SNP are stored in .df
model.df.head()

Unnamed: 0,loc,start,end,nbps,nsnps,genealogy
0,0,0,1,1,1,"(4:3966045.25836907932535,(3:1647872.083592993..."
1,1,0,1,1,1,"(3:2804552.73206556029618,((2:1385073.89052010..."
2,2,0,1,1,1,"((4:1125746.36276281694882,5:1125746.362762816..."
3,3,0,1,1,1,"(5:5342401.82635926455259,(3:1518004.052378125..."
4,4,0,1,1,1,"((5:1382277.29472179431468,(2:1026643.67880559..."


In [14]:
# the snp array is stores in .seqs
model.seqs[:, :20]

array([[0, 0, 3, 1, 1, 2, 1, 1, 0, 1, 2, 3, 2, 2, 1, 2, 3, 0, 0, 1],
       [0, 1, 3, 1, 3, 1, 3, 1, 1, 1, 2, 3, 1, 2, 3, 2, 3, 0, 2, 1],
       [0, 1, 3, 1, 3, 2, 1, 1, 1, 1, 3, 0, 1, 0, 1, 2, 3, 0, 2, 1],
       [2, 0, 3, 1, 3, 2, 1, 1, 1, 1, 2, 3, 1, 1, 1, 2, 3, 2, 0, 1],
       [0, 1, 1, 0, 3, 1, 1, 3, 1, 1, 2, 3, 2, 0, 3, 0, 0, 2, 2, 1],
       [0, 1, 3, 1, 1, 2, 1, 3, 1, 2, 2, 3, 2, 0, 1, 0, 3, 2, 0, 0]],
      dtype=uint8)

In [15]:
# write the snps array as a phylip file
model.write_seqs_to_phylip()

wrote concatenated sequence file (6 x 100bp) to /home/deren/physeqs/notebooks/test.phy


### Infer gene trees 
Writing the sequence data to disk is optional and actually not required for some types of analyses, since *ipcoal* has built-in inference tools for inferring gene trees from the sequence data while it is stored in memory. This can create a really simple and reproducible workflow based simply on the random seed used for your analysis without a need to upload your simulated files to DRYAD at the end of your project. 

When you call one of the *inference* methods it will fill a new column in your dataframe called **inferred_trees**. 

In [16]:
model.infer_gene_trees(inference_method="raxml", inference_args={"N": 100, "x": 123})

In [19]:
model.df.head(10)

Unnamed: 0,loc,start,end,nbps,nsnps,genealogy
0,0,0,1,1,1,"(4:3966045.25836907932535,(3:1647872.083592993..."
1,1,0,1,1,1,"(3:2804552.73206556029618,((2:1385073.89052010..."
2,2,0,1,1,1,"((4:1125746.36276281694882,5:1125746.362762816..."
3,3,0,1,1,1,"(5:5342401.82635926455259,(3:1518004.052378125..."
4,4,0,1,1,1,"((5:1382277.29472179431468,(2:1026643.67880559..."
5,5,0,1,1,1,"((6:1646353.03948009829037,(1:1205429.41899846..."
6,6,0,1,1,1,"(2:6652509.93418486695737,(1:2448769.050822679..."
7,7,0,1,1,1,"((5:2130976.21255934098735,6:2130976.212559340..."
8,8,0,1,1,1,"((5:881032.66432007553522,6:881032.66432007553..."
9,9,0,1,1,1,"((2:1131963.00230873096734,3:1131963.002308730..."


In [None]:
# save the dataframe with the inferred trees 
model.df.to_csv("./tree_table.csv")

### Write data as a site count matrix (*sensu* SVDquartets)

In [18]:
# for idx, mat in enumerate(snps.reshape((5,16,16))):
#     toyplot.matrix(mat, label="Matrix " + str(idx), colorshow=True);