## Validation of ipcoal sequence simulator

In this notebook we implement ipcoal simulations that use either `seqgen` or our `seqmodel` as the underlying sequence algorithm for mutating sequences evolving on trees. We show that under all models that we currently support in the `ipcoal` implementation matches the outputs of `seqgen`. 

In [1]:
import ipcoal
import toytree
import numpy as np

### A species tree
We will evolve sequences on genealogies that are sampled from a species tree. An example species tree is shown below generated from `toytree`. 


In [2]:
# generate a random species tree topology
tree = toytree.rtree.unittree(ntips=8, treeheight=1e6, seed=123)

# draw the species tree
canvas, axes = tree.draw(ts='p');

# add a title
canvas.text(
    x=canvas.width / 2., 
    y=20,
    text="Species tree", 
    style={"font-size": "14px"},
);

### Define a demographic model based on the species tree
Using `ipcoal` we can sample a single genealogy under a demographic model defined by the divergence times in the species tree above, and with an effective population size parameter applied to all edges of the tree. 

In [3]:
# define an ipcoal model, simulate trees and show result table
mod = ipcoal.Model(tree=tree, Ne=1e5, seed=1234)
mod.sim_trees(1)
mod.df.head()

Unnamed: 0,locus,start,end,nbps,nsnps,genealogy
0,0,0,1,1,0,"((r1:859988,(r3:666450,(..."


### Draw one simulated genealogy

Here I use the `fixed_order=...` argument to toytree so that it will plot the tips in the same order as in the species tree above. This makes it easier to see the differences between the two trees. 

In [4]:
# load the resulting genealogy as a toytree
genealogy = toytree.tree(mod.df.genealogy[0], fixed_order=tree.get_tip_labels())

# draw the tree
canvas, axes = genealogy.draw(ts='c', tip_labels=True);

# add a title
canvas.text(
    x=canvas.width / 2., 
    y=20,
    text="Gene tree", 
    style={"font-size": "14px"},
);

### Simulate sequence data
In addition to the `sim_trees()` function call used above, which only samples genealogies evolving under the defined model, `ipcoal` can  also simulate SNPs or loci evolving on genealogies by using the function calls `.sim_snps()` or `.sim_loci()`. In this case a markov model of molecular substitutions will be applied to mutate sites along the edges of the tree. The default option is to evolve sites under the Jukes-Cantor model, but you can provide additional parameter options to implement more complex models similar to the `seqgen` program. 

In [7]:
# init the model
mod = ipcoal.Model(
    tree, 
    Ne=1e5,
    mut=1e-8,
    recomb=0,
    substitution_model={
        "state_frequencies": (0.25, 0.25, 0.25, 0.25),
        "kappa": 1.0,
    },
    seed=123,
)

You can view a summary of the substitution model after initializing the `ipcoal` model object to see the effect of substitution model parameters on the instantaneous rate matrix. 

In [8]:
mod.get_substitution_model_summary()

state_frequencies:
    A     C     G     T
 0.25  0.25  0.25  0.25

kappa: 1.0
ts/tv: 0.5

instantaneous transition rate matrix:
        A       C       G       T
A -1.0000  0.3333  0.3333  0.3333
C  0.3333 -1.0000  0.3333  0.3333
G  0.3333  0.3333 -1.0000  0.3333
T  0.3333  0.3333  0.3333 -1.0000


### Evolve sequences in ipcoal using the `SeqModel` class

First we will generate data using the pure Python implementation in `ipcoal` which we call SeqModel. Then we will compare our results with data generated under the same parameter settings in `seqgen`. There is of course a lot of stochasticity in the evolutionary process, so to validate the two classes are returning similar results we will simulate data on a single genealogy (nloci=1) and for many sites (nsites=1e6). 

In [8]:
# simulate one locus
mod.sim_loci(nloci=1, nsites=1e6)

# calculate genetic distances
seqmodel_dists = mod.get_pairwise_distances()
seqmodel_dists

Unnamed: 0,r0,r1,r2,r3,r4,r5,r6,r7
r0,0.0,0.022301,0.02216,0.022169,0.022323,0.022057,0.02231,0.022092
r1,0.022301,0.0,0.017545,0.019885,0.013918,0.019766,0.020014,0.019807
r2,0.02216,0.017545,0.0,0.019803,0.017598,0.019655,0.019936,0.019715
r3,0.022169,0.019885,0.019803,0.0,0.01994,0.015873,0.012043,0.01593
r4,0.022323,0.013918,0.017598,0.01994,0.0,0.019818,0.020076,0.019868
r5,0.022057,0.019766,0.019655,0.015873,0.019818,0.0,0.016016,0.01462
r6,0.02231,0.020014,0.019936,0.012043,0.020076,0.016016,0.0,0.016069
r7,0.022092,0.019807,0.019715,0.01593,0.019868,0.01462,0.016069,0.0


### Evolve sequences in ipcoal using the `SeqGen` class
Here we implement the same model but use a subprocess call to pass the  genealogy and substution model arguments to the `seqgen` binary to perform the sequence simulation. 


In [7]:
# re-init the model (we want to start from the same seed)
mod = ipcoal.Model(
    tree, 
    Ne=1e5,
    mut=1e-8,
    recomb=0,
    substitution_model={
        "state_frequencies": (0.25, 0.25, 0.25, 0.25),
        "kappa": 1.0,
    },
    seed=123,
)

# simulate one locus this time using seqgen
mod.sim_loci(nloci=1, nsites=1e6, seqgen=True)

# calculate genetic distances
seqgen_dists = mod.get_pairwise_distances()
seqgen_dists

Unnamed: 0,r0,r1,r2,r3,r4,r5,r6,r7
r0,0.0,0.026128,0.015953,0.026125,0.025972,0.02613,0.026092,0.025935
r1,0.026128,0.0,0.026053,0.022988,0.01039,0.023034,0.022971,0.022825
r2,0.015953,0.026053,0.0,0.026024,0.025885,0.02604,0.025972,0.025838
r3,0.026125,0.022988,0.026024,0.0,0.022857,0.01607,0.011739,0.015832
r4,0.025972,0.01039,0.025885,0.022857,0.0,0.022872,0.022824,0.022677
r5,0.02613,0.023034,0.02604,0.01607,0.022872,0.0,0.01604,0.012187
r6,0.026092,0.022971,0.025972,0.011739,0.022824,0.01604,0.0,0.01581
r7,0.025935,0.022825,0.025838,0.015832,0.022677,0.012187,0.01581,0.0


### Are the results close enough within random expectations?

In [12]:
# are the values close to within a high tolerance?
np.allclose(seqmodel_dists, seqgen_dists, rtol=1e-1)

False

In [13]:
import toyplot
toyplot.matrix(
    seqgen_dists - seqmodel_dists,
    width=400, height=400,
    margin=10,
);

### The results get more similar as more data is simulated?

In [9]:
from concurrent.futures import ProcessPoolExecutor

In [None]:
def func(tree, seed, rep, nsites):
    # re-init the model (we want to start from the same seed)
    mod1 = ipcoal.Model(
        tree, 
        Ne=1e5,
        mut=1e-8,
        recomb=0,
        substitution_model={
            "state_frequencies": (0.25, 0.25, 0.25, 0.25),
            "kappa": 1.0,
        },
        seed=seed,
    )

    # simulate one locus
    mod1.sim_loci(nloci=1, nsites=nsites)

    # calculate genetic distances
    seqmod_dists = mod1.get_pairwise_distances()

    # re-init the model (we want to start from the same seed)
    mod2 = ipcoal.Model(
        tree, 
        Ne=1e5,
        mut=1e-8,    
        recomb=0,
        substitution_model={
            "state_frequencies": (0.25, 0.25, 0.25, 0.25),
            "kappa": 1.0,
        },
        seed=seed,
    )
    # simulate one locus
    mod2.sim_loci(nloci=1, nsites=nsites, seqgen=True)

    # calculate genetic distances
    seqgen_dists = mod2.get_pairwise_distances()

    # get distance between matrices
    dist1 = (seqgen_dists - seqmod_dists).abs().sum().sum()
    dist2 = (seqgen_dists - seqmod_dists).values.flatten().mean()
    return nsites, rep, dist1, dist2

In [10]:
with ProcessPoolExecutor(max_workers=8) as executor:
    # simulate many replicates of diff size to fill dataframe
    idx = 0
    results = {}
    for nsites in [1e4, 5e5, 1e6, 2e6]:
        for rep in range(8):

            # random seed for this rep
            seed = np.random.randint(0, 1e8)
            future = executor.submit(func, *(tree, seed, rep, nsites))
            results[idx] = future
            idx += 1

In [11]:
import pandas as pd

# store result in dataframe
data = pd.DataFrame({
    "nsites": np.zeros(8 * 4),
    "rep": np.zeros(8 * 4),
    "dist": np.zeros(8 * 4, dtype=float),
})

In [12]:
for idx in results:
    nsites, rep, dist1, dist2 = results[idx].result()
    data.nsites[idx] = nsites
    data.rep[idx] = rep
    data.dist[idx] = dist2

In [14]:
data

Unnamed: 0,nsites,rep,dist
0,10000.0,0.0,0.000566
1,10000.0,1.0,-0.002659
2,10000.0,2.0,-0.000169
3,10000.0,3.0,-0.000616
4,10000.0,4.0,-0.000847
5,10000.0,5.0,-0.001416
6,10000.0,6.0,-0.001053
7,10000.0,7.0,-0.002369
8,500000.0,0.0,8.3e-05
9,500000.0,1.0,0.000198


In [15]:
data.groupby("nsites").apply(np.mean)

Unnamed: 0_level_0,nsites,rep,dist
nsites,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10000.0,10000.0,3.5,-0.00107
500000.0,500000.0,3.5,4.8e-05
1000000.0,1000000.0,3.5,-1.6e-05
2000000.0,2000000.0,3.5,-9e-06


In [21]:
a.hlines()

<bound method Cartesian.hlines of <toyplot.coordinates.Cartesian object at 0x7f6af75592e8>>

In [23]:
import toyplot

c, a, m = toyplot.scatterplot(
    data.nsites,
    data.dist, 
    height=300, 
    width=350, 
    size=10, 
    opacity=0.6,
    xscale='log',
    #yscale='log',
    #marker="-",
    #mstyle={"stroke-width": 2}
);
a.hlines(0)


<toyplot.mark.AxisLines at 0x7f6af77da240>

In [124]:
import toyplot

toyplot.scatterplot(
    data.nsites,
    data.dist, 
    height=300, width=350, 
    size=10, 
    opacity=0.7,
    xscale='log',
    yscale='log',
);

In [91]:
data.groupby("nsites").dist.apply(np.median)#.mean()

nsites
1000.0       0.315000
10000.0      0.132200
100000.0     0.041170
1000000.0    0.060128
2000000.0    0.145511
3000000.0    0.083587
Name: dist, dtype: float64

In [88]:
data

Unnamed: 0,nsites,rep,dist
0,1000.0,0.0,0.36
1,1000.0,1.0,0.312
2,1000.0,2.0,0.318
3,1000.0,3.0,0.45
4,1000.0,4.0,0.29
5,1000.0,5.0,0.356
6,1000.0,6.0,0.258
7,1000.0,7.0,0.274
8,1000.0,8.0,0.48
9,1000.0,9.0,0.256


### Apply a more complex model

In [15]:
# init the model
mod = ipcoal.Model(
    tree, 
    Ne=1e5,
    mut=1e-8,
    substitution_model={
        "state_frequencies": [0.15, 0.35, 0.35, 0.15],
        "kappa": 3.0,
    },
)
mod.get_substitution_model_summary()

state_frequencies:
    A     C     G     T
 0.15  0.35  0.35  0.15

kappa: 3.0
ts/tv: 1.26

instantaneous transition rate matrix:
        A       C       G       T
A -1.3717  0.3097  0.9292  0.1327
C  0.1327 -0.8407  0.3097  0.3982
G  0.3982  0.3097 -0.8407  0.1327
T  0.1327  0.9292  0.3097 -1.3717


In [None]:
import pandas as pd

# store result in dataframe
data = pd.DataFrame({
    "nsites": np.zeros(10 * 4),
    "rep": np.zeros(10 * 4),
    "dist": np.zeros(10 * 4, dtype=float),
})

# simulate many replicates of diff size to fill dataframe
idx = 0
for nsites in [1e3, 1e4, 1e5, 1e6, 2e6, 3e6]:
    for rep in range(10):
        
        # random seed for this rep
        seed = np.random.randint(0, 1e8)
        
        # re-init the model (we want to start from the same seed)
        mod1 = ipcoal.Model(
            tree, 
            Ne=1e5,
            mut=1e-8,
            substitution_model={
                "state_frequencies": [0.15, 0.35, 0.35, 0.15],
                "kappa": 3.0,
            },
            seed=seed,
        )

        # simulate one locus
        mod1.sim_loci(nloci=1, nsites=nsites)

        # calculate genetic distances
        seqmod_dists = mod1.get_pairwise_distances()

        # re-init the model (we want to start from the same seed)
        mod2 = ipcoal.Model(
            tree, 
            Ne=1e5,
            mut=1e-8,    
            substitution_model={
                "state_frequencies": [0.15, 0.35, 0.35, 0.15],
                "kappa": 3.0,
            },
            seed=seed,
        )
        # simulate one locus
        mod2.sim_loci(nloci=1, nsites=nsites, seqgen=True)

        # calculate genetic distances
        seqgen_dists = mod2.get_pairwise_distances()
        
        # get distance between matrices
        dist = (seqgen_dists - seqmod_dists).abs().sum().sum()
        
        # store results and advance counter
        data.nsites[idx] = nsites
        data.rep[idx] = rep
        data.dist[idx] = dist
        idx += 1

In [None]:
data

In [None]:
import toyplot

toyplot.scatterplot(
    data.dist, 
    data.nsites,
    height=400, width=400, 
    size=10, 
    opacity=0.7,
)

In [17]:
r0 = mod.sim_loci(nloci=100, nsites=100000)
r1 = mod.sim_loci(nloci=100, nsites=100000, seqgen=True)

In [18]:
# are the values close to within a high tolerance?
np.allclose(
    r0.get_pairwise_distances(), 
    r1.get_pairwise_distances(),
    rtol=1e-8
)

True

In [19]:
# are the values close to within a high tolerance?
np.allclose(
    r0.get_pairwise_distances(model="JC"), 
    r1.get_pairwise_distances(model="JC"),
    rtol=1e-8
)

True

### Show that results also match up when using `sim_snps()`

In [30]:
# init the model
mod = ipcoal.Model(
    tree, 
    Ne=1e5,
    mut=1e-8,
    substitution_model={
        "state_frequencies": [0.15, 0.35, 0.35, 0.15],
        "kappa": 3.0,
    },
)

# simulate snps under both classes
r0 = mod.sim_snps(500)
r1 = mod.sim_snps(500, seqgen=True)

# ask if genetic distances are super close
np.allclose(
    r0.get_pairwise_distances(model="JC"), 
    r1.get_pairwise_distances(model="JC"),
    rtol=1e-8
)

True