In [1]:
import toytree
import ipcoal

From Wakeley: https://wakeleylab.oeb.harvard.edu/files/wakeleylab/files/wakeley94.pdf

<i> 
The mean of this distribution is ab, and the variance is ab$^{2}$, making the coefficient of variation, in rate, among sites a$^{-1/2}$ The parameter a can be used to describe the extent of rate variation when gamma distributions are compared. For distributions with the same mean, as a gets smaller the variance and coefficient of variation in- crease. When a is small, most sites have rates near 0, but a few have very high rates. Alternatively, as a gets larger, the variance and coefficient of variation decrease until the entire distribution is concentrated at a single rate.
</i>

From Seqgen docs:


<i>
The second model of rate heterogeneity assigns different rates to different sites according to a gamma distribution (Yang, 1993). <b>The distribution is scaled such that the mean rate for all the sites is 1 but the user must supply a parameter which describes its shape.</b> A low value for this parameter ($<$1.0) simulates a large degree of site-specific rate heterogeneity and as this value increases the simulated data becomes more rate-homogeneous. This can be performed as a continuous model, i.e. every site has a different rate sampled from the gamma distribution of the given shape, or as a discrete model, i.e. each site falls into one of N rate categories approximating the gamma distribution. For a review of site-specific rate heterogeneity and its implications for phylogenetic analyses, see Yang (1996). 
</i>

In [2]:
tre = toytree.rtree.unittree(5, 1e6, seed=123)

### JC

In [3]:
%%timeit
mod = ipcoal.Model(tre, Ne=1e5, nsamples=2, seed=1234, seed_mutations=123)
mod.sim_loci(10, 20)

85.5 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### GTR

In [4]:
%%timeit
mod = ipcoal.Model(
    tre, Ne=1e5, nsamples=2, seed=1234, seed_mutations=123,
    substitution_model={"state_frequencies": (0.3, 0.2, 0.2, 0.3), "kappa": 0.5}                  
)
mod.sim_loci(10, 20)

86 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### GTR GAMMA

In [5]:
%%timeit
mod = ipcoal.Model(
    tre, Ne=1e5, nsamples=2, seed=1234, seed_mutations=123,
    substitution_model={
        "state_frequencies": (0.3, 0.2, 0.2, 0.3), 
        "kappa": 0.5, 
        "gamma": 0.5,
    } 
)
mod.sim_loci(10, 20)

694 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### GTR GAMMA discerete categories

In [6]:
%%timeit
mod = ipcoal.Model(
    tre, Ne=1e5, nsamples=2, seed=1234, seed_mutations=123,
    substitution_model={
        "state_frequencies": (0.3, 0.2, 0.2, 0.3), 
        "kappa": 0.5, 
        "gamma": 0.5,
        "gamma_categories": 4,
    } 
)
mod.sim_loci(10, 20)
mod.df

192 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Example

In [7]:
mod = ipcoal.Model(
    tre, Ne=1e5, nsamples=2, seed=1234, seed_mutations=123,
    substitution_model={
        "state_frequencies": (0.3, 0.2, 0.2, 0.3), 
        "kappa": 0.5, 
        "gamma": 0.5,
        "gamma_categories": 4,
    } 
)
mod.sim_loci(10, 20)

In [8]:
mod.draw_seqview(0);

In [9]:
mod.get_substitution_model_summary()

state_frequencies:
   A    C    G    T
 0.3  0.2  0.2  0.3

kappa: 0.5
ts/tv: 0.24

instantaneous transition rate matrix:
        A       C       G       T
A -0.9677  0.3226  0.1613  0.4839
C  0.4839 -1.0484  0.3226  0.2419
G  0.2419  0.3226 -1.0484  0.4839
T  0.4839  0.1613  0.3226 -0.9677

gamma rate var. (4 discrete categories) alpha: 0.5


### Comparison to SEQGEN (TODO)