In [1]:
import momi
import pandas
import os

Use the **`help()`** function to view documentation.

In [2]:
help(momi)

Help on package momi:

NAME
    momi

FILE
    /Users/jkamm/anaconda/lib/python2.7/site-packages/momi/__init__.py

DESCRIPTION
    momi (MOran Models for Inference) is a python package for computing the site frequency spectrum,
    a summary statistic commonly used in population genetics, and using it to infer demographic history.
    
    Please refer to examples/tutorial.py for usage & introduction.

PACKAGE CONTENTS
    compress_sfs
    compute_sfs
    convolution
    demography
    likelihood
    likelihood_surface
    math_functions
    moran_model
    parse_ms
    simulate_inference
    size_history
    tensor
    util




# Creating demographies

momi uses a syntax based on the program ms by Richard Hudson.
A demography is specified as a sequence of events.
Time is measured going backwards from the present (t=0) to the past (t>0).

There are 4 kinds of events:
* **('-en', t, i, N)**
    * At time t, population i has its scaled population size set to N, and growth rate set to 0.
* **('-eg', t, i, g)**
    * At time t, population i has exponential growth rate g (so for s>t, $N(s) = N(t) e^{(s-t)  g}$)
* **('-ej', t, i, j)**
    * At time t, all lineages in population i move into j.
* **('-ep', t, i, j, p_ij)**
    * At time t, each lineage in i moves into j with probability p_ij.

Note **-en,-eg,-ej** are flags from ms, while **-ep** replaces the flag **-es** in ms.
By default, all parameters are scaled as in ms, but this can be adjusted.

See **`help(momi.Demography)`** or **`help(momi.Demography.__init__)`** for more details.

### An example demography

Now let's consider a concrete example. More examples can be found at [example_demographies.ipynb](files/example_demographies.ipynb).

Unlike ms, populations can be labeled by arbitrary strings. In this example, we'll label the sampled populations as **'chb'** and **'yri'**. The demography will also involve admixture with a third population, **'nea'**.

Using the default parameter scaling (which is the same as **ms**),
we assume that all population sizes have been rescaled by a "reference" size N_ref (e.g. 10,000),
and time is scaled so there are 4*N_ref generations per unit time.

In [3]:
# define the list of events
events = [('-en', 0., 'chb', 10.),         # at present (t=0), 'chb' has diploid population size 10 * N_ref
          ('-eg', 0, 'chb' , 6.),          # at present (t=0), 'chb' growing at rate 6
          ('-ep', .25, 'chb', 'nea', .03), # at t=.25, 'chb' has a bit of admixture from 'nea'
          ('-ej', .5, 'chb', 'yri'),       # at t=.5, 'chb' joins onto 'yri' 
          ('-ej', 1.5, 'yri', 'nea'),      # at t=1.5, 'yri' joins onto 'nea'
          ]

# construct the Demography object, sampling 14 alleles from 'yri' and 10 alleles from 'chb'
demo = momi.Demography(events, sampled_pops=('yri','chb'), sampled_n=(14,10))

# Coalescent statistics

Let's examine some statistics of the above demography, such as the TMRCA (time to most recent common ancestor) and total branch length of the genealogy.

In [4]:
eTmrca = momi.expected_tmrca(demo)
print "Expected TMRCA of all samples:", "\t", eTmrca

eTmrca_chb = momi.expected_deme_tmrca(demo, 'chb')
print "Expected TMRCA of chb samples:", "\t", eTmrca_chb

eL = momi.expected_total_branch_len(demo)
print "Expected total branch length:", "\t", eL

# See help(momi.expected_tmrca), etc. for more details.
# Advanced users can use momi.expected_sfs_tensor_prod()
# to compute these and many other summary statistics.

Expected TMRCA of all samples: 	1.41920517653
Expected TMRCA of chb samples: 	1.25374295456
Expected total branch length: 	7.93963743415


# Expected Sample Frequency Spectrum (SFS)

The expected SFS for configuration $(i_0,i_1,...)$ is the expected number of SNPs with $i_0$ derived alleles in population 0, $i_1$ derived alleles in population 1, etc.

In the below example we use **`momi.expected_sfs()`** to compute the expected SFS for several configurations.

In [5]:
# a list of configs (index0 == yri, index1 == chb)
config_list = [(0,1), (1,0), (3,1), (0,10), (12,0), (2,2)]

# the SFS entries corresponding to each config in config_list
eSFS = momi.expected_sfs(demo, config_list, mut_rate=1.0)
print eSFS

# See help(momi.expected_sfs) for more options (e.g. folded SFS, sampling error, normalization)

[ 2.6309565   0.95647102  0.0048574   0.06108049  0.04371188  0.0046881 ]


# Observed SFS

The observed SFS gives the number of observed SNPs for each configuration.

momi represents the observed SFS as a **dictionary**, mapping configs (tuples) to counts (ints).

An existing dataset for 10,000 loci has already been stored in [tutorial_data.txt](tutorial_data.txt). We read it in and examine it here.

In [6]:
data_file = "tutorial_data.txt"
sfs_list = momi.read_sfs_list(data_file)

# sfs_list is a list of the SFS at each of 10,000 loci
print "Number of loci:"
print len(sfs_list)

Number of loci:
10000


In [7]:
print "SFS at first locus:"
print sfs_list[0]

# an SFS is represent as a dictionary, {config0: count0, config1: count1, ...}

SFS at first locus:
{(0, 1): 26, (3, 2): 1, (7, 0): 3, (2, 0): 2, (3, 0): 6, (6, 0): 1, (0, 2): 3, (11, 8): 3, (14, 1): 3, (0, 5): 1, (0, 10): 2, (5, 0): 1, (0, 4): 1, (0, 9): 1, (0, 3): 1, (0, 8): 1, (1, 0): 10, (11, 6): 2, (14, 0): 2}


In [8]:
print "Combined SFS at all loci:"
combined_sfs = momi.sum_sfs_list(sfs_list)
print combined_sfs

Combined SFS at all loci:
{(7, 3): 89, (6, 9): 223, (12, 1): 92, (14, 4): 753, (13, 4): 87, (0, 7): 4971, (1, 6): 106, (0, 10): 26774, (3, 7): 143, (2, 5): 83, (8, 5): 88, (5, 8): 138, (4, 0): 22333, (10, 8): 132, (9, 0): 8237, (6, 7): 95, (5, 5): 90, (11, 5): 82, (10, 7): 107, (7, 6): 107, (6, 10): 1875, (12, 6): 105, (14, 1): 687, (13, 7): 121, (0, 4): 12375, (1, 1): 99, (4, 10): 1888, (3, 2): 69, (2, 6): 123, (8, 2): 78, (4, 5): 82, (9, 3): 78, (6, 0): 13671, (11, 0): 6145, (7, 5): 101, (14, 2): 681, (13, 10): 1768, (0, 1): 202357, (3, 1): 108, (9, 9): 193, (7, 8): 154, (14, 8): 3938, (13, 0): 4766, (12, 8): 136, (2, 1): 81, (8, 9): 239, (9, 4): 97, (5, 1): 95, (10, 3): 81, (7, 2): 102, (12, 2): 64, (11, 10): 1702, (14, 5): 979, (13, 3): 96, (1, 5): 85, (3, 6): 120, (2, 2): 85, (1, 10): 1741, (8, 6): 91, (4, 1): 112, (10, 9): 206, (9, 7): 97, (6, 4): 84, (5, 4): 76, (11, 4): 87, (10, 4): 105, (7, 1): 86, (12, 7): 105, (11, 9): 251, (14, 6): 1486, (13, 6): 94, (0, 5): 7916, (1, 0): 9

In [9]:
print "Number of mutations with configuration (1,0):"
print combined_sfs[(1,0)]

Number of mutations with configuration (1,0):
97090


# Composite likelihood

We construct a composite likelihood by using a Poisson random field (PRF) approximation.
This assumes that the number of observed SNPs for each configuration are independent Poisson.

Below we compute the composite likelihood of our dataset and demography, given a mutation rate of 10.0 per locus.

In [10]:
# the mutation rate per locus
mut_rate_per_locus = 10.

# the mutation rate for the whole dataset
n_loci = len(sfs_list)
combined_mut_rate = n_loci * mut_rate_per_locus

composite_log_lik = momi.unlinked_log_likelihood(combined_sfs, demo, mut_rate=combined_mut_rate)
print "Composite log likelihood:", composite_log_lik

Composite log likelihood: -67579.2634253


# Automatic differentiation

`momi` uses the package `autograd` to automatically compute derivatives.

Gradients can be extremely useful in parameter inference.
However, computing gradients is **not** strictly necessary for most functionality.
Users who don't plan to use auto-differentiation can skip this section, or return to it later.

Let's start by defining a function that maps some parameters to a demography.

In [11]:
import autograd.numpy as np ## thinly wrapped version of numpy for auto-differentiation

def demo_func(N_chb_bottom, N_chb_top, pulse_t, pulse_p, ej_chb, ej_yri):
    ej_chb = pulse_t + ej_chb
    ej_yri = ej_chb + ej_yri
    
    # use autograd.numpy for math functions (e.g. logarithm)
    # This will allow us to take derivatives later
    G_chb = -np.log(N_chb_top / N_chb_bottom) / ej_chb
    
    events = [('-en', 0., 'chb', N_chb_bottom),
              ('-eg', 0, 'chb' , G_chb),
              ('-ep', pulse_t, 'chb', 'nea', pulse_p),
              ('-ej', ej_chb, 'chb', 'yri'),
              ('-ej', ej_yri, 'yri', 'nea'),
              ]

    return momi.Demography(events, ('yri','chb'), (14,10))

Next, let's define a function that returns the expected
TMRCA from the parameters, and then use `autograd` to compute its gradient.

In [12]:
# function mapping vector of parameters to the TMRCA of the corresponding demography
def tmrca_func(params):
    # equivalent to demo_func(params[0], params[1], params[2], ...)
    demo = demo_func(*params)
    # return expected TMRCA
    return momi.expected_tmrca(demo)

# use autograd.grad() to obtain the gradient function
from autograd import grad
tmrca_grad = grad(tmrca_func)

x = [10., .1, .25, .03, .25, 1.]
print "Parameters:"
print x
print "Expected TMRCA:"
print tmrca_func(x)
print "Gradient:"
print tmrca_grad(x)

Parameters:
[10.0, 0.1, 0.25, 0.03, 0.25, 1.0]
Expected TMRCA:
1.31277716218
Gradient:
[0.0045248167517458011, 0.55838496083536993, 0.28449869706692488, 3.9671067152725357, 0.83175299796269708, 0.13916390098743361]


### More details about using `autograd` for automatic differentiation

`autograd` uses the magic of *operator overloading* to compute derivatives automatically.

Here are a few rules to keep in mind, to make sure `autograd` works correctly:

* Arithmetic operations `+,-,*,/,**` all work with autograd

* For more complicated mathematical operations, use `autograd.numpy` and `autograd.scipy`, thinly wrapped versions of `numpy` and `scipy` that support auto-differentiation.
    * For most users, `numpy` contains all the mathematical operations that are needed: `exp()`, `log()`, trigonemetric functions, matrix operations, fourier transform, etc.
    * If needed, it is also possible to use autograd to define derivatives of your own mathematical operations.
* Other do's and don'ts: (copy and pasted from autograd tutorial)
    * Do use
        * Most of `numpy`'s functions
        * Most `numpy.ndarray` methods
        * Some `scipy` functions
        * Indexing and slicing of arrays `x = A[3, :, 2:4]`
        * Explicit array creation from lists `A = np.array([x, y])`
    * Don't use
        * Assignment to arrays `A[0,0] = x`
        * Implicit casting of lists to arrays `A = numpy.sum([x, y])`, use `A = numpy.sum(np.array([x, y]))` instead.
        * `A.dot(B)` notation (use `np.dot(A, B)` instead)
        * In-place operations (such as `a += b`, use `a = a + b` instead)

Documentation for autograd can be found at https://github.com/HIPS/autograd/

# Inference

Now let's try to infer the following demography from data.

In [13]:
true_params = [10., .1, .25, .03, .25, 1.]
true_demo = demo_func(*true_params)

We can generate a new dataset with `ms` (or similar program, e.g. `scrm`, `macs`, `msprime`, etc.)

In [14]:
## to generate new dataset, change this to
## ms_path = "/path/to/ms"
ms_path = None

if ms_path is not None:
    print "Generating new dataset with ms"
    
    n_loci, mut_rate_per_locus, recom_rate_per_locus = 10000, 10., 10.
    
    ms_output = momi.simulate_ms(ms_path, true_demo, 
                                 n_loci, mut_rate_per_locus, 
                                 additional_ms_params="-r %f 10000" % recom_rate_per_locus)
    sfs_list = momi.sfs_list_from_ms(ms_output)

    combined_sfs = momi.sum_sfs_list(sfs_list)
    combined_mut_rate = n_loci * mut_rate_per_locus
    
    ## uncomment this line to save the new dataset
    #momi.write_sfs_list(sfs_list, data_file)
else:
    print "No ms_path provided, using SFS previously stored in %s." % data_file

No ms_path provided, using SFS previously stored in tutorial_data.txt.


In [15]:
# define (lower,upper) bounds on the parameter space
bounds = [(.01, 100.),
          (.01, 100.),
          (.01,5.),
          (.001,.25),
          (.01,5.),
          (.01,5.)]

# pick a random start value for the parameter search
import random
lower_bounds, upper_bounds = [l for l,u in bounds], [u for l,u in bounds]
start_params = [random.triangular(lower, upper, mode)
                for lower, upper, mode in zip(lower_bounds, upper_bounds, [1, 1, 1, .1, 1,1,1])]

Now search for the MCLE with `momi.unlinked_mle_search()`.

By default, `momi.unlinked_mle_search()` assumes `demo_func` is differentiable with `autograd`, and uses the gradient in a hill-climbing algorithm.
* If you don't want to use `autograd`, you can disable the gradient (i.e. jacobian) with:
    * `momi.unlinked_mle_search(..., jac=False, ...)`
* Conversely, `momi.unlinked_mle_search()` provides options for using the Hessian (second-order derivative), in addition to the gradient.
    * See `help(momi.unlinked_mle_search)`.

Beware that `momi.unlinked_mle_search()` can be vulnerable to local maxima.
It isn't a problem in this example, but in other cases, you may have to try multiple starting positions to reach the global maximum.

In [16]:
mcle_search_result = momi.unlinked_mle_search(combined_sfs, demo_func, combined_mut_rate, start_params, 
                                              bounds = bounds, maxiter = 500, output_progress = 25)

iter 0
objective ( [ 32.07686572  35.93110607   2.76977726   0.14485968   0.61600818   2.56761102] ) == 2.34621e+06
iter 25
objective ( [  7.96202564e-01   1.45976961e+01   6.88916031e-01   5.88225780e-02   1.00000000e-02   1.00000000e-02] ) == 96702.8
iter 50
objective ( [ 1.42369308  0.44158165  0.29680041  0.05825242  0.31992278  0.01      ] ) == 50679.1
iter 75
objective ( [ 6.38266067  0.10773648  0.17317568  0.04638007  0.3565005   0.52274802] ) == 2658.81
iter 100
objective ( [ 9.95820913  0.1010553   0.24771354  0.02994495  0.25438522  0.98704888] ) == 769.057
iter 125
objective ( [ 9.93544969  0.10158646  0.24789708  0.02965485  0.25484613  0.99444729] ) == 768.814


In [17]:
print "Search results:"
# print info such as whether search succeeded, number function/gradient evaluations, etc
print mcle_search_result
# note the printed function & gradient values are for -1*log_likelihood

Search results:
  status: 2
 success: True
    nfev: 129
     fun: 768.81437521055341
       x: array([ 9.93544964,  0.10158656,  0.24789711,  0.02965485,  0.25484617,
        0.99444729])
 message: 'Converged (|x_n-x_(n-1)| ~= 0)'
     jac: array([  5.22489881e-04,   1.26229927e-05,   1.23677773e-03,
         4.07467537e-01,   4.54915639e-03,  -6.33379112e-04])
     nit: 26


In [18]:
est_params = mcle_search_result.x
print "Estimated params:"
print est_params
print "Ratio of Estimated/Truth:"
print est_params / true_params

Estimated params:
[ 9.93544964  0.10158656  0.24789711  0.02965485  0.25484617  0.99444729]
Ratio of Estimated/Truth:
[ 0.99354496  1.01586557  0.99158846  0.98849488  1.01938467  0.99444729]


In [19]:
print "Log-likelihood at estimated parameters:", -mcle_search_result.fun
print "Log-likelihood at true parameters:", momi.unlinked_log_likelihood(combined_sfs, true_demo, mut_rate=combined_mut_rate)

Log-likelihood at estimated parameters: -768.814375211
Log-likelihood at true parameters: -774.885651385


# Confidence Intervals

As the number of independent loci goes to infinity,
the MCLE is asymptotically Gaussian, with mean at the truth,
and covariance given by the inverse 'Godambe information'.

This can be used to construct approximate confidence intervals,
which have the correct coverage properties in the limit (assuming certain regularity conditions, e.g. *identifiability* and *consistency*).

In [20]:
print "Computing approximate covariance of MCLE..."

## the approximate covariance matrix of the MCLE
mcle_cov = momi.unlinked_mle_approx_cov(est_params, sfs_list, demo_func, mut_rate_per_locus)
print mcle_cov

Computing approximate covariance of MCLE...
[[  2.53727567e-02  -2.24368132e-04   2.71444867e-04   1.14725781e-04
   -4.58609258e-04  -1.50682514e-03]
 [ -2.24368132e-04   4.27654324e-06  -2.80645723e-06  -2.61300014e-06
    4.59916259e-06   4.68172599e-05]
 [  2.71444867e-04  -2.80645723e-06   1.71937574e-05   3.40832616e-06
   -1.82960528e-05  -1.06727642e-05]
 [  1.14725781e-04  -2.61300014e-06   3.40832616e-06   3.14499286e-06
   -4.52473913e-06  -5.33580081e-05]
 [ -4.58609258e-04   4.59916259e-06  -1.82960528e-05  -4.52473913e-06
    2.32418172e-05   2.61780696e-05]
 [ -1.50682514e-03   4.68172599e-05  -1.06727642e-05  -5.33580081e-05
    2.61780696e-05   1.47242029e-03]]


In [21]:
# marginal confidence intervals
print "Approximate 95% confidence intervals for parameters:"

import scipy.stats
conf_lower, conf_upper = scipy.stats.norm.interval(.95, loc = est_params, scale = np.sqrt(np.diag(mcle_cov)))
print pandas.DataFrame({"Truth" : true_params, "Lower" : conf_lower, "Upper" : conf_upper}, columns = ["Lower","Upper","Truth"])

Approximate 95% confidence intervals for parameters:
      Lower      Upper  Truth
0  9.623250  10.247649  10.00
1  0.097533   0.105640   0.10
2  0.239770   0.256024   0.25
3  0.026179   0.033131   0.03
4  0.245397   0.264295   0.25
5  0.919239   1.069655   1.00


In [22]:
# higher dimensional confidence regions, using wald test
print "Smallest alpha, so that level-alpha confidence region contains Truth:"
print "(alpha = 0 is a single point, alpha = 1 is whole parameter space)"

# wald test: residual * cov^{-1} * residual should be Chi-squared with n_params degrees of freedom

inv_cov = np.linalg.inv(mcle_cov)
# make sure the numerical inverse is still symmetric
assert np.allclose(inv_cov, inv_cov.T)
inv_cov = (inv_cov + inv_cov.T) / 2.0

resids = est_params - true_params
wald_stat = np.dot(resids, np.dot(inv_cov, resids))
print "alpha = ", scipy.stats.chi2.cdf(wald_stat, df=len(resids))

Smallest alpha, so that level-alpha confidence region contains Truth:
(alpha = 0 is a single point, alpha = 1 is whole parameter space)
alpha =  0.312401628989
