In [1]:
import momi
import pandas
import os

Use the **`help()`** function to view documentation.

In [2]:
help(momi)

Help on package momi:

NAME
    momi

FILE
    //anaconda/lib/python2.7/site-packages/momi/__init__.py

DESCRIPTION
    momi (MOran Models for Inference) is a python package for computing the site frequency spectrum,
    a summary statistic commonly used in population genetics, and using it to infer demographic history.
    
    Please refer to examples/tutorial.ipynb for usage & introduction.

PACKAGE CONTENTS
    compress_sfs
    compute_sfs
    convolution
    data_structure
    demography
    likelihood
    likelihood_surface
    math_functions
    moran_model
    parse_ms
    simulate_inference
    size_history
    tensor
    util




# Creating demographies

momi uses a syntax based on the program ms by Richard Hudson.
A demography is specified as a sequence of events.
Time is measured going backwards from the present (t=0) to the past (t>0).

There are 4 kinds of events:
* **('-en', t, i, N)**
    * At time t, population i has its scaled population size set to N, and growth rate set to 0.
* **('-eg', t, i, g)**
    * At time t, population i has exponential growth rate g (so for s>t, $N(s) = N(t) e^{(s-t)  g}$)
* **('-ej', t, i, j)**
    * At time t, all lineages in population i move into j.
* **('-ep', t, i, j, p_ij)**
    * At time t, each lineage in i moves into j with probability p_ij.

Note **-en,-eg,-ej** are flags from ms, while **-ep** replaces the flag **-es** in ms.
By default, all parameters are scaled as in ms, but this can be adjusted.

See **`help(momi.Demography)`** or **`help(momi.Demography.__init__)`** for more details.

### An example demography

Now let's consider a concrete example. More examples can be found at [example_demographies.ipynb](files/example_demographies.ipynb).

Unlike ms, populations can be labeled by arbitrary strings. In this example, we'll label the sampled populations as **'chb'** and **'yri'**. The demography will also involve admixture with a third population, **'nea'**.

Using the default parameter scaling (which is the same as **ms**),
we assume that all population sizes have been rescaled by a "reference" size N_ref (e.g. 10,000),
and time is scaled so there are 4*N_ref generations per unit time.

In [3]:
# define the list of events
events = [('-en', 0., 'chb', 10.),         # at present (t=0), 'chb' has diploid population size 10 * N_ref
          ('-eg', 0, 'chb' , 6.),          # at present (t=0), 'chb' growing at rate 6
          ('-ep', .25, 'chb', 'nea', .03), # at t=.25, 'chb' has a bit of admixture from 'nea'
          ('-ej', .5, 'chb', 'yri'),       # at t=.5, 'chb' joins onto 'yri' 
          ('-ej', 1.5, 'yri', 'nea'),      # at t=1.5, 'yri' joins onto 'nea'
          ]

# construct the Demography object, sampling 14 alleles from 'yri' and 10 alleles from 'chb'
demo = momi.Demography(events, sampled_pops=('yri','chb'), sampled_n=(14,10))

# Coalescent statistics

Let's examine some statistics of the above demography, such as the TMRCA (time to most recent common ancestor) and total branch length of the genealogy.

In [4]:
eTmrca = momi.expected_tmrca(demo)
print "Expected TMRCA of all samples:", "\t", eTmrca

eTmrca_chb = momi.expected_deme_tmrca(demo, 'chb')
print "Expected TMRCA of chb samples:", "\t", eTmrca_chb

eL = momi.expected_total_branch_len(demo)
print "Expected total branch length:", "\t", eL

# See help(momi.expected_tmrca), etc. for more details.
# Advanced users can use momi.expected_sfs_tensor_prod()
# to compute these and many other summary statistics.

Expected TMRCA of all samples: 	1.41920517653
Expected TMRCA of chb samples: 	1.25374295456
Expected total branch length: 	7.93963743415


# Expected Sample Frequency Spectrum (SFS)

The expected SFS for configuration $((a_0,d_0),(a_1,d_1),...)$ is the expected number of SNPs with $a_0$ ancestral and $d_0$ derived alleles in population 0, $a_1$ ancestral and $d_1$ derived alleles in population 1, etc.

The following all create a config with 13 ancestral, 1 derived alleles in "yri", and 10 ancestral, 0 derived alleles in "chb".

In [5]:
# these all represent the same config
[[13,1],[10,0]]
momi.config(a=(13,10), d=(1,0))
momi.config(d=(1,0), n=(14,10))
momi.config(a=(13,10), n=(14,10))

((13, 1), (10, 0))

In the below example we use **`momi.expected_sfs()`** to compute the expected SFS for several configurations.

In [6]:
# a list of configs (index0 == yri, index1 == chb)
configs = [( (14,0), (9,1), ), # 1 derived allele in chb
           ( (13,1), (10,0), ), # 1 derived allele in yri 
           ( (11,3), (9,1),) , # 3 derived in yri, 1 derived in chb
           ( (14,0), (0,10), ), # 0 derived in yri, all derived in chb
           ( (2,12), (10,0), ), # 12 derived in yri, 0 derived in chb 
           ( (2,12), (2,8), ), # 12 derived in yri, 8 derived in chb
          ]

configs = momi.Configs(("yri","chb"), configs)

# the SFS entries corresponding to each config in configs
eSFS = momi.expected_sfs(demo, configs, mut_rate=1.0)
print eSFS

# See help(momi.expected_sfs) for more options (e.g. folded SFS, sampling error, normalization)

[ 2.6309565   0.95647102  0.0048574   0.06108049  0.04371188  0.00323204]


# Segregating sites

A dataset of segregating sites has been stored in [tutorial_data.txt](tutorial_data.txt). The file is organized as follows:
* Each locus starts with a line "**//**".
* Subsequent lines correspond to segregating sites.
    * The first column is the **position** of the site $x \in [0,1]$.
    * Subsequent columns give the **ancestral,derived** allele counts (respectively), in each population.

In [7]:
# print first few lines in the shell
!head tutorial_data.txt

Position	:	yri	chb

//

0.0	:	14,0	7,3
0.0007	:	13,1	10,0
0.001	:	0,14	10,0
0.0012	:	11,3	10,0
0.0012	:	14,0	9,1
0.0013	:	0,14	10,0


In [8]:
# read the file with momi
data_filename = "tutorial_data.txt"
with open(data_filename,'r') as f:
    seg_sites = momi.read_seg_sites(f)

`SegSites` has methods `SegSites.position()`, `SegSites.config()`, and `SegSites.allele_count()` to access the positions and allele counts at different sites.

The underlying arrays can also be accessed through `SegSites.position_arrays` and `SegSites.config_arrays`.

In [9]:
# equivalent to seg_sites.position_arrays[0][2]
print seg_sites.position(locus=0,site=2)

# equivalent to seg_sites.config_arrays[0][2,:,:]
print seg_sites.config(locus=0, site=2)

# equivalent to seg_sites.config_arrays[0][2,0,1]
print seg_sites.allele_count(locus=0, site=2, pop=1, allele=0)

0.001
[[ 0 14]
 [10  0]]
10


You can directly construct `seg_sites` with `SegSites()`.

In [10]:
seg_sites2 = momi.SegSites(seg_sites.sampled_pops, 
                           seg_sites.config_arrays, 
                           seg_sites.position_arrays)

assert seg_sites2 == seg_sites

# Observed SFS 

`SegSites.sfs` is an `Sfs` object. Use `Sfs.freq` to get the observed frequency. `Sfs.loci` and `Sfs.total` are underlying dict-like objets that can be directly accessed.

In [11]:
sfs = seg_sites.sfs

# equivalent to sfs.total[((13,1),(10,0))]
print sfs.freq(((13,1),
                (10,0)))

# equivalent to sfs.loci[0][((13,1),(10,0))]
print sfs.freq(((13,1),
                (10,0)), locus=0)

print "Unique configs:", len(sfs.total)

9251
412
Unique configs: 163


You can directly construct `sfs` with `Sfs()`.

In [12]:
sfs2 = momi.Sfs(sfs.sampled_pops, sfs.loci)
sfs3 = momi.Sfs(seg_sites.sampled_pops, seg_sites.config_arrays)

assert sfs2 == sfs
assert sfs3 == sfs

# Composite likelihood

The composite likelihood treats each site as independent.  
`momi` provides 2 composite likelihoods:
* **multinomial**: Fixed number of sites, each drawn from multinomial
* **Poisson random field**: Random number of sites, drawn from Poisson distribution

Here we illustrate the **multinomial** likelihood (the default). See **help(momi.composite_log_likelihood)** for more options.

In [13]:
print "Composite log likelihood:", momi.composite_log_likelihood(sfs, demo)

Composite log likelihood: -6855.05740059


# Automatic differentiation

`momi` uses the package `autograd` to automatically compute derivatives.

Gradients can be extremely useful in parameter inference.
However, computing gradients is **not** necessary for much functionality.
Users who don't plan to use auto-differentiation can skip this section, or return to it later.

Let's start by defining a function that maps some parameters to a demography.

In [14]:
import autograd.numpy as np ## thinly wrapped version of numpy for auto-differentiation

def demo_func(N_chb_bottom, N_chb_top, pulse_t, pulse_p, ej_chb, ej_yri):
    ej_chb = pulse_t + ej_chb
    ej_yri = ej_chb + ej_yri
    
    # use autograd.numpy for math functions (e.g. logarithm)
    # This will allow us to take derivatives later
    G_chb = -np.log(N_chb_top / N_chb_bottom) / ej_chb
    
    events = [('-en', 0., 'chb', N_chb_bottom),
              ('-eg', 0, 'chb' , G_chb),
              ('-ep', pulse_t, 'chb', 'nea', pulse_p),
              ('-ej', ej_chb, 'chb', 'yri'),
              ('-ej', ej_yri, 'yri', 'nea'),
              ]

    return momi.Demography(events, ('yri','chb'), (14,10))

Next, let's define a function that returns the expected
TMRCA from the parameters, and then use `autograd` to compute its gradient.

In [15]:
# function mapping vector of parameters to the TMRCA of the corresponding demography
def tmrca_func(params):
    # equivalent to demo_func(params[0], params[1], params[2], ...)
    demo = demo_func(*params)
    # return expected TMRCA
    return momi.expected_tmrca(demo)

# use autograd.grad() to obtain the gradient function
from autograd import grad
tmrca_grad = grad(tmrca_func)

x = [10., .1, .25, .03, .25, 1.]
print "Parameters:"
print x
print "Expected TMRCA:"
print tmrca_func(x)
print "Gradient:"
print tmrca_grad(x)

Parameters:
[10.0, 0.1, 0.25, 0.03, 0.25, 1.0]
Expected TMRCA:
1.31277716218
Gradient:
[0.0045248167517457968, 0.5583849608353697, 0.2844986970669251, 3.9671067152725357, 0.83175299796269686, 0.13916390098743414]


### More details about using `autograd` for automatic differentiation

`autograd` uses the magic of *operator overloading* to compute derivatives automatically.

Here are a few rules to keep in mind, to make sure `autograd` works correctly:

* Arithmetic operations `+,-,*,/,**` all work with autograd

* For more complicated mathematical operations, use `autograd.numpy` and `autograd.scipy`, thinly wrapped versions of `numpy` and `scipy` that support auto-differentiation.
    * For most users, `numpy` contains all the mathematical operations that are needed: `exp()`, `log()`, trigonemetric functions, matrix operations, fourier transform, etc.
    * If needed, it is also possible to use autograd to define derivatives of your own mathematical operations.
* Other do's and don'ts: (copy and pasted from autograd tutorial)
    * Do use
        * Most of `numpy`'s functions
        * Most `numpy.ndarray` methods
        * Some `scipy` functions
        * Indexing and slicing of arrays `x = A[3, :, 2:4]`
        * Explicit array creation from lists `A = np.array([x, y])`
    * Don't use
        * Assignment to arrays `A[0,0] = x`
        * Implicit casting of lists to arrays `A = numpy.sum([x, y])`, use `A = numpy.sum(np.array([x, y]))` instead.
        * `A.dot(B)` notation (use `np.dot(A, B)` instead)
        * In-place operations (such as `a += b`, use `a = a + b` instead)

Documentation for autograd can be found at https://github.com/HIPS/autograd/

# Inference

Now let's try to infer the following demography from data.

In [16]:
true_params = [10., .1, .25, .03, .25, 1.]
true_demo = demo_func(*true_params)

We can generate a new dataset with `ms` (or similar program, e.g. `scrm`, `macs`, `msprime`, etc.)

In [17]:
## to generate new dataset, change this to
## ms_path = "/path/to/ms"
ms_path = None
#ms_path = os.environ["MS_PATH"]

if ms_path is not None:
    print "Generating new dataset with ms"
    
    n_loci = 20
    kb_per_locus = 500
    mut_rate_per_kb, recom_rate_per_kb = 1.,1.
    
    ms_output = momi.simulate_ms(ms_path, true_demo, 
                                 n_loci, kb_per_locus * mut_rate_per_kb, 
                                 additional_ms_params="-r %f %d" % (kb_per_locus*recom_rate_per_kb,
                                                                    int(kb_per_locus * 1000)))
    seg_sites = momi.seg_sites_from_ms(ms_output, sampled_pops=true_demo.sampled_pops)

    sfs_list = seg_sites.sfs_list
    combined_sfs = momi.sum_sfs_list(sfs_list)
    
    ## uncomment these lines to save the new dataset
    #with open(data_filename,'w') as f:
    #    momi.write_seg_sites(f, seg_sites)
else:
    print "No ms_path provided, using SFS previously stored in %s." % data_filename

No ms_path provided, using SFS previously stored in tutorial_data.txt.


In [18]:
# define (lower,upper) bounds on the parameter space
bounds = [(.01, 100.),
          (.01, 100.),
          (.01,5.),
          (.001,.25),
          (.01,5.),
          (.01,5.)]

# pick a random start value for the parameter search
import random
lower_bounds, upper_bounds = [l for l,u in bounds], [u for l,u in bounds]
start_params = [random.triangular(lower, upper, mode)
                for lower, upper, mode in zip(lower_bounds, upper_bounds, [1, 1, 1, .1, 1,1,1])]

Now search for the MCLE (max composite likelihood estimate) with **`momi.composite_mle_search()`**.

By default, `momi.composite_mle_search()` assumes `demo_func` is differentiable with `autograd`, and uses the gradient in a hill-climbing algorithm.
* If you don't want to use `autograd`, you can disable the gradient (i.e. jacobian) with:
    * `momi.composite_mle_search(..., jac=False, ...)`
* Conversely, `momi.composite_mle_search()` provides options for using the Hessian (second-order derivative), in addition to the gradient.
    * See **`help(momi.composite_mle_search)`**.

Beware that **`momi.composite_mle_search()`** can be vulnerable to local maxima.
It isn't a problem in this example, but in other cases, you may have to try multiple starting positions to reach the global maximum.

In [19]:
mcle_search_result = momi.composite_mle_search(seg_sites, demo_func, start_params, 
                                               bounds = bounds, maxiter = 500, output_progress = 25)

iter 0
objective ( [  5.95357791  41.79357977   2.93004337   0.10209898   1.33329717   0.79459345] ) == 57732.8
iter 25
objective ( [  4.08057833e-01   1.61250101e+01   5.50932365e-01   8.93920120e-02   1.00000000e-02   1.00000000e-02] ) == 12480.6
iter 50
objective ( [ 2.05904147  0.39295325  0.55806372  0.08467571  0.01        0.01      ] ) == 3825.91
iter 75
objective ( [ 11.56175137   0.07887715   0.26813808   0.05783174   0.23800528   0.64479867] ) == 622.579
iter 100
objective ( [ 10.41182537   0.0986267    0.25796856   0.03486405   0.25547071   0.92478748] ) == 609.101
iter 125
objective ( [ 10.43281027   0.09826173   0.25819395   0.03518239   0.25505389   0.91965524] ) == 609.096


In [20]:
print "Search results:"
# print info such as whether search succeeded, number function/gradient evaluations, etc
print mcle_search_result
# note the printed function & gradient values are for -1*log_likelihood

Search results:
  status: 1
 success: True
    nfev: 142
     fun: 609.09631109738257
       x: array([ 10.43285226,   0.09826032,   0.25819429,   0.03518325,
         0.25505262,   0.91964191])
 message: 'Converged (|f_n-f_(n-1)| ~= 0)'
     jac: array([ -8.11746624e-05,  -2.81783307e-02,   1.08419090e-02,
        -1.37625964e-02,   8.21489506e-03,   2.91106765e-04])
     nit: 30


In [21]:
est_params = mcle_search_result.x
print "Estimated params:"
print est_params
print "Ratio of Estimated/Truth:"
print est_params / true_params

Estimated params:
[ 10.43285226   0.09826032   0.25819429   0.03518325   0.25505262
   0.91964191]
Ratio of Estimated/Truth:
[ 1.04328523  0.98260317  1.03277714  1.17277489  1.02021048  0.91964191]


In [22]:
print "Log-likelihood at estimated parameters:", -mcle_search_result.fun
print "Log-likelihood at true parameters:", momi.composite_log_likelihood(seg_sites, true_demo)

Log-likelihood at estimated parameters: -609.096311097
Log-likelihood at true parameters: -620.223208703


# Confidence Intervals

As the number of independent loci goes to infinity,
the MCLE is asymptotically Gaussian, with mean at the truth,
and covariance given by the inverse 'Godambe information'.

This can be used to construct approximate confidence intervals,
which have the correct coverage properties in the limit (assuming certain regularity conditions, e.g. *identifiability* and *consistency*).

`momi` currently has 2 methods for computing confidence intervals:
* **iid**: Treats the loci as iid. 
    * Appropriate when there are many loci, roughly identically distributed.
* **series**: Treat the segregating sites as a time series. 
    * Works for a small number loci, and they don't have to be identically distributed.
    * Converges as number of SNPs per locus $\to \infty$

In [23]:
print "Computing approximate covariance of MCLE..."

## the approximate covariance matrix of the MCLE
mcle_cov = momi.godambe_scaled_inv("series", est_params, seg_sites, demo_func)
print mcle_cov

Computing approximate covariance of MCLE...
[[  2.83614147e-01  -2.22484182e-03   1.25407637e-03   9.62798115e-04
   -2.99532213e-03  -1.35515544e-02]
 [ -2.22484182e-03   4.74041313e-05   2.23852025e-06  -2.87650807e-05
    2.20218631e-05   5.43422018e-04]
 [  1.25407637e-03   2.23852025e-06   1.30070846e-04   8.36411082e-06
   -1.06968511e-04   4.28447892e-04]
 [  9.62798115e-04  -2.87650807e-05   8.36411082e-06   3.88437232e-05
   -1.97469695e-05  -6.14189100e-04]
 [ -2.99532213e-03   2.20218631e-05  -1.06968511e-04  -1.97469695e-05
    1.69207898e-04  -1.55599082e-04]
 [ -1.35515544e-02   5.43422018e-04   4.28447892e-04  -6.14189100e-04
   -1.55599082e-04   1.61661246e-02]]


In [24]:
# marginal confidence intervals (wald test)
print "Approximate 95% confidence intervals for parameters:"

import scipy.stats
conf_lower, conf_upper = scipy.stats.norm.interval(.95, loc = est_params, scale = np.sqrt(np.diag(mcle_cov)))
print pandas.DataFrame({"Truth" : true_params, "Lower" : conf_lower, "Upper" : conf_upper}, columns = ["Lower","Upper","Truth"])

Approximate 95% confidence intervals for parameters:
      Lower      Upper  Truth
0  9.389065  11.476640  10.00
1  0.084766   0.111755   0.10
2  0.235841   0.280547   0.25
3  0.022968   0.047399   0.03
4  0.229557   0.280548   0.25
5  0.670440   1.168844   1.00


In [25]:
# wald test: residual * cov^{-1} * residual should be Chi-squared with n_params degrees of freedom
print "Approximate p-value of the residuals"

inv_cov = np.linalg.inv(mcle_cov)
# make sure the numerical inverse is still symmetric
assert np.allclose(inv_cov, inv_cov.T)
inv_cov = (inv_cov + inv_cov.T) / 2.0

resids = est_params - true_params
wald_stat = np.dot(resids, np.dot(inv_cov, resids))
print "p = ", 1.-scipy.stats.chi2.cdf(wald_stat, df=len(resids))

Approximate p-value of the residuals
p =  0.513751587806


In [26]:
## log likelihood ratio test
## has a nonstandard null distribution for composite likelihood, which we simulate from
n_null_sims = 10000
llr_p = momi.log_lik_ratio_p("series", n_null_sims, est_params, true_params, [True] * len(true_params), 
                             seg_sites, demo_func)
print "p-value of log-likelihood ratio"
print llr_p

p-value of log-likelihood ratio
0.6222


In [27]:
## log likelihood for iid
n_null_sims = 10000
llr_p = momi.log_lik_ratio_p("iid", n_null_sims, est_params, true_params, [True] * len(true_params), 
                             seg_sites, demo_func)
print "p-value of log-likelihood ratio"
print llr_p

p-value of log-likelihood ratio
0.8503
