Skip to content

A short manuscript describing some modelling errors made with msprime

Notifications You must be signed in to change notification settings

jeromekelleher/msprime-model-errors

Repository files navigation

msprime-model-errors

This repo collects together information about two errors in msprime simulations. The manuscript describes the problems and background in detail.

In short, there are two distinct problems here:

  1. The three population Out-of-Africa model given as an example in the msprime documentation was not an accurate description of the true model. In the most ancient time period, migration was allowed to continue between ancestral African and European populations. Fortunately, the difference is a subtle one, and the differences in expected diversity measures between the models is small. However, this code has been extensively copied --- see below.

  2. The simulation pipeline for the analysis for the influential Martin et al paper contained an error, leading to the simulations being of a substantially different model from what was expected.

See the manuscript for more information and analyis.

How did this happen?

The error present in the msprime documentation was found as part of the quality control process for stdpopsim, as described in the preprint.

See these issues for more details:

What do I need to do?

If you have copied incorrect code, you have two basic options to fix it:

Use stdpopsim

Defining demographic models is hard and error-prone. In an attempt to reduce the duplicated effort involved in reimplementing published models multiple times, the PopSim Consortium developed stdpopsim, a standard library of population genetic models. The correct version of the three population Out-of-Africa model is defined and can be run as simply as

   $ stdpopsim HomSap -d OutOfAfrica_3G09 -c chr22 10 -o ooa.trees

There is also a Python API which can plug directly into your existing pipeline and significantly simplify your code.

Fix your model code

The problem with the OOA model is that migration is allowed between two ancestral populations until the indefinite past, when we should only have a single ancestral population. The solution is to add MigrationRateChange events to ensure that this erroneous migration isn't happening.

Here is the correct model with a single randomly mating ancestral population:

import math
import msprime
def out_of_africa():
    # First we set out the maximum likelihood values of the various parameters
    # given in Table 1.
    N_A = 7300
    N_AF = 12300
    N_B = 2100
    N_EU0 = 1000
    N_AS0 = 510
    r_EU = 0.004   # 0.4% EU growth
    r_AS = 0.0055  # 0.55% AS growth
    # Migration rates during the various epochs.
    m_AF_B = 25e-5
    m_AF_EU = 3e-5
    m_AF_AS = 1.9e-5
    m_EU_AS = 9.6e-5
    # Times in Table 1 are provided in years, calculated on the assumption
    # of 25 years per generation: we need to convert back into generations.
    generation_time = 25
    T_AF = 220e3 / generation_time
    T_B = 140e3 / generation_time
    T_EU_AS = 21.2e3 / generation_time
    # We need to work out the starting (diploid) population sizes based on
    # the growth rates provided for these two populations
    N_EU = N_EU0 / math.exp(-r_EU * T_EU_AS)
    N_AS = N_AS0 / math.exp(-r_AS * T_EU_AS)
    # Population IDs correspond to their indexes in the population
    # configuration array. Therefore, we have 0=YRI, 1=CEU and 2=CHB
    # initially.
    population_configurations = [
        msprime.PopulationConfiguration(
            sample_size=0, initial_size=N_AF),
        msprime.PopulationConfiguration(
            sample_size=1, initial_size=N_EU, growth_rate=r_EU),
        msprime.PopulationConfiguration(
            sample_size=1, initial_size=N_AS, growth_rate=r_AS)
    ]
    migration_matrix = [
        [      0, m_AF_EU, m_AF_AS],
        [m_AF_EU,       0, m_EU_AS],
        [m_AF_AS, m_EU_AS,       0],
    ]
    demographic_events = [
        # CEU and CHB merge into B with rate changes at T_EU_AS
        msprime.MassMigration(
            time=T_EU_AS, source=2, destination=1, proportion=1.0),
        msprime.MigrationRateChange(time=T_EU_AS, rate=0),
        msprime.MigrationRateChange(
            time=T_EU_AS, rate=m_AF_B, matrix_index=(0, 1)),
        msprime.MigrationRateChange(
            time=T_EU_AS, rate=m_AF_B, matrix_index=(1, 0)),
        msprime.PopulationParametersChange(
            time=T_EU_AS, initial_size=N_B, growth_rate=0, population_id=1),
        # Population B merges into YRI at T_B
        msprime.MassMigration(
            time=T_B, source=1, destination=0, proportion=1.0),
        msprime.MigrationRateChange(time=T_B, rate=0),   # NB THIS EVENT WAS MISSING!!!!
        # Size changes to N_A at T_AF
        msprime.PopulationParametersChange(
            time=T_AF, initial_size=N_A, population_id=0)
    ]
    return {
        'population_configurations':population_configurations,
        'migration_matrix':migration_matrix,
        'demographic_events':demographic_events}

Copies of Out-of-Africa example

These GitHub repos have a copy of the faulty Out-of-Africa model that was in the msprime documentation. Each link is to a file containing either a direct copy of the code, or code that is obviously derived from it.

The list is probably not exhaustive.

Copies of OOA code that have been fixed

These repos previously had the faulty model and have now been fixed.

Copies of Martin et al pipeline

The analysis for the Martin et al paper is define in armartin/ancestry_pipeline. We found the following repos that have code derived from it:

Publications affected

By searching for the GitHub repository URLs above, we were able to identify a number of papers that may be affected by the erroneous model.

About

A short manuscript describing some modelling errors made with msprime

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published