# 2. Simulating ancestry with msprime

 - [2.1 Basic syntax](#2.1BasicSyntax)
 - [2.2 Specifying contig information](#2.2SpecifyingContigInformation)
 - [2.3 Demographic history](#2.3DemographicHistory)
 - [2.4 Debugging demography](#2.4DebuggingDemography)
 - [2.5 Other ancestry models](#2.5OtherAncestryModels)
 - [2.6 Mixing different ancestry models](#MixingDifferentAncestryModels)

This is the first of two sessions about `msprime`, a backwards-time tree sequence simulator. In this first worksheet, we will learn how to design simulations using customisable models of genomic history. In the next session, we will add mutations to these simulated histories to obtain sequence data.

#### Some relevant papers:
 -  [Efficient coalescent simulation and genealogical analysis for large sample sizes](https://doi.org/10.1371/journal.pcbi.1004842)
 - [Efficient ancestry and mutation simulation with msprime 1.0](https://doi.org/10.1093/genetics/iyab229)
 - [tskit.dev documentation](https://tskit.dev/)
 
Simulations are important in population genetics for many reasons:

**Exploration:**
Simulations allow us to explore the influence of various historical scenarios on observed patterns of genetic variation and inheritance.

**Benchmarking and evaluating methodologies:**
To assess the accuracy of inferential methods, we need test datasets for which the true values of important parameters are known.

**Model training:**
Some methods for ancestry inference are trained on simulated data (eg. Approximate Bayesian Computation).
This is especially important in studies of complex demographies, where there are many potential parameters and models, making it impractical to specify likelihood functions.

### A brief history of msprime

The first release of `msprime` was an emulation of the popular `ms` coalescent simulator with added support for tree sequences.
However, it has since become an expansive and flexible backwards-in-time simulator for various different models of genetic ancestry and mutation, and even for simplified models of selection.

#### Forwards and backwards simulation

The main characteristic of `msprime` is that it simulates *tree sequences* in *backwards-time*.

<img src="pics/msprime-1.png" width="200" height="200">
<img src="pics/msprime-2.png" width="200" height="200">
<img src="pics/msprime-3.png" width="200" height="200">
<img src="pics/msprime-4.png" width="200" height="200">
<img src="pics/msprime-5.png" width="200" height="200">

The alternative is to use a *forwards-time* simulator like `SLiM`:

<img src="pics/slim-1.png" width="200" height="200">
<img src="pics/slim-2.png" width="200" height="200">
<img src="pics/slim-3.png" width="200" height="200">
<img src="pics/slim-4.png" width="200" height="200">
<img src="pics/slim-5.png" width="200" height="200">
<img src="pics/slim-6.png" width="200" height="200">
<img src="pics/slim-7.png" width="200" height="200">
<img src="pics/slim-8.png" width="200" height="200">
<img src="pics/slim-9.png" width="200" height="200">
<img src="pics/slim-10.png" width="200" height="200">

In general, forwards-time simulation with `SLiM` is *detailed* and *more realistic*,
while backwards-time simulation with `msprime` is *fast* and *efficient*.
With  some exceptions, `msprime` is designed for *neutral* simulations, while `SLiM` is better suited to complex simulations, including those involving selection.

However, `pyslim` allows you to combine these via a process called 'recapitation'.
See [this](https://tskit.dev/pyslim/docs/latest/tutorial.html) for details.

<img src="pics/recapitation-1.png" width="400" height="400">
<img src="pics/recapitation-2.png" width="400" height="400">

In [1]:
import msprime
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import SVG

## 2.1 Basic syntax

To simulate a tree sequence in `msprime`, we use the `sim_ancestry()` method:

In [None]:
ts = msprime.sim_ancestry(samples=2, random_seed=1)
SVG(ts.draw_svg())

In a moment, we'll discuss the parameters of this simple simulation and how to change them,
but for now, let's just note what we see:
a tree sequence holding a single, mutation-less tree spanning a single base at position 0, describing relationships between four sample nodes.

Printing the `ts` object shows, alongside all the information we've seen before, a provenance record indicating that the object was generated by `msprime` using the `sim_ancestry()` method.
These records help us keep track of the steps we've performed on our data -- especially handy when we work with more complicated tree sequences, as we will later on.

In [None]:
ts

## 2.2 Specifying contig information
Although we have specified 2 samples, our tree sequence contains 4 sample nodes.
This is because the `samples` argument specifies the number of *individuals* in the sample,
and by default, `sim_ancestry()` assumes diploid organisms.
To change this, use the `ploidy` argument:

In [None]:
ts = msprime.sim_ancestry(samples=2, random_seed=1, ploidy=3)
SVG(ts.draw_svg())

You can see which of these nodes (haplotypes) belongs to the same individuals by iterating through the samples:

In [None]:
sample_ids = ts.samples()
for s in ts.nodes():
    if s.id in sample_ids:
        print(s)

It's easiest to start thinking about genome lengths in units of nucleotides. By default, we are simulating a sequence length that spans just one of these units.
We can specify a larger region using the `sequence_length` argument:

In [None]:
ts = msprime.sim_ancestry(samples=2, random_seed=1, sequence_length=10)
SVG(ts.draw_svg())

Also, note that our 'tree sequence' consists of just a single tree because we have not yet specified a `recombination_rate`, and the default is 0. 
This is the probability of a recombination event per genomic unit (base), per generation.

In [None]:
ts = msprime.sim_ancestry(
                samples=2,
                random_seed=82,
                sequence_length=10,
                recombination_rate=0.1)
SVG(ts.draw_svg())

We may also wish to specify recombination rates that vary across the genome. 
We do this by creating a `RateMap` object,
which lists recombination rates between defined positions in the sequence.

In [None]:
rate_map = msprime.RateMap(position=[0, 10, 20], rate=[0.01, 0.1])
ts = msprime.sim_ancestry(3, recombination_rate=rate_map, random_seed=2)
SVG(ts.draw_svg())

Note that it's also possible to simulate gene conversion in `msprime`.
If this is of interest, have a look at the API documentation for the `gene_conversion_rate` and `gene_conversion_tract_length` arguments,
and [this](https://tskit.dev/msprime/docs/stable/ancestry.html?highlight=gene%20conversion#gene-conversion) short tutorial.


By default, the recombination events will be assigned to integer locations along the sequence. However there may be situations where you want to model the genome using continuous coordinates. In this case, use the `discrete_genome=False` argument:

In [None]:
ts = msprime.sim_ancestry(
    samples=2, random_seed=28, sequence_length=100,
    recombination_rate=0.01, discrete_genome=False)
SVG(ts.draw_svg())

*Exercise:* Our study organisms have a ploidy of 3 and a chromosome of length 100.
Between the bases at positions 50 and 60, there is a recombination 'hotspot' with a higher recombination rate of 0.1.
Modify the following code to simulate chromosomes for 2 such organisms.

In [None]:
rate_map = msprime.RateMap(position=[0, 100], rate=[0.01])
ts = msprime.sim_ancestry(samples=3,
                          recombination_rate=rate_map,
                          random_seed=193)
SVG(ts.draw_svg())

## 2.3 Demographic history

So far, we have been simulating samples from a single population of a constant size, which isn’t particularly exciting! One of the strengths of `msprime` is that it can be used to specify quite complicated models of demography and population history with a simple Python API.

### 2.3.1 Population size and time scaling

So far, we've assumed our samples and all their ancestors belong to a single population with $2N_e=1$, which scales the simulation to use classical coalescent time units of $\frac{1}{2N_e}$. (Note that `msprime` scales this for different ploidies, also).

In [None]:
ts = msprime.sim_ancestry(samples=2, random_seed=234)
SVG(ts.draw_svg(y_axis=True))

In [None]:
tree = ts.first()
print("Total branch length:", tree.total_branch_length)
print("Time at root:", ts.tables.nodes.time[tree.root])

By specifying it explicitly using the `population_size` argument, all times and branch lengths will be rescaled.

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    random_seed=234,
    population_size=2)
SVG(ts.draw_svg(y_axis=True))

In [None]:
tree = ts.first()
print("Total branch length:", tree.total_branch_length)
print("Time at root:", ts.tables.nodes.time[tree.root])

Note that this population size should be the number of *individuals*, not the number of genomes. `msprime` uses the supplied `ploidy` to do all the other necessary adjustments under-the-hood.

In [None]:
ts = msprime.sim_ancestry(
    samples=1,
    ploidy=4,
    random_seed=234,
    population_size=2)
SVG(ts.draw_svg())

In [None]:
tree = ts.first()
print("Total branch length:", tree.total_branch_length)
print("Time at root:", ts.tables.nodes.time[tree.root])

### 2.3.2 Population structure

To simulate using more complicated models of demographic history, we will need to create a `msprime.Demography` object.
`msprime` supports simulation from multiple discrete populations, each of which is initialized via the `add_population()` method.
For each population, you can specify a sample size, an effective population size, an exponential growth rate and a name.

We could have specified our previous simulation with the following code:

In [None]:
dem = msprime.Demography()
dem.add_population(
    name="my_pop",
    initial_size=1
)
dem

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    random_seed=234,
    demography=dem
)
SVG(ts.draw_svg())

We can also use `Population` objects to specify an exponential growth rate of the population per generation (i.e. forwards in time)

<img src="pics/worksheet2-dem1.png" width="400" height="400">

In [None]:
dem = msprime.Demography()
dem.add_population(
    name="my_pop",
    initial_size=1e5,
    growth_rate=0.1
)
dem

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    random_seed=11,
    demography=dem
)
SVG(ts.draw_svg())

Note that the recent branches are much longer than the older ones.
This is what we expect to see in a growing population.

We can add any number of populations into our `Demography` objects.

In [None]:
dem = msprime.Demography()
dem.add_population(
    name="R",
    description="Red population",
    initial_size=500
)
dem.add_population(
    name="B",
    description="Blue population",
    initial_size=500,
    growth_rate=0.0001
)
dem

<img src="pics/worksheet2-dem2.png" width="400" height="400">

However, this simulation will run forever unless we also specify some migration between the groups!

In [25]:
# # without an end_time, THIS WILL RUN FOREVER!

# ts = msprime.sim_ancestry(
#   samples={"R" : 3, "B" : 3}, 
#   demography=dem,
#   random_seed=12,
#   sequence_length=1000, 
#   )

To understand why, recall that `msprime` is a backwards-time simulator.
Starting from the present day, the simulation will move further back in time with each step of the simulation, simulating until all samples have coalesced to a single common ancestor at each genomic location. 
However, with no migration between our two populations, samples from the red population will never coalesce with samples from the blue population.
To fix this, let’s add some migration events.


### 2.3.3 Migration

With `msprime`, you can specify continual rates of migrations between populations, as well as admixture events, divergences and one-off mass migrations.


**Migration rates** are passed to our `msprime.Demography` object individually via the `set_migration_rate()` method.
This allows us to specify the expected number of migrants moving from population `dest` to population `source` per generation, divided by the size of population `source`. When this rate is small (close to 0), it is approximately equal to the fraction of population `source` that consists of new migrants from population `dest` in each generation. 

Note that our `destination` is the population that the migrants originate from, not the one they migrate to!
Again, this 'backwards' terminology is used because we are simulating our demographic history *backwards-in-time*, from the present to the past. 

In [None]:
dem = msprime.Demography()
dem.add_population(
    name="R",
    description="Red population",
    initial_size=500
)
dem.add_population(
    name="B",
    description="Blue population",
    initial_size=500,
    growth_rate=0.0001
)

# Set migration rates.
dem.set_migration_rate(source=0, dest=1, rate=0.05)
dem.set_migration_rate(source=1, dest=0, rate=0.02)

dem

For instance, the `Demography` object above specifies that in each generation, approximately 5% of the red population (R) consists of migrants from the blue population (B), and approximately 2% of the blue population consists of migrants from the red population.

In [None]:
# Simulate.
ts = msprime.sim_ancestry(
  samples={"R" : 5, "B" : 5},
  demography=dem,
  sequence_length=1e7,
  random_seed=145)
ts

One consequence of explicitly specifying `Population` objects is that each of the nodes/haplotypes in our simulated tree sequence will now belong to one of our specified populations:

In [None]:
ts.tables.nodes

The population table holds more detail about these populations,
including metadata that we can manually pass into the `Demography` object as a dictionary.

In [None]:
ts.tables.populations

We'll use this information to draw the simulated dataset with all of its sample and ancestral haplotypes coloured by population label:

In [None]:
colour_map = {0:"red", 1:"blue"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
for tree in ts.trees():
    print("Tree on interval:", tree.interval)
    # The code below will only work in a Jupyter notebook with SVG output enabled.
    display(SVG(tree.draw(node_colours=node_colours, width=500, height=400)))

More coalescences are happening in the blue population than the red population. This makes sense given that the blue population is specifying more migrants to the red population than vice versa.

### 2.3.4 Changing migration rates

We can change any of the migration rates at any time in the simulation. To do this, we'll use the `add_migration_rate_change()` method on our demography object.
This will specify the populations whose migration rates are to be changed, the time of the change and the new migration rate.

For instance, say we wanted to specify that in each generation prior to `time=100`, 1% of the red population (0) consisted of migrants from the blue population (1).

In [None]:
dem.add_migration_rate_change(time=100, rate=0.01, source=0, dest=1)
dem

The output above shows that we have successfully added our first demographic event to our `Demography` object, a migration rate change. We are now ready to simulate:

In [None]:
ts = msprime.sim_ancestry(
  samples={"R" : 4, "B" : 4},
  demography=dem,
  sequence_length=1000,
  random_seed=63461
)
ts

In [None]:
colour_map = {0:"red", 1:"blue"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
for tree in ts.trees():
    print("Tree on interval:", tree.interval)
    # The code below will only work in a Jupyter notebook with SVG output enabled.
    display(SVG(tree.draw(node_colours=node_colours, width=500, height=400)))

The red population is now specifying more migrants to the blue population prior to the 100th generation.
Since more blue nodes have ancestors in the red population than vice versa,
we now see more coalescences in the red population.

### 2.3.5 Admixture

It is also easy to specify admixture and divergence events with `msprime`. Suppose we wanted to specify our demography so that our admixed population arose 200 generations ago, with 70% of the new population being migrants from the red ancestral population, and 30% being migrants from the blue ancestral population.

<img src="pics/worksheet2-dem3.png" width="400" height="400">

We can do this by using the `add_admixture()` method on our demography object. We must supply a list of ancestral populations participating in the admixture, and a list of the same size specifying the proportions of migrants from each of these populations.

In [None]:
dem = msprime.Demography()
dem.add_population(
    name="AncestralPop0", description="Plotted in red.", initial_size=1000, growth_rate=0)
dem.add_population(
    name="AncestralPop1", description="Plotted in blue.", initial_size=1000, growth_rate=0)
dem.add_population(
    name="AdmixedPop", description="Plotted in purple.", initial_size=700)
dem.set_migration_rate(
    source=0, dest=1, rate=0.001)
dem.set_migration_rate(
    source=1, dest=0, rate=0.001)

# Specify admixture event.
dem.add_admixture(time=200, derived="AdmixedPop", ancestral=["AncestralPop0", "AncestralPop1"], proportions=[0.6, 0.4])
dem

In this simulated sample, all of the recent ancestral haplotypes belong to the admixed population, and all of those prior to the admixture event belong to one of the ancestral populations.

In [None]:
ts = msprime.sim_ancestry(
  samples={"AncestralPop0" : 0, "AncestralPop1" : 0, "AdmixedPop" : 6},
  demography=dem,
  sequence_length=1000,
  random_seed=63
)

print("Populations of nodes from time < 500:")
print([u.population for u in ts.nodes() if u.time < 200])
print("Populations of nodes from time >= 500:")
print([u.population for u in ts.nodes() if u.time >= 200])

In [None]:
colour_map = {0:"red", 1:"blue", 2:"purple"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
for tree in ts.trees():
    print("Tree on interval:", tree.interval)
    # The code below will only work in a Jupyter notebook with SVG output enabled.
    display(SVG(tree.draw(node_colours=node_colours, width=500, height=400)))

Admixtures and population splits are special types of demographic events that affect the state of some of the defined populations, in addition to moving lineages between populations. The output below shows that by adding the admixture event, we are triggering a change in the state of `AdmixedPop` at `time = 200`; the population is active at the start of the simulation, but becomes inactive for all steps of the simulation beyond time 200.

In [None]:
dem

This means that, for example, adding any demographic events that affect the AdmixedPop before this time will produce an error:

In [None]:
# # THIS WON'T WORK
# dem.add_migration_rate_change(time=300, rate=0.01, source="AncestralPop0", dest="AdmixedPop");
# ts = msprime.sim_ancestry(
#   samples={"AncestralPop0" : 0, "AncestralPop1" : 0, "AdmixedPop" : 6},
#   demography=dem,
#   sequence_length=1000,
#   random_seed=63,
#   recombination_rate=1e-7)

### 2.3.6 Population splits

We can also simulate population divergences with `msprime`. Suppose we want to model a situation where all lineages from multiple populations are migrants from a single ancestral population at a single point in time.

<img src="pics/worksheet2-dem4.png" width="400" height="400">

We’ll specify this with the `add_population_split()` method. We need to know the time of the event, and the IDs of the derived and ancestral populations participating in the divergence event.

Notice that in this case, we do not need to provide proportions as we did in the case of admixture. This makes sense when you consider that `msprime` simulates backwards-in-time: all lineages in all of the derived populations originate from the ancestral population in a split event. Any differences in ‘quantities’ of migrants must be modelled by sizes of the derived populations at the time of the split.

In [None]:
dem = msprime.Demography()
dem.add_population(name="R", description="Plotted in red.", initial_size=500)
dem.add_population(name="B", description="Plotted in blue.",initial_size=500)
dem.add_population(name="AncestralPopulation", description="Plotted in green.", initial_size=500)

# Add the population split.
dem.add_population_split(time=100, derived=["R","B"], ancestral="AncestralPopulation")
dem

Population splits will also modify the state of each of the derived populations, changing them from active to inactive at the time of the split.

In [None]:
ts = msprime.sim_ancestry(
  samples={"R" : 3, "B" : 3, "AncestralPopulation" : 0},
  demography=dem,
  sequence_length=1000,
  random_seed=63
)

print("Populations of nodes from time < 100:")
print([u.population for u in ts.nodes() if u.time < 100])
print("Populations of nodes from time >= 100:")
print([u.population for u in ts.nodes() if u.time >= 100])

<img src="pics/worksheet2-ex.png" width="400" height="400">

*Exercise:* The code below specifies a simulation of two populations in which a red population and a blue population are created by a population split at time 200.

Can you add a third contemporary population with the same effective population size, the green population, so that the red and green populations are created by a more recent population split  at time 150? (You may need to create some other ancestral populations too.)
The diagram above may help you design your simulation.

In [None]:
# Setup.
dem = msprime.Demography()
dem.add_population(name="R", description="Plotted in red.", initial_size=500)
dem.add_population(name="B", description="Plotted in blue.",initial_size=500)
dem.add_population(name="AncestralPopulation", description="Plotted in yellow.", initial_size=500)

# Add the population split.
dem.add_population_split(time=200, derived=["R","B"], ancestral="AncestralPopulation")
dem

In [58]:
# Simulate!
ts_ex = msprime.sim_ancestry(
  samples={"R" : 1, "B" : 1, "AncestralPopulation" : 0},
  demography=dem,
  sequence_length=1000,
  random_seed=63
)

In [None]:
# Plot.
colour_map = {0:"red", 1:"blue", 2:"yellow"}
node_colours = {u.id: colour_map[u.population] for u in ts_ex.nodes()}
for tree in ts_ex.trees():
    print("Tree on interval:", tree.interval)
    # The code below will only work in a Jupyter notebook with SVG output enabled.
    display(SVG(tree.draw(node_colours=node_colours, width=500, height=400)))

### 2.3.7 Changing population sizes or growth rates

We also may wish to specify changes to rates of population growth, or sudden changes in population size at a particular time (e.g. bottlenecks). Both of these can be specified by applying the `add_population_parameters_change()` method to our `Demography` object.

<img src="pics/worksheet2-dem5.png" width="400" height="400">

In [None]:
dem = msprime.Demography()
dem.add_population(name="R", description="Plotted in red.", initial_size=500)
dem.add_population(name="B", description="Plotted in blue.",initial_size=500)
dem.set_migration_rate(source=0, dest=1, rate=0.05)
dem.set_migration_rate(source=1, dest=0, rate=0.02)

# Bottleneck in Population 0 between 50 - 150 generations ago.
dem.add_population_parameters_change(time=50, initial_size=250, population=0)
dem.add_population_parameters_change(time=150, initial_size=500, population=0)

# Exponential growth in Population 1 starting 50 generations ago.
dem.add_population_parameters_change(time=100, growth_rate=0.01, population=1)

# Sort events, since we've added some out of time order.
dem.sort_events()

In [None]:
# Simulate.
ts = msprime.sim_ancestry(
    samples={"R" : 3, "B" : 3},
    demography=dem,
    sequence_length=1000,
    random_seed=63461
)

Note that because `msprime` simulates backwards-in-time, parameter changes must be interpreted backwards-in-time as well. For instance, the population growth event in the example above specifies continual growth in the early history of population 1 up until 100 generations in the past.

## 2.4 Debugging demography

To help you spot any mistakes in your specified demography, `msprime` provides a debugger that prints out your population history in a more human-readable form.

In [None]:
my_history = msprime.DemographyDebugger(demography=dem)
my_history

See [the documentation](https://tskit.dev/msprime/docs/stable/demography.html#sec-demography-numerical) for more examples.

## 2.5 Other ancestry models

When might `msprime`'s default (Hudson coalescent) model of genealogy not be ideal for you? Consider the following assumptions it makes:

 - Selective neutrality
 - Random mating and survival outcomes
 - Sample is small relative to population size
 
However, it is possible to simulate from a selection of other models in `msprime` as well! 
We'll look at just one of the possible alternative models,
but you can see the full list [here](https://tskit.dev/msprime/docs/stable/ancestry.html#sec-ancestry-quickref).

### 2.5.1 Discrete-time Wright-Fisher

DTWF simulations are less efficient than coalescent simulations, but should perform better

 - on analyses of very recent history (especially for haplotype-based information like identity-by-descent and local ancestry)
 - on small populations, or when sample size is a large fraction of the population size
 - when modelling very long range correlations across chromosomes is important
 
See [this paper](https://doi.org/10.1371/journal.pgen.1008619) for a more comprehensive comparison between DTWF and coalescent simulations.

Consider the following (coalescent) simulation of two individuals in a population of two.

In [None]:
ts = msprime.sim_ancestry(
                samples=2,
                population_size=2,
                random_seed=82,
                sequence_length=20,
                recombination_rate=0.3)
SVG(ts.draw_svg())

Notice that we now have over 4 distinct chromosomes within a single generation, and indeed, some of these chromosomes inherit from multiple ancestors within a single generation.

In [None]:
ts.tables.nodes

This is no longer a problem when we use a discrete-time Wright-Fisher simulation:

In [None]:
# DTWF sim
ts = msprime.sim_ancestry(
                samples=2,
                population_size=2,
                model=msprime.DiscreteTimeWrightFisher(),
                random_seed=82,
                sequence_length=10,
                recombination_rate=0.2)
SVG(ts.draw_svg())

In [None]:
ts.tables.nodes

### 2.5.2 Combining different models of ancestry

You can also combine multiple ancestry models inside a single `msprime` simulation. Here's a simple example in which the most recent 50 generations are simulated under a DTWF model, and more distant history is simulated under a coalescent model:

In [None]:
models=[msprime.DiscreteTimeWrightFisher(duration=50),
        msprime.StandardCoalescent()]

In [None]:
ts = msprime.sim_ancestry(
    10,
    population_size=100,
    sequence_length=1e6,
    recombination_rate=1e-8,
    model=models,
    random_seed=6789
)

In [None]:
ts.tables.nodes