In [None]:
# Please execute this cell (shift+<Return>) before starting the workbook
# this should print out "Your notebook is ready to go"
import sys
import tskit
import msprime

if "pyodide" in sys.modules: # if running in-browser (e.g. using JupyterLite)
    import tqdm
    import micropip
    await micropip.install(['jupyterquiz', "demesdraw"])

import genealogical_analysis_workshop

workbook = genealogical_analysis_workshop.setup_workbook2()  # this may take a minute or two
display(workbook.setup)

# Simulating ancestry with `msprime`

We will use <a href="https://tskit.dev/msprime/docs/stable/intro.html">msprime</a>, a genetic simulator, to generate genealogical trees with mutations occurring along them. We will simulate genomic data sets using two demographic models: (1) a simple model with a single, panmictic population and (2) a stepping-stone model with eight populations. `Msprime` can simulate data under more complicated demographic models, but this will not form part of this workbook (see the end of this notebook for links and tutorials).
 
### Why do we do simulations in population genetics?

**A null model:**
It is often useful to have a null model against which to compare data. Neutral simulations are commonly used for this purpose.

**Exploration:**
Simulations allow us to explore the influence of various historical scenarios on observed patterns of genetic variation and inheritance.

**Benchmarking and evaluating methodologies:**
To assess the accuracy of inferential methods, we need test datasets for which the true values of important parameters are known.

**Model training:**
Some methods for ancestry inference are trained on simulated data (e.g., Approximate Bayesian Computation).
This is especially important in studies of complex demographies, where there are many potential parameters and models, making it impractical to specify likelihood functions.

### A brief history of msprime and tskit

The first release of `msprime` was an emulation of the popular `ms` coalescent simulator, and introduced the tree sequence format. Later, the tree sequence component was split off into `tskit`, a separate library for general use. `Msprime` has subsequently evolved into an expansive and flexible *backwards-time* simulator for various different models of genetic ancestry and mutation, and even for simplified models of selection.
There is also a popular *forwards-time* simulator named `SLiM` which is not covered in this workshop, but which also outputs tree sequences. The availability of `msprime` and `SLiM` has given rise to powerful hybrid approachs, such as [recapitation](https://tskit.dev/pyslim/docs/latest/tutorial.html#sec-tutorial-recapitation), that use the tree sequence format to combine backwards-time and forwards-time simulations. 

### Backwards simulation

The main characteristic of `msprime` is that it simulates *tree sequences* in *backwards-time*. Although this is usually much more efficient than simulating forwards in time, it is more restricted in the sort of scenarios that can be modelled.

<img src="pics/msprime-1.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-2.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-3.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-4.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-5.png" style="display:inline-block" width="190" height="190">

In [None]:
# Import extra libraries for plotting etc, analysis, etc.

import random
import numpy as np
import matplotlib.pyplot as plt

import demesdraw

## Simulating a tree sequence

To perform simulations using `msprime`, we first simulate a tree sequence without mutations (i.e. the genetic genealogy) using `sim_ancestry()`. If [necessary](https://tskit.dev/tutorials/no_mutations.html), we can then add neutral mutations to the tree sequence using `sim_mutations()`.

In [None]:
ts = msprime.sim_ancestry(
    samples=2, # Two diploid individuals
    random_seed=1
)
ts.draw_svg()

In [None]:
ts # Note there are no mutations yet

## Specifying information about sample genomes
Although we have specified 2 samples, our tree sequence contains 4 sample nodes (i.e. sample genomes).
This is because the `samples` argument specifies the number of *individuals* in the sample,
and by default, `sim_ancestry()` assumes diploid organisms.
To change this, use the `ploidy` argument:

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    ploidy=3, # Two triploid individuals
    random_seed=1
)
ts.draw_svg()

It's easiest to start thinking about genome lengths in units of nucleotides. By default, we are simulating a sequence length that spans just one of these units.
We can specify a larger region using the `sequence_length` argument:

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    sequence_length=10_000,
    random_seed=1
)
ts.draw_svg()

Also, note that our 'tree sequence' consists of just a single tree. This is because we have not yet specified a `recombination_rate`, which is set to 0 by default. 
This is the probability of a recombination event per genomic unit (base), per generation.

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    sequence_length=10_000,
    recombination_rate=1e-5, # Allow for recombination: this is quite a high rate
    random_seed=100
)
ts.draw_svg()

## Basic population information

Finally, we need to say something about the dynamics of the wider population from which our samples have been drawn. The default is to assume a single randomly mating population of fixed size (later we will see how to change this). In a simple model like this, most users will therefore want to specify a `population_size`. Population geneticists sometimes refer to this as $N_e$, or the "effective population size" in a panmictic population.

<div class="alert alert-block alert-info"><b>Note:</b>
The standard <code>msprime</code> model is a theoretical one that allows the population size to be any floating point number greater than 0. In fact, if not specified, the population size in msprime defaults to 1, which sounds biologically impossible, but simply produces a result identical to that of a larger population with the time units scaled differently.
</div>

In [None]:
ts_small = msprime.sim_ancestry(
    samples=2,
    sequence_length=1_000,
    recombination_rate=1e-8, # Small recombination rate
    population_size=20_000, # Rough "effective population size" suitable for humans
    random_seed=107
)
ts_small  # Display summary to screen

In [None]:
# Draw the trees
ts_small.draw_svg()

<dl><dt>Exercise 1</dt><dd>To illustate the speed of <code>msprime</code>, simulate a large tree sequence of 20,000 diploid individuals, each with a 1 Mbp long genome, using a recombination rate of 1e-8 from a population of size 20,000. Run the simulation with a random seed of 2022,  save it to the variable `ts`, and output the summary table on the screen.
<div class="alert alert-block alert-info"><b>Tip:</b>
    Make sure you DON'T display the SVG trees! Each tree is huge, and there are a lot of them.</div></dd></dl>

In [None]:
# Exercise 1: Set `ts` to a new large tree sequence, generated using msprime.sim_ancestry() with
# specific parameters (random_seed=2022, etc.), then output the tree sequence summary table to screen.


In [None]:
workbook.Q1()

## Simulating mutations along the tree sequence

Next, to generate genetic variation, we add neutral mutations by applying `sim_mutations()` to the existing `TreeSequence` object. At minimum, you must supply a per-base, per-generation mutation rate.

In [None]:
mts_small = msprime.sim_mutations(
    ts_small, # Use the small tree sequence so that we can plot it easily
    rate=2e-7, # Set an unusually high mutation rate per generation per base pair
    random_seed=103
)
mts_small.draw_svg()

In [None]:
mts_small # Note there are 10 sites with variation but 11 mutations

<dl class="exercise"><dt>Exercise 2</dt>
<dd>Print out the mutation tables for the <code>mts_small</code> tree sequence</dd></dl>

In [None]:
# Exercise 2: Print out the mutations table. Notice that one site has experienced multiple mutations.


In [None]:
workbook.Q2()

Adding mutations to a tree sequence is usually very fast, and the resulting tree sequence is a highly efficient way of storing genomic data. For instance, adding 10,000 variable sites to the large tree sequence you simulated in Exercise 1 should take less than a second, and corresponds to a storing a sites-by-samples matrix of 40,000 by 10,000 values (i.e. 400 million haploid genotypes). In tree sequence format, this only takes up about 8Mb of space. The equivalent VCF would be thousands or tens of thousands of times larger.

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Add mutations to the large tree sequence generated in Exercise 1 using a mutation rate of 1e-8. Run the simulation using a random seed of 2022, and print a summary table of the resulting tree sequence to the screen</dd></dl>

In [None]:
# Exercise 3: Add lots of mutations to the huge genealogy you simulated earlier.


In [None]:
workbook.Q3()

The advent of tree sequences together with efficient simulation frameworks like `msprime` makes it possible for the first time to simulate and store huge (population-scale) genealogies, and synthesise vast amounts of genome sequence data.

## More complicated simulations

`Msprime` allows for more realistic simulations. We have no time to cover all the additional options implemented in `msprime`, but they are explained in the online [API documentation](https://tskit.dev/msprime/docs/stable/intro.html). Below are some interesting and useful options:

 - **Recombination rate variation across the genome**
 We can introduce this by creating a `RateMap` object, which lists recombination rates between defined positions in the sequence. See this [page](https://tskit.dev/msprime/docs/stable/rate_maps.html) for more about `RateMap` objects.
```
recomb_rate_map = msprime.RateMap(position=[0, 10, 20], rate=[0.01, 0.1])
ts = msprime.sim_ancestry(3, recombination_rate=recomb_rate_map, random_seed=2)
SVG(ts.draw_svg())
```
 - **Mutation rate variation across the genome** This can also be introduced using a `RateMap` object, which lists mutation rates between defined position in the sequence. See this [page](https://tskit.dev/msprime/docs/stable/rate_maps.html) for more detail.
```
mutation_rate_map = msprime.RateMap(position=[0, 40, 60, 100], rate=[0.01, 0.1, 0.01])
mts = msprime.sim_mutations(ts, rate=mutation_rate_map, random_seed=104)
SVG(mts.draw_svg())
```
 - **Mutation models** There are [pre-defined models](https://tskit.dev/msprime/docs/stable/mutations.html) (e.g., Jukes-Cantor). You can also use your own custom models.
 - **Stacking mutations** Mutations can be simulated on the same tree sequence under different models and/or parameters and/or over different time periods.
 - **Gene conversion** See the API documentation for the `gene_conversion_rate` and `gene_conversion_tract_length` arguments, and [this](https://tskit.dev/msprime/docs/stable/ancestry.html#gene-conversion) short illustration of use.
 - **Alternative simulation models**  In `msprime` the default simulation model is the standard ("Hudson") coalescent, which applies for relatively small samples from large populations. It is also possible to study [other models](https://tskit.dev/msprime/docs/stable/ancestry.html#specifying-ancestry-models) such as the generation-by-generation "Discrete Time Wright Fisher" (DTWF) model, which are slower but more suitable when sampling most of the population.
 - **Continuous coordinates** By default, the recombination and mutation events will be assigned to integer locations along the sequence. However, there may be situations where you want to model the genome using continuous coordinates. In this case, use the `discrete_genome=False` argument:
```
ts = msprime.sim_ancestry(
    samples=2,
    random_seed=28,
    sequence_length=100,
    recombination_rate=0.01,
    discrete_genome=False
)
SVG(ts.draw_svg())
```

## Simulations under custom demographic models

To run simulations under more complicated models of demographic history, we need to create a `msprime.Demography` object.

`Msprime` can simulate multiple discrete populations, sometime called "demes". It provides a number of [built-in theoretical models](https://tskit.dev/msprime/docs/stable/demography.html#quick-reference), and a [library of community-validated demographic models](https://elifesciences.org/articles/54967) for a number of common species, including humans, is available in the [stdpopsim](https://popsim-consortium.github.io/stdpopsim-docs/stable/tutorial.html#running-stdpopsim-with-the-python-interface-api) package. Alternatively you can create your own bespoke model or modify an existing one, either within `msprime` (see its [extensive documentation](https://tskit.dev/msprime/docs/stable/demography.html)) or more portably, using the [demes specification](https://popsim-consortium.github.io/demes-spec-docs/main/introduction.html) for demographic models.

Since constructing bespoke demographic models can be quite involved, here we will simulate some data using the built-in stepping-stone model, specifying eight populations. First, we create the `msprime.Demography` object:

In [None]:
deme_size = 500  # Population size of each deme
num_demes = 8
num_deme_samples = 25

demography = msprime.Demography.stepping_stone_model(
    [deme_size] * num_demes,
    migration_rate=0.05
)

To visualise this, the easiest thing is to convert it to the `demes` format, which we can then draw using the rather nifty `demesdraw` package. This plots the population size (on the x-axis) against time (on the y-axis).

In [None]:
demesdraw.tubes(
    msprime.Demography.to_demes(demography),  # Convert to standard "demes" format
    positions={f"pop_{i}": i * deme_size * 3 for i in range(num_demes)},
    seed=3,
)
plt.show()

Since all 8 populations are of constant size, they appear as fixed-width "tubes" through time. You can see that each has a small amount of migration (arrows) from each population to the two adjacent ones. Here's how to actually simulate this demographic setup.
<div class="alert alert-block alert-info"><b>Note:</b>
    Because of limited migration, we expect the closest relatives of an individual to be found in the same population. As for the other populations, if you look carefully at the coloured arrows, you will see that the first and last population can migrate between each other, so the populations can actually be thought of as lying on a circle. All other things being equal, from the point of view of an individual in <code>pop_0</code>, on average <code>pop_1</code> and <code>pop_7</code> should contain the next closest relatives, then <code>pop_2</code> and <code>pop_6</code>, then <code>pop_3</code> and <code>pop_5</code>; finally (being the most number of hops away) <code>pop_4</code> should contain, on average, the most distant relatives.</div>

In [None]:
mu = 1e-8 # Human-like mutation rate

ts = msprime.sim_ancestry(
    {population_id: num_deme_samples for population_id in range(num_demes)},
    sequence_length=5e6, # 5Mb
    demography=demography,
    recombination_rate=1e-8, # Human-like recombination rate
    random_seed=123,
)

mts = msprime.sim_mutations(
    ts,
    rate=mu,
    random_seed=321
)
mts  # Display it to screen: it should have 8 populations

A venerable statistic that is often used to measure genetic differentiation between subpopulations is known as the fixation index, or $F_{st}$. If there is no differentiation, its value should be 0. Although $F_{st}$ maxes out at 1, values above 0.15 are usually taken as indicating very significant population differention:

In [None]:
pop_0_sample_ids = mts.samples(population=0)
pop_4_sample_ids = mts.samples(population=4)

print(
    "Fst from variable sites:",
    mts.Fst([pop_0_sample_ids, pop_4_sample_ids]),
    "\nFst from genealogical branch lengths:",
    mts.Fst([pop_0_sample_ids, pop_4_sample_ids], mode="branch")
)

It looks like there is a relatively small amount of differentiation even between the most distant populations: this reflects the relatively high migration rate we have used.

<dl class="exercise"><dt>Exercise 4</dt><dd>Using a <code>for</code> loop, print out the standard site-based $F_{st}$ values between samples in population 0 and samples from each of the other 8 populations in turn. Does $F_{st}$ reflect the expected relationship between populations?</dd></dl>

In [None]:
# Exercise 4: Loop over the populations, printing out Fst between each and pop_0.


In [None]:
workbook.Q4()

## Running the same simulation many times
Variation among simulated genetic genealogies means that there is variation in branch-wise statistics among the genealogies as well. To see this, we need a bunch of simuation replicates. See the [Randomness and replication](https://tskit.dev/msprime/docs/stable/replication.html) section in the `msprime` manual for lots more detail. Here, we will simply use `num_replicates` argument in `sim_ancestry()`. This offers a convenient way to run many simulations under the same model. In this case, to avoid storing many tree sequence `msprime` returns an *iterator* over multiple tree sequences, and simulations are carried out "lazily", i.e. they are performed on the fly each time a new tree sequence is obtained.

Below, we look at the average $F_{st}$ between the first two populations for multiple replicates of our stepping-stone model:

In [None]:
import tqdm

number_of_replicates = 100

ancestry_reps = msprime.sim_ancestry(
    {i: num_deme_samples for i in range(num_demes)},
    sequence_length=2e6,
    demography=demography,
    recombination_rate=1e-8,
    random_seed=1234,
    # num_replicates > 1 means that an iterator over tree sequences is returned
    num_replicates=number_of_replicates
) 

# optional: wrap the iterator using the tqdm package, to show a progressbar
ancestry_reps = tqdm.tqdm(ancestry_reps, "Simulations", number_of_replicates)

Fst_vals = []
# performs the simulations, by accessing the replicates in a loop
for ts in ancestry_reps:
    Fst = ts.Fst([ts.samples(0), ts.samples(1)], mode="branch")
    Fst_vals.append(float(Fst))  # For convenience, convert the numpy array returned by ts.Fst to a standard number

<dl class="exercise"><dt>Exercise 5</dt>
<dd>Use <code>plt.hist()</code> to plot the histogram of Fst values between populations 0 and 1, and <code>np.mean()</code> to print out the mean. How easy is it to use Fst to decide if populations 0 and 1 are genetically differentiated?</dd></dl>

In [None]:
# Exercise 5: plot a histogram of Fst values, and also print out the mean.


In [None]:
workbook.Q5()

## More complicated demographic models

`Msprime` can simulate data under more complicated demographic models, which are beyond the scope of this workshop. Some demographic events and features that can be introduced into custom models include:

 - Varying population size.
 - Population structure (multiple demes with different migration rates).
 - Migration (constant or varying migration rates).
 - Admixture.
 - Population divergence.
 - Simulating genetic evolution through an existing pedigree.

## Some relevant papers and resources
 -  [Efficient coalescent simulation and genealogical analysis for large sample sizes](https://doi.org/10.1371/journal.pcbi.1004842)
 - [Efficient ancestry and mutation simulation with msprime 1.0](https://doi.org/10.1093/genetics/iyab229)
 - [tskit.dev documentation](https://tskit.dev/)

## Acknowledgement
This workbook is heavily based on [Georgia Tsambos' Jupyter notebooks](https://github.com/gtsambos/2022-ts-workshops).