In [None]:
# Please execute this cell (shift+<Return>) before starting the workbook
# this should print out "Your notebook is ready to go"
import sys
import tskit
import msprime

if "pyodide" in sys.modules: # if running in-browser (e.g. using JupyterLite)
    import tqdm
    import micropip
    await micropip.install(['jupyterquiz', "demesdraw", "stdpopsim"])

import genealogical_analysis_workshop

workbook = genealogical_analysis_workshop.setup_workbook2()  # this may take a minute or two
display(workbook.setup)

# Simulating genome sequences

In this workbook you will learn how to simulate semi-realistic genome sequences. There are many reasons why simulation is a central part of many genomic analyses:
 
### Why do genomic simulations?

**A null model:**
It is often useful to have a null model against which to compare data. Neutral simulations are commonly used for this purpose.

**Exploration:**
Simulations allow us to explore the influence of various historical scenarios on observed patterns of genetic variation and inheritance.

**Benchmarking and evaluating methodologies:**
To assess the accuracy of inferential methods, we need test datasets for which the true values of important parameters are known.

**Model training:**
Some methods for ancestry inference are trained on simulated data (e.g., Approximate Bayesian Computation).
This is especially important in species with a complex population history, where there are many potential parameters and models, making it impractical to specify likelihood functions.

## Tools

One of the most popular ways to simulate genomic data sets is to use the <a href="https://tskit.dev/msprime/docs/stable/intro.html">msprime</a> simulator. This coalescent simulator is efficient enough to generate thousands or millions of whole genomes. It does this by first generating a genealogy in the form of a tree sequence; then (to create genetic sequence data) `msprime` can overlay neutral mutations onto the genealogy.

Note that although the genealogies generated by `msprime` are usually *neutral* (i.e. not shaped by natural selection), it is possible to incorporate complex demographies, such as population growth, shrinking, migration, and subdivision: we will see this at the end of the notebook.

### A brief history of msprime and tskit

The first release of `msprime` was an emulation of the popular `ms` coalescent simulator, and introduced the tree sequence format. Later, the tree sequence component was split off into `tskit`, a separate library for general use. `Msprime` has subsequently evolved into an expansive and flexible *backwards-time* simulator for various different models of genetic ancestry and mutation, and even for simplified models of selection.
There is also a popular *forwards-time* simulator named `SLiM` which is not covered in this workshop, but which also outputs tree sequences. The availability of `msprime` and `SLiM` has given rise to powerful hybrid approachs, such as [recapitation](https://tskit.dev/pyslim/docs/latest/tutorial.html#sec-tutorial-recapitation), that use the tree sequence format to combine backwards-time and forwards-time simulations. 

### Backwards simulation

The main characteristic of `msprime` is that it simulates *tree sequences* in *backward-time*. Although this is usually much more efficient than simulating forwards in time, it is more restricted in the sort of scenarios that can be modelled.

<img src="pics/msprime-1.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-2.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-3.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-4.png" style="display:inline-block" width="190" height="190">
<img src="pics/msprime-5.png" style="display:inline-block" width="190" height="190">

In [None]:
import msprime

## Basic simulations

To perform simulations using `msprime`, we first simulate a tree sequence without mutations (i.e. the genetic genealogy) using `sim_ancestry()`. If [necessary](https://tskit.dev/tutorials/no_mutations.html), we can then add neutral mutations to the tree sequence using `sim_mutations()`.

In [None]:
ts = msprime.sim_ancestry(
    samples=2, # Two diploid individuals
    random_seed=1
)
ts.draw_svg()

In [None]:
ts # Note there are no mutations yet

### Samples
Although we have specified 2 samples, our tree sequence contains 4 sample nodes (i.e. sample genomes).
This is because the `samples` argument specifies the number of *individuals* in the sample,
and by default, `sim_ancestry()` assumes diploid organisms.
To change this, use the `ploidy` argument:

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    ploidy=3, # Two triploid individuals
    random_seed=1
)
ts.draw_svg()

(note that it is also possible to sample at different points in time: we won't go into this complication here)

### Sequence length

It's easiest to start thinking about genome lengths in units of nucleotides. By default, we are simulating a sequence length that spans just one of these units.
We can specify a larger region using the `sequence_length` argument. Notice how the genome position changes in the plot:

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    sequence_length=10_000,
    random_seed=1
)
ts.draw_svg()

### Recombination rate

The tree sequence we have simulated consists of just a single tree. That's because we have not yet specified a recombination rate, so a default rate of zero is assumed. We can set a more realistic rate using the `recombination_rate` parameter (conventionally interpreted as the probability of a recombination event per base per generation).

In [None]:
ts = msprime.sim_ancestry(
    samples=2,
    sequence_length=10_000,
    recombination_rate=1e-5, # Allow for recombination: this is quite a high rate
    random_seed=100
)
ts.draw_svg()

### Population information

As well as specifying samples, we need to say something about the dynamics of the wider population from which our samples have been drawn. The default is to assume a single randomly mating population of fixed size (later we will see how to change this). In a simple model like this, most users will therefore want to specify a `population_size`. Population geneticists sometimes refer to this as $N_e$, or the "effective population size" in a panmictic population.

<div class="alert alert-block alert-info"><b>Note:</b>
The standard <code>msprime</code> model is a theoretical one that allows the population size to be any floating point number greater than 0. In fact, if not specified, the population size in msprime defaults to 1, which sounds biologically impossible, but simply produces a result identical to that of a larger population with the time units scaled differently.
</div>

In [None]:
ts_small = msprime.sim_ancestry(
    samples=2,
    sequence_length=1_000,
    recombination_rate=1e-8, # Small recombination rate
    population_size=20_000, # Rough "effective population size" suitable for humans
    random_seed=107
)
ts_small  # Display summary to screen

In [None]:
# Draw the trees
ts_small.draw_svg()

<dl><dt>Exercise 1</dt><dd>To illustate the speed of <code>msprime</code>, simulate a large tree sequence of 20,000 diploid individuals, each with a 1 Mbp long genome, using a recombination rate of 1e-8 from a population of size 20,000. Run the simulation with a random seed of 2022,  save it to the variable `ts`, and output the summary table on the screen.
<div class="alert alert-block alert-info"><b>Tip:</b>
    Make sure you DON'T display the SVG trees! Each tree is huge, and there are a lot of them.</div></dd></dl>

In [None]:
# Exercise 1: Set `ts` to a new large tree sequence, generated using msprime.sim_ancestry() with
# specific parameters (random_seed=2022, etc.), then output the tree sequence summary table to screen.


In [None]:
workbook.Q1()

## Simulating mutations along the tree sequence

Next, to generate genetic variation, we add neutral mutations by applying `sim_mutations()` to the existing `TreeSequence` object. At minimum, you must supply a per-base, per-generation mutation rate.

In [None]:
mts_small = msprime.sim_mutations(
    ts_small, # Use the small tree sequence so that we can plot it easily
    rate=2e-7, # Set an unusually high mutation rate per generation per base pair
    random_seed=103
)
mts_small.draw_svg()

In [None]:
mts_small # Note there are 10 sites with variation but 11 mutations

<dl class="exercise"><dt>Exercise 2</dt>
<dd>Print out the mutation tables for the <code>mts_small</code> tree sequence</dd></dl>

In [None]:
# Exercise 2: Print out the mutations table. Notice that one site has experienced multiple mutations.


In [None]:
workbook.Q2()

Adding mutations to a tree sequence is usually very fast, and the resulting tree sequence is a highly efficient way of storing simulated genomic data. For instance, adding 10,000 variable sites to the large tree sequence you simulated in Exercise 1 should take less than a second, and corresponds to a storing a sites-by-samples matrix of 40,000 by 10,000 values (i.e. 400 million haploid genotypes). In tree sequence format, this only takes up about 8Mb of space. The equivalent VCF would be thousands or tens of thousands of times larger.

<dl class="exercise"><dt>Exercise 3</dt>
<dd>Add mutations to the large tree sequence generated in Exercise 1 using a mutation rate of 1e-8. Run the simulation using a random seed of 2022, and print a summary table of the resulting tree sequence to the screen</dd></dl>

In [None]:
# Exercise 3: Add lots of mutations to the huge genealogy you simulated earlier.


In [None]:
workbook.Q3()

The take-home message is that the advent of tree sequences, together with efficient simulation frameworks like `msprime`, makes it possible for the first time to simulate and store huge (population-scale) genealogies, and synthesise vast amounts of genome sequence data.

## More complex models

Observed genetic diversity is influenced by many factors. Primary among those are 1. _genetic processes_ such as recombination and mutation, 2. _demographic processes_ such as changes in population sizes and migration, and 3. _selective processes_ where by natural selection influences particular regions of the genome. 

The first two of these are well catered for in `msprime`. The third tends to be tackled by forward-time simulators, although some simple forms of selection (e.g. [selective sweeps](https://tskit.dev/msprime/docs/stable/ancestry.html#sec-ancestry-models-selective-sweeps)) can be approximated by `msprime` simulations.

### Genetic processes

`Msprime` has many additional options to change the genetic processes used in its simulations. These are detailed in the online [API documentation](https://tskit.dev/msprime/docs/stable/intro.html). Below is a brief summary of the more important ones:

 - **Recombination rate variation across the genome**
 We can introduce this by creating a `RateMap` object, which lists recombination rates between defined positions in the sequence. See this [page](https://tskit.dev/msprime/docs/stable/rate_maps.html) for more about `RateMap` objects.
 - **Mutation rate variation across the genome** This can also be introduced using a `RateMap` object, which lists mutation rates between defined position in the sequence. See this [page](https://tskit.dev/msprime/docs/stable/rate_maps.html) for more detail.
 - **Mutation models** There are [pre-defined models](https://tskit.dev/msprime/docs/stable/mutations.html) (e.g., Jukes-Cantor). You can also use your own custom models.
 - **Stacking mutations** Mutations can be simulated on the same tree sequence under different models and/or parameters and/or over different time periods.
 - **Gene conversion** See the API documentation for the `gene_conversion_rate` and `gene_conversion_tract_length` arguments, and [this](https://tskit.dev/msprime/docs/stable/ancestry.html#gene-conversion) short illustration of use.
 - **Alternative simulation models**  In `msprime` the default simulation model is the standard ("Hudson") coalescent, which applies for relatively small samples from large populations. It is also possible to study [other models](https://tskit.dev/msprime/docs/stable/ancestry.html#specifying-ancestry-models) such as the generation-by-generation "Discrete Time Wright Fisher" (DTWF) model, which are slower but more suitable when sampling most of the population. 

### Demographic processes

Demography is often a major contributing factor to patterns of genetic diversity, and is important to consider when analysing genomes.

`Msprime` can simulate multiple discrete populations, sometime called "subpopulations" or "demes", and provides a number of [built-in theoretical models](https://tskit.dev/msprime/docs/stable/demography.html#quick-reference), or you can create your own bespoke model or modify an existing one. But before tackling explicit examples of how to do this in `msprime`, it is worth mentioning a useful alternative approach.

## Standardised population simulations: `stdpopsim`

It can be difficult to know what genetic and demographic parameters to use in a simulation.  However, a [library of community-validated demographic models](https://elifesciences.org/articles/54967) exists for a number of common species, including humans, available via the [stdpopsim](https://popsim-consortium.github.io/stdpopsim-docs/stable/tutorial.html#running-stdpopsim-with-the-python-interface-api) package. Under the hood this uses either `msprime` or `SLiM` to perform the simulations, producing `tskit` tree sequences which can be analysed in the same way we have been doing so far.

If you are fortunate enough that your species of interest, or a close relative, is in the `stdpopsim` [catalog](https://popsim-consortium.github.io/stdpopsim-docs/stable/catalog.html), then this gives you an easy way to simulate the genomes of your target species, under a variety of tested and published models. There are currenly over 20 species in the catalog, focused on those of genetic, agricultural, conservation, or medical interest, and more are being added all the time.

As an example, we will simulate some human genomes (but feel free to experiment with other organisms in the catalog)

In [None]:
import stdpopsim
import demesdraw

species_label = "HomSap"  # Or try e.g. "AnoGam" for the mosquito /Anopheles gambiae/
species = stdpopsim.get_species(species_label)

print(
    f"{len(species.demographic_models)} demographic models for {species.name}; chromosomes",
    [chr.id for chr in species.genome.chromosomes],
)

<dl class="exercise"><dt>Exercise 4</dt><dd>Using a <code>for</code> loop, iterate over the <code>species.demographic_models</code>, print out the
<code>.id</code> and <code>.long_description</code> attributes of each of model. 
</dd></dl>

In [None]:
# Exercise: loop over the demographic models, printing their id and long_description


In [None]:
workbook.Q4()

We will simulate under the `AmericanAdmixture_4B11` model. This is a model of American admixture (where Americans are a mix of separate African, European, and Asian populations). Before simulating, we can visualise the demography of this model using the rather nifty `demesdraw` [package](https://grahamgower.github.io/demesdraw/latest), plotting the population size (on the x-axis) against time (on the y-axis):


In [None]:
import demesdraw

model = species.get_demographic_model("AmericanAdmixture_4B11")
demesdraw.tubes(model.model.to_demes(), log_time=True)
print("Population names are", {pop.id: pop.name for pop in model.populations})

In [None]:
# Now actually run a simulation of chromosome 20 under this model
contig = species.get_contig("20", left=1e7, right=4e7)  # restrict between 10Mb and 40Mb for speed
samples = {"AFR": 10, "EUR": 10, "ASIA": 10, "ADMIX": 10}
engine = stdpopsim.get_engine("msprime")
ts = engine.simulate(model, contig, samples, seed=321)  # emits a warning, ignore for simplicity
ts  # display the output: should have tens of thousands of variable sites

Normally, you would replicate such a simulation many times (we shall see how to to that at the end of this workbook). However, since we have simulated a substantial chunk of genome, we might hope that some of the inherent genetic randomness is captured by looking across the genome (you could check this by changing the seed value in the simulation above, to see if it gives noticably different results).

Genetic diversity in most species is surprisingly deep. One way to illustrate this is to look at the age of the mutations in our simulation. To reflect the prevalence of mutations, we need to weight each mutation time by the number of samples underneath that mutation.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Find number of samples under each mutation
weights = np.zeros(ts.num_mutations, dtype=int)
for tree in ts.trees():
    for m in tree.mutations():
        weights[m.id] += tree.num_samples(m.node)

plt.hist(np.log(ts.mutations_time), bins=50, density=True, weights=weights)

# label the plot nicely
ticks = 10**np.arange(6)
plt.xticks(np.log(ticks), ticks)
plt.xlabel(f"Time ({ts.time_units})")
plt.ylabel(f"Density")
plt.title(f"Ages of mutations")

# show the out of Africa event
node_is_nonAfrican = (ts.nodes_population != 0)
oldest_nonAfrican = np.max(ts.nodes_time[node_is_nonAfrican])
plt.axvline(np.log(oldest_nonAfrican), c="black")
print(
    f"Out of Africa event plotted at time {oldest_nonAfrican} {ts.time_units}",
    f"({oldest_nonAfrican * species.generation_time} years)"
)

According to our simulation, if you pick two modern-day human genomes at random, the majority of (neutral) mutations that cause them to differ originate more that 61 thousand years ago: i.e. mostly in Africa.

There is a well-understood theoretical reason for this: in the simplest neutral model, the average time-to-common-ancestry (the _tMRCA_ or _coalescence time_) in generations between two genomes is twice the "effective population size" ($2 N_e$). With a human $N_e$ estimated between 1000 and 10,0000, we expect the lineage connecting two arbitrary human genomes to stretch back many thousands of generations. This is likely to be roughly true even though humans do not exactly conform to this simple theoretical model. Since neutral mutations occur with roughly equal probability along a lineage, the average age of the mutations that separate two humans is also likely to be many thousands of generations ago.

In other words, the age of mutations should reflect the time-to-most-recent-common-ancestry in the genetic genealogy. This is another example of the duality between branch lengths and genetic variation metrics that we met in workbook 1. To illustrate this, we'll calculate pairwise tMRCA along our simulated genomes.

<dl class="exercise"><dt>Exercise 5</dt>
<dd>Using the <code>.diversity</code> statistic with <code>mode="branch"</code>, print out the average pairwise tMRCA value for the entire tree sequence. Then loop over the four populations and print out the same value <em>within</em> each population (NB: the population name is given by the <code>.metadata["name"]</code> attribute of each population). You can convert the branch length diversity to a generation-time by dividing by 2 (because the statistic counts the branch length up to the common ancestor and back down again), and to a time in years by multiplying by <code>species.generation_time</code>. <blockquote>(Hint - you can restrict the calculation to a single set of samples by setting <code>sample_sets</code> parameter to <code>ts.samples(population=pop.id)</code> in the <code>.diversity</code> call)</blockquote></dd></dl>

In [None]:
# Exercise 5: loop over populations, using ts.diversity(..., mode="branch") / 2 * species.generation_time
#  to print the average tMRCA across the genome, in years, within each population


In [None]:
workbook.Q5()

## Simulations under custom demographic models

If the `stdpopsim` catalog does not cater for your species or demographic history of interest, you may wish to simulate bespoke demographic models. `Msprime` has [extensive documentation](https://tskit.dev/msprime/docs/stable/demography.html) for creating demographic models, and here we will we use its built-in functionality to investigate a basic "stepping-stone" model of eight populations. However, for complex cases it is recommended that you investigate the [demes](https://popsim-consortium.github.io/demes-spec-docs/main/introduction.html) project, which provides a [portable way](https://doi.org/10.1093/genetics/iyac131) of specifying demographic models, which can then be imported into programs such as `msprime`.

In [None]:
deme_size = 500  # Population size of each deme
num_demes = 8
num_deme_samples = 25

# Make a msprime `Demography` object
demography = msprime.Demography.stepping_stone_model(
    [deme_size] * num_demes,
    migration_rate=0.05
)

To visualise this, the easiest thing is to convert it to the `demes` format and use the `demesdraw` package we met previously.

In [None]:
demesdraw.tubes(
    msprime.Demography.to_demes(demography),  # Convert to standard "demes" format
    positions={f"pop_{i}": i * deme_size * 3 for i in range(num_demes)},
    seed=3,
)
plt.show()

Since all 8 populations are of constant size, they appear as fixed-width "tubes" through time. You can see that each has a small amount of migration (arrows) from each population to the two adjacent ones. Here's how to actually simulate this demographic setup.
<div class="alert alert-block alert-info"><b>Note:</b>
    Because of limited migration, we expect the closest relatives of an individual to be found in the same population. As for the other populations, if you look carefully at the coloured arrows, you will see that the first and last population can migrate between each other, so the populations can actually be thought of as lying on a circle. All other things being equal, from the point of view of an individual in <code>pop_0</code>, on average <code>pop_1</code> and <code>pop_7</code> should contain the next closest relatives, then <code>pop_2</code> and <code>pop_6</code>, then <code>pop_3</code> and <code>pop_5</code>; finally (being the most number of hops away) <code>pop_4</code> should contain, on average, the most distant relatives.</div>

In [None]:
mu = 1e-8 # Human-like mutation rate

ts = msprime.sim_ancestry(
    {population_id: num_deme_samples for population_id in range(num_demes)},
    sequence_length=5e6, # 5Mb
    demography=demography,
    recombination_rate=1e-8, # Human-like recombination rate
    random_seed=123,
)

mts = msprime.sim_mutations(
    ts,
    rate=mu,
    random_seed=321
)
mts  # Display it to screen: it should have 8 populations

A venerable statistic that is often used to measure genetic differentiation between subpopulations is known as the fixation index, or $F_{st}$. If there is no differentiation, its value should be 0. Although $F_{st}$ maxes out at 1, values above 0.15 are usually taken as indicating very significant population differention:

In [None]:
pop_0_sample_ids = mts.samples(population=0)
pop_4_sample_ids = mts.samples(population=4)

print(
    "Fst from variable sites:",
    mts.Fst([pop_0_sample_ids, pop_4_sample_ids]),
    "\nFst from genealogical branch lengths:",
    mts.Fst([pop_0_sample_ids, pop_4_sample_ids], mode="branch")
)

It looks like there is a relatively small amount of differentiation even between the most distant populations: this reflects the relatively high migration rate we have used.

<dl class="exercise"><dt>Exercise 6</dt><dd>Using a <code>for</code> loop, print out the standard site-based $F_{st}$ values between samples in population 0 and samples from each of the other 8 populations in turn. Does $F_{st}$ reflect the expected relationship between populations?</dd></dl>

In [None]:
# Exercise 6: Loop over the populations, printing out Fst between each and pop_0.


In [None]:
workbook.Q6()

## Running the same simulation many times

Variation among simulated genetic genealogies means that there is variation in branch-wise statistics among the genealogies as well. To see this, we need a bunch of simuation replicates. See the [Randomness and replication](https://tskit.dev/msprime/docs/stable/replication.html) section in the `msprime` manual for lots more detail. Here, we will simply use `num_replicates` argument in `sim_ancestry()`. This offers a convenient way to run many simulations under the same model. In this case, to avoid storing many tree sequence `msprime` returns an *iterator* over multiple tree sequences, and simulations are carried out "lazily", i.e. they are performed on the fly each time a new tree sequence is obtained.

Below, we look at the average $F_{st}$ between the first two populations for multiple replicates of our stepping-stone model:

In [None]:
import tqdm
import numpy as np
import matplotlib.pyplot as plt

number_of_replicates = 100

ancestry_reps = msprime.sim_ancestry(
    {i: num_deme_samples for i in range(num_demes)},
    sequence_length=2e6,
    demography=demography,
    recombination_rate=1e-8,
    random_seed=1234,
    # num_replicates > 1 means that an iterator over tree sequences is returned
    num_replicates=number_of_replicates
) 

# optional: wrap the iterator using the tqdm package, to show a progressbar
ancestry_reps = tqdm.tqdm(ancestry_reps, "Simulations", number_of_replicates)

Fst_vals = []
# performs the simulations, by accessing the replicates in a loop
for ts in ancestry_reps:
    Fst = ts.Fst([ts.samples(0), ts.samples(1)], mode="branch")
    Fst_vals.append(float(Fst))  # For convenience, convert the numpy array returned by ts.Fst to a standard number

<dl class="exercise"><dt>Exercise 7</dt>
<dd>Use <code>plt.hist()</code> to plot the histogram of Fst values between populations 0 and 1, and <code>np.mean()</code> to print out the mean. How easy is it to use Fst to decide if populations 0 and 1 are genetically differentiated?</dd></dl>

In [None]:
# Exercise 7: plot a histogram of Fst values, and also print out the mean.


In [None]:
workbook.Q7()

## More complicated demographic models

`Msprime` can simulate data under more complicated demographic models, which are beyond the scope of this workshop. Some demographic events and features that can be introduced into custom models include:

 - Varying population size.
 - Population structure (multiple demes with different migration rates).
 - Migration (constant or varying migration rates).
 - Admixture.
 - Population divergence.
 - Simulating genetic evolution through an existing pedigree.

## Some relevant papers and resources
 -  [Efficient coalescent simulation and genealogical analysis for large sample sizes](https://doi.org/10.1371/journal.pcbi.1004842)
 - [Efficient ancestry and mutation simulation with msprime 1.0](https://doi.org/10.1093/genetics/iyab229)
 - [tskit.dev documentation](https://tskit.dev/)

## Acknowledgement
This workbook is heavily based on [Georgia Tsambos' Jupyter notebooks](https://github.com/gtsambos/2022-ts-workshops).