In [1]:
import msprime, tskit
import numpy as np
from IPython.display import SVG

# 3. Simulating mutations with msprime

In this worksheet, we're going to focus on how to add simulated mutations to our existing tree sequences, should you need to look at actual genome **sequences**....

 - [3.0 Do you really need mutations?](#3.0DoYouReallyNeedMutations)
 - [3.1 The basic syntax](#3.1TheBasicSyntax)
 - [3.2 Mutation models](#3.2MutationModels)
 - [3.3 Mutation rates in different epochs](#3.3MutationRatesInDifferentEpochs)
 - [3.4 Exporting into other formats](#3.4ExportingIntoOtherFormats)

<a id='13.0DoYouReallyNeedMutations'></a>
       
## 3.0 Do you really need mutations?

In tree sequences, the genetic genealogy exists independently of the mutations that
generate the observed genetic variation.
Our analyses may not actually require mutations: the genealogy on its own may be sufficent.
This applies particularly to cases where you generate tree sequences by simulation,
when you can be certain that your branch lengths are correct.

Think carefully. Do you really need to analyse
information about alleles and mutations? If not, here's why you could consider omitting
them: 

1. Neutral mutations and sites can be added to a genealogy later
2. Simulating sites and mutations increases memory requirements and computational time
3. Storing mutations increases the tree sequence file size, and can slow some downstream analyses

Consider the following simulation of samples from a 2-population [island model](https://tskit.dev/msprime/docs/stable/api.html#msprime.Demography.island_model):

In [2]:
L = 1e7
rho = 1e-8 
n_subpops = 2
subpop_size = 1e4
migration_rate = 1e-4

# Create a mutationless diploid tree sequence of n_subpops demes
ts_orig = msprime.sim_ancestry(
    samples={f"pop_{i}": 10 for i in range(n_subpops)},  # 10 samples from each subpop
    demography=msprime.Demography.island_model([subpop_size] * n_subpops, migration_rate),
    ploidy=2,
    recombination_rate=rho,
    sequence_length=L,
    random_seed=123,
)

We can use the `sim_mutations()` function to add neutral sites and mutations:

In [None]:
mu = 1e-8

ts_mutated = msprime.sim_mutations(ts_orig, rate=mu, random_seed=456)
print(
    "Adding mutations has increased the tree sequence file size by "
    f"{ts_mutated.nbytes / ts_orig.nbytes * 100:.0f}%",
)

### Analysis without variable sites

Some genetic analyses are primarily focused on patterns of descent or ancestry. For instance, have you ever needed to study
* local ancestry, global ancestry and identity-by-descent?
* identification of most recent common ancestors and their descendants (including e.g. genealogical nearest neighbour analysis)?

In cases like these, having the genealogy is usually sufficient to perform the analysis.



Although many genetic analyses are based on patterns of genetic variation, for many purposes the genetic variation can be thought of as a measure of the relative length of branches on the local trees in a tree sequence. So while mutations are necessary to generate realistically variable genetic sequences, some statistical analyses do not necessarily require them to be present in a tree sequence. We'll talk about this further in the next notebook.

### A basic tree sequence to work with

To emphasise that mutations and genealogy are truly separate in `msprime`, we will use the same simulated genetic genealogies (node and edge tables) throughout this notebook.

In [None]:
ts = msprime.sim_ancestry(2, sequence_length=100, random_seed=1987, recombination_rate=0.001)
SVG(ts.draw_svg())

In [5]:
ts_big = msprime.sim_ancestry(
    100,
    population_size = 1e4,
    sequence_length=5e7,
    random_seed=1982,
    recombination_rate=1e-8
)

<a id='3.1TheBasicSyntax'></a>
## 3.1 The basic syntax

To simulate mutations, apply the `sim_mutations()` method to an existing tree sequence object.
At minimum, you must supply a per-base, per-generation mutation rate.

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=2016)
SVG(mts.draw_svg())

By default, the mutations are simulated under a discrete *Jukes-Cantor* model.
(Under this model, there is an equal probability of each ancestral state (`A`, `C`, `G`, `T`), and an equal probability of each possible transition between these states (`A<->C`, `A<->G` etc.
See [this](https://tskit.dev/msprime/docs/stable/api.html#msprime.JC69) for more information.)

The output is yet another tree sequence, but this time with mutations on particular edges of the trees.
There is now an additional provenance record to show that this dataset was obtained by applying `sim_mutations()` to the tree sequence initially simulated with `sim_ancestry()`.

In [None]:
mts

Under the hood, `sim_mutations()` has added a *mutations table* to the original data:


In [None]:
mts.tables.mutations

And a sites table:

In [None]:
mts.tables.sites

Notice that some sites have experienced multiple mutations.

To view the sequence information at each successive site, we can use the `variants()` iterator:

In [None]:
for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

To get all alleles at once, we can use `genotype_matrix()` (but **only** if the tree sequence is small!!)

In [None]:
mts.genotype_matrix()

### 3.1.1 Mutation rate variation

You can specify mutation hotspots using the `RateMap` method as an argument to `rate`, instead of the single number:

In [None]:
ratemap = msprime.RateMap(position=[0, 40, 60, 100], rate=[0.01, 0.1, 0.01])
mts = msprime.sim_mutations(ts, rate=ratemap, random_seed=104)
SVG(mts.draw_svg(mutation_labels={}))

#### Note: discrete vs continuous coordinates
 
 By default, the mutations generated by `sim_mutations()` will be at discrete positions.
 As with `sim_ancestry()`, we can also choose to place mutations at continuous (floating point) positions if wish.
 See the `discrete_genome` option [here](https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_mutations),
 and note that the mutations produced with this option will conform to the classical `infinite sites` assumption`.

<a id='3.2MutationModels'></a>
## 3.2 Mutation models

Under the default Jukes-Cantor mutation model, there is an equal probability of each ancestral state (`A`, `C`, `G`, `T`), and an equal probability of each possible transition between these states (`A<->C`, `A<->G` etc).
These properties are defined in the model's `root_distribution` and `transition_matrix`:

In [None]:
msprime.JC69().alleles

In [None]:
msprime.JC69().root_distribution

In [None]:
msprime.JC69().transition_matrix

We can summarise the mutations in this tree sequence with some extra code:

In [None]:
def count_transitions(ts, alleles):
    counts = np.zeros((len(alleles), len(alleles)), dtype='int')
    for s in ts.sites():
        aa = s.ancestral_state
        for m in s.mutations:
            pa = aa
            da = m.derived_state
            if m.parent != tskit.NULL:
                pa = ts.mutation(m.parent).derived_state
            counts[alleles.index(pa), alleles.index(da)] += 1
    print("\t", "\t".join(alleles))
    for j, a in enumerate(alleles):
        print(f"{a}\t", "\t".join(map(str, counts[j])))
        
count_transitions(mts, msprime.JC69().alleles)

In this case, we have a reasonably small number of mutations, so it's not necessarily obvious that our mutations satisfy the Jukes-Cantor model. This should be clearer looking at a larger dataset:

In [None]:
mts_big = msprime.sim_mutations(ts_big, rate=1e-8, random_seed=2016)
count_transitions(mts_big, msprime.JC69().alleles)

There are *many* other mutations models you can choose from in `msprime`:

 - [BinaryMutationModel](https://tskit.dev/msprime/docs/stable/api.html#msprime.BinaryMutationModel): Basic binary mutation model with two flip-flopping alleles: “0” and “1”.

 - [JC69](https://tskit.dev/msprime/docs/stable/api.html#msprime.JC69): Jukes & Cantor model (‘69), equal probability of transitions between nucleotides

 - [HKY](https://tskit.dev/msprime/docs/stable/api.html#msprime.HKY): Hasegawa, Kishino & Yano model (‘85), different probabilities for transitions and transversions

 - [F84](https://tskit.dev/msprime/docs/stable/api.html#msprime.F84): Felsenstein model (‘84), different probabilities for transitions and transversions

 - [GTR](https://tskit.dev/msprime/docs/stable/api.html#msprime.GTR): The Generalised Time-Reversible nucleotide mutation model, a general parameterisation of a time-reversible mutation process

 - [BLOSUM62](https://tskit.dev/msprime/docs/stable/api.html#msprime.BLOSUM62): The BLOSUM62 model of time-reversible amino acid mutation

 - [PAM](https://tskit.dev/msprime/docs/stable/api.html#msprime.PAM): The PAM model of time-reversible amino acid mutation

 - [MatrixMutationModel](https://tskit.dev/msprime/docs/stable/api.html#msprime.MatrixMutationModel): Superclass of the specific mutation models with a finite set of states

 - [InfiniteAlleles](https://tskit.dev/msprime/docs/stable/api.html#msprime.InfiniteAlleles): A generic infinite-alleles mutation model

 - [SLiMMutationModel](https://tskit.dev/msprime/docs/stable/api.html#msprime.SLiMMutationModel): An infinite-alleles model of mutation producing SLiM-style mutations

Any of these alternative models can be specified using the `model` argument of `sim_mutations()`.
We'll practice with one of them:

### 3.2.1 An alternative example: Transitions and transversions

In some situations we want nucleotide transitions (`A<->G`), (`C<->T`) (ie. exchanges of bases with a similar shape) to be more likely than nucleotide tranversions (all others).
We can do this with the [Hasegawa, Kishino & Yano (HKY) model](https://tskit.dev/msprime/docs/stable/api.html#msprime.HKY).
In addition to an overall mutation `rate`, you also specify `kappa`, a constant scaling parameter such that the probability of each type of transition is `kappa` times the probability of each type of transversion.

Most of our simulated mutations are now transitions. (Also, note that the site with multiple mutations now experiences a back mutation).

In [None]:
mts_big = msprime.sim_mutations(
    ts_big,
    rate=1e-8,
    random_seed=2016,
    model=msprime.HKY(kappa=20))

count_transitions(mts_big, msprime.HKY(kappa=20).alleles)

There is now a very small (non-0) probability of *silent mutations*, transitions that do not change the type of the allele. This behaviour is an artifact of the model adjustments that `msprime` makes to enable mutation 'stackability'. It usually doesn't matter too much, unless your mutation model has a large number of possible mutation allele types -- for more information on this, see [this discussion in the documentation](https://tskit.dev/msprime/docs/stable/mutations.html#sec-mutations-adjusting-for-silent).

In [None]:
msprime.HKY(kappa=20).transition_matrix

<a id='3.3MutationRatesInDifferentEpochs'></a>
### 3.2.2 Mutation rates in different epochs

Use the `start_time` and `end_time` arguments to simulate mutations only within specific timeframes.
This can be useful if you want to simulate a change in mutation rates over time.

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=1714, start_time=2)
SVG(mts.draw_svg(mutation_labels={}))

In [None]:
mts.tables.mutations

In [None]:
mmts = msprime.sim_mutations(mts, rate=0.1, random_seed=851, start_time=1,
                            end_time=2)
SVG(mmts.draw_svg(mutation_labels={}))

In [None]:
mmts.tables.mutations

Note that the original mutation has been re-numbered according to its age.

### 3.2.3 Stacking mutations

As the previous example shows,
we can apply `sim_mutations()` to *any* tree sequence, including one that already has mutations on it.
This allows us to 'stack' mutations, which can be useful if you wish to simulate several different types of mutations from different models, or perhaps over different time periods. 

For instance, here's a tree sequence with some mutations that we made earlier:

In [None]:
SVG(mts.draw_svg(mutation_labels={})) 

We can run `sim_mutations()` once more to overlay more mutations on top of this tree sequence:

In [None]:
mmts = msprime.sim_mutations(mts, rate=0.01, random_seed=1959)
SVG(mmts.draw_svg(mutation_labels={}))

There's also no reason why we can't apply `sim_mutations()` to a tree sequence generated outside of `msprime`.

### 3.2.4 Make-your-own mutation model!

Mutation models consist of a few key ingredients:

 - **alleles** (the possible 'choices' you can see at each unit)
 - **root distribution** (what's the distribution of ancestral alleles?)
 - **transition matrix** (what's the probability of mutating from one allele to another?)
 
You can use these to define your own (finite-sites) mutation model:

In [None]:
model = msprime.MatrixMutationModel(
    alleles = ["💩", "🎄", "🔥"],
    root_distribution = [1.0, 0.0, 0.0],
    transition_matrix = [[0.0, 1.0, 0.0],
                         [0.0, 0.8, 0.2],
                         [1.0, 0.0, 0.0]]
)
mts = msprime.sim_mutations(
    ts, rate=0.01, random_seed=1215112, model=model)

for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

*Exercise*. For most of their evolutionary history, mutations in your study organisms were well described by the HKY model. 
However, they were all were exposed to an environmental mutagen 50 generations ago, and from then on all new mutations were of type 👽. All of the nucleotide bases have an equal probability of mutating to an allele of type 👽. Once a nucleotide mutates to a 👽, it cannot mutate back to a normal nucleotide. Modify the code below to simulate this scenario.

In [None]:
# The underlying genealogy
ts_ex = msprime.sim_ancestry(
    10, sequence_length=1000, random_seed=1987, recombination_rate=0.001,
    population_size=1e3)

In [52]:
# Modify code here 
alien_model = msprime.MatrixMutationModel(
    alleles = ['A', 'C', 'G', 'T'],
    root_distribution =  [0.25, 0.25, 0.25, 0.25],
    transition_matrix = [[0, 1/3, 1/3, 1/3],
       [1/3, 0, 1/3, 1/3],
       [1/3, 1/3, 0, 1/3],
       [1/3, 1/3, 1/3, 0]]
)

mts_ex = msprime.sim_mutations(
    ts_ex, rate=1e-5, random_seed=2016, model=msprime.HKY(kappa=20),
     start_time=0, end_time=None
)
mts_ex = msprime.sim_mutations(
    mts_ex, rate=1e-5, random_seed=2299, model=alien_model,
     start_time=0, end_time=None
)

In [None]:
# Check here
print(mts_ex.tables.mutations)

In [None]:
# Check here
for var in mts_ex.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

<a id='3.4ExportingSequenceDataIntoOtherFormats'></a>
## 3.3 Exporting sequence data into other formats

### 3.3.1 To VCF

We'll use the `write_vcf()` method to export our new simulated sequences to a VCF file for use with external software:

In [None]:
with open("worksheet3-output.vcf", "w") as vcf_file:
    mts.write_vcf(vcf_file)

In [None]:
with open("worksheet3-output.vcf", "r") as f:
    print(f.read())

...and we may even want to specify our own individual names:

In [None]:
indv_names = ["platypus_1", "platypus_2"]
with open("worksheet3-output.vcf", "w") as vcf_file:
    mts.write_vcf(vcf_file, individual_names=indv_names)

In [None]:
with open("worksheet3-output.vcf", "r") as f:
    print(f.read())

### 3.3.2 To other Python objects

Many of the underlying methods in `tskit` produce `numpy` objects as output.
Because of this, tree sequences play well with other Python libraries, including plotting libraries like `matplotlib` and scientific libraries like `scikit-allel`.

For instance, the `genotype_matrix()` method will return an object that works as a HaplotypeArray in `scikit-allel`, giving you easy access to its summary functions.

In [None]:
import allel

gens = mts_big.genotype_matrix()
haps = allel.HaplotypeArray(gens)
allele_counts = haps.count_alleles()
allel.mean_pairwise_difference(allele_counts)

However, many of these calculations can also be done within `tskit` itself.
We'll see this in the next notebook!