In [None]:
import msprime, tskit
import numpy as np
from IPython.display import SVG
msprime.__version__

# Simulating mutations with msprime

 - Brief recap of what we learnt last time

First, we need to make it clear that you might not always need to do this.

In msprime simulations, the genealogy is created independently of the genetic variation. You simulate the underlying genealogy first, as a set of node and edge tables, and *then* if desired, you simulate mutations to go with that genealogical structure. All of the sequence and variation data in your simulation is simply a consequence of the mutations that you've thrown on top of the trees.

(Get a picture of a tree sequence with and without mutations)

However, in many situations, the genealogy is *all* you need. 

Think carefully. Does your application *need* information about alleles and mutations? If not, save yourself (and your computer) some effort.

Some examples of analyses that *do not* require this information include:
 - most analyses of ancestry, including local ancestry, global ancestry and identity-by-descent
 - most recent common ancestors
 - genealogical nearest neighbours
 - ...?

In later worksheets we will see some examples of these analyses.

But in this worksheet, we're going to focus on how to do mutation simulations, should you need to look at actual genome **sequences**....

 - 3.1 The basic syntax
 - 3.2 Mutation models
 - 3.3 Silent mutations and adjustments
 - 3.4 Stacked mutations
 - 3.5 Mutation rate variation
 - 3.6 Discrete and continuous coordinates
 - 3.7 Exporting into other formats

### A basic tree sequence to work with

To emphasise that mutations and genealogy are truly separate in msprime, we will use the same simulated genealogy (ie. node and edge table) in all of the simulations in this worksheet. Everything else we do will just be throwing different sets of mutations on top of these trees, to produce different genomic sequences.

In [None]:
ts = msprime.sim_ancestry(2, sequence_length=100, random_seed=1987, recombination_rate=0.001)
SVG(ts.draw_svg())

In [None]:
ts_big = msprime.sim_ancestry(100, sequence_length=5e7, random_seed=1982, recombination_rate=1e-8)

## 3.1 The basic syntax

(2.10pm)

To simulate mutations, apply the `sim_mutations()` method to an existing tree sequence object.
At minimum, you must supply a per-base, per-generation mutation rate.

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=2016)
SVG(mts.draw_svg())

By default, the mutation are simulated under a discrete *Jukes-Cantor* model, which we'll discuss further later on.
The output is yet another tree sequence, but this time with mutations on particular edges of the trees.
Under the hood, you'll notice that in addition to the node and edge table that were there before, there are now two new tables in the tree sequence.
One is a *mutations table*:

In [None]:
mts.tables.mutations

The other holds information about the sites at which these mutations arose.
Notice that some sites have experienced multiple mutations.

In [None]:
mts.tables.sites

To view the sequence information, you have a few options. The `variants()` iterator:

In [None]:
for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

To get all alleles at once, can use `genotype_matrix()` (**only** if the tree sequence is small!!)

In [None]:
mts.genotype_matrix()

(Are there other ways? Ask others. Is there a way of getting all variants on a slice of the tree sequence)? VCF option discussed further down.

## Stacking mutations

(2.20pm)

Note: you can apply `sim_mutations()` to *any* tree sequence, including one that already has mutations on it. This allows you to 'stack' mutations, which can be useful if you wish to simulate several different types of mutations from different models, over different time periods. 
There can be complicated statistical consequences of doing this that you need to be aware of and which we'll discuss soon, but for now, just note that it is possible and easy to do this:

In [None]:
mmts = msprime.sim_mutations(mts, rate=0.01, random_seed=1959)
SVG(mmts.draw_svg())

Note that the tree sequence above is just the original one (below) with a few new mutations on it.

In [None]:
SVG(mts.draw_svg()) 

There's also no reason why you can't apply `sim_mutations()` to a tree sequence generated outside of `msprime`. (This may be particularly useful if you want to add neutral mutations to a SLiM-generated dataset).

## 3.2 Mutation models

(2.25pm)

By default, `msprime` invokes a Jukes-Cantor model of nucleotide mutations.
Under this model, there is an equal probability of each ancestral state (`A`, `C`, `G`, `T`), and an equal probability of each possible transition between these states (`A<->C`, `A<->G` etc).
This is defined in the model's `transition_matrix`:

In [None]:
msprime.JC69().transition_matrix

In our case, we can eyeball all the mutations that has arisen in our simulations

In [None]:
for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

And summarise it like this, with some extra code

In [None]:
# Can I make the below code a bit more readable?
def count_transitions(ts, alleles):
    counts = np.zeros((len(alleles), len(alleles)), dtype='int')
    for s in ts.sites():
        aa = s.ancestral_state
        for m in s.mutations:
            pa = aa
            da = m.derived_state
            if m.parent != tskit.NULL:
                pa = ts.mutation(m.parent).derived_state
            counts[alleles.index(pa), alleles.index(da)] += 1
    print("\t", "\t".join(alleles))
    for j, a in enumerate(alleles):
        print(f"{a}\t", "\t".join(map(str, counts[j])))
        
count_transitions(mts, msprime.JC69().alleles)

In this case, we have a reasonably small number of mutations, so it's not necessarily obvious that our mutations satisfy the Jukes-Cantor model. This should be clearer looking at a larger dataset:

In [None]:
mts_big = msprime.sim_mutations(ts_big, rate=0.0001, random_seed=2016)
count_transitions(mts_big, msprime.JC69().alleles)

What are the other options?

### HKY

In some situations you'll want nucleotide transitions (`A<->G`), (`C<->T`) (ie. exchanges of bases with a similar shape) to be more likely than nucleotide tranversions (all others).
You can do this with the Hasegawa, Kishino & Yano (HKY) model.
In addition to a mutation `rate`, which specifies the probability of transversions, you also specify `kappa`, a constant scaling parameter such that the probability of transitions is `kappa` times the probability of transversions.

In [None]:
msprime.HKY(kappa=20).transition_matrix

You'll notice that there is now a very small (non-0) probability of *silent mutations* -- transitions from one state to the other. We'll discuss this soon.

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=2022, model=msprime.HKY(kappa=20))
SVG(mts.draw_svg())

Now, most of the mutations are transitions. (Also, note that the site with multiple mutations now experiences a back mutation).

In [None]:
for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

In [None]:
count_transitions(mts, msprime.HKY(kappa=20).alleles)

Again, the patterns will be a bit more statistically clear on a larger tree sequence.

In [None]:
mts_big = msprime.sim_mutations(ts_big, rate=0.0001, random_seed=2016, model=msprime.HKY(kappa=20))
count_transitions(mts_big, msprime.HKY(kappa=20).alleles)

### GTR

(Maybe omit this...)
For even more control, what if you want to specify each nucleotide substitution individually?
Then you want the GTR (Generalised Time-Reversible) model.
Would recommend looking up the documentation (put in a link) if you want to use this one, but essentially, in addition to specifying an equilibrium mutation `rate`, you also specify some `relative_rates`, which indicate the relative frequency of each of the possible nucleotide switches: (`A<->C`, `A<->G`, `A<->T`, `C<->G`, `C<->T`, `G<->T`).

In [None]:
mts = msprime.sim_mutations(
    ts, rate=0.01, random_seed=2022, model=msprime.GTR(relative_rates=[1,2,1,2,1,2]))
SVG(mts.draw_svg())

### Binary mutation model
Two possible alleles: 0 and 1, the ancestral allele is always 0.

In [None]:
mts_big = msprime.sim_mutations(
    ts_big, rate=0.0001, random_seed=20278,
    model=msprime.BinaryMutationModel())
count_transitions(mts_big, msprime.BinaryMutationModel().alleles)

### Models where the mutations aren't nucleotides

(2.35pm)

So far, we've been thinking about genomic locations in terms of nucleotides, and mutations as nucloeotide substitutions.
However, msprime just seems these genomic locations as numbers along a line, and there's no reason why those numbers can't represent other more general entities

(mention that you need to reinterpret/rescale recombination rate in this case)

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=2022,
                            model=msprime.PAM())

Now, all of the variants are one of the first 20 letters in the latin alphabet, representing an amino acid of some sort. 

In [None]:
for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

In this case, *many* of the mutations are now silent.

In [None]:
# mts_big = msprime.sim_mutations(ts_big, rate=0.0001, random_seed=2016, model=msprime.PAM())
# count_transitions(mts_big, msprime.PAM().alleles)

### Make-your-own mutation model!
Maximum flexibility:
 - alleles (the possible 'choices' you can see at each unit
 - root distribution (what's the distribution of ancestral alleles?)
 - transition matrix (what's the probability of mutating from one allele to another?)

In [None]:
model = msprime.MatrixMutationModel(
    alleles = ["💩", "🎄", "🔥"],
    root_distribution = [1.0, 0.0, 0.0],
    transition_matrix = [[0.0, 1.0, 0.0],
                         [0.0, 0.8, 0.2],
                         [1.0, 0.0, 0.0]]
)
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=1215112, model=model)

for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

 - mention the SLiM mutation model, for anyone using SLiM
 - This would be a good place for an exercise

### Exercise:
Modify the following code to simulate mutations on top of the `ts` tree sequence, using a blah blah model where transpositions are 5 times more likely than transversions, with a hotspot between locations 25 and 50.

## 3.3 Silent mutations, state independence and related adjustments

(2.45pm)

Some things I've drawn attention to before:
 - mutations that apparently don't change the allelic state
 - non-0 diagonal entries in some transition matrices
 - the complications of simulating with stacked mutations

These are all to do with *silent mutations*, i.e. mutations which do not change the allelic state. This the main 'gotcha' to keep in mind with some of these mutation models.

Let's have a closer look at some of the transition matrices of the models we've looked at:

In [None]:
msprime.JC69().transition_matrix

In [None]:
msprime.HKY(kappa=20).transition_matrix

In [None]:
msprime.PAM().transition_matrix[:4,:4]

In [None]:
np.sum(msprime.PAM().transition_matrix[1,:])

In particular, notice that some of the diagonal entries are non-zero. This indicates that there is a positive probability of mutations where the ancestral and derived state are the same. We have observed instances of these above.

Why does `msprime` simulate these `silent mutations`? The short answer is that it is...

The most important thing is that we know how to deal with it, and in particular, how to adjust for it. We might usually thing that when we specify a mutation rate, we are specifying a rate of mutations which change the allelic state. The consequence of using this mutation rate with a model that includes silent mutations is that we will have fewer varying sites than we might expect to produce, which in turn leads to underestimates of statistics relying on these, like diversity, segregating sites etc.

If you are using an msprime mutation model where some fraction of the mutations are non-0, we need to figure out what fraction of our mutations are expected to be silent, and adjust our mutation rate accordingly. Link [here](https://tskit.dev/msprime/docs/stable/mutations.html#adjusting-mutation-rates-for-silent-mutations)

Note that the probability of an `A<->A` transition is *more* likely than an `A->B`, `A->C`, `A->D` transition etc!
This produces silent mutations.
The reason why this happens: it's a workaround to ensure that `msprime` does not simulate nonsensical mutations.
Let's see how to adjust the mutation rate for the PAM model:

$$ \mu = \mu_{silent} + \mu_{non-silent}$$

and

$$ \mu_{silent} = \sum_i T_{i, i} $$
where $T_{i, j}$ is the $(i, j)$th element of the transition matrix for this mutation model.

In [None]:
T_pam = msprime.PAM().transition_matrix
silent_prop = np.sum(np.diag(T_pam))/np.sum(T_pam)
silent_prop

So 45% of our mutations will be silent. We therefore need to adjust our mutation rates to make up for this 'missing' 45%. $\mu \left( 1 + 45/55 \right)$, right?

These 'silent' mutations are just a mathematical artifact. Basically, in order to ensure that you throw down mutations that are consistent with each other, msprime makes this tradeoff -- you simulate variant mutations that are at a rate that is *proportional* to those specified by the model. You then need to adjust for this.

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=2022,
                            model=msprime.PAM())
SVG(mts.draw_svg())

 - Consider an exercise here where people adjust their mutation rate.
 - Discuss scenarios where silent mutations are and aren't likely to be an issue

## 3.5 Mutation rate variation

3pm

You can specify mutation hotspots using the `RateMap` method as an argument to `rate`, instead of the single number:

In [None]:
ratemap = msprime.RateMap(position=[0, 40, 60, 100], rate=[0.01, 0.1, 0.01])
mts = msprime.sim_mutations(ts, rate=ratemap, random_seed=104)
SVG(mts.draw_svg())

## 3.6 Mutation rates in different epochs

(3.05pm)

Use the `start_time` and `end_time` arguments to simulate mutations only within specific timeframes:

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=1714, start_time=2)
SVG(mts.draw_svg())

In [None]:
mmts = msprime.sim_mutations(mts, rate=0.1, random_seed=851, start_time=1,
                            end_time=2)
SVG(mmts.draw_svg())

In [None]:
mts.tables.mutations

In [None]:
# Why does msprime renumber the mutations?
mmts.tables.mutations

*Exercise*. Over the past 50 generations, your study organisms were exposed to an environmental mutagen that induces additional mutations uniformly across their genomes at a rate of 1e-6 per generation. Modify the code below to model these circumstances.

 ## 3.6 Discrete vs continuous coordinates
 
 (3.15pm)
 
 As with `sim_ancestry()`, we can choose whether to place mutations at discrete (integer) or continuous (floating point) positions. Use the `discrete_genome` argument:

In [None]:
mts = msprime.sim_mutations(ts, rate=0.01, random_seed=2016,
                           discrete_genome=False)

for var in mts.variants():
    print(var.site.position, var.alleles, var.genotypes, sep="\t")

(Our mutational model now confirms to the classical *infinite sites* assumption).

## 3.7 Exporting sequence data into other formats

(3.20pm)

To run analyses using the sequence data you've just simulated, you have a lot of different approaches to consider.
One thing that I'll mention briefly now, demonstrate fully later, is that there is a lot you can do using the tree sequence objects on their own. If there is a way to do the operation using `tskit`, we recommend it, as on realistically sized datasets, it will almost always be quicker and more memory-efficient.
However sometimes you can't avoid exporting the data into another format, like VCF, in order to work with other software

### To VCF
Basic syntax:

In [None]:
mts.num_individuals

In [None]:
with open("worksheet3-output.vcf", "w") as vcf_file:
    mts.write_vcf(vcf_file)

What the output looks like

In [None]:
with open("worksheet3-output.vcf", "r") as f:
    print(f.read())

Fancier version, if you want to specify your own individual names:

In [None]:
indv_names = ["platypus_1", "platypus_2"]
with open("worksheet3-output.vcf", "w") as vcf_file:
    mts.write_vcf(vcf_file, individual_names=indv_names)

In [None]:
with open("worksheet3-output.vcf", "r") as f:
    print(f.read())

### To other Python objects

Many of the underlying attributes of tskit objects are `numpy` objects, so play well with other Python libraries, including plotting libraries like `matplotlib` and scientific libraries like `scikit-allel`.

In particular, the `genotype_matrix` method will return an object that works as a HaplotypeArray in scikit-allel, which gives you access to summary functions.

In [None]:
import allel

(Confusingly named `genotype matrix`, but is actually haplotypes)

In [None]:
gens = mts.genotype_matrix()
print(gens)

Can easily be converted into the right objects for analyses with `allel`.

In [None]:
haps = allel.HaplotypeArray(gens)
allele_counts = haps.count_alleles()

In [None]:
%%time
allel.mean_pairwise_difference(allele_counts)

But, this takes a long time on big datasets. Some of these operations can be done much more quickly within tskit itself, which we'll get to next time.