# 5. Other simulation tools

This worksheet covers a few other packages that allow you to run different types of simulations based on the `msprime` simulation engine.

 - 5.1 `stdpopsim`
 - 5.2 `tspop`
 - 5.3 `tstrait`


In [None]:
import stdpopsim
import tspop
import tstrait
import msprime

## 5.1. `stdpopsim`

As we've seen, simulating genomic history accurately can be complicated.
However, in some situations you might just want a quick simulation of a genome that can be used to benchmark the performance of method, or to give you a rough sanity check of what certain population genetic statistics look like in certain organisms.

For instance, suppose you wanted to get a rough idea of what diversity levels look like in people from different continental ancestries.
Consider how hard this might be if we had to code it up in `msprime` (especially if we had little prior experience in human genetics):
we'd first have to research relevant literature for information about human chromosomes (including mutation rates and recombination maps); then we'd have to find a well-supported model of demographic history for our organism (which could involve some amount of literature search), and then we'd have to code it all up, which is not an error-free process...

`stdpopsim` offers a library of standardised simulations that let you pull all of these bits of information from its catalog, reducing your need to do research and the possibility of error.
Let's briefly look at the simulation syntax:



In [None]:
# Choose a species to simulate.
species = stdpopsim.get_species("HomSap")

# Choose a species-specific contig and demographic model
contig = species.get_contig("chr22")
model = species.get_demographic_model("OutOfAfrica_3G09")

# Choose samples from the populations specific to your demographic model
samples = {"YRI": 5, "CHB": 5, "CEU": 0}

# Choose a simulation engine and use it to generate a tree sequence.
engine = stdpopsim.get_engine("msprime")
ts = engine.simulate(model, contig, samples)
ts

Later on, you might want to browse the `stdpopsim` [Catalog ](https://popsim-consortium.github.io/stdpopsim-docs/stable/catalog.html)to see considerable range of species, genomic contigs and demographic models that you can simulate under.
`stdpopsim` also has other features,
including the facility to simulate under a DFE
 -- see the various papers published by the stdpopsim consortium for more details.

## 5.2. `tspop`

Why is understanding ancestry important?

 - **Demography and history:** Inference about the dates and composition of evolutionary changes and historical events.
 - **Medicine:** GWAS and risk prediction studies, admixture mapping studies.
 - **Genetic pipelines:** Phasing, imputation, genotyping errors, SNP ascertainment.
 
Suppose your genealogical ancestors can be partitioned into distinct *populations*.

 <img src="pics/worksheet4-LA.png" width="500" height="500">
 
 This is typically reported as global and local ancestry:
 
 <img src="pics/worksheet4-LA2.png" width="500" height="500">

This is the kind of thing you could simulate in `msprime`, which we covered in Notebook 2.
For instance,
this demographic history

 <img src="pics/5-admixture-diagram.png" width="500" height="500">

 could be simulated with the following code

In [4]:
demography = msprime.Demography()
demography.add_population(name="RED", initial_size=500)
demography.add_population(name="BLUE", initial_size=500)
demography.add_population(name="ADMIX", initial_size=500)
demography.add_population(name="ANC", initial_size=500)
demography.add_admixture(
    time=100, derived="ADMIX", ancestral=["RED", "BLUE"], proportions=[0.5, 0.5]
)
demography.add_population_split(
    time=1000, derived=["RED", "BLUE"], ancestral="ANC"
)

ts = msprime.sim_ancestry(
    samples={"RED": 0, "BLUE": 0, "ADMIX" : 2},
    demography=demography,
    random_seed=1011,
    sequence_length=1e7,
    recombination_rate=3e-8
)

However, even if you setup an admixed simulation in msprime, you may not be able to get complete information about local ancestry out of it.

In [None]:
colour_map = {0:"red", 1:"blue", 2: "purple", 3: "gray"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
tree = ts.first()
tree.draw(node_colours=node_colours)

Note that for sample 0,
we can't see whether it inherited this bit of DNA from the red or the blue population. 
In order to fix this, we need to use the `add_census` method, a special demographic event in `msprime`. 
Let's modify the demography and re-simulate with this added 'census event':

In [6]:
demography = msprime.Demography()
demography.add_population(name="RED", initial_size=500)
demography.add_population(name="BLUE", initial_size=500)
demography.add_population(name="ADMIX", initial_size=500)
demography.add_population(name="ANC", initial_size=500)
demography.add_admixture(
    time=100, derived="ADMIX", ancestral=["RED", "BLUE"], proportions=[0.5, 0.5]
)
demography.add_census(time=100.01) # Census is added here!
demography.add_population_split(
    time=1000, derived=["RED", "BLUE"], ancestral="ANC"
)

ts = msprime.sim_ancestry(
    samples={"RED": 0, "BLUE": 0, "ADMIX" : 2},
    demography=demography,
    random_seed=1011,
    sequence_length=1e7,
    recombination_rate=3e-8
)

Notice that this tree has red and blue nodes at every branch. This is the information added by `add_census`, and it's what's needed to calculate local ancestry. We'll do this now using the `tspop` package:

In [7]:
pa = tspop.get_pop_ancestry(ts, census_time=100.01)

The information about tracts of ancestry within particular populations is in the `tsop.PopAncestry.squashed_table` object:

In [None]:
st = pa.squashed_table
print(st)

`tspop` also contains some functions for plotting this stuff:

In [None]:
pa.plot_karyotypes(
    sample_pair=(0,1),
    colors=['red', 'blue'],
    pop_labels=['RedPop', 'BluePop'],
    title="Local ancestry for individual 0",
    length_in_Mb=True,
    outfile=None,
)

We can also get global ancestry proportions using the `calculate_ancestry_fraction()` method (though, note that it works for a single node only, so to get the individual's global ancestry you'll need to average two values): 

In [None]:
hap1 = pa.calculate_ancestry_fraction(population=0, sample=0)
hap2 = pa.calculate_ancestry_fraction(population=0, sample=1)
ind_anc = (hap1+hap2) / 2 # # Average ancestry across both chromosomes

print("Individual 0 has {:.2f}% ancestry from population 0".format(ind_anc*100))

For more information about what makes tspop so fast, see the paper about the [link_ancestors algorithm](https://doi.org/10.1093/bioadv/vbad163) that underlies it. 

## 5.3 `tstrait`
What if we want we want our simulated individuals to also have simulated phenotypic values based on (some amount of) genetic determination?
This is what `tstrait` is for.

We'll start by using `msprime` to simulate a tree sequence with mutations: 


In [None]:
ts = msprime.sim_ancestry(
    samples=10_000,
    recombination_rate=1e-8,
    sequence_length=100_000,
    population_size=10_000,
    random_seed=100,
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=101)

Then, we'll use `tstrait`'s `sim_phenotype` to add some phenotype data to the individuals in this simulation,
after first defining a statistical `model` for the distribution of genetic effects. 

In [12]:
model = tstrait.trait_model(distribution="normal", mean=0, var=1)


Here, we're specifying via `num_causal` that we want 100 variants to be causal for the phenotype.
The `h2` value sets the phenotype's additive heritability:
here, we've specified that these variants should account for 30% of the total variance in our phenotype (the remaining 70% will be due to random environmental noise.)

In [21]:
sim_result = tstrait.sim_phenotype(
    ts=ts, num_causal=100, model=model, h2=0.25, random_seed=128
)

`sim_phenotype()` has randomly chosen 100 variants from our set of simulated mutations to be causal for this trait,
with effect sizes distributed according to the `model` we supplied.
We can see exactly which sites have been chosen,
and their simulated effect sizes,
in the `trait` attribute of our output, which can be printed as a  `pandas` dataframe:

In [None]:
trait_df = sim_result.trait
trait_df.head()

The trait values held by each individual has been calculated from these causal variants, and can be accessed with the `phenotype` attribute:

In [None]:
phenotype_df = sim_result.phenotype
phenotype_df.head()

Note that the phenotypic value has been split into a genetic and environmental value which is printed alongside the phenotype value. Our environmental values are typically larger than the genetic values, which makes sense given that we simulated this trait to have a fairly low heritability of 0.3.

This is just a brief introduction to `tstrait` -- for more, see the [documentation](https://tskit.dev/tstrait/docs/stable/index.html) and the associated [paper]([https://doi.org/10.1093/bioinformatics/btae334](https://doi.org/10.1093/bioinformatics/btae334)).