In [None]:
# Please execute me (shift+<Return>) – this should print out "Your notebook is ready to go"

import tskit

from IPython.display import HTML, display

import genealogical_analysis_workshop

workshop = genealogical_analysis_workshop.setup()
display(HTML(workshop.css))
print("Your notebook is ready to go")

## Getting started with `tskit`

This workbook has a number of exercises

<div class="alert alert-block alert-info">
    This workbook uses the tskit <a href="https://tskit.dev/tskit/docs/stable/python-api.html#sec-python-api">Python API</a> so assumes a basic knowledge of Python, but it's also possible to
<a href="https://tskit.dev/tutorials/tskitr.html">use R</a>, or access the API in other languages, notably
<a href="https://tskit.dev/tskit/docs/stable/c-api.html#sec-c-api">C</a> and <a href="https://github.com/tskit-dev/tskit-rust">Rust</a>.
</div>

Genetic genealogies in the form of tree sequences can be obtained by simulation or inference from real data. In the next practical we shall explore both methods. However, for the time being, we will simply work with two pre-existing tree sequences, one from a simple simulation of selection on a 1Mb section of genome and another inferred from a large human dataset.

In [None]:
import tskit
ts = tskit.load("simulated.trees")
print("Loaded the tree sequence into a variable called `ts`")

In a jupyter notebook, the output of the final command in a cell is output to screen.
When a tree sequence instance is output in this way, a nicely formatted summary should be printed out. Try it below:

In [None]:
# Display `ts` to the screen (formatted nicely in a Jupyter notebook)
ts

Various useful properties of the tree sequence have been printed out above, such as the number of trees and sample nodes, the sequence length (conventionally interpreted as the number of base pairs), etc. They are available as Python attributes:

In [None]:
print(ts.num_trees, "trees")
print(ts.num_samples, "sample nodes")
print("Sequence length is", ts.sequence_length, "bp")

The summary above also prints information on the basic entities in the tree sequence, such as nodes and edges, sites and mutations, and individuals and populations, which are stored in [tables](https://tskit.dev/tutorials/tables_and_editing.html#correspondence-between-tables-and-trees).

<h6>Exercise 1</h6>
Modify the code below to print out not just the number of nodes, but also the number of edges, sites and mutations (and check it corresponds to the counts reported in the summary above).

In [None]:
# Exercise 1: modify me
print("Tree sequence has", ts.num_nodes, "nodes")

In [None]:
workshop.Q1()

The trees in a tree sequence are constructed from the nodes and edges. There are a few different ways to access a tree, all of which return a <a href="">Tree</a> object. 

In [None]:
ts.first()  # The first tree in the tree sequence (tree number 0)

In [None]:
ts.at(100000) # The tree at position 100Kb: this is tree number 6 (i.e. it is the 7th tree)

In [None]:
tree = ts.last()
print(f"The last tree (index {tree.index}) covers bases {tree.interval.left} to {tree.interval.right}")

To visualise the tree you can use either SVG or text format:

In [None]:
tree = ts.first()
print(tree.draw_text())  # Text format: numbers represent node IDs

In [None]:
from IPython.display import SVG
svg_tree = tree.draw_svg(size=(400, 200))
display(SVG(svg_tree))  # output SVG format to a Jupyter notebook

<div class="alert alert-block alert-info"><b>Note:</b>
    Often in these trees, the terminal branches are very short, and the height of the nodes is best plotted on a log scale. This can be done using <code>draw_svg(..., time_scale="log_time")</code>.
</div>

<h6>Exercise 2</h6>
You can find details of a particular node (e.g. node 9) using <code>ts.node(9)</code>. Use this to find the age of the root (i.e. oldest) node shown in the tree above. 
<div class="alert alert-block alert-info"><b>Tip:</b>
    You can check that you have the right node ID by using <code>tree.root</code>.</div>

In [None]:
# Exercise 2: find the age (in generations ago) of the oldest node in the first tree


In [None]:
workshop.Q2()

<h6>Exercise 3</h6>
Plot the tree at position 400 Kb in SVG format below, adding the following parameters to the `draw_svg()` function to produce nicer formatting:

* `size=(400, 200)` – to plot the SVG 400 pixels wide and 200 pixels high
* `y_axis=True` – to show a y (time) axis
* `y_ticks={y: str(y) for y in range(0, 5000, 1000)}` – to use nice tick spacing on the y axis

<div class="alert alert-block alert-info">
<b>Tip:</b> to look at the help for a function like `draw_svg()` you can type `help(tskit.Tree.draw_svg)` or `?tskit.Tree.draw_svg` in a jupyter notebook, or examine the documentation <a href="https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree.draw_svg">here</a>.</div>

In [None]:
# Exercise 3: use draw_svg() below to plot the tree at position 400Kb


In the SVG format, the nodes are marked in black and numbered by node ID. Squares are used for nodes whose genomes we have sampled.

Mutations, on the other hand, are by default plotted as red crosses, with the mutation number plotted beside them.

In [None]:
workshop.Q3()

You should find that the tree you just plotted and the tree plotted as an SVG a little earlier look quite different, particularly in the length of the terminal branches. This could indicate different sorts of history for different regions of the genome. One way to investigate this is to look at change along the genome. In `tskit` this is done by iterating through the trees in a tree sequence.

## Processing trees

A common idiom that underlies many tree sequence algorithms, including those
we'll encounter later in this tutorial for calculating 
population genetic statistics, involves moving along the genome by
iterating over all of its [Trees](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree). 
To do this you use the
[.trees](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.trees) method. Because trees in a tree sequence are correlated, it is very efficent to move from one tree to an adjacent tree in this way (much more efficent than accessing `tree[0]` then `tree[1]`, then `tree[2]` etc.)

In [None]:
for tree in ts.trees():
    print(f"Tree {tree.index} has a root at time {ts.node(tree.root).time} and covers {tree.interval}")

Here's an example of plotting the time of the root along the genome

In [None]:
import matplotlib.pyplot as plt

xy_tuples = [
    (tree.interval.left, ts.node(tree.root).time)
    for tree in ts.trees()
]
plt.step(*zip(*xy_tuples))  # uses `zip` to change (x1,y1), (x2,y2), ... into (x1, x2, ...), (y1, y2, ...)
plt.xlabel("Genome position (bp)")
plt.ylabel(f"Time of root (or MRCA) in {ts.time_units}")
plt.yscale("log")
plt.show()

<h6>Exercise 4</h6>
Copy and paste the code above, then change it to plot the total branch length of each tree instead of the time of the root node.

<div class="alert alert-block alert-info">
    <b>Tip:</b> Use the <code>Tree.total_branch_length</code> property. Also, you don't need to reimport matplotlib.pyplot. The plot should look very similar, because trees with recent root nodes are likely to have a short total branch length.</div>


In [None]:
# Exercise 4: plot the total branch length of the trees along the genome


It seems like that there might be something unusual about the trees in the middle of this
chromosome: both the root node time and the total branch length are noticably smaller around 400Kb into the sequence. We can investigate this using various built-in statistics.

## Built-in statistics

If you are calculating statistics along the genome, the `tskit` library provides methods for you. This can avoid the need to explicitly iterate over trees by hand. Methods include, for example, the average density of variable sites along the genome:

In [None]:
print(
    "Density of sites over all the genome:",
    ts.num_sites / ts.sequence_length,
    "(NB: this counts *all* defined sites, not just variable ones)" 
)
print(
    "Equivalent statistical calculation:",
    ts.segregating_sites(),
    "(This is better as it only counts variable sites)"
)

In [None]:
# Another advantage of the built-in stats framework is that stats can be output in windows along the genome
import numpy as np
window_points, step = np.linspace(0, ts.sequence_length, num=11, retstep=True)
print(ts.segregating_sites(windows=window_points))

# Or plot the windowed data
plt.stairs(ts.segregating_sites(windows=window_points), window_points, baseline=None)
plt.xlabel("Genome position (bp)")
plt.ylabel(f"Density of variable sites in {step/1000}Kb windows")
plt.yscale("log")
plt.show()

Note the equivalence between this and total branch length

## AFS

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(12, 3))

afs = ts.allele_frequency_spectrum()
ax1.bar(range(ts.num_samples + 1), afs)
ax1.set_title("Unpolarised allele freq spectrum")


afs = ts.keep_intervals([[0, 200_000]]).trim().allele_frequency_spectrum()
ax2.bar(range(ts.num_samples + 1), afs)
ax2.set_title("Unpolarised allele freq spectrum")

afs = ts.keep_intervals([[400_000, 600_000]]).trim().allele_frequency_spectrum()
ax3.bar(range(ts.num_samples + 1), afs)
ax3.set_title("Unpolarised allele freq spectrum")


plt.show()



In [None]:
import matplotlib.pyplot as plt
import numpy as np
intervals = np.linspace(0, ts.sequence_length, 20)
intervals=list(ts.breakpoints())
plt.stairs(ts.segregating_sites(windows=intervals, mode="branch"), intervals, baseline=None)
plt.xlabel("Genome position (kb)")
plt.ylabel("Diversity")
plt.yscale("log")
plt.show()

:::{margin}
The {ref}`visualization tutorial <sec_tskit_viz>` gives more drawing possibilities
:::

This tree shows the classic signature of a recent expansion or selection event, with many
long terminal branches, resulting in an excess of singleton mutations.

:::{margin}
The {ref}`Simplification tutorial<sec_simplification>` details many other uses
for {meth}`~TreeSequence.simplify`.
:::

It can often be helpful to slim down a tree sequence so that it represents the genealogy
of a smaller subset of the original samples. This can be done using the powerful
{meth}`TreeSequence.simplify` method.

The {meth}`TreeSequence.draw_svg` method allows us to draw
more than one tree: either the entire tree sequence, or
(by using the ``x_lim`` parameter) a smaller region of the genome:

In [None]:
reduced_ts = ts.simplify([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])  # simplify to the first 10 samples
print("Genealogy of the first 10 samples for the first 5kb of the genome")
display(SVG(reduced_ts.draw_svg(x_lim=(0, 5000))))

These are much more standard-looking coalescent trees, with far longer branches higher
up in the tree, and therefore many more mutations at higher-frequencies.

:::{margin}
You cannot directly edit a tree sequence; to add e.g. metadata you must edit a
copy of the underlying tables. This is described in the
{ref}`Tables and editing tutorial<sec_tables_editing>`.
:::

:::{note}
In this tutorial we refer to objects, such as sample nodes, by their numerical IDs. These
can change after simplification, and it is often more meaningful to 
{ref}`work with metadata<sec_tutorial_metadata>`, such as sample and population names,
which can be permanently attached to objects in the tree sequence. Such metadata is
often incorporated automatically by the tools generating the tree sequence. 
:::

(sec_processing_sites_and_mutations)=

## Processing sites and mutations

:::{margin}
See the tutorial entitled "{ref}`sec_tskit_no_mutations`" for why you may not need
sites or mutations in your analyses.
:::

For many purposes it may be better to focus on the genealogy of your samples, rather than
the {ref}`sites<sec_data_model_definitions_site>` and
{ref}`mutations<sec_data_model_definitions_mutation>` that
{ref}`define <sec_what_is_dna_data>` the genome sequence itself. Nevertheless,
{program}`tskit` also provides efficient ways to return {class}`Site` object and
{class}`Mutation` objects from a tree sequence.
For instance, under the finite sites model of mutation that we used above, multiple mutations
can occur at some sites, and we can identify them by iterating over the sites using the
{meth}`TreeSequence.sites` method:

In [None]:
import numpy as np
num_muts = np.zeros(ts.num_sites, dtype=int)
for site in ts.sites():
    num_muts[site.id] = len(site.mutations)  # site.mutations is a list of mutations at the site

# Print out some info about mutations per site
for nmuts, count in enumerate(np.bincount(num_muts)):
    info = f"{count} sites"
    if nmuts > 1:
        info += f", with IDs {np.where(num_muts==nmuts)[0]},"
    print(info, f"have {nmuts} mutation" + ("s" if nmuts != 1 else ""))

(sec_processing_genotypes)=

## Processing genotypes

At each site, the sample nodes will have a particular allelic state (or be flagged as
{ref}`tskit:sec_data_model_missing_data`). The
{meth}`TreeSequence.variants` method gives access to the
full variation data. For efficiency, the {attr}`~Variant.genotypes`
at a site are returned as a [numpy](https://numpy.org) array of integers:

In [None]:
import numpy as np
np.set_printoptions(linewidth=200)  # print genotypes on a single line

print("Genotypes")
for v in ts.variants():
    print(f"Site {v.site.id}: {v.genotypes}")
    if v.site.id >= 4:  # only print up to site ID 4
        print("...")
        break

:::{note}
Tree sequences are optimised to look at all samples at one site, then all samples at an
adjacent site, and so on along the genome. It is much less efficient look at all the
sites for a single sample, then all the sites for the next sample, etc. In other words,
you should generally iterate over sites, not samples. Nevertheless, all the alleles for
a single sample can be obtained via the
{meth}`TreeSequence.haplotypes` method.
:::


To find the actual allelic states at a site, you can refer to the
{attr}`~Variant.alleles` provided for each {class}`Variant`:
the genotype value is an index into this list. Here's one way to print them out; for
clarity this example also prints out the IDs of both the sample nodes (i.e. the genomes)
and the diploid {ref}`individuals <sec_nodes_or_individuals>` in which each sample
node resides.

In [None]:
samp_ids = ts.samples()
print("  ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids]))
print("       ID of (sample) node: ", " ".join([f"{s:3}" for s in samp_ids]))
for v in ts.variants():
    site = v.site
    alleles = np.array(v.alleles)
    print(f"Site {site.id} (ancestral state '{site.ancestral_state}')",  alleles[v.genotypes])
    if site.id >= 4:  # only print up to site ID 4
        print("...")
        break

:::{note}
Since we have used the {class}`msprime.JC69` model of mutations, the alleles are all
either 'A', 'T', 'G', or 'C'. However, more complex mutation models can involve mutations
such as indels, leading to allelic states which need not be one of these 4 letters, nor
even be a single letter.
:::


(sec_tskit_getting_started_compute_statistics)=

## Computing statistics

There are a {ref}`large number of statistics<tskit:sec_stats>` and related calculations
built in to {program}`tskit`. Indeed, many basic population genetic statistics are based
on the allele (or site) frequency spectrum (AFS), which can be obtained from a tree sequence
using the {meth}`TreeSequence.allele_frequency_spectrum`
method:

In [None]:
afs = ts.allele_frequency_spectrum()
plt.bar(np.arange(ts.num_samples + 1), afs)
plt.title("Unpolarised allele frequency spectrum")
plt.show()

By default this method returns the "folded" or unpolarized AFS that doesn't
{ref}`take account of the ancestral state<tskit:sec_stats_polarisation>`.
However, since the tree sequence provides the ancestral state, we can plot the polarized
version; additionally we can base our calculations on branch lengths rather than alleles,
which provides an estimate that is not influenced by random mutational "noise".

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 3))

afs1 = ts.allele_frequency_spectrum(polarised=True, mode="branch")
ax1.bar(np.arange(ts.num_samples+1), afs1)
ax1.set_title("Genome-wide branch-length AFS")

restricted_ts = ts.keep_intervals([[5e6, 5.5e6]])
afs2 = restricted_ts.allele_frequency_spectrum(polarised=True, mode="branch")
ax2.bar(np.arange(restricted_ts.num_samples+1), afs2)
ax2.set_title("Branch-length AFS between 5 and 5.5Mb")

plt.show()

On the left is the frequency spectrum averaged over the entire genome, and on the right
is the spectrum for a section of the tree sequence between 5 and 5.5Mb, which we've
created by deleting the regions outside that interval using
{meth}`TreeSequence.keep_intervals`. Unsurprisingly,
as we noted when looking at the trees, there's a far higher proportion of singletons in
the region of the sweep.

(sec_tskit_getting_started_compute_statistics_windowing)=

### Windowing

It is often useful to see how statistics vary in different genomic regions. This is done
by calculating them in {ref}`tskit:sec_stats_windows` along the genome. For this,
let's look at a single statistic, the genetic {meth}`~TreeSequence.diversity` (π). As a
site statistic this measures the average number of genetic differences between two
randomly chosen samples, whereas as a branch length statistic it measures the average
branch length between them. We'll plot how the value of π changes using 10kb windows,
plotting the resulting diversity between positions 4 and 6 Mb:

In [None]:
import numpy as np
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 3))
L = int(ts.sequence_length)
windows = np.linspace(0, L, num=L//10_000)
ax1.stairs(ts.diversity(windows=windows), windows/1_000, baseline=None)  # Default is mode="site"
ax1.set_ylabel("Diversity")
ax1.set_xlabel("Genome position (kb)")
ax1.set_title("Site-based calculation")
#ax1.set_xlim(4e3, 6e3)
ax1.set_yscale("log")
ax2.stairs(ts.diversity(windows=windows, mode="branch"), windows/1_000, baseline=None)
ax2.set_xlabel("Genome position (kb)")
ax2.set_title("Branch-length-based calculation")
#ax2.set_xlim(4e3, 6e3)
ax2.set_yscale("log")
plt.show()

There's a clear drop in diversity in the region of the selective sweep. And as expected,
the statistic based on branch-lengths gives a less noisy signal.


(sec_tskit_getting_started_exporting_data)=

## Saving and exporting data

Tree sequences can be efficiently saved to file using {meth}`TreeSequence.dump`, and
loaded back again using {func}`tskit.load`. By convention, we use the suffix ``.trees``
for such files:

In [None]:
import tskit

ts.dump("data/my_tree_sequence.trees")
new_ts = tskit.load("data/my_tree_sequence.trees")

It's also possible to export tree sequences to different formats. Note, however, that
not only are these usually much larger files, but that analysis is usually much faster
when performed by built-in tskit functions than by exporting and using alternative
software. If you have a large tree sequence, you should *try to avoid exporting
to other formats*.

### Newick and Nexus format

The most common format for interchanging tree data is Newick. 
We can export to a newick format string quite easily. This can be useful for
interoperating with existing tree processing libraries but is very inefficient for
large trees. There is also no support for including sites and mutations in the trees.

In [None]:
small_ts = reduced_ts.keep_intervals([[0, 10000]])
tree = small_ts.first()
print(tree.newick(precision=3))

For an entire set of trees, you can use the Nexus file format, which acts as a container
for a list of Newick format trees, one per line:

In [None]:
small_ts = small_ts.trim()  # Must trim off the blank region at the end of cut-down ts
print(small_ts.as_nexus(precision=3, include_alignments=False))

### VCF

The standard way of interchanging genetic variation data is the Variant Call Format, 
for which tskit has basic support:

In [None]:
import sys
small_ts.write_vcf(sys.stdout)

The write_vcf method takes a file object as a parameter; to get it to write out to the
notebook here we ask it to write to stdout.

### Scikit-allel

Because tskit integrates very closely with numpy, we can interoperate very efficiently
with downstream Python libraries for working with genetic
sequence data, such as [scikit-allel](https://scikit-allel.readthedocs.io/en/stable/).
We can interoperate with {program}`scikit-allel` by exporting the genotype matrix as a
numpy array, which {program}`scikit-allel` can then process in various ways.

In [None]:
import allel
# Export the genotype data to allel. Unfortunately there's a slight mismatch in the 
# terminology here where genotypes and haplotypes mean different things in the two
# libraries.
h = allel.HaplotypeArray(small_ts.genotype_matrix())
print(h.n_variants, h.n_haplotypes)
h

Sckit.allel has a wide-ranging and efficient suite of tools for working with genotype
data, so should provide anything that's needed. For example, it gives us an
another way to compute the pairwise diversity statistic (that we calculated
{ref}`above<sec_tskit_getting_started_compute_statistics_windowing>`
using the native {meth}`TreeSequence.diversity` method):

In [None]:
ac = h.count_alleles()
allel.mean_pairwise_difference(ac)

(sec_tskit_getting_started_key_points)=

## Key points covered above

Some simple methods and take-home messages from this introduction to the
{program}`tskit` {ref}`sec_python_api`,
in rough order of importance:

* Objects and their attributes
    * In Python, a {class}`TreeSequence` object has a number of basic attributes such as
        {attr}`~TreeSequence.num_trees`, {attr}`~TreeSequence.num_sites`,
        {attr}`~TreeSequence.num_samples`, {attr}`~TreeSequence.sequence_length`, etc.
        Similarly a {class}`Tree` object has e.g. an {attr}`~Tree.interval` attribute, a
        {class}`Site` object has a {attr}`~Site.mutations` attribute, a {class}`Node`
        object has a {attr}`~Node.time` attribute, and so on.
    * {ref}`sec_terminology_nodes` (i.e. genomes) can belong to
        {ref}`individuals<sec_terminology_individuals_and_populations>`. For example,
        sampling a diploid individual results in an {class}`Individual` object which
        possesses two distinct {ref}`sample nodes<sec_terminology_nodes_samples>`.
* Key tree sequence methods
    * {meth}`~TreeSequence.samples()` returns an array of node IDs specifying the
        nodes that are marked as samples
    * {meth}`~TreeSequence.node` returns the node object for a given integer node ID
    * {meth}`~TreeSequence.trees` iterates over all the trees
    * {meth}`~TreeSequence.sites` iterates over all the sites
    * {meth}`~TreeSequence.variants` iterates over all the sites with their genotypes
        and alleles
    * {meth}`~TreeSequence.simplify()` reduces the number of sample nodes in the tree
        sequence to a specified subset
    * {meth}`~TreeSequence.keep_intervals()` (or its complement,
        {meth}`~TreeSequence.delete_intervals()`) removes genetic information from
        specific regions of the genome
    * {meth}`~TreeSequence.draw_svg()` plots tree sequences (and {meth}`Tree.draw_svg()`
        plots trees)
    * {meth}`~TreeSequence.at()` returns a tree at a particular genomic position
        (but using {meth}`~TreeSequence.trees` is usually preferable)
    * Various population genetic statistics can be calculated using methods on a tree
        sequence, for example {meth}`~TreeSequence.allele_frequency_spectrum`,
        {meth}`~TreeSequence.diversity`, and {meth}`~TreeSequence.Fst`; these can
        also be calculated in windows along the genome.