In [None]:
import msprime, tskit
import tsinfer, tsdate
from IPython.display import SVG

# 5. Tree sequence inference with tsinfer and tsdate

 - [5.1 An overview of tsinfer](#5.1Overview)
 - [5.2 Hands on with tsinfer](#5.2HandsOn)
 - [5.3 Inference accuracy](#5.3InferenceAccuracy)

Simulating a tree sequence is relatively simple compared to *inferring* a tree sequence from existing data.
The [tsinfer software](https://tsinfer.readthedocs.io/en/stable/) implements a heuristic algorithm which does this in a scalable manner.

<a id='5.1Overview'></a>
## 5.1 An overview of `tsinfer`

`Tsinfer` (pronounced t-s-infer) is comparable in some ways to other ancestral inference software such as [ARGweaver](https://doi.org/10.1371/journal.pgen.1004342), [Relate](https://myersgroup.github.io/relate/), and [Rent+](https://doi.org/10.1093/bioinformatics/btw735). However, it differs considerably in approach and scalability.
Note that none of these other software packages produce tree sequences as output, although is possible to convert their output to tree sequences.
Also note that `tsinfer` produces trees with a relatively accurate topology, but unlike other ancestral inference tools, it makes no attempt at the moment to produce precise branch length estimates -- for this we need another tool like `tsdate`.

An important restriction is that `tsinfer` requires phased sample sequences with known ancestral states for each variant. It also works better with full sequence data than with data from scattered target SNPs (e.g. as obtained from SNP chips).

###  Algorithm overview
The `tsinfer` method is split into two main parts: 
1. the reconstruction and time ordering of ancestral haplotypes and 
2. the inference of the copying process. 

The paper contains the following schematic overview of the method, with part 1 on the left and part 2 on the right. Note the reduced length of the blue inferred ancestor chunks back in time. 

<img style="height: 600px" src="pics/tsinfer-schematic.png">

<a id='5.2HandsOn'></a>
## 5.2 Hands-on with `tsinfer` and `tsdate`

Let's try out `tsinfer` using some sequence data generated with `msprime`.
Here's a small simulated sample drawn from an admixture scenario:

In [None]:
# Specify demographic history.
demography = msprime.Demography()
demography.add_population(name="SMALL", initial_size=2000)
demography.add_population(name="BIG", initial_size=5000)
demography.add_population(name="ADMIX", initial_size=2000)
demography.add_population(name="ANC", initial_size=5000)
demography.add_admixture(
    time=100, derived="ADMIX", ancestral=["SMALL", "BIG"], proportions=[0.5, 0.5])
demography.add_population_split(time=1000, derived=["SMALL", "BIG"], ancestral="ANC")
# demography.debug()

In [None]:
# Simulate.
seq_length = 1e6
ts = msprime.sim_ancestry(samples={"SMALL": 1, "BIG": 1, "ADMIX" : 1},
                          demography=demography,
                          random_seed=83,
                         sequence_length=seq_length,
                        recombination_rate=1e-8)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=318)
ts

In [None]:
# Write to VCF.
with open("worksheet5-input.vcf", "w") as vcf_file:
    ts.write_vcf(vcf_file)

### Step 1: pre-process data

`tsinfer` requires a `SampleData` object as input.
To create this, we'll need:

 - phased genotype data at sites with known positions
 - information about ancestral and derived alleles at each site
 
First, let's see how we might get this information from a VCF file:

In [None]:
a_file = open("worksheet5-input.vcf")
number_of_lines = 10

for i in range(number_of_lines):
    line = a_file.readline()
    print(line)

We'll need the information in the `POS`, `REF` and `ALT` fields, as well as the genotypes.

In [None]:
a_file = open("worksheet5-input.vcf")

lines = a_file.readlines()
with tsinfer.SampleData(sequence_length=seq_length) as sample_data:
    for line in lines:
        if line[0] != "#":
            l = line.split("\t")
            pos = int(l[1])
            ref = l[3]
            alt = l[4]
            gens = "".join(l[9:]).replace("|", "").replace("\n", "")
            gens = [int(g) for g in gens]
            sample_data.add_site(pos, gens, [ref, alt])
a_file.close()

For larger VCFs, you may wish to use the [cyvcf2](https://github.com/brentp/cyvcf2) package.
See [this](https://tsinfer.readthedocs.io/en/latest/tutorial.html#reading-a-vcf) for some example usage.

You can use the `from_tree_sequence` method to create a `SampleData` object from the larger Pongo dataset:

### Step 2: Apply tsinfer!

All we need is our `SampleData` object and an estimated recombination rate:

In [None]:
tsi = tsinfer.infer(sample_data, recombination_rate=1e-7)
tsi

Let's have a look at some of the inferred trees. How do they compare with the real ones?

In [None]:
location=50000
SVG(tsi.at(location).draw_svg(size=(500,500)))

In [None]:
SVG(ts.at(location).draw_svg(size=(600,350)))

Some quick observations:
 - Various inaccuracies in topologies
 - Some *polytomies:* nodes with more than two children 
 
`tsinfer` also works on larger datasets:

To obtain estimates of node times, we will need to use `tsdate`,
a method for efficiently inferring the ages of ancestors in a tree sequence.
See the documentation page [here](https://tsdate.readthedocs.io/en/latest/).

### Step 3: simplify the tree sequence

First, we'll apply `simplify()`:

In [None]:
tsi = tsi.simplify()
SVG(tsi.at(location).draw_svg())

### Step 4: Apply `tsdate`!

We supply an estimated (haploid) effective population size, and a mutation rate.

In [None]:
tsid = tsdate.date(tsi, Ne=7000, mutation_rate=1e-8)
SVG(tsid.at(location).draw_svg())

In [None]:
SVG(ts.at(location).draw_svg())

With inferred node times and branch lengths in our tree sequence,
we can now apply any of the branch or time-related methods in the previous worksheet to obtain inferred branch statistics, IBD segments and so on.

<a id='5.3InferenceAccuracy'></a>
## 5.3 Inference accuracy

Inferring genome-wide genealogies is a challenging task, and (as we have seen) the output from `tsinfer` should be treated with some caution.

There are not many established ways to compare one tree sequence (or ARG) with another. However, thanks to phylogenetics, there *are* many ways to compare individual trees (i.e. tree distance metrics). The most discriminating that we have found is the Kendall-Colijn metric, which also has the benefit of dealing in a principled way with the *polytomies* found in `tsinfer` trees.

<img style="height: 600px" src="pics/worksheet5-distances.png">

Consider what parts of the inferred tree sequence are likely to be important in your downstream analyses.
For instance, do branch lengths and ancestor times matter for you, or will tree topologies suffice?
Do you need your ancestral segments to be contiguous,
or is it okay if they are split over multiple ancestors in multiple edges?
Questions like these should inform the types of benchmarking that matter to you.

For instance, I thought `tsinfer` did a pretty good job of inferring recent IBD segment lengths:

<img style="height: 300px" src="pics/worksheet5-ibd-length.png">

But `tsdate` seemed to systematically overestimate their ages:

<img style="height: 300px" src="pics/worksheet5-ibd-time.png">

Given the variety of tools that are now available for these purposes (including the many that we have covered today), benchmarking the accuracy of these inferences is a task of high community importance if we are to rely on inferred genome-wide genealogies for future work.