This notebook covers the [intro to the tsinfer tutorial](https://tsinfer.readthedocs.io/en/latest/tutorial.html).

In [1]:
import tsinfer, sys

Say we have the following dataset of 5 haplotypes over a sequence of length 6:

```
sample  haplotype
0       AGCGAT
1       TGACAG
2       AGACAT
3       ACCGCT
4       ACCGCT
```
We'll first create an object called `sample_data` to hold this dataset. The object is a native SampleData object:

In [2]:
with tsinfer.SampleData(sequence_length=6) as sample_data:
    sample_data.add_site(0, [0, 1, 0, 0, 0], ["A", "T"])
    sample_data.add_site(1, [0, 0, 0, 1, 1], ["G", "C"])
    sample_data.add_site(2, [0, 1, 1, 0, 0], ["C", "A"])
    sample_data.add_site(3, [0, 1, 1, 0, 0], ["G", "C"])
    sample_data.add_site(4, [0, 0, 0, 1, 1], ["A", "C"])
    sample_data.add_site(5, [0, 1, 0, 0, 0], ["T", "G"])
    

The first argument in ```add_site()``` specifies the genomic location of the added site.

The second argument is a list of genotypes taking values 0 (ancestral state) and 1 (derived state).

The third argument is the length-2 list of alleles at this site. The elements of this list map to 0 and 1.

Presumably, all three of these are needed.

SampleData objects look like the foundational unit of `tsinfer`:

In [3]:
sys.stdout.write(", ".join(dir(sample_data)))

ADDING_POPULATIONS, ADDING_SAMPLES, ADDING_SITES, BUILD_MODE, EDIT_MODE, FORMAT_NAME, FORMAT_VERSION, READ_MODE, _SampleData__all_haplotypes, __class__, __delattr__, __dict__, __dir__, __doc__, __enter__, __eq__, __exit__, __format__, __ge__, __getattribute__, __gt__, __hash__, __init__, __init_subclass__, __le__, __lt__, __module__, __ne__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__, __weakref__, _alloc_site_writer, _build_state, _check_build_mode, _check_edit_mode, _check_format, _check_metadata, _check_write_modes, _chunk_size, _compressor, _format_str, _individuals_writer, _last_position, _metadata_codec, _mode, _new_lmdb_store, _num_flush_threads, _open_lmbd_readonly, _open_readonly, _populations_writer, _samples_writer, _sites_writer, add_individual, add_population, add_provenance, add_site, arrays, close, copy, data, data_equal, file_size, finalise, finalised, format_name, format_version, from_tree_sequence, genotypes, haplot

We'll learn more about this later I guess.

Let's infer a treeSequence for tbis data! It's as simple as this:

In [5]:
inferred_ts = tsinfer.infer(sample_data)

In [6]:
for tree in inferred_ts.trees():
    print(tree.draw(format="unicode"))
for sample_id, h in enumerate(inferred_ts.haplotypes()):
    print(sample_id, h, sep="\t")

    7      
┏━━┳┻━━┓   
┃  5   6   
┃ ┏┻┓ ┏┻┓  
0 3 4 1 2  

0	AGCGAT
1	TGACAG
2	AGACAT
3	ACCGCT
4	ACCGCT


## An example with simulation

In [7]:
import msprime
import tqdm # allows us to have progress bars. Useful!

In [8]:
ts = msprime.simulate(
    sample_size=10000, Ne=10**4, recombination_rate=1e-8,
    mutation_rate=1e-8, length=10*10**6, random_seed=42)
ts.dump("simulation-source.trees")
print("simulation done:", ts.num_trees, "trees and", ts.num_sites,  "sites")

progress = tqdm.tqdm(total=ts.num_sites)
with tsinfer.SampleData(
        path="simulation.samples", sequence_length=ts.sequence_length,
        num_flush_threads=2) as sample_data:
    for var in ts.variants():
        sample_data.add_site(var.site.position, var.genotypes, var.alleles)
        progress.update()
    progress.close()

  0%|          | 0/39001 [00:00<?, ?it/s]

simulation done: 36734 trees and 39001 sites


100%|██████████| 39001/39001 [00:09<00:00, 4277.60it/s]
