## Validating the tskit Usher tree conversion

In this notebook we check that the tskit encoding of the full Usher tree for the Viridian data is an accurate reflection of the data. We check that the node metadata is maintained correctly (and can be joined with the sc2ts representation of the Viridian dataset) and that we round-trip the alignment data correctly. 

In [2]:
import tskit
import tszip
import numpy as np
import pandas as pd
import sc2ts

In [12]:
%%time
ts = tszip.load("../data/usher_viridian_v1.0.trees.tsz")

CPU times: user 871 ms, sys: 198 ms, total: 1.07 s
Wall time: 847 ms


In [13]:
ts

Tree Sequence,Unnamed: 1
Trees,1
Sequence Length,29 904
Time Units,uncalibrated
Sample Nodes,4 471 579
Total Size,522.1 MiB
Metadata,dict  time_zero_date: 2024-06-06

Table,Rows,Size,Has Metadata
Edges,5 345 018,163.1 MiB,
Individuals,0,24 Bytes,
Migrations,0,8 Bytes,
Mutations,3 364 841,118.7 MiB,
Nodes,5 345 019,198.8 MiB,✅
Populations,0,8 Bytes,
Provenances,2,1.2 KiB,
Sites,27 508,671.6 KiB,

Provenance Timestamp,Software Name,Version,Command,Full record
"12 August, 2025 at 03:26:34 PM",sc2ts,0.0.4.dev676+g1d501ae30,minimise_metadata,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev676+g1d501ae30  parameters:  dict  command: minimise_metadata  field_mapping:  dict  strain: sample_id  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 362.3317494392395 user_time: 363.49 sys_time: 2.75 max_memory: 5253632000
"12 August, 2025 at 11:21:25 AM",tskit,0.6.4,delete_sites,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: delete_sites TODO: add parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1


We can use some utility functions in sc2ts to get dataframes summarising the nodes and mutations. 

These use some numba-compiled code to compute inheritance statistics, which takes a few seconds to compile the first  time its run.

In [14]:
 %%time
df_nodes = sc2ts.node_data(ts)

CPU times: user 1.2 s, sys: 376 ms, total: 1.58 s
Wall time: 1.58 s


In [15]:
%%time
df_muts = sc2ts.mutation_data(ts)

CPU times: user 666 ms, sys: 139 ms, total: 806 ms
Wall time: 805 ms


In [16]:
df_muts

Unnamed: 0,mutation_id,position,parent,node,inherited_state,derived_state,date,num_descendants,num_inheritors
0,0,267,-1,1235535,T,A,2024-06-06,1,1
1,1,267,-1,5297533,T,G,2024-06-06,1,1
2,2,269,-1,246856,G,A,2024-06-01,12,12
3,3,269,-1,2677518,G,A,2024-06-01,75,75
4,4,269,-1,4780608,G,A,2024-06-01,26,26
...,...,...,...,...,...,...,...,...,...
3364836,3364836,29671,-1,4571321,A,G,2024-06-06,1,1
3364837,3364837,29671,-1,4525231,A,T,2024-06-06,1,1
3364838,3364838,29671,-1,4622571,A,G,2024-06-06,1,1
3364839,3364839,29671,-1,4520611,A,T,2024-06-06,1,1


In [17]:
df_nodes

Unnamed: 0,sample_id,node_id,is_sample,is_recombinant,num_mutations,max_descendant_samples,date
0,ERR4085584,0,True,False,0,1,2024-06-06
1,ERR4086315,1,True,False,0,1,2024-06-06
2,ERR4091746,2,True,False,0,1,2024-06-06
3,ERR4165155,3,True,False,0,1,2024-06-06
4,ERR4204267,4,True,False,0,1,2024-06-06
...,...,...,...,...,...,...,...
5345014,SRR21886505,5345014,True,False,5,1,2024-06-06
5345015,node_763111,5345015,False,False,1,3,2024-06-04
5345016,SRR21635465,5345016,True,False,0,1,2024-06-06
5345017,node_763112,5345017,False,False,1,2,2024-06-05


In [18]:
df_samples = df_nodes[df_nodes.is_sample].set_index("sample_id")

Now load the full Viridian dataset of MAFFT alignments and metadata stored in [VCF Zarr format](https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giaf049/8154315). The sc2ts.Data object provides a wrapper for this and some handy access methods.

In [20]:
%%time
ds = sc2ts.Dataset("../data/viridian_mafft_2024-10-14_v1.vcz.zip")
ds

Dataset at ../data/viridian_mafft_2024-10-14_v1.vcz.zip with 4484157 samples, 29903 variants, and 30 metadata fields. See ds.metadata.field_descriptors() for a description of the fields.

In [21]:
ds.metadata.field_descriptors()

Unnamed: 0_level_0,dtype,description
field,Unnamed: 1_level_1,Unnamed: 2_level_1
Artic_primer_version,object,"If known, the ARTIC primer scheme from ENA met..."
Collection_date,object,collection_date from ENA metadata
Country,object,Country from ENA metadata. If ':' was in the e...
Date_tree,object,"A consensus date, using up 3 sources of data f..."
Date_tree_order,object,This helped define the order in which the samp...
Experiment,object,experiment_accession from ENA metadata
First_created,object,first_created from the ENA metadata
Genbank_N,int16,Number of Ns in the GenBank consensus sequence...
Genbank_accession,object,This is the GenBank accession of the assembly
Genbank_other_runs,object,Any other run accessions associated with the G...


Do a join of this to the samples dataframe so what we can check that the metadata lines up appropriately.

In [22]:
%%time
df_samples = df_samples.join(ds.metadata.as_dataframe(["Date_tree", "Viridian_pangolin_1.29"]))
df_samples

CPU times: user 6.52 s, sys: 432 ms, total: 6.95 s
Wall time: 6.98 s


Unnamed: 0_level_0,node_id,is_sample,is_recombinant,num_mutations,max_descendant_samples,date,Date_tree,Viridian_pangolin_1.29
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ERR4085584,0,True,False,0,1,2024-06-06,2020-04-09,B
ERR4086315,1,True,False,0,1,2024-06-06,2020-04-03,B
ERR4091746,2,True,False,0,1,2024-06-06,2020-04-03,B
ERR4165155,3,True,False,0,1,2024-06-06,2020-02-26,B
ERR4204267,4,True,False,0,1,2024-06-06,2020-03-04,B
...,...,...,...,...,...,...,...,...
SRR20933483,5345012,True,False,1,1,2024-06-06,2022-07-11,BA.5.2.1
SRR21370322,5345013,True,False,1,1,2024-06-06,2022-08-14,BA.5.2.1
SRR21886505,5345014,True,False,5,1,2024-06-06,2022-09-21,BA.5.2.1
SRR21635465,5345016,True,False,0,1,2024-06-06,2022-09-06,BA.5.2.1


## Dates

Not that the node times/dates on this tree are not well calibrated, and you should use the Date_tree field instead for sample dates.



## Check genotype encoding

In [24]:
assert np.all(ts.samples() == df_samples.node_id)

The sc2ts dataset uses a fixed encoding of alleles, such that "A" is always 0, "C" is always 1 etc, and does not treat the reference allele specially (this is a departure from the usual conventions in VCF Zarr where allele 0 is usually the reference allele). The simplest way to compare data from a tree sequence then is to make sure that the alleles are encoded in the same way when iterating over the variants. Then, we can compare the allelic states directly using numpy operations without translating back to strings, which is *very* slow in comparison.

In [25]:
sc2ts.IUPAC_ALLELES

'ACGT-RYSWKMBDHV.'

In [26]:
ds_variants = ds.variants(sample_id=df_samples.index.values, position=ts.sites_position.astype(int))
ts_variants = ts.variants(alleles=tuple("ACGT"))
from tqdm.notebook import tqdm

per_site_diffs = []
for ds_var, ts_var in tqdm(zip(ds_variants, ts_variants), total=ts.num_sites):
    assert ts_var.alleles == tuple(ds_var.alleles[:4])
    missing = ds_var.genotypes == -1
    # The IUPAC ambiguity codes in the sc2ts.IUPAC alleles are in indexes > 3.
    # We're not interested in gap characters here (which are 3)
    ambiguous = ds_var.genotypes > 3
    confident_calls = ~(missing | ambiguous)
    true_calls = ds_var.genotypes[confident_calls]
    stored = ts_var.genotypes[confident_calls]
    per_site_diffs.append(np.sum(true_calls != stored))

per_site_diffs = np.array(per_site_diffs)

  0%|          | 0/27508 [00:00<?, ?it/s]

In [29]:
non_zero = np.where(per_site_diffs != 0)
non_zero

(array([ 3988,  6072,  6074,  9383, 14343, 19908, 19923, 19926, 19927,
        19929, 19932, 19933, 19937, 19942, 19944, 19945, 20150, 20493,
        20495, 20496, 20498, 20500, 20502, 20952, 21059, 21085, 21145,
        22050, 22143, 23119, 24547, 24593, 25361, 25475, 25606, 25718,
        26049, 26816, 27231]),)

In [30]:
non_zero_pos = ts.sites_position[non_zero].astype(int)
non_zero_pos

array([ 4321,  6513,  6515, 10029, 15510, 21595, 21610, 21615, 21617,
       21621, 21624, 21627, 21632, 21637, 21639, 21641, 21846, 22195,
       22197, 22198, 22200, 22202, 22204, 22674, 22786, 22813, 22882,
       23854, 23948, 25000, 26530, 26577, 27384, 27507, 27638, 27752,
       28095, 28916, 29362])

In [31]:
non_zero_pos.shape

(39,)

In [32]:
ts.num_sites - non_zero_pos.shape[0]

27469

We have exact matches for 27469 of the sites in the data for which we have nucleotide calls.

In [33]:
import collections
data = []
for var in  ds.variants(position=non_zero_pos):
    #  NB: make sure we use "N" to represent missing data
    alleles = np.append(var.alleles, ["N"])
    counter = collections.Counter(alleles[var.genotypes])
    data.append({"position": int(var.position), **{a:counter[a] for a in alleles}})


In [34]:
df_composition = pd.DataFrame(data, dtype=int)
df_composition

Unnamed: 0,position,A,C,G,T,-,R,Y,S,W,K,M,B,D,H,V,.,N
0,4321,555,3347068,5,1115563,4,0,2229,1,0,1,8,0,0,0,0,0,18723
1,6513,5019,2,3656397,442,772494,9,0,0,0,14,0,0,0,0,0,0,49780
2,6515,5131,57,4,3656612,772571,0,6,0,0,0,0,0,0,0,0,0,49776
3,10029,0,807092,0,3660160,0,0,111,0,0,0,0,0,0,0,0,0,16794
4,15510,3,3167,1,4130426,17,0,1094,0,1,1,0,0,0,0,0,0,349447
5,21595,42,4269963,181,165995,94,0,791,3,0,0,4,0,0,0,0,0,47084
6,21610,8,7007,20,4426897,331,0,17,0,1,0,0,0,0,0,0,0,49876
7,21615,8,7731,958,4424251,250,0,15,0,4,5,0,0,0,0,0,0,50935
8,21617,4424814,7621,208,25,257,20,0,0,4,0,54,0,0,0,0,0,51154
9,21621,10856,4408687,21,10786,1241,0,174,7,1,0,23,0,0,0,0,0,52361


In [36]:
np.sum(df_composition["-"] > 500_000)

np.int64(6)

Of the 39 sites where we don't exactly reproduce the confident calls in the alignment, we have 6 sites that have more than 500,000 samples called as a gap character. It seems likely these are driven by differences in the alignment or perhaps or other factors.