# Example of data processing using sc2ts, tskit and VCZ

This notebook provides the code for the section of the paper in which we lay out the advantages of the software ecosystem that we are using.

In [1]:
import pathlib
import collections

import sc2ts
import tszip
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

In [2]:
datadir = pathlib.Path("../data")

In [3]:
ds = sc2ts.Dataset(datadir / "viridian_mafft_2024-10-14_v1.vcz.zip")
ds

<sc2ts.dataset.Dataset at 0x7fbed9617880>

In [4]:
ts_sc2ts = tszip.load(datadir / "sc2ts_2023-02-21_intersection.trees.tsz")
ts_sc2ts

Tree Sequence,Unnamed: 1
Trees,348
Sequence Length,29 904
Time Units,days
Sample Nodes,1 226 587
Total Size,183.3 MiB
Metadata,dict  time_zero_date: 2023-02-21

Table,Rows,Size,Has Metadata
Edges,1 432 165,43.7 MiB,
Individuals,0,24 Bytes,
Migrations,0,8 Bytes,
Mutations,1 838 918,64.9 MiB,
Nodes,1 431 231,61.4 MiB,✅
Populations,0,8 Bytes,
Provenances,1 130,1.7 MiB,
Sites,27 431,669.7 KiB,

Provenance Timestamp,Software Name,Version,Command,Full record
"28 June, 2025 at 02:08:08 PM",tskit,0.6.4,delete_sites,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: delete_sites TODO: add parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1
"28 June, 2025 at 02:08:01 PM",tskit,0.6.4,simplify,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: simplify TODO: add simplify parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1
"11 June, 2025 at 09:11:21 PM",sc2ts,0.0.4.dev616+ga47b49d,minimise_metadata,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev616+ga47b49d  parameters:  dict  command: minimise_metadata  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 136.87927651405334 user_time: 138.85999999999999 sys_time: 1.8800000000000001 max_memory: 4302036992
"19 January, 2025 at 04:26:26 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-21 base_ts: results/v1-beta1/v1- beta1_2023-02-20.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 23.96233034133911 user_time: 20.23 sys_time: 3.54 max_memory: 14884753408
"19 January, 2025 at 04:26:00 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-20 base_ts: results/v1-beta1/v1- beta1_2023-02-19.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 241.07852435112 user_time: 1551.5 sys_time: 15.07 max_memory: 34759606272
"19 January, 2025 at 04:21:58 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-19 base_ts: results/v1-beta1/v1- beta1_2023-02-18.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 148.19285607337952 user_time: 373.21 sys_time: 7.95 max_memory: 15546343424
"19 January, 2025 at 04:19:28 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-18 base_ts: results/v1-beta1/v1- beta1_2023-02-17.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 276.11299777030945 user_time: 2794.96 sys_time: 21.41 max_memory: 32186015744
"19 January, 2025 at 04:14:51 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-17 base_ts: results/v1-beta1/v1- beta1_2023-02-16.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 374.88766050338745 user_time: 9851.05 sys_time: 49.94 max_memory: 73610461184
"19 January, 2025 at 04:08:34 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-16 base_ts: results/v1-beta1/v1- beta1_2023-02-15.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 417.37725377082825 user_time: 15233.05 sys_time: 70.1 max_memory: 111090552832
"19 January, 2025 at 04:01:35 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-15 base_ts: results/v1-beta1/v1- beta1_2023-02-14.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 462.40532898902893 user_time: 14538.21 sys_time: 69.35 max_memory: 144653651968


In [9]:
ts_usher = tszip.load(datadir / "usher_2023-02-21_intersection_ds_di.trees.tsz")
ts_usher

Tree Sequence,Unnamed: 1
Trees,1
Sequence Length,29 904
Time Units,days
Sample Nodes,1 226 587
Total Size,203.1 MiB
Metadata,dict  sc2ts:  dict  date: 2023-02-20

Table,Rows,Size,Has Metadata
Edges,1 620 986,49.5 MiB,
Individuals,0,24 Bytes,
Migrations,0,8 Bytes,
Mutations,1 838 186,64.9 MiB,
Nodes,1 620 987,75.7 MiB,✅
Populations,0,8 Bytes,
Provenances,6,3.4 KiB,
Sites,27 431,669.7 KiB,

Provenance Timestamp,Software Name,Version,Command,Full record
"02 July, 2025 at 06:06:39 PM",tsdate,0.2.3,variational_gamma,Details  dict  schema_version: 1.0.0  software:  dict  name: tsdate version: 0.2.3  parameters:  dict  mutation_rate: 1.5310839439999822e-06 recombination_rate: None time_units: days progress: True population_size: None eps: 1e-10 max_iterations: 1 max_shape: 1000 rescaling_intervals: 0 rescaling_iterations: 5 match_segregating_sites: False regularise_roots: True singletons_phased: True command: variational_gamma  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  tskit:  dict  version: 0.6.4  resources:  dict  elapsed_time: 1302.554074048996 user_time: 1346.3999999999999 sys_time: 4.3999999999999995 max_memory: 2308784128
"02 July, 2025 at 05:41:41 PM",tskit,0.6.4,simplify,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: simplify TODO: add simplify parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1
"02 July, 2025 at 03:31:16 PM",tskit,0.6.4,delete_sites,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: delete_sites TODO: add parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1
"02 July, 2025 at 03:31:15 PM",tskit,0.6.4,simplify,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: simplify TODO: add simplify parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1
"28 June, 2025 at 02:07:36 PM",sc2ts,0.0.4.dev644+gf671a5e,minimise_metadata,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev644+gf671a5e  parameters:  dict  command: minimise_metadata  field_mapping:  dict  strain: sample_id Date_tree: Date_tree  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 436.17398166656494 user_time: 436.1 sys_time: 3.71 max_memory: 5781262336
"28 June, 2025 at 01:54:10 PM",tskit,0.6.4,delete_sites,Details  dict  schema_version: 1.0.0  software:  dict  name: tskit version: 0.6.4  parameters:  dict  command: delete_sites TODO: add parameters  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1


In [10]:
%%time 
dfn_sc2ts = sc2ts.node_data(ts_sc2ts)

CPU times: user 457 ms, sys: 24 ms, total: 481 ms
Wall time: 479 ms


In [11]:
%%time 
dfn_usher = sc2ts.node_data(ts_usher)

CPU times: user 554 ms, sys: 64.2 ms, total: 618 ms
Wall time: 616 ms


Make sure that the two tree sequences we're comparing have the same set of samples and reflects the same set of sequences.

In [12]:
assert np.array_equal(ts_sc2ts.samples(), ts_usher.samples())
assert np.array_equal(dfn_sc2ts[dfn_sc2ts.is_sample].node_id.values, ts_sc2ts.samples())
assert np.array_equal(dfn_usher[dfn_usher.is_sample].node_id.values, ts_usher.samples())
assert np.array_equal(dfn_usher[dfn_usher.is_sample].sample_id.values, dfn_sc2ts[dfn_sc2ts.is_sample].sample_id.values)
assert np.array_equal(ts_sc2ts.sites_position, ts_usher.sites_position)

A key element of processing data efficiently in tskit and VCZ is to use numpy arrays of integers to represent allelic states, instead of the classical approach of using strings, etc. In sc2ts, alleles are given fixed integer representations, such that A=0, C=1, G=2, and T=3. So, to represent the DNA string "AACTG" we would use the numpy array ``[0, 0, 1, 3, 2]`` instead. This has many advantages and makes it much easier to write efficient code. 

The drawback of this is that it's not as easy to inspect and debug, and we must always be aware of the translation required. 

In this analysis we're interested in how well the sc2ts and Usher do at imputing ambiguous bases, and want to count how many times the bases that they impute for samples are compatible with the ambiguity codes.

According to https://www.bioinformatics.org/sms/iupac.html the IUPAC ambiguity codes are as follows:

```
R	G or A	puRine
Y	T or C	pYrimidine
M	A or C	aMino
K	G or T	Keto
S	G or C	Strong interaction (3 H bonds)
W	A or T	Weak interaction (2 H bonds)
H	A or C or T	not-G, H follows G in the alphabet
B	G or T or C	not-A, B follows A
V	G or C or A	not-T (not-U), V follows U
D	G or A or T	not-C, D follows C
```

So, we build up a mapping of each ambiguity code to its compatible bases:

In [13]:
compatible = {
    "R": ["G", "A"],
    "Y": [ "T", "C"], 
   "M": ["A", "C"],
   "K": ["G", "T"],
   "S": ["G", "C"],
   "W": ["A", "T"],
   "H": ["A", "C", "T"],
   "B": ["G", "T", "C"],
   "V": ["G", "C", "A"],
   "D": ["G", "A", "T"],
}

The mapping from alleles to integers is managed by the sc2ts.IUPAC_ALLELES value, and so we build up the corresponding mapping in integer space:

In [19]:
sc2ts.IUPAC_ALLELES

'ACGT-RYSWKMBDHV.'

In [14]:
compatible_encoded = {}
for ambiguity_code, compatible_bases in compatible.items():
    compatible_bases_encoded = [sc2ts.IUPAC_ALLELES.index(base) for base in compatible_bases]
    compatible_encoded[sc2ts.IUPAC_ALLELES.index(ambiguity_code)] = compatible_bases_encoded
compatible_encoded

{5: [2, 0],
 6: [3, 1],
 10: [0, 1],
 9: [2, 3],
 7: [2, 1],
 8: [0, 3],
 13: [0, 1, 3],
 11: [2, 3, 1],
 14: [2, 1, 0],
 12: [2, 0, 3]}

A very important aspect of this data encoding is to understand how missing data is handled. [Explain that it's -1, and that it's better to use -1 and be obvious about it than to hide it with a constant]

In [16]:
sample_id = dfn_usher[dfn_usher.is_sample].sample_id.values
ds_variants = ds.variants(sample_id=sample_id, position=ts_sc2ts.sites_position.astype(int))
usher_variants = ts_usher.variants(alleles=tuple(sc2ts.IUPAC_ALLELES))
sc2ts_variants = ts_sc2ts.variants(alleles=tuple(sc2ts.IUPAC_ALLELES))

iterator = tqdm(zip(ds_variants, usher_variants, sc2ts_variants), total=ts_usher.num_sites)
data = []
for var_ds, var_usher, var_sc2ts in iterator:
    usher_correctly_imputed = 0
    sc2ts_correctly_imputed = 0
    total_ambiguous = 0
    for ambiguity_code, compatible_bases in compatible_encoded.items():
        samples = np.where(var_ds.genotypes == ambiguity_code)[0]
        total_ambiguous += samples.shape[0]
        imputed = collections.Counter(var_usher.genotypes[samples])
        usher_correctly_imputed += sum(imputed[base] for base in compatible_bases)
        imputed = collections.Counter(var_sc2ts.genotypes[samples])
        sc2ts_correctly_imputed += sum(imputed[base] for base in compatible_bases)
    # NOTE: -1 means missing data ("N")
    missing_samples = np.where(var_ds.genotypes == -1)[0]
    imputed_differently = np.sum(
        var_usher.genotypes[missing_samples] != var_sc2ts.genotypes[missing_samples])
       
    data.append({"position": int(var_ds.position), 
                 "total_ambiguous": total_ambiguous, 
                 "usher_correctly_imputed": usher_correctly_imputed,
                 "sc2ts_correctly_imputed": sc2ts_correctly_imputed,
                 "total_missing": missing_samples.shape[0],
                 "imputed_differently": imputed_differently
                })


  0%|          | 0/27431 [00:00<?, ?it/s]

In [17]:
df = pd.DataFrame(data)
df

Unnamed: 0,position,total_ambiguous,usher_correctly_imputed,sc2ts_correctly_imputed,total_missing,imputed_differently
0,267,1,1,1,10131,0
1,269,13,13,13,10099,0
2,270,9,9,9,10187,1
3,271,3,3,3,10184,0
4,272,5,5,5,10187,0
...,...,...,...,...,...,...
27426,29667,4,4,4,1122,0
27427,29668,22,22,22,993,0
27428,29669,8,8,8,997,0
27429,29670,28,28,28,1004,2


In [18]:
df["usher_incorrectly_imputed"] = df["total_ambiguous"] - df["usher_correctly_imputed"]
df["sc2ts_incorrectly_imputed"] = df["total_ambiguous"] - df["sc2ts_correctly_imputed"]
df.sum()

position                     407703842
total_ambiguous                 413303
usher_correctly_imputed         413299
sc2ts_correctly_imputed         413226
total_missing                 80574518
imputed_differently              56782
usher_incorrectly_imputed            4
sc2ts_incorrectly_imputed           77
dtype: int64

In [24]:
    77 / 413299 * 100

0.018630579798160653

In [17]:
(df.imputed_differently.sum() / df.total_missing.sum()) * 100

np.float64(0.07047141132138078)

In [22]:
(ts_sc2ts.num_samples * ts_sc2ts.num_sites) / 1024**3

31.335752454586327

The total aligned dataset for 1,226,587 samples and 27,431 sites shared by both the sc2ts and Usher tree sequences represents 31.34GiB of nucleotide calls. Of the 80,574,518 missing data calls (Ns) in the alignments, sc2ts and Usher disagreed in their imputed values for 56,782 (0.07%). Additionally, 413,303 calls made use of the IUPAC uncertainty codes. Of these sc2ts imputed 77 (0.02%) incorrectly (i.e., with a base that is not compatible with the ambiguity code). Remarkably, Usher imputed only 4 calls from this set incorrectly.
Performing this computation by co-iterating over all sites in the source alignments and the sc2ts and Usher tree sequencesrequired only 3:20 seconds on a (computer), with the core logic using Numpy and Pandas requiring around 30 lines of Python code. 