# Analysing the sc2ts ARG using tskit, numpy and pandas

This notebooks shows examples of analysing the 2.48 million sample sc2ts ARG using the [tskit Python API](https://tskit.dev/tskit/docs/), in conjuction with Pandas and with an emphasis on using vectorised approaches with Numpy. This is to showcase the ease and efficiency with which pandemic scale analyses can be done directly within a notebook environment using only Python, and to provide a starting point for further analyses. We focus on the core computational tasks one might like to perform on a pandemic scale ARG, with the minimal set of external dependencies required to do this.

Please see the tskit [documentation](https://tskit.dev/tskit/docs/stable/) for detailed information on the Python APIs, and the  [tskit tutorials](https://tskit.dev/tutorials/intro.html) site for an introduction to tskit and help with various tasks such as visualisation. Please see the [preprint](https://www.biorxiv.org/content/10.1101/2023.06.08.544212) for background and further details on all aspects of sc2ts and the ARG.

In [1]:
import tszip
import sc2ts
import numpy as np

First, we load the ARG into memory from the compressed representation using [tszip](https://tskit.dev/software/tszip.html), which typically takes less than a second. The object we get back is a [tskit TreeSequence](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence), which we call ``ts`` by convention.

In [2]:
ts = tszip.load("../data/sc2ts_viridian_v1.2.trees.tsz")

When viewed in a notebook environment, the TreeSequence gives some summary information about its contents:

In [3]:
ts

Tree Sequence,Unnamed: 1
Trees,316
Sequence Length,29 904
Time Units,days
Sample Nodes,2 482 157
Total Size,408.1 MiB
Metadata,dict  time_zero_date: 2023-02-21

Table,Rows,Size,Has Metadata
Edges,2 748 838,83.9 MiB,
Individuals,0,24 Bytes,
Migrations,0,8 Bytes,
Mutations,2 285 344,80.6 MiB,
Nodes,2 747 985,220.1 MiB,✅
Populations,0,8 Bytes,
Provenances,1 135,1.7 MiB,
Sites,29 893,729.8 KiB,

Provenance Timestamp,Software Name,Version,Command,Full record
"11 September, 2025 at 09:02:43 PM",sc2ts,0.0.4.dev716+g109ed6a6a,minimise_metadata,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev716+g109ed6a6a  parameters:  dict  command: minimise_metadata  field_mapping:  dict  strain: sample_id pango: pango scorpio: scorpio  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 281.8967273235321 user_time: 282.34 sys_time: 3.31 max_memory: 8572321792
"11 September, 2025 at 08:57:47 PM",tsdate,0.2.3,variational_gamma,Details  dict  schema_version: 1.0.0  software:  dict  name: tsdate version: 0.2.3  parameters:  dict  mutation_rate: 9.967653503392e-07 recombination_rate: None time_units: days progress: False population_size: None eps: 1e-10 max_iterations: 25 max_shape: 1000 rescaling_intervals: 0 rescaling_iterations: 5 match_segregating_sites: False regularise_roots: True singletons_phased: True command: variational_gamma  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  tskit:  dict  version: 0.6.4  resources:  dict  elapsed_time: 1786.5585663318634 user_time: 1477.3799999999999 sys_time: 365.09999999999997 max_memory: 14737055744
"11 September, 2025 at 08:27:51 PM",Unknown,Unknown,Unknown,Details  dict  command: scripts/run_nonsample_dating.p y  args:  list  sc2ts_v1_2023-02- 21_pr_pp_mp_aph_bps_pango.tree s  sc2ts_v1_2023-02- 21_pr_pp_mp_aph_bps_pango_date d.trees
"11 September, 2025 at 12:40:10 PM",sc2ts,0.0.4.dev716+g109ed6a6a,apply_node_parsimony_heuristics,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev716+g109ed6a6a  parameters:  dict  command: apply_node_parsimony_heuristic s  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 1725.9827859401703 user_time: 1691.29 sys_time: 37.32 max_memory: 14721216512
"11 September, 2025 at 11:38:48 AM",sc2ts,0.0.4.dev716+g109ed6a6a,map_parsimony,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev716+g109ed6a6a  parameters:  dict  command: map_parsimony dataset: ../data/viridian_mafft_2024- 10-14_v1.vcz.zip  sites:  list  203  222  335  337  683  686  687  688  689  690  691  692  693  694  823  1191  1684  1820  1912  3096  4579  5284  5512  5812  5869  6513  6514  6515  7528  9430 ... and 133 more  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 113.88616251945496 user_time: 113.53999999999999 sys_time: 15.16 max_memory: 9682702336
"11 September, 2025 at 11:33:56 AM",sc2ts,0.0.4.dev716+g109ed6a6a,push_up_unary_recombinant_mutations,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev716+g109ed6a6a  parameters:  dict  command: push_up_unary_recombinant_muta tions  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 2.014835834503174 user_time: 1222.18 sys_time: 19.299999999999997 max_memory: 9223966720
"11 September, 2025 at 11:33:50 AM",sc2ts,0.0.4.dev716+g109ed6a6a,append_exact_matches,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev716+g109ed6a6a  parameters:  dict  command: append_exact_matches match_db: ../inference/results/v1- beta1.matches.db  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 1241.5077295303345 user_time: 1218.72 sys_time: 17.549999999999997 max_memory: 6132613120
"20 August, 2025 at 09:18:38 AM",sc2ts,0.0.4.dev712+gb7a68b468,push_up_unary_recombinant_mutations,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev712+gb7a68b468  parameters:  dict  command: push_up_unary_recombinant_muta tions  environment:  dict  os:  dict  system: Linux node: holly release: 5.10.0-24-amd64 version: #1 SMP Debian 5.10.179-5 (2023-08-08) machine: x86_64  python:  dict  implementation: CPython version: 3.12.11  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.4  tsinfer:  dict  version: 0.1.dev1455+g7522bcc8d  resources:  dict  elapsed_time: 1.3764989376068115 user_time: 3.36 sys_time: 1.7 max_memory: 4453330944
"19 January, 2025 at 04:26:26 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-21 base_ts: results/v1-beta1/v1- beta1_2023-02-20.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 23.96233034133911 user_time: 20.23 sys_time: 3.54 max_memory: 14884753408
"19 January, 2025 at 04:26:00 AM",sc2ts,0.0.4.dev536+g33a72d6,extend,Details  dict  schema_version: 1.0.0  software:  dict  name: sc2ts version: 0.0.4.dev536+g33a72d6  parameters:  dict  command: extend dataset: /tmp/jk/viridian_2024-04- 29.alpha1.zarr.zip date: 2023-02-20 base_ts: results/v1-beta1/v1- beta1_2023-02-19.ts match_db: /tmp/jk/matches/v1- beta1.matches.db date_field: Date_tree  include_samples:  list  ERR5461562  ERR5469699  ERR5469807  ERR5486121  ERR5521603  ERR5531143  ERR5532096  ERR5532118  ERR5537492  ERR5676810  ERR5690893  ERR5690055  ERR5690921  ERR5695631  ERR5701881  ERR5690052  ERR5690920  SRR17041376  SRR17461792  num_mismatches: 4 hmm_cost_threshold: 7 min_group_size: 10 min_root_mutations: 2 min_different_dates: 3 max_mutations_per_sample: 5 max_recurrent_mutations: 2 deletions_as_missing: True max_daily_samples: None show_progress: True retrospective_window: 7 max_missing_sites: 500 random_seed: 42 num_threads: 80 memory_limit: 250  environment:  dict  os:  dict  system: Linux node: epyc000.hpc.in.bmrc.ox.ac.uk release: 4.18.0-553.16.1.el8_10.x86_64 version: #1 SMP Thu Aug 8 17:47:08 UTC 2024 machine: x86_64  python:  dict  implementation: CPython version: 3.11.3  libraries:  dict  kastore:  dict  version: 2.1.1  tskit:  dict  version: 0.6.0  tsinfer:  dict  version: 0.1.dev1455+g7522bcc  resources:  dict  elapsed_time: 241.07852435112 user_time: 1551.5 sys_time: 15.07 max_memory: 34759606272


## Information on nodes

The [sc2ts Python API](https://tskit.dev/sc2ts/docs/stable/api.html) provides a few utility functions that take advantage of specific structure of the sc2ts ARG and its metadata. The ``node_data`` function returns a Pandas Dataframe summarising the nodes in the ARG, and in particular, makes the key ``sample_id``, ``pango`` and ``scorpio`` metadata fields efficiently accessible.

In [4]:
dfn = sc2ts.node_data(ts)
dfn

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date
0,B,Vestigial_ignore,.,0,False,False,0,2019-11-23
1,B,Wuhan/Hu-1/2019,.,1,False,False,0,2019-12-26
2,A,SRR11772659,.,2,True,False,1,2020-01-19
3,B,SRR11397727,.,3,True,False,0,2020-01-24
4,B,SRR11397730,.,4,True,False,0,2020-01-24
...,...,...,...,...,...,...,...,...
2747980,BA.2.1,,Omicron (BA.2-like),2747980,False,False,1,2022-01-03
2747981,CH.1.1,,Omicron (BA.2-like),2747981,False,False,1,2022-09-28
2747982,BA.5.1,,Omicron (BA.5-like),2747982,False,False,2,2022-05-31
2747983,BU.1,,Omicron (BA.5-like),2747983,False,False,4,2022-06-13


Pandas provides many powerful functions. For example, here is a breakdown of the numbers of nodes by their Scorpio assignment:

In [5]:
dfn["scorpio"].value_counts()

scorpio
Delta (B.1.617.2-like)           583392
Delta (AY.4-like)                554731
Omicron (BA.1-like)              371937
Omicron (BA.2-like)              366040
Alpha (B.1.1.7-like)             317036
Omicron (BA.5-like)              194121
.                                192203
Delta (AY.4.2-like)               92668
Omicron (BA.4-like)               27625
Iota (B.1.526-like)               11026
Epsilon (B.1.429-like)             8526
Omicron (XBB.1.5-like)             6020
Gamma (P.1-like)                   5008
Epsilon (B.1.427-like)             4522
Beta (B.1.351-like)                3027
Mu (B.1.621-like)                  1484
Delta (B.1.617.2-like) +K417N      1424
Omicron (XE-like)                  1177
B.1.1.318-like                     1118
Omicron (XBB.1-like)               1037
Eta (B.1.525-like)                  893
Omicron (XBB-like)                  785
B.1.617.1-like                      495
Omicron (Unassigned)                357
Zeta (P.2-like)                 

We can then use this to identify specific nodes. Suppose we wanted to find the first Alpha node. To do this, we simply need to find oldest node with the corresponding Pango label.

**NOTE** The dataframe returned by ``node_data`` has a ``date`` field, which gives the estimated date for each node, which has day-precision. In order to find the oldest node reliably we need finer precision, which is provided by the ``time`` column in the tskit 
[node table](https://tskit.dev/tskit/docs/stable/data-model.html#node-table). To access this information in Pandas, we add this field to the dataframe. Then, we sort the rows by time and then extract the first record for each Pango designation:

In [6]:
dfn["time"] = ts.nodes_time
dfn_pango = dfn.sort_values("time", ascending=False).groupby(["pango"]).first()
dfn_pango

Unnamed: 0_level_0,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
pango,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,,.,9,False,False,2,2019-12-26,1153.000000
A.1,,.,227,False,False,2,2020-01-19,1129.000000
A.2,,.,530,False,False,4,2020-02-01,1116.000000
A.2.2,,.,1190,False,False,1,2020-02-26,1091.109256
A.2.3,,.,1186,False,False,2,2020-02-21,1096.465780
...,...,...,...,...,...,...,...,...
XW,,Omicron (BA.2-like),1159411,False,True,1,2022-03-10,348.300416
XY,,Omicron (Unassigned),1187989,False,True,2,2022-03-16,342.625280
XZ,,Omicron (BA.2-like),1163537,False,False,1,2022-03-05,353.377858
Y.1,,.,55861,False,False,1,2020-08-18,917.000000


We can then extract the "alpha origin" node easily:

In [7]:
alpha_origin = dfn_pango.loc["B.1.1.7"]
alpha_origin

sample_id                             
scorpio           Alpha (B.1.1.7-like)
node_id                          86456
is_sample                        False
is_recombinant                   False
num_mutations                       36
date               2020-09-23 00:00:00
time                        881.613313
Name: B.1.1.7, dtype: object

## Information on mutations

The sc2ts Python API also provides a way to extract information about mutations. 

In [8]:
dfm = sc2ts.mutation_data(ts).set_index("node")
dfm

Unnamed: 0_level_0,mutation_id,site_id,position,parent,inherited_state,derived_state,date
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
644280,0,24,25,-1,T,A,2021-09-20
12915,1,37,38,-1,A,G,2020-04-21
344292,2,42,43,-1,T,A,2021-07-17
427855,3,42,43,-1,T,C,2021-07-25
670767,4,42,43,-1,T,C,2021-08-21
...,...,...,...,...,...,...,...
691821,2285339,29839,29850,-1,A,G,2021-10-18
719257,2285340,29839,29850,-1,A,G,2021-10-21
952268,2285341,29839,29850,-1,A,G,2021-12-13
1129360,2285342,29839,29850,-1,A,G,2022-01-08


This dataframe has information about each mutation, including the genome position (``position``) and node over which the mutation occured. Here, we're interested in finding all the mutations immediately ancestral to the Alpha origin node found above, and so we add an index on the ``node`` field. Then, getting the mutations is quick and easy:

In [9]:
alpha_origin_mutations = dfm.loc[alpha_origin["node_id"]]
alpha_origin_mutations

Unnamed: 0_level_0,mutation_id,site_id,position,parent,inherited_state,derived_state,date
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
86456,160960,910,913,-1,C,T,2020-05-29
86456,472353,5385,5388,-1,C,A,2020-05-29
86456,507747,5983,5986,-1,C,T,2020-05-29
86456,575045,6951,6954,-1,T,C,2020-05-29
86456,812756,11282,11288,-1,T,-,2020-05-29
86456,812805,11283,11289,-1,C,-,2020-05-29
86456,812923,11284,11290,-1,T,-,2020-05-29
86456,813004,11285,11291,-1,G,-,2020-05-29
86456,813082,11286,11292,-1,G,-,2020-05-29
86456,813200,11287,11293,-1,T,-,2020-05-29


Here we can see there's quite a lot of mutations (36), but many of them have the derived state of the gap character ("-"). These represent the approximation of deletions in sc2ts, in which deletions at each site are modelled indepenently. 

We can extract the actual deletion events from this representation by looking at runs of mutations to the gap character and extracting their start and length. We do this using some pandas group-by logic.

In [10]:
dels = alpha_origin_mutations[alpha_origin_mutations["derived_state"] == "-"].copy()
dels["run"] = (dels["position"].diff(1) != 1).cumsum()
dels

Unnamed: 0_level_0,mutation_id,site_id,position,parent,inherited_state,derived_state,date,run
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
86456,812756,11282,11288,-1,T,-,2020-05-29,1
86456,812805,11283,11289,-1,C,-,2020-05-29,1
86456,812923,11284,11290,-1,T,-,2020-05-29,1
86456,813004,11285,11291,-1,G,-,2020-05-29,1
86456,813082,11286,11292,-1,G,-,2020-05-29,1
86456,813200,11287,11293,-1,T,-,2020-05-29,1
86456,813299,11288,11294,-1,T,-,2020-05-29,1
86456,813396,11289,11295,-1,T,-,2020-05-29,1
86456,813465,11290,11296,-1,T,-,2020-05-29,1
86456,1376533,21756,21765,-1,T,-,2020-05-29,2


In [11]:
dels_grp = dels.groupby("run")
summary = dels_grp.first() 
summary["length"] = dels_grp.last()["position"] - summary["position"] + 1
summary[["position", "length"]]

Unnamed: 0_level_0,position,length
run,Unnamed: 1_level_1,Unnamed: 2_level_1
1,11288,9
2,21765,6
3,21991,3
4,28271,1


So, we have 4 deletions: a 9 bp deletion starting at position 11288, 6bp at 21765, etc.

Non deletions are more straightforward to analyse, but it's important to factor out the deletions first because of their multibase nature. This is easily done:

In [12]:
non_dels = alpha_origin_mutations[alpha_origin_mutations["derived_state"] != "-"]
non_dels

Unnamed: 0_level_0,mutation_id,site_id,position,parent,inherited_state,derived_state,date
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
86456,160960,910,913,-1,C,T,2020-05-29
86456,472353,5385,5388,-1,C,A,2020-05-29
86456,507747,5983,5986,-1,C,T,2020-05-29
86456,575045,6951,6954,-1,T,C,2020-05-29
86456,988872,14670,14676,-1,C,T,2020-05-29
86456,1017218,15273,15279,-1,C,T,2020-05-29
86456,1058632,16169,16176,-1,T,C,2020-05-29
86456,1521333,23054,23063,-1,A,T,2020-05-29
86456,1532177,23262,23271,-1,C,A,2020-05-29
86456,1554236,23595,23604,-1,C,A,2020-05-29


## Recombination nodes and local trees

The ``node_data`` dataframe has a field ``is_recombinant`` which makes finding the recombination nodes in the sc2ts ARG easy

In [13]:
df_recomb = dfn[dfn.is_recombinant]
df_recomb

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
1530,B.1.157,,.,1530,False,True,2,2020-03-18,1070.000000
26465,B.1.221,,.,26465,False,True,2,2020-06-29,967.212654
27003,B.1.160,,.,27003,False,True,1,2020-07-12,954.666560
28379,B.1.426,,.,28379,False,True,0,2020-08-28,907.000000
34811,B.1.177,,.,34811,False,True,2,2020-09-18,886.080636
...,...,...,...,...,...,...,...,...,...
1430261,BA.2,,Omicron (BA.2-like),1430261,False,True,11,2023-01-21,31.669198
1430452,BA.2,,Omicron (BA.2-like),1430452,False,True,2,2023-01-03,49.963285
1431988,CH.1.1,,Omicron (BA.2-like),1431988,False,True,3,2023-01-25,27.092445
1432902,BA.2,,Omicron (BA.2-like),1432902,False,True,12,2023-01-27,25.713623


XBB is an important recombinant and is well captured by the sc2ts ARG. Its "origin node" is a recombination node, meaning that it has more than one parent in the ARG. We can see that the ``is_recombinant`` flag is True:

In [14]:
xbb_origin = dfn_pango.loc["XBB"]
xbb_origin

sample_id                            
scorpio            Omicron (XBB-like)
node_id                       1396207
is_sample                       False
is_recombinant                   True
num_mutations                      14
date              2022-05-30 00:00:00
time                       267.175348
Name: XBB, dtype: object

To find where the breakpoint is (all recombinants in the current sc2ts ARG have only 2 parents and consequently one breakpoint; this is not a limitation of tskit, which allows arbitrarily complex inheritance), we must inspect the tskit [edge table](https://tskit.dev/tskit/docs/stable/data-model.html#edge-table). This simply records the IDs of the parent and child nodes and the interval over which this relationship applies. We then find the edges which record the parents of the XBB origin by looking for those edges in which it is a child:

In [15]:
xbb_edges = np.where(ts.edges_child == xbb_origin["node_id"])[0]
xbb_edges

array([235736, 400939])

We get two edge IDs, which we can then use to get the properties of the edges themselves. One quick way to visualise the information is to access the edge table directly:

In [16]:
ts.tables.edges[xbb_edges]

id,left,right,parent,child,metadata
0,22577,29904,1363939,1396207,
1,0,22577,1101942,1396207,


We can also look at the parents of the event directly:

In [17]:
dfn.loc[ts.edges_parent[xbb_edges]]

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
1363939,BM.1.1.1,,Omicron (BA.2-like),1363939,False,False,1,2022-04-26,301.500484
1101942,BA.2.10,ERR9201084,Omicron (BA.2-like),1101942,True,False,1,2022-02-26,360.0


To get the breakpoint, we can then take the maximum of the left coordinate:

(Note, we're using two different conventions for accessing the underlying table data here --- they are essentially equivalent and can be mixed and matched as is most convenient.)

In [18]:
bp = np.max(ts.edges_left[xbb_edges])
bp

np.float64(22577.0)

The *effect* of this recombination event is to change the tree to the left and right of the breakpoint. Tskit provides an efficient view over the "local trees" along the genome that result from recombination events. In this case, we would like to examine the trees to the right and left of the breakpoint for XBB. The simplest (and most efficient) way to do this is to first look at the right-hand tree using the ``ts.at`` method. 

In [19]:
tree = ts.at(bp)
tree

Tree,Unnamed: 1
Index,193
Interval,22 577-22 578 (1)
Roots,1
Nodes,2 747 984
Sites,1
Mutations,87
Total Branch Length,99 256 253


The tree returned here is an instance of the tskit [Tree](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree) class which provides many operations. This is built around the efficient numerical representation of trees, and specifically, the 
[quintuply linked tree](https://tskit.dev/tskit/docs/stable/data-model.html#quintuply-linked-trees) data structure, in which the tree topology is encoded using a set of numpy arrays. For example, the parent of each node in the tree is represented using the ``parent_array``:

In [20]:
tree.parent_array

array([     -1,      -1,       9, ..., 2740283, 2736794,      -1],
      shape=(2747986,), dtype=int32)

So, ``tree.parent_array[u]`` means "what is the node ID of the parent of node ``u`` in this tree". The ``-1`` values here indicate that a particular node has no parent, and so is either a root or is not "present" in the current tree. We can then verify that the local tree has the correct parent for the right-hand side of the XBB recombination event:

In [21]:
tree.parent_array[xbb_origin["node_id"]]

np.int32(1363939)

We can use the same approach to access the tree on the left hand side of the breakpoint, but it is more efficient to "mutate" the current tree instance using the ``prev()`` method. This is because ``ts.at()`` requires us to create a new tree instance and to apply all the edges that intersect with the given point. Using the ``prev`` method, on the other hand, takes advantage of the fact that there's very little difference between the trees at either side of the breakpoint and makes those minimal changes.

In [22]:
tree.prev()
tree

Tree,Unnamed: 1
Index,192
Interval,22 502-22 577 (75)
Roots,1
Nodes,2 747 984
Sites,75
Mutations,3 043
Total Branch Length,99 256 311


Now we can also verify that the left-hand parent of the XBB recombination node is what we extracted from the node table above.

In [23]:
tree.parent_array[xbb_origin["node_id"]]

np.int32(1101942)

## Getting the ancestors of a node in a tree

An operation that we're often interested in is to get the ancestors of a node in a particular tree. Using the ``ancestors`` methods along with Pandas selection provides an easy way to access data on the ancestors of the Alpha origin node.

In [24]:
tree = ts.first()   
u = alpha_origin["node_id"]
dfn.iloc[[u] + list(tree.ancestors(u))]    

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
86456,B.1.1.7,,Alpha (B.1.1.7-like),86456,False,False,36,2020-09-23,881.613313
4552,B.1.1,,.,4552,False,False,3,2020-02-03,1114.859379
98,B.1.1,,.,98,False,False,3,2020-01-28,1120.000555
59,B.1,,.,59,False,False,1,2020-01-28,1120.059657
12,B.1,SRR11597205,.,12,True,False,2,2020-01-28,1120.059657
27,B,,.,27,False,False,1,2019-12-26,1153.0
1,B,Wuhan/Hu-1/2019,.,1,False,False,0,2019-12-26,1153.0


**Note:** In this example we use the first tree in the sequence to access the ancestors of the Alpha origin node. This is valid here because there is no recombinant among those ancestors. When recombination is present among the ancestors, more care is required.

## Iterating over all trees

A common pattern in analysing ARGs with tskit is to iterate sequentially over all local trees along the genome. This is done using the ``.trees()`` iterator, which again takes advantage of the fact that adjacent trees are very similar for efficiency. Here we use this to count the number of polytomies (nodes with > 2 children) in each local tree. This is done using Numpy functions on the ``num_children_array``, which is a zero-copy view on a per-node count maintained by the underlying C library.

In [25]:
%%time
num_polytomies = np.zeros(ts.num_trees, dtype=int)
for tree in ts.trees():
    num_polytomies[tree.index] = np.sum(tree.num_children_array > 2)
np.min(num_polytomies), np.max(num_polytomies)

CPU times: user 915 ms, sys: 32 ms, total: 947 ms
Wall time: 946 ms


(np.int64(222242), np.int64(222257))

## Examining children

The numpy ``argmax`` function gives us a conveient way to find the ID of the node which has the largest number of children:

In [26]:
tree = ts.first()
u = np.argmax(tree.num_children_array)
tree.num_children_array[u]

np.int32(12682)

This has an impressive 12,682 children! We can first look at its path back to root, and verify that it is valid across all trees by checking that there are no recombinants in its ancestry:

In [27]:
dfn.loc[[u] + list(tree.ancestors(u))]

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
887654,BA.2.5,,Omicron (BA.2-like),887654,False,False,1,2021-11-11,467.306208
863361,BA.2,,Omicron (BA.2-like),863361,False,False,2,2021-11-05,473.128012
843826,BA.2,,Omicron (BA.2-like),843826,False,False,0,2021-11-05,473.128012
822854,BA.2,,Omicron (BA.2-like),822854,False,False,42,2021-11-05,473.128012
1436802,B.1.1.529,,Probable Omicron (Unassigned),1436802,False,False,39,2020-06-26,970.041636
759815,B.1.1,,.,759815,False,False,1,2020-01-28,1120.000353
5001,B.1.1,,.,5001,False,False,1,2020-01-28,1120.000353
98,B.1.1,,.,98,False,False,3,2020-01-28,1120.000555
59,B.1,,.,59,False,False,1,2020-01-28,1120.059657
12,B.1,SRR11597205,.,12,True,False,2,2020-01-28,1120.059657


We can then get the list of children for that node using the ``tree.children()`` method, and extract the relevant rows from the node data table:

In [28]:
df_children = dfn.loc[list(tree.children(u))]
df_children

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
886829,BA.2.5,SRR17712991,Omicron (BA.2-like),886829,True,False,0,2021-12-13,435.000000
903506,BA.2.5,,Omicron (BA.2-like),903506,False,False,1,2021-11-11,467.306208
924867,BA.2,SRR17712363,Omicron (BA.2-like),924867,True,False,2,2021-12-30,418.000000
927976,BA.2,ERR9234047,Omicron (BA.2-like),927976,True,False,1,2022-01-02,415.000000
927977,BA.2,ERR9234055,Omicron (BA.2-like),927977,True,False,1,2022-01-02,415.000000
...,...,...,...,...,...,...,...,...,...
2735525,BA.2,,Omicron (BA.2-like),2735525,False,False,1,2021-12-21,427.140858
2739602,BA.2,,Omicron (BA.2-like),2739602,False,False,1,2021-12-16,432.996202
2742229,BA.2,,Omicron (BA.2-like),2742229,False,False,1,2022-02-25,361.765177
2742596,BA.2,,Omicron (BA.2-like),2742596,False,False,1,2022-01-04,413.235262


When we break down the counts of those nodes by their sample status and number of mutations, we see that we have over 9,500 samples which are identical to the focal node. Thus, this seems like a well-inferred node capturing a very deeply sampled phase of the BA.2 outbreak.

In [29]:
df_children[["num_mutations", "is_sample"]].value_counts()

num_mutations  is_sample
0              True         9552
1              True         2038
               False         816
2              True          198
               False          36
3              True           28
4              True            7
3              False           4
5              True            2
6              False           1
Name: count, dtype: int64

## All descendants

Getting the descendants of a node in a particular tree is straightforward, and can be done efficiently in the standard traversal orders. For example, here we get all the nodes descending from the Alpha origin in the first tree in preorder, and extract the corresponding dataframe:


In [30]:
tree = ts.first()
descendants = tree.preorder(alpha_origin["node_id"])
descendants

array([  86456,   45989,   45958, ..., 1551400, 1551420,   86035],
      shape=(317368,), dtype=int32)

In [31]:
alpha_descendants = dfn.loc[descendants]
alpha_descendants

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
86456,B.1.1.7,,Alpha (B.1.1.7-like),86456,False,False,36,2020-09-23,881.613313
45989,B.1.1.7,,Alpha (B.1.1.7-like),45989,False,False,1,2020-09-23,881.613313
45958,B.1.1.7,ERR4827009,Alpha (B.1.1.7-like),45958,True,False,1,2020-10-21,853.000000
51925,B.1.1.7,ERR4862251,Alpha (B.1.1.7-like),51925,True,False,1,2020-11-05,838.000000
56457,B.1.1.7,ERR4905821,Alpha (B.1.1.7-like),56457,True,False,1,2020-11-11,832.000000
...,...,...,...,...,...,...,...,...,...
1543606,B.1.1.7,ERR5316249,Alpha (B.1.1.7-like),1543606,True,False,0,2021-02-05,746.000000
1544098,B.1.1.7,ERR5314092,Alpha (B.1.1.7-like),1544098,True,False,0,2021-02-06,745.000000
1551400,B.1.1.7,ERR5338932,Alpha (B.1.1.7-like),1551400,True,False,0,2021-02-13,738.000000
1551420,B.1.1.7,ERR5338978,Alpha (B.1.1.7-like),1551420,True,False,0,2021-02-13,738.000000


However, this is an approximation to the true descendants of the node, because there are some recombinants descending from it. 

In [32]:
alpha_descendants[alpha_descendants["is_recombinant"]]

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
263782,B.1.526,,Iota (B.1.526-like),263782,False,True,2,2021-04-05,687.73831
277217,B.1.1.7,,Alpha (B.1.1.7-like),277217,False,True,0,2021-05-12,650.0
251428,B.1.1.7,,Alpha (B.1.1.7-like),251428,False,True,3,2021-04-14,678.320265
330618,Q.3,,Alpha (B.1.1.7-like),330618,False,True,1,2021-07-01,600.596459
229637,B.1.1.7,,Alpha (B.1.1.7-like),229637,False,True,1,2021-03-28,695.635514
191383,B.1.1.7,,Alpha (B.1.1.7-like),191383,False,True,0,2021-02-28,723.309391
214402,B.1.1.7,,Alpha (B.1.1.7-like),214402,False,True,2,2021-03-27,696.0
104754,B.1.177.9,,.,104754,False,True,3,2021-01-07,775.25865
143250,B.1.1.7,,Alpha (B.1.1.7-like),143250,False,True,2,2021-02-04,747.4853
286975,B.1.1.7,,Alpha (B.1.1.7-like),286975,False,True,1,2021-05-19,643.294881


To get the true set of descendants of this node we must currently take the union of the descendants across all trees as follows.

**NOTE:** efficient support for this operation is planned in tskit

In [33]:
all_descendants = []
for tree in ts.trees():
    all_descendants = np.union1d(
        all_descendants, np.sort(tree.preorder(alpha_origin["node_id"])))

In [34]:
alpha_descendants = dfn.loc[all_descendants]
alpha_descendants

Unnamed: 0,pango,sample_id,scorpio,node_id,is_sample,is_recombinant,num_mutations,date,time
45958,B.1.1.7,ERR4827009,Alpha (B.1.1.7-like),45958,True,False,1,2020-10-21,853.000000
45959,B.1.1.7,ERR4827181,Alpha (B.1.1.7-like),45959,True,False,0,2020-10-21,853.000000
45960,B.1.1.7,ERR4833995,Alpha (B.1.1.7-like),45960,True,False,1,2020-10-21,853.000000
45961,B.1.1.7,ERR4827287,Alpha (B.1.1.7-like),45961,True,False,1,2020-10-22,852.000000
45962,B.1.1.7,ERR4848983,Alpha (B.1.1.7-like),45962,True,False,0,2020-10-23,851.000000
...,...,...,...,...,...,...,...,...,...
2747788,B.1.1.7,,Alpha (B.1.1.7-like),2747788,False,False,2,2021-03-02,721.000000
2747798,B.1.1.7,,Alpha (B.1.1.7-like),2747798,False,False,1,2021-03-02,721.211832
2747870,B.1.1.7,,Alpha (B.1.1.7-like),2747870,False,False,4,2021-02-13,738.095214
2747937,B.1.1.7,,Alpha (B.1.1.7-like),2747937,False,False,1,2021-02-02,749.576484
