# Pango events within the ARG

In this notebook we examine how well the ARG reflects evolutionary events implicit in the pango naming system.


## Summary

Of the 2058 distinct pango lineages in the ARG, 1473 of these (comprising 717798 samples) match perfectly, with unique origination events in the ARG where all samples assigned a given lineage descend from the first node assigned that lineage. A further 245 lineages (589197 samples) match perfectly when we count the descendants of the  parent of the first node (accounting for polytomies in which multiple originating nodes for a given lineage are siblings). We then have 306 lineages (755840 samples) where the difference in the number descendants of the first node's parent is < 100. The remaining 35 lineages (420047 samples) are dominated by a few large lineages such as BA.1.1 (147271 samples) and AY.4.2 (54607 samples) which have multiple non-sibling origins within the ARG.


In [1]:
import sc2ts
import tszip
import pathlib
import numpy as np
import pandas as pd
import concurrent.futures as cf
from tqdm.notebook import tqdm

datadir = pathlib.Path("../data")

## Code

In [2]:
ts = tszip.load(datadir / "sc2ts_viridian_v1.2.trees.tsz")


In [3]:
df_node = sc2ts.node_data(ts).set_index("node_id")
df_node["node_time"] = ts.nodes_time
df_node

Unnamed: 0_level_0,pango,sample_id,scorpio,is_sample,is_recombinant,num_mutations,max_descendant_samples,date,node_time
node_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,B,Vestigial_ignore,.,False,False,0,0,2019-11-23,1186.228583
1,B,Wuhan/Hu-1/2019,.,False,False,0,2482157,2019-12-26,1153.000000
2,A,SRR11772659,.,True,False,1,255,2020-01-19,1129.000000
3,B,SRR11397727,.,True,False,0,1,2020-01-24,1124.000000
4,B,SRR11397730,.,True,False,0,1,2020-01-24,1124.000000
...,...,...,...,...,...,...,...,...,...
2747980,BA.2.1,,Omicron (BA.2-like),False,False,1,128,2022-01-03,414.988112
2747981,CH.1.1,,Omicron (BA.2-like),False,False,1,112,2022-09-28,146.324039
2747982,BA.5.1,,Omicron (BA.5-like),False,False,2,3,2022-05-31,266.203615
2747983,BU.1,,Omicron (BA.5-like),False,False,4,97,2022-06-13,253.553625


## Concordance with original assignments by Viridian

We extract the original Pango lineage assignments (version 1.29 and version 1.21) from the dataset and join this with our nodes samples to measure the levels of concordance between the three different calls for the same sequences.

In [4]:
ds = sc2ts.Dataset(datadir / "viridian_mafft_2024-10-14_v1.vcz.zip")
ds

Dataset at ../data/viridian_mafft_2024-10-14_v1.vcz.zip with 4484157 samples, 29903 variants, and 30 metadata fields. See ds.metadata.field_descriptors() for a description of the fields.

In [21]:
df_sample = df_node[df_node.is_sample].set_index("sample_id")
df_sample

Unnamed: 0_level_0,pango,scorpio,is_sample,is_recombinant,num_mutations,max_descendant_samples,date,node_time
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SRR11772659,A,.,True,False,1,255,2020-01-19,1129.0
SRR11397727,B,.,True,False,0,1,2020-01-24,1124.0
SRR11397730,B,.,True,False,0,1,2020-01-24,1124.0
SRR11597198,A,.,True,False,0,1,2020-01-25,1123.0
SRR11597221,A,.,True,False,0,1,2020-01-25,1123.0
...,...,...,...,...,...,...,...,...
ERR10937847,XBB.1.5,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.0
ERR10937891,XBB.1.5.62,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.0
ERR10937893,FD.1,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.0
ERR10937945,CH.1.1.3,Omicron (BA.2-like),True,False,0,1,2023-02-20,1.0


In [22]:
df_sample = df_sample.join(ds.metadata.as_dataframe(["Viridian_pangolin", "Viridian_pangolin_1.29", "Viridian_scorpio_1.29"]))

In [23]:
df_sample = df_sample.rename(columns={"Viridian_pangolin": "Viridian_pangolin_1.21"})
df_sample

Unnamed: 0_level_0,pango,scorpio,is_sample,is_recombinant,num_mutations,max_descendant_samples,date,node_time,Viridian_pangolin_1.21,Viridian_pangolin_1.29,Viridian_scorpio_1.29
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
SRR11772659,A,.,True,False,1,255,2020-01-19,1129.0,A,A,.
SRR11397727,B,.,True,False,0,1,2020-01-24,1124.0,B,B,.
SRR11397730,B,.,True,False,0,1,2020-01-24,1124.0,B,B,.
SRR11597198,A,.,True,False,0,1,2020-01-25,1123.0,A,A,.
SRR11597221,A,.,True,False,0,1,2020-01-25,1123.0,A,A,.
...,...,...,...,...,...,...,...,...,...,...,...
ERR10937847,XBB.1.5,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.0,XBB.1.5,XBB.1.5,Omicron (XBB.1.5-like)
ERR10937891,XBB.1.5.62,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.0,XBB.1.5.62,XBB.1.5.62,Omicron (XBB.1.5-like)
ERR10937893,FD.1,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.0,FD.1,FD.1,Omicron (XBB.1.5-like)
ERR10937945,CH.1.1.3,Omicron (BA.2-like),True,False,0,1,2023-02-20,1.0,CH.1.1.3,CH.1.1.3,Omicron (BA.2-like)


In [24]:
np.sum(df_sample["Viridian_pangolin_1.21"] != df_sample["Viridian_pangolin_1.29"])

np.int64(6588)

In [25]:
print(f"{_ / df_sample.shape[0] * 100: .2f}")

 0.27


In [17]:
np.sum(df_sample["pango"] != df_sample["Viridian_pangolin_1.29"])

np.int64(21680)

In [18]:
print(f"{_ / df_sample.shape[0] * 100: .2f}")

 0.87


Only 712 samples differ in their Scorpio designations.

In [26]:
df_sample[df_sample["scorpio"] != df_sample["Viridian_scorpio_1.29"]][["scorpio", "Viridian_scorpio_1.29"]]

Unnamed: 0_level_0,scorpio,Viridian_scorpio_1.29
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR14389522,Zeta (P.2-like),.
SRR16617944,Zeta (P.2-like),.
ERR8516089,Delta (B.1.617.2-like) +K417N,Delta (B.1.617.2-like)
ERR6165703,Delta (AY.4-like),Delta (B.1.617.2-like)
ERR6165751,Delta (AY.4-like),Delta (B.1.617.2-like)
...,...,...
SRR23601085,Omicron (XBB-like),Omicron (XBB.1.5-like)
SRR23601164,Omicron (XBB-like),Omicron (XBB.1.5-like)
SRR23601292,Omicron (XBB-like),Omicron (XBB.1.5-like)
SRR23601180,Omicron (XBB-like),Omicron (XBB.1.5-like)


## Origination events

How well does the ARG reflect the phylogenetic structure of the pango naming system?

We sort the nodes by the descending samples first, and then by node time. This should guarantee that the first node in the dataframe for each pango lineage is the "majority" node for that pango.

In [4]:
dfn_sorted = df_node.sort_values(["max_descendant_samples", "node_time"], ascending=False)
dfn_sorted

Unnamed: 0_level_0,pango,sample_id,scorpio,is_sample,is_recombinant,num_mutations,max_descendant_samples,date,node_time
node_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,B,Wuhan/Hu-1/2019,.,False,False,0,2482157,2019-12-26,1153.000000
27,B,,.,False,False,1,2477500,2019-12-26,1153.000000
12,B.1,SRR11597205,.,True,False,2,2477495,2020-01-28,1120.059657
59,B.1,,.,False,False,1,2477489,2020-01-28,1120.059657
98,B.1.1,,.,False,False,3,1218787,2020-01-28,1120.000555
...,...,...,...,...,...,...,...,...,...
2689016,XBB.1.5.62,ERR10937891,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.000000
2689017,FD.1,ERR10937893,Omicron (XBB.1.5-like),True,False,0,1,2023-02-20,1.000000
2689018,CH.1.1.3,ERR10937945,Omicron (BA.2-like),True,False,0,1,2023-02-20,1.000000
2689019,CH.1.1.3,ERR10937969,Omicron (BA.2-like),True,False,0,1,2023-02-20,1.000000


In [5]:
dfn_pango = dfn_sorted.reset_index().groupby(["pango"]).first()
dfn_pango

Unnamed: 0_level_0,node_id,sample_id,scorpio,is_sample,is_recombinant,num_mutations,max_descendant_samples,date,node_time
pango,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A,9,,.,False,False,2,1085,2019-12-26,1153.000000
A.1,227,,.,False,False,2,245,2020-01-19,1129.000000
A.2,530,,.,False,False,4,234,2020-02-01,1116.000000
A.2.2,1190,,.,False,False,1,65,2020-02-26,1091.109256
A.2.3,1186,,.,False,False,2,79,2020-02-21,1096.465780
...,...,...,...,...,...,...,...,...,...
XW,1159411,,Omicron (BA.2-like),False,True,1,32,2022-03-10,348.300416
XY,1187989,,Omicron (Unassigned),False,True,2,23,2022-03-16,342.625280
XZ,1163537,,Omicron (BA.2-like),False,False,1,48,2022-03-05,353.377858
Y.1,55861,,.,False,False,1,36,2020-08-18,917.000000


In [11]:

def worker(work):
    pango, row = work
    df_pango = df_node[df_node.pango == pango]
    samples = df_pango[df_pango.is_sample].index
    root = row["node_id"]
    tracked_samples = []
    parent_tracked_samples = []
    for tree in ts.trees(tracked_samples=samples):
        tracked_samples.append(tree.num_tracked_samples(root))
        parent = tree.parent(root)
        if parent != -1:
            parent_tracked_samples.append(tree.num_tracked_samples(parent))
        else:
            parent_tracked_samples.append(0)
            
    return {
        "pango": pango,
        "root": root,
        "total_samples": len(samples),
        "max_descendants": np.max(tracked_samples),
        "min_descendants": np.min(tracked_samples),
        "parent_max_descendants": np.max(parent_tracked_samples),
        "parent_min_descendants": np.min(parent_tracked_samples),
    }
    
# Note: set things up this way with an eye to using concurrent.futures,
# but it was totally GIL-blocked, seemingly. Not worth setting up
# process level parallelism.
data = []
for work in tqdm(dfn_pango.iterrows(), total=dfn_pango.shape[0]):
    result = worker(work)
    data.append(result)
        
df_pango_events = pd.DataFrame(data)
df_pango_events

  0%|          | 0/2058 [00:00<?, ?it/s]

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants
0,A,9,225,225,225,225,225
1,A.1,227,245,245,245,245,245
2,A.2,530,47,47,47,47,47
3,A.2.2,1190,65,65,65,65,65
4,A.2.3,1186,79,79,79,79,79
...,...,...,...,...,...,...,...
2053,XW,1159411,32,32,32,32,32
2054,XY,1187989,23,23,23,23,23
2055,XZ,1163537,48,48,48,48,48
2056,Y.1,55861,36,36,36,36,36


In [12]:
total = df_pango_events["total_samples"]
diff = (total - df_pango_events["max_descendants"]).abs() 
diff_parent = (total - df_pango_events["parent_max_descendants"]).abs() 
df_pango_events["diff"] = diff
df_pango_events["diff_parent"] = diff_parent
df_pango_events["relative_diff"] = diff / total
df_pango_events["relative_diff_parent"] = diff_parent / total

In [13]:
df_pango_events

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants,diff,diff_parent,relative_diff,relative_diff_parent
0,A,9,225,225,225,225,225,0,0,0.0,0.0
1,A.1,227,245,245,245,245,245,0,0,0.0,0.0
2,A.2,530,47,47,47,47,47,0,0,0.0,0.0
3,A.2.2,1190,65,65,65,65,65,0,0,0.0,0.0
4,A.2.3,1186,79,79,79,79,79,0,0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
2053,XW,1159411,32,32,32,32,32,0,0,0.0,0.0
2054,XY,1187989,23,23,23,23,23,0,0,0.0,0.0
2055,XZ,1163537,48,48,48,48,48,0,0,0.0,0.0
2056,Y.1,55861,36,36,36,36,36,0,0,0.0,0.0


In [14]:
perfect = df_pango_events[df_pango_events["diff"] == 0]
perfect

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants,diff,diff_parent,relative_diff,relative_diff_parent
0,A,9,225,225,225,225,225,0,0,0.0,0.0
1,A.1,227,245,245,245,245,245,0,0,0.0,0.0
2,A.2,530,47,47,47,47,47,0,0,0.0,0.0
3,A.2.2,1190,65,65,65,65,65,0,0,0.0,0.0
4,A.2.3,1186,79,79,79,79,79,0,0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
2053,XW,1159411,32,32,32,32,32,0,0,0.0,0.0
2054,XY,1187989,23,23,23,23,23,0,0,0.0,0.0
2055,XZ,1163537,48,48,48,48,48,0,0,0.0,0.0
2056,Y.1,55861,36,36,36,36,36,0,0,0.0,0.0


In [15]:
perfect.shape

(1473, 11)

In [16]:
perfect.total_samples.sum()

np.int64(717798)

# Consider the effects of polytomies 

In [17]:
perfect_for_parent = df_pango_events[(df_pango_events["diff"] > 0) & (df_pango_events["diff_parent"] == 0)]
perfect_for_parent

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants,diff,diff_parent,relative_diff,relative_diff_parent
15,A.5,818,47,46,46,47,47,1,0,0.021277,0.0
19,AA.3,50820,28,27,27,28,28,1,0,0.035714,0.0
33,AM.1,97590,2,1,1,2,2,1,0,0.500000,0.0
43,AY.103,266229,67055,67021,67015,67055,67050,34,0,0.000507,0.0
46,AY.105,264390,222,220,219,222,222,2,0,0.009009,0.0
...,...,...,...,...,...,...,...,...,...,...,...
1989,XBB.1.5.9,2716447,14,12,12,14,14,2,0,0.142857,0.0
2001,XBB.2,1396939,103,93,93,103,103,10,0,0.097087,0.0
2009,XBB.2.7,1398638,5,2,2,5,5,3,0,0.600000,0.0
2025,XBF.2,1429629,4,2,2,4,4,2,0,0.500000,0.0


In [18]:
perfect_for_parent.shape

(245, 11)

In [19]:
perfect_for_parent.total_samples.sum()

np.int64(589197)

# The rest

In [20]:
not_perfect = df_pango_events[df_pango_events["diff_parent"] != 0]
not_perfect

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants,diff,diff_parent,relative_diff,relative_diff_parent
6,A.2.5.1,255097,3,2,2,2,2,1,1,0.333333,0.333333
7,A.2.5.2,325703,4,3,3,3,3,1,1,0.250000,0.250000
40,AY.100,280335,18621,17693,17426,17693,17426,928,928,0.049836,0.049836
48,AY.107,2711036,758,601,598,601,598,157,157,0.207124,0.207124
50,AY.109,380418,316,305,304,305,304,11,11,0.034810,0.034810
...,...,...,...,...,...,...,...,...,...,...,...
1968,XBB.1.5.59,1435395,2,1,1,1,1,1,1,0.500000,0.500000
1994,XBB.1.5.96,1429403,16,8,8,8,8,8,8,0.500000,0.500000
2006,XBB.2.4,1434117,4,3,3,3,3,1,1,0.250000,0.250000
2007,XBB.2.5,1420329,33,30,30,30,30,3,3,0.090909,0.090909


In [21]:
not_perfect.total_samples.sum()

np.int64(1175887)

## Lineages that are pretty close 

And have a reasonable number of samples

In [22]:
close_to_right = not_perfect[(not_perfect["diff_parent"] < 100)]
close_to_right

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants,diff,diff_parent,relative_diff,relative_diff_parent
6,A.2.5.1,255097,3,2,2,2,2,1,1,0.333333,0.333333
7,A.2.5.2,325703,4,3,3,3,3,1,1,0.250000,0.250000
50,AY.109,380418,316,305,304,305,304,11,11,0.034810,0.034810
52,AY.110,296845,704,660,660,660,660,44,44,0.062500,0.062500
54,AY.112,407132,137,45,45,45,45,92,92,0.671533,0.671533
...,...,...,...,...,...,...,...,...,...,...,...
1968,XBB.1.5.59,1435395,2,1,1,1,1,1,1,0.500000,0.500000
1994,XBB.1.5.96,1429403,16,8,8,8,8,8,8,0.500000,0.500000
2006,XBB.2.4,1434117,4,3,3,3,3,1,1,0.250000,0.250000
2007,XBB.2.5,1420329,33,30,30,30,30,3,3,0.090909,0.090909


In [23]:
close_to_right.shape

(306, 11)

In [24]:
close_to_right.total_samples.sum()

np.int64(755840)

## Important stuff that's a long way off

In [25]:
important = not_perfect[not_perfect["diff_parent"] >= 100]
important.sort_values("total_samples", ascending=False)

Unnamed: 0,pango,root,total_samples,max_descendants,min_descendants,parent_max_descendants,parent_min_descendants,diff,diff_parent,relative_diff,relative_diff_parent
978,BA.1.1,2731625,147271,13829,13811,13829,13811,133442,133442,0.906098,0.906098
142,AY.4.2,280324,54607,54499,54344,54500,54345,108,107,0.001978,0.001959
1011,BA.1.17.2,827900,38207,27490,27484,27504,27498,10717,10703,0.280498,0.280132
1217,BA.5.2.1,1189190,31263,30584,30388,30591,30395,679,672,0.021719,0.021495
1004,BA.1.15,820352,21527,19391,19389,19398,19396,2136,2129,0.099224,0.098899
1145,BA.2.9,2730731,19336,13004,13004,13801,13801,6332,5535,0.327472,0.286254
40,AY.100,280335,18621,17693,17426,17693,17426,928,928,0.049836,0.049836
988,BA.1.1.18,818916,14866,5951,5945,5951,5945,8915,8915,0.599691,0.599691
129,AY.39,250694,11529,10748,10571,10748,10571,781,781,0.067742,0.067742
1012,BA.1.18,817061,11158,7919,7904,7919,7904,3239,3239,0.290285,0.290285


In [26]:
important.shape

(35, 11)

In [27]:
important.total_samples.sum()

np.int64(420047)

In [28]:
df_pango_events.to_csv(datadir / "pango_events_in_arg.csv", index=False)