In [10]:
import tszip
import numpy as np

## Problematic sites and masking

Choosing which set of sites to mask out is tricky. We look at the problematic sites list that was curated in the earlier days of the pandemic, and how that corresponds to the sites that are masked out of the sc2ts and UShER files.

First, get the data

In [None]:
!wget https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf

In [11]:
problematic_sites = []
with open("problematic_sites_sarsCov2.vcf") as f:
    for line in f:
        if not line.startswith("#"):
            splits = line.split("\t")
            problematic_sites.append(int(splits[1]))

problematic_sites = np.array(problematic_sites)
problematic_sites.shape

(481,)

In [35]:
problematic_sites[-10:]

array([29894, 29895, 29896, 29897, 29898, 29899, 29900, 29901, 29902,
       29903])

## UShER tree

The Usher tree masks out substantially more sites than is in the problematic sites list, and it's not clear what's going on. There's 2396 sites overall that are missing.

In [24]:
ts_usher = tszip.load("../data/usher_viridian_v1.0.trees.tsz")
usher_sites = ts_usher.sites_position.astype(int)
usher_sites.shape

(27508,)

In [30]:
ts_usher.sequence_length - ts_usher.num_sites - 1

2395.0

In [19]:
len(np.setdiff1d(problematic_sites, usher_sites))

275

In [16]:
len(np.intersect1d(problematic_sites, usher_sites))

206

## Sc2ts final

The final sc2ts tree is missing only 10 sites, 7 of which are in this problematic sites list.

In [26]:
ts_sc2ts = tszip.load("../data/sc2ts_viridian_v1.2.trees.tsz")
sc2ts_sites = ts_sc2ts.sites_position.astype(int)
sc2ts_sites.shape

(29893,)

In [31]:
ts_sc2ts.sequence_length - ts_sc2ts.num_sites - 1

10.0

In [20]:
len(np.setdiff1d(problematic_sites, sc2ts_sites))

7

In [21]:
np.setdiff1d(problematic_sites, sc2ts_sites)

array([  635,  8835, 11074, 11083, 15521, 16887, 21575])

In [17]:
len(np.intersect1d(problematic_sites, sc2ts_sites))

474

## Sc2ts inference

The base ARG has 29803 sites, after we mask out the top 100 most mutating sites. We have an intersection of 11 sites with the problematic sites list.


In [28]:
ts_sc2ts_base = tszip.load("../arg_postprocessing/sc2ts_v1_2023-02-21.trees")
sc2ts_base_sites = ts_sc2ts_base.sites_position.astype(int)
sc2ts_base_sites.shape

(29803,)

In [32]:
ts_sc2ts.sequence_length - ts_sc2ts_base.num_sites - 1

100.0

In [33]:
len(np.setdiff1d(problematic_sites, sc2ts_base_sites))

11

In [36]:
np.setdiff1d(problematic_sites, sc2ts_base_sites)

array([  635,  8835, 11074, 11083, 15521, 16887, 21304, 21305, 21575,
       21987, 28253])