# Get a set of fixed-date samples to see VOCs

Getting saltational variants started off in the presence of time travellers is a significant challenge. One way we can get some sequences of which we are confident are important to these outbreaks and have reasonably accurate dates is to look at the Pango designation data.

We then merge data and figure out which of these are in the Viridian dataset by looking at the ENA data.

See https://github.com/jeromekelleher/sc2ts-paper/issues/268


#### Download files

In [3]:
%%bash
wget --quiet https://raw.githubusercontent.com/cov-lineages/pango-designation/16205e716c6a68ff1c3d0f26f0c77478682368ac/lineages.csv


In [4]:
%%bash
curl -s -X 'GET' \
  'https://www.ebi.ac.uk/ena/portal/api/filereport?result=read_run&accession=PRJEB37886&fields=sample_accession%2Csample_alias&limit=0&format=tsv&download=true' \
  -H 'accept: */*' > filereport_read_run_PRJEB37886_tsv.txt


In [5]:
%%bash
wget --quiet --content-disposition https://figshare.com/ndownloader/files/49694808


In [6]:
%%bash
wget --quiet --content-disposition https://figshare.com/ndownloader/files/49784541


#### Parse files


In [None]:
import pandas as pd

# lineage, sample name
pango = pd.read_csv("lineages.csv", sep=",")
pango["sample_name"] = [s.split("/")[1] for s in pango["taxon"]]
pango

In [2]:
pango = pango.set_index("sample_name")

In [None]:
# run accession, sample name
ena = pd.read_csv("filereport_read_run_PRJEB37886_tsv.txt", sep="\t")
ena["sample_name"] = [s.split("/")[1] for s in ena["sample_alias"]]
ena

In [4]:
ena = ena.set_index("sample_name")

In [None]:
pango_ena = pango.join(ena, how="inner")
pango_ena

In [6]:
del pango, ena

In [None]:
# Run (strain)
viridian = pd.read_csv("run_metadata.v05.tsv.gz", sep="\t").set_index("Run")
viridian

In [None]:
pango_ena = pango_ena.set_index("run_accession")
pango_ena

In [9]:
joined = viridian.join(pango_ena, how="inner")
# Keep viridian, as it's needed later
del pango_ena

In [None]:
joined

These should all be COGUK samples now. Check on country, as a sanity check

In [None]:
joined.Country.unique()

In [12]:
# Subset down to the columns that we're using here and chuck out 2020-12-31 and non full precision dates
joined = joined[["Date_tree", "lineage", "Viridian_pangolin_1.29"]]
joined = joined[(joined["Date_tree"] != "2020-12-31") & (joined["Date_tree"].str.len() == 10)]

#### Search for seed samples among COG-UK samples

Use this joined dataframe now to extract some early sequences for each lineage of interest.

In [13]:
def extract_lineage(lineage, max_rows=10):
    df = joined[joined.lineage == lineage].sort_values("Date_tree")
    print("Got", df.shape[0], " runs")
    return df.head(max_rows)

In [None]:
extract_lineage("B.1.617")

In [None]:
extract_lineage("B.1.617.1")

In [None]:
extract_lineage("B.1.617.2")

In [None]:
extract_lineage("BA.1")

In [None]:
extract_lineage("BA.2")

In [None]:
extract_lineage("BA.4")

#### Search for seed samples among the new Viridian African samples

In [None]:
africa = pd.read_excel("suppl_tables_S2-10.xlsx", sheet_name="Table S7").set_index("Run")
africa = africa[[]]
africa

In [None]:
joined = viridian.join(africa, how="inner")
# Filter out probable contaminants, as done in the Viridian paper.
# "These were further filtered for quality, requiring no more than 3
# “heterozygous” base calls (ie none of A,C,G,T,N) and no more than 5,000 Ns."
joined["Viridian_cons_het"] = joined["Viridian_cons_het"].astype(int)
joined = joined[joined["Viridian_cons_het"] < 4]
joined = joined[[
    "Country", "Region",
    "Date_tree", "Collection_date",
    "Viridian_cons_het",
    "Viridian_pangolin_1.29",
]]
joined

In [22]:
del viridian, africa

In [23]:
# Overwrite this function above
def extract_lineage(lineage, max_rows=10):
    df = joined[joined["Viridian_pangolin_1.29"] == lineage].sort_values("Date_tree")
    print("Got", df.shape[0], " runs")
    return df.head(max_rows)

In [None]:
extract_lineage("BA.1")

In [None]:
extract_lineage("BA.2")

In [None]:
extract_lineage("BA.4")