# Time travellers in Viridian

In this notebook we do some brief analysis on the time travelling samples (those with a collection date that is much earlier than their true collection date) in the Viridian dataset.

This also provides a useful example of how to extract and manipulate metadata fields from the VCF Zarr encoding using Pandas.

In [1]:
import sc2ts

In [2]:
ds = sc2ts.Dataset("../data/viridian_mafft_2024-10-14_v1.vcz.zip")
ds

Dataset at ../data/viridian_mafft_2024-10-14_v1.vcz.zip with 4484157 samples, 29903 variants, and 30 metadata fields. See ds.metadata.field_descriptors() for a description of the fields.

In [3]:
ds.metadata.field_descriptors()

Unnamed: 0_level_0,dtype,description
field,Unnamed: 1_level_1,Unnamed: 2_level_1
Artic_primer_version,object,"If known, the ARTIC primer scheme from ENA met..."
Collection_date,object,collection_date from ENA metadata
Country,object,Country from ENA metadata. If ':' was in the e...
Date_tree,object,"A consensus date, using up 3 sources of data f..."
Date_tree_order,object,This helped define the order in which the samp...
Experiment,object,experiment_accession from ENA metadata
First_created,object,first_created from the ENA metadata
Genbank_N,int16,Number of Ns in the GenBank consensus sequence...
Genbank_accession,object,This is the GenBank accession of the assembly
Genbank_other_runs,object,Any other run accessions associated with the G...


In [4]:
df = ds.metadata.as_dataframe(["Date_tree", "Viridian_scorpio_1.29"])
df

Unnamed: 0_level_0,Date_tree,Viridian_scorpio_1.29
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR11772659,2020-01-19,.
SRR11597132,2020-01-29,.
SRR12162233,2020-01-30,.
SRR12162234,2020-01-30,.
SRR11597217,2020-02-02,.
...,...,...
ERR13177478,2023-04-20,Omicron (XBB-like)
ERR13177479,2023-04-21,Omicron (XBB-like)
ERR13177480,2023-04-24,Omicron (XBB-like)
ERR13177481,2023-04-26,Omicron (XBB-like)


Filter out incomplete dates for simplicity. Full ISO dates are 10 chars long.

In [7]:
df = df[df["Date_tree"].str.len() == 10]
df

Unnamed: 0_level_0,Date_tree,Viridian_scorpio_1.29
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR11772659,2020-01-19,.
SRR11597132,2020-01-29,.
SRR12162233,2020-01-30,.
SRR12162234,2020-01-30,.
SRR11597217,2020-02-02,.
...,...,...
ERR13177478,2023-04-20,Omicron (XBB-like)
ERR13177479,2023-04-21,Omicron (XBB-like)
ERR13177480,2023-04-24,Omicron (XBB-like)
ERR13177481,2023-04-26,Omicron (XBB-like)


In [11]:
dfg = df.groupby("Viridian_scorpio_1.29").min()
dfg

Unnamed: 0_level_0,Date_tree
Viridian_scorpio_1.29,Unnamed: 1_level_1
.,2020-01-01
A.23.1-like,2020-04-12
A.23.1-like+E484K,2020-12-26
AV.1-like,2020-12-31
Alpha (B.1.1.7-like),2020-01-12
B.1.1.318-like,2020-12-31
B.1.1.7-like+E484K,2020-12-08
B.1.617.1-like,2020-03-17
B.1.617.3-like,2020-12-31
Beta (B.1.351-like),2020-05-27


As a very rough proxy for the existance of time travellers, we just look at the minimum date observed within each of the Scorpio designations. Nearly all of these are post 2020, but yet most have samples in 2020.

In [14]:
dfg_2020 = dfg[dfg["Date_tree"] < "2021"]

In [16]:
dfg_2020.shape

(32, 1)

In [17]:
dfg.shape

(43, 1)

In [18]:
dfg_2020

Unnamed: 0_level_0,Date_tree
Viridian_scorpio_1.29,Unnamed: 1_level_1
.,2020-01-01
A.23.1-like,2020-04-12
A.23.1-like+E484K,2020-12-26
AV.1-like,2020-12-31
Alpha (B.1.1.7-like),2020-01-12
B.1.1.318-like,2020-12-31
B.1.1.7-like+E484K,2020-12-08
B.1.617.1-like,2020-03-17
B.1.617.3-like,2020-12-31
Beta (B.1.351-like),2020-05-27
