# Make a dataframe with curated mutation counts

Potential suggestions for updating Bloom repo:
* add 3mer sequence context to `clade_founder_nts.csv`
* update list of sites to ignore with sites from https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473?
* filter based on genome length?

## Read in Python modules

In [1]:
import os
import pandas as pd
import numpy as np

## Identify sites that are conserved in all clade founders

Read in clade founder sequences. Add a column giving a site's sequence context. Then, make a list of sites where the site, its codon, and its 3mer sequence context are conserved across all founders and identical to the Wuhan-Hu-1 reference sequence. For sites in noncoding sequences, we only consider the site and its 3mer sequence context.

In [2]:
# Read in data
fitness_results_dir = '../SARS2-mut-fitness/results_gisaid_2024-04-24'
founder_df = pd.read_csv(os.path.join(fitness_results_dir, 'clade_founder_nts/clade_founder_nts.csv'))
founder_df.sort_values(['clade', 'site'], inplace=True)

# Get founder seqs
founder_seq_dict = {}
for (clade, data) in founder_df.groupby('clade'):
    founder_seq_dict[clade] = ''.join(data['nt'])

# For each row, get the site's 3mer motif in the corresponding founder sequence
def get_motif(site, clade):
    founder_seq = founder_seq_dict[clade]
    return founder_seq[site-2:site+1]
min_and_max_sites = [founder_df['site'].min(), founder_df['site'].max()]
founder_df['motif'] = founder_df.apply(
    lambda row: np.nan if row['site'] in min_and_max_sites \
        else get_motif(row['site'], row['clade']),
    axis=1
)

# Add columns giving the reference codon and motif
founder_df = founder_df.merge(
    (
        founder_df[founder_df['clade'] == '19A']
        .rename(columns={'codon' : 'ref_codon', 'motif' : 'ref_motif'})
    )[['site', 'ref_codon', 'ref_motif']], on='site', how='left'
)

# Identify sites where the codon and motif are conserved across all clade founders
# by subsetting data to entries with identical codons/motifs to reference, then
# identifying sites that still have entries for all clades
data = founder_df[
    (founder_df['codon'] == founder_df['ref_codon']) &
    (founder_df['motif'] == founder_df['ref_motif'])
]
site_counts = data['site'].value_counts()
nclades = len(founder_df['clade'].unique())
conserved_sites = site_counts[site_counts == nclades].index
founder_df['same_context_all_founders'] = founder_df['site'].isin(conserved_sites)

print('Number of sites in genome:', len(founder_df['site'].unique()))
print('Number of conserved sites:', len(conserved_sites))

Number of sites in genome: 29903
Number of conserved sites: 28702


## Read in and curate counts data

Read in dataframe on actual and expected counts, and add columns with metadata.

In [3]:
# Read in data
counts_df = pd.read_csv(os.path.join(
    fitness_results_dir,
    'expected_vs_actual_mut_counts/expected_vs_actual_mut_counts.csv'
))

# Add metadata
counts_df[['wt_nt', 'mut_nt']] = counts_df['nt_mutation'].str.extract(r'(\w)\d+(\w)')
counts_df['mut_type'] = counts_df['wt_nt'] + counts_df['mut_nt']

def get_mut_class(synonymous, noncoding, clade_founder_aa, mutant_aa):
    if synonymous:
        return 'synonymous'
    elif noncoding:
        return 'noncoding'
    elif mutant_aa == '*':
        return 'nonsense'
    elif mutant_aa != clade_founder_aa:
        return 'nonsynonymous'
    else:
        raise ValueError(mutant_aa, clade_founder_aa)

counts_df['mut_class'] = counts_df.apply(
    lambda row: get_mut_class(row['synonymous'], row['noncoding'], row['clade_founder_aa'], row['mutant_aa']),
    axis=1
)

# Add column indicating if clade is pre or post omicron
pre_omicron_clades = [
    '20A', '20B', '20C', '20E', '20G', '20H', '20I', '20J', '21C','21I', '21J'
]
counts_df['pre_or_post_omicron'] = counts_df['clade'].apply(
    lambda x: 'pre_omicron' if x in pre_omicron_clades else 'post_omicron'
)

# Add column indicating if a site is before site 21,555
counts_df['nt_site_before_21555'] = counts_df['nt_site'] < 21555

# Add column indicating whether RNA sites from the Lan, 2022, Nature Comm. structure
# are predicted to be paired, using code from Hensel, 2023, biorxiv
filename = '../data/lan_2022/41467_2022_28603_MOESM11_ESM.txt'
with open(filename) as f:
    lines = [line.rstrip().split() for line in f]
paired = np.array([[int(x[0]),int(x[4])] for x in lines[1:]])
paired_dict = dict(zip(paired[:,0], paired[:,1]))
def assign_ss_pred(site):
    if site not in paired_dict:
        return 'nd'
    elif paired_dict[site] == 0:
        return 'unpaired'
    else:
        return 'paired'
counts_df['ss_prediction'] = counts_df['nt_site'].apply(lambda x: assign_ss_pred(x))

Create a dataframe with curated counts. We curate the data in the following ways:
* only analyze sites where that site, the site's codon, and the site's 5' and 3' neighboring nucleotides are conserved in all clade founders and identical to the Wuhan-Hu-1 reference sequence
* ignore sites that are annotated as being masked in any clade of the UShER tree (`masked_in_usher == True`), are annotated for exclusion (`exclude == True`), or were identified to highly homoplastic by De Maio et al. (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473)

Then, subset the dataframe to one row for each possible mutation, including the following columns:
* `actual_count`: gives the mutation's count for `subset == all` from the above dataframe of counts
* additional columns give actual counts for subsets of the data, such as geographical subsets (England vs. USA) or phylogenetic subsets (pre- vs. post-Omicron)
* additional columns also give counts that are truncated at the 95th percentile

In [4]:
# Ignore sites that are masked or excluded in any clade of the UShER tree
sites_to_ignore = list(counts_df[
    (counts_df['masked_in_usher'] == True) |
    (counts_df['exclude'] == True)
]['nt_site'].unique())

# Homoplastic sites from De Maio et al.
sites_to_ignore += [
    187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408,
    14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077,
    28826, 28854, 29700, 4050, 13402, 11083, 15324, 21575
]

# Aggregate counts across...
# ... all clades for "all" subset
ignore_cols = [
    'expected_count', 'actual_count', 'count_terminal', 'count_non_terminal', 'mean_log_size',
    'clade', 'pre_or_post_omicron'
]
groupby_cols = [
    col for col in counts_df.columns.values
    if col not in ignore_cols
]
curated_counts_df = counts_df[
    (counts_df['nt_site'].isin(conserved_sites)) &
    ~(counts_df['nt_site'].isin(sites_to_ignore)) &
    (counts_df['subset'] == 'all')
].groupby(groupby_cols, as_index=False).agg('sum', numeric_only=True)
assert sum(curated_counts_df['nt_mutation'].duplicated(keep=False)) == 0

# ... England or USA, and merge counts column with above dataframe
subsets = ['England', 'USA']
for subset in subsets:
    subset_data = counts_df[
        (counts_df['nt_site'].isin(conserved_sites)) &
        ~(counts_df['nt_site'].isin(sites_to_ignore)) &
        (counts_df['subset'] == subset)
    ].groupby(groupby_cols, as_index=False).agg('sum', numeric_only=True)
    assert sum(subset_data['nt_mutation'].duplicated(keep=False)) == 0
    assert len(subset_data) == len(curated_counts_df)
    curated_counts_df = curated_counts_df.merge(
        (
            subset_data
            .rename(columns={'actual_count' : f'actual_count_{subset}'})
        )[['nt_mutation', f'actual_count_{subset}']], on='nt_mutation'
    )

# ... pre- or post-omicron clades, and merge counts column with above dataframe
subsets = ['pre_omicron', 'post_omicron']
for subset in subsets:
    subset_data = counts_df[
        (counts_df['nt_site'].isin(conserved_sites)) &
        ~(counts_df['nt_site'].isin(sites_to_ignore)) &
        (counts_df['subset'] == 'all') &
        (counts_df['pre_or_post_omicron'] == subset)
    ].groupby(groupby_cols, as_index=False).agg('sum', numeric_only=True)
    assert sum(subset_data['nt_mutation'].duplicated(keep=False)) == 0
    assert len(subset_data) == len(curated_counts_df)
    curated_counts_df = curated_counts_df.merge(
        (
            subset_data
            .rename(columns={'actual_count' : f'actual_count_{subset}'})
        )[['nt_mutation', f'actual_count_{subset}']], on='nt_mutation'
    )

# Save curated counts to an output file
assert len(curated_counts_df) == len(curated_counts_df['nt_mutation'].unique())
curated_counts_df.drop(columns=['subset', 'exclude', 'masked_in_usher'], inplace=True)
outfile = '../results/curated_mut_counts.csv'
curated_counts_df.to_csv(outfile, index=False)

curated_counts_df.head()

Unnamed: 0,nt_site,nt_mutation,clade_founder_nt,gene,clade_founder_codon,clade_founder_aa,mutant_codon,mutant_aa,aa_mutation,synonymous,...,ss_prediction,expected_count,actual_count,count_terminal,count_non_terminal,mean_log_size,actual_count_England,actual_count_USA,actual_count_pre_omicron,actual_count_post_omicron
0,266,A266C,A,ORF1a;ORF1ab,ATG;ATG,M;M,CTG;CTG,L;L,M1L;M1L,False,...,paired,22.594416,0,0,0,0.0,0,0,0,0
1,266,A266G,A,ORF1a;ORF1ab,ATG;ATG,M;M,GTG;GTG,V;V,M1V;M1V,False,...,paired,177.06636,2,2,0,0.0,0,1,0,2
2,266,A266T,A,ORF1a;ORF1ab,ATG;ATG,M;M,TTG;TTG,L;L,M1L;M1L,False,...,paired,51.76947,2,1,1,0.34657,0,1,0,2
3,267,T267A,T,ORF1a;ORF1ab,ATG;ATG,M;M,AAG;AAG,K;K,M1K;M1K,False,...,paired,36.81665,2,2,0,0.0,1,1,0,2
4,267,T267C,T,ORF1a;ORF1ab,ATG;ATG,M;M,ACG;ACG,T;T,M1T;M1T,False,...,paired,165.16175,1,1,0,0.0,0,0,0,1


Summary statistics of mutations in datset

In [5]:
print('Number of unique muts:')
print('In the full dataset:', len(counts_df['nt_mutation'].unique()))
print('In the curated dataset:', len(curated_counts_df['nt_mutation'].unique()))

Number of unique muts:
In the full dataset: 90621
In the curated dataset: 84138


In [6]:
print('Number of curated mutations per category:')
curated_counts_df['mut_class'].value_counts()

Number of curated mutations per category:


mut_class
nonsynonymous    63518
synonymous       18042
nonsense          2089
noncoding          489
Name: count, dtype: int64

## Use the dataframe with counts at all sites to make a list of gene boundaires

In [7]:
# Get gene boundaires
gene_boundaries_df = counts_df.groupby('gene', as_index=False).agg(
    min_site = ('nt_site', 'min'),
    max_site = ('nt_site', 'max'),
)
data = pd.DataFrame({'gene': ['ORF1b'], 'min_site': [13480], 'max_site': [21552]})
gene_boundaries_df = pd.concat([gene_boundaries_df, data]).sort_values('min_site')
gene_boundaries_df = gene_boundaries_df[
    ~(gene_boundaries_df['gene'].str.contains(';')) &
    (gene_boundaries_df['gene'] != 'noncoding')
].reset_index(drop=True)

# Save list to file
outfile = '../results/gene_boundaries.csv'
gene_boundaries_df.to_csv(outfile, index=False)

gene_boundaries_df

Unnamed: 0,gene,min_site,max_site
0,ORF1b,13480,21552
1,ORF1ab,13481,21552
2,S,21563,25381
3,ORF3a,25393,26217
4,E,26245,26469
5,M,26523,27188
6,ORF6,27202,27384
7,ORF7a,27394,27755
8,ORF7b,27757,27884
9,ORF8,27894,28256
