# Utility notebook to examine variant scores for specific mutations

This notebook is not part of the pipeline, but can be used if you want to inspect information abut variants containing any specific mutation.

Import Python modules:

In [49]:
import pandas as pd

import yaml

with open('config.yaml') as f:
    config = yaml.safe_load(f)

escape_fracs = pd.read_csv(config['escape_fracs'])

escape_scores = pd.read_csv(config['escape_scores'])

Specify the mutation and antibody / serum of interest using RBD numbering:

In [74]:
mutation = 'N343G'
serum = 'S309_421'

Parse site and mutant out of mutation, and also get mutation in sequential 1, 2, ... RBD numbering:

In [71]:
site = int(mutation[1: -1])
mutant_aa = mutation[-1]
wt_aa = mutation[0]

sequential_site = site - 330
sequential_mutation = f"{wt_aa}{sequential_site}{mutant_aa}"

Overall estimated escape fraction for mutation in each library and average (will be empty if no escape estimated):

In [72]:
(escape_fracs
 .query('protein_site == @site')
 .query('mutation == @mutant_aa')
 .query('selection == @serum')
 )

Unnamed: 0,selection,library,condition,site,label_site,wildtype,mutation,protein_chain,protein_site,mut_escape_frac_epistasis_model,mut_escape_frac_single_mut,site_total_escape_frac_epistasis_model,site_total_escape_frac_single_mut,site_avg_escape_frac_epistasis_model,site_avg_escape_frac_single_mut,nlibs


Now here are the escape scores for all variants containing the mutation.
Note that the mutations for the escape scores are in **sequential** (not RBD) numbering, so offset by 330.
We show the substitutons in the variant, its escape score, its binding and expression in the DMS, and whether it passes the filters for DMS binding / expression based both on both the variant DMS value and the mutations in the variant.
If you want to only show variants that pass these filters, uncomment the query lines to query for them:

In [73]:
(escape_scores
 .query('name == @serum')
 .assign(aa_substitutions=lambda x: x['aa_substitutions'].fillna(''))
 .query('aa_substitutions.str.contains(@sequential_mutation)')
 .query('pass_pre_count_filter')
# .query('muts_pass_bind_filter')
# .query('muts_pass_expr_filter')
# .query('variant_pass_bind_filter')
# .query('variant_pass_expr_filter')
 [['library', 'aa_substitutions', 'score', 'variant_expr', 'variant_bind',
   'muts_pass_expr_filter', 'variant_pass_expr_filter', 'muts_pass_bind_filter',
   'variant_pass_bind_filter']]
 .sort_values('variant_expr', ascending=False)
 .sort_values('library')
 )

Unnamed: 0,library,aa_substitutions,score,variant_expr,variant_bind,muts_pass_expr_filter,variant_pass_expr_filter,muts_pass_bind_filter,variant_pass_bind_filter
14366012,lib1,N13F,0.004298,-0.93,-0.43,False,True,True,True
14370096,lib1,N13F N58R K128S G146T,0.02065,-2.85,-1.78,False,False,True,True
14371230,lib1,N13F T46F L131V L187D,0.04681,-2.67,-1.43,False,False,True,True
14351365,lib1,N13F K128V P169S,0.02086,-2.64,-1.46,False,False,True,True
14368750,lib1,C6N N13F,0.3067,-2.36,-0.92,False,False,True,True
14361112,lib1,C6N N13F,0.316,-2.33,-0.82,False,False,True,True
14353637,lib1,N13F E154C,0.000715,-2.08,-0.94,False,False,True,True
14354047,lib1,N13F I72V A145D,0.00073,-1.68,-2.46,False,False,True,False
14354779,lib1,N13F,0.000759,-1.66,-0.29,False,False,True,True
14340871,lib1,N13F,0.000313,-1.53,-0.42,False,False,True,True


Finally, look at the DMS measurements for the mutation at an individual library (as well as average) level:

In [40]:
(pd.read_csv(config['mut_bind_expr'])
 .query('mutation == @mutation')
 )

Unnamed: 0,site_RBD,site_SARS2,wildtype,mutant,mutation,mutation_RBD,bind_lib1,bind_lib2,bind_avg,expr_lib1,expr_lib2,expr_avg
254,13,343,N,D,N343D,N13D,-0.37,-0.37,-0.37,-1.71,-1.63,-1.67
