# Utility notebook to examine variant scores for specific mutations

This notebook is not part of the pipeline, but can be used if you want to inspect information abut variants containing any specific mutation.

Import Python modules:

In [1]:
import pandas as pd

import yaml

with open('config.yaml') as f:
    config = yaml.safe_load(f)

escape_fracs = pd.read_csv(config['escape_fracs'])

escape_scores = pd.read_csv(config['escape_scores'])

Specify the mutation and antibody / serum of interest using RBD numbering:

In [6]:
mutation = 'G404F'
serum = 'S2X259_59'

Parse site and mutant out of mutation, and also get mutation in sequential 1, 2, ... RBD numbering:

In [7]:
site = int(mutation[1: -1])
mutant_aa = mutation[-1]
wt_aa = mutation[0]

sequential_site = site - 330
sequential_mutation = f"{wt_aa}{sequential_site}{mutant_aa}"

Overall estimated escape fraction for mutation in each library and average (will be empty if no escape estimated):

In [8]:
(escape_fracs
 .query('protein_site == @site')
 .query('mutation == @mutant_aa')
 .query('selection == @serum')
 )

Unnamed: 0,selection,library,condition,site,label_site,wildtype,mutation,protein_chain,protein_site,mut_escape_frac_epistasis_model,mut_escape_frac_single_mut,site_total_escape_frac_epistasis_model,site_total_escape_frac_single_mut,site_avg_escape_frac_epistasis_model,site_avg_escape_frac_single_mut,nlibs


Now here are the escape scores for all variants containing the mutation.
Note that the mutations for the escape scores are in **sequential** (not RBD) numbering, so offset by 330.
We show the substitutons in the variant, its escape score, its binding and expression in the DMS, and whether it passes the filters for DMS binding / expression based both on both the variant DMS value and the mutations in the variant.
If you want to only show variants that pass these filters, uncomment the query lines to query for them:

In [9]:
(escape_scores
 .query('name == @serum')
 .assign(aa_substitutions=lambda x: x['aa_substitutions'].fillna(''))
 .query('aa_substitutions.str.contains(@sequential_mutation)')
 .query('pass_pre_count_filter')
# .query('muts_pass_bind_filter')
# .query('muts_pass_expr_filter')
# .query('variant_pass_bind_filter')
# .query('variant_pass_expr_filter')
 [['library', 'aa_substitutions', 'score', 'variant_expr', 'variant_bind',
   'muts_pass_expr_filter', 'variant_pass_expr_filter', 'muts_pass_bind_filter',
   'variant_pass_bind_filter']]
 .sort_values('variant_expr', ascending=False)
 .sort_values('library')
 )

Unnamed: 0,library,aa_substitutions,score,variant_expr,variant_bind,muts_pass_expr_filter,variant_pass_expr_filter,muts_pass_bind_filter,variant_pass_bind_filter
4700077,lib1,G74F N110T F160G,0.4186,-2.57,-2.88,False,False,True,False
4792167,lib2,G74F,0.804,-1.63,-1.81,False,False,True,True
4809441,lib2,G74F L131I,0.3747,-1.67,-2.08,False,False,True,True
4801592,lib2,F44I G74F,0.6062,-1.99,-2.39,False,False,True,False
4804785,lib2,A14S G74F,0.3626,-2.36,-2.15,False,False,True,True


Finally, look at the DMS measurements for the mutation at an individual library (as well as average) level:

In [10]:
(pd.read_csv(config['mut_bind_expr'])
 .query('mutation == @mutation')
 )

Unnamed: 0,site_RBD,site_SARS2,wildtype,mutant,mutation,mutation_RBD,bind_lib1,bind_lib2,bind_avg,expr_lib1,expr_lib2,expr_avg
1537,74,404,G,F,G404F,G74F,-2.14,-1.81,-1.97,-1.96,-1.84,-1.9
