# Utility notebook to examine variant scores for specific mutations

This notebook is not part of the pipeline, but can be used if you want to inspect information abut variants containing any specific mutation.

Import Python modules:

In [1]:
import pandas as pd

import yaml

with open('config.yaml') as f:
    config = yaml.safe_load(f)

escape_fracs = pd.read_csv(config['escape_fracs'])

escape_scores = pd.read_csv(config['escape_scores'])

Specify the mutation and antibody / serum of interest using RBD numbering:

In [2]:
mutation = 'N439K'
serum = 'REGN10987_400'

Parse site and mutant out of mutation, and also get mutation in sequential 1, 2, ... RBD numbering:

In [3]:
site = int(mutation[1: -1])
mutant_aa = mutation[-1]
wt_aa = mutation[0]

sequential_site = site - 330
sequential_mutation = f"{wt_aa}{sequential_site}{mutant_aa}"

Overall estimated escape fraction for mutation in each library and average (will be empty if no escape estimated):

In [4]:
(escape_fracs
 .query('protein_site == @site')
 .query('mutation == @mutant_aa')
 .query('selection == @serum')
 )

Unnamed: 0,selection,library,condition,site,label_site,wildtype,mutation,protein_chain,protein_site,mut_escape_frac_epistasis_model,mut_escape_frac_single_mut,site_total_escape_frac_epistasis_model,site_total_escape_frac_single_mut,site_avg_escape_frac_epistasis_model,site_avg_escape_frac_single_mut,nlibs
116087,REGN10987_400,average,REGN10987_400,109,439,N,K,E,439,0.06063,0.018398,2.204,2.449,0.2204,0.2449,1
238910,REGN10987_400,lib1,REGN10987_400_lib1,109,439,N,K,E,439,0.06063,0.018398,1.533,1.09,0.1533,0.1363,1


Now here are the escape scores for all variants containing the mutation.
Note that the mutations for the escape scores are in **sequential** (not RBD) numbering, so offset by 330.
We show the substitutons in the variant, its escape score, its binding and expression in the DMS, and whether it passes the filters for DMS binding / expression based both on both the variant DMS value and the mutations in the variant.
If you want to only show variants that pass these filters, uncomment the query lines to query for them:

In [5]:
(escape_scores
 .query('name == @serum')
 .assign(aa_substitutions=lambda x: x['aa_substitutions'].fillna(''))
 .query('aa_substitutions.str.contains(@sequential_mutation)')
 .query('pass_pre_count_filter')
 .query('muts_pass_bind_filter')
 .query('muts_pass_expr_filter')
 .query('variant_pass_bind_filter')
 .query('variant_pass_expr_filter')
 [['library', 'aa_substitutions', 'score', 'variant_expr', 'variant_bind',
   'muts_pass_expr_filter', 'variant_pass_expr_filter', 'muts_pass_bind_filter',
   'variant_pass_bind_filter']]
 .sort_values('variant_expr', ascending=False)
 .sort_values('library')
 )

Unnamed: 0,library,aa_substitutions,score,variant_expr,variant_bind,muts_pass_expr_filter,variant_pass_expr_filter,muts_pass_bind_filter,variant_pass_bind_filter
11416809,lib1,T3S N109K,0.04575,0.01,0.1,True,True,True,True
11417939,lib1,F62V N109K F156T,0.002581,-0.02,-0.52,True,True,True,True
11416607,lib1,N109K,0.03481,-0.2,0.13,True,True,True,True
11441528,lib1,N109K,0.001985,-0.32,0.09,True,True,True,True
11424633,lib1,D59R N64T N109K,0.2587,-0.61,-0.04,True,True,True,True
11442311,lib1,A42P D98E N109K,0.04817,-0.61,-0.16,True,True,True,True
11535761,lib2,N58S N109K,0.1484,-0.4,-0.13,True,True,True,True


Finally, look at the DMS measurements for the mutation at an individual library (as well as average) level:

In [6]:
(pd.read_csv(config['mut_bind_expr'])
 .query('mutation == @mutation')
 )

Unnamed: 0,site_RBD,site_SARS2,wildtype,mutant,mutation,mutation_RBD,bind_lib1,bind_lib2,bind_avg,expr_lib1,expr_lib2,expr_avg
2276,109,439,N,K,N439K,N109K,0.11,-0.02,0.04,-0.33,-0.36,-0.35
