# Build combined 'codon_variant_table.csv'

Need to take the codon variant tables from the "epistatic shifts" _Science_ paper and from the unpublished Omicron (BA.1 and BA.2) projects, and extract and combine the barcodes that correspond to these variants:
* Wuhan-Hu-1
* BA.1

Also need to duplicate the "follow-up pools" for Wuhan-Hu-1 (if I can find where those are) from the _Science_ paper and duplicate into library 2. 

The problem is this: it is not clear which barcodes came from the follow-up pool, and which did not. 
The follow-up pool was sequenced separately, so might be able to use this to help.

Ideal would be to find barcodes that correspond to the follow-up pool, query the codon variant table on those barcodes, duplicate them but assign to library 2. 

I'm not going to worry about this right now, and will just focus on what we already have.

In [1]:
import os

import numpy
import pandas as pd
from IPython.display import display, HTML

import yaml

Read config file.

In [2]:
with open('config.yaml') as f:
    config = yaml.safe_load(f)

Define input and output directories.

In [3]:
datadir = 'data'
resultsdir = config['variants_dir']

os.makedirs(resultsdir, exist_ok=True)
os.makedirs(config['prior_DMS_data_dir'], exist_ok=True)

Read in the new input codon variant tables and assign a target for each.

In [4]:
# hard-coded dictionary to replace values in codon variant table:

names_df = {'Wuhan_Hu_1':'Wuhan-Hu-1',
            'BA1':'BA.1',
            'pool1':'lib1',
            'pool2':'lib2',
           }

In [5]:
codon_variant_table = pd.DataFrame()

for target, file in config['input_codon_variant_tables'].items():
    print(f'Reading codon variant table for {target}.')
    df = pd.read_csv(file).replace(names_df)
    codon_variant_table = pd.concat([codon_variant_table, df])
    df.to_csv(config['codon_variant_tables'][target], index=False)

codon_variant_table.to_csv(config['codon_variant_table'], index=False)
display(HTML(codon_variant_table.head().to_html(index=False)))

Reading codon variant table for Wuhan-Hu-1.
Reading codon variant table for BA.1.


target,library,barcode,variant_call_support,codon_substitutions,aa_substitutions,n_codon_substitutions,n_aa_substitutions
Wuhan-Hu-1,lib1,AAAAAAAAAAAGGAGA,4,GGT166ATG,G166M,1,1
Wuhan-Hu-1,lib1,AAAAAAAAAAATTTAA,4,,,0,0
Wuhan-Hu-1,lib1,AAAAAAAAAACGCGTA,3,GAA154ACT,E154T,1,1
Wuhan-Hu-1,lib1,AAAAAAAAAACTCCAA,2,TTT156ATG,F156M,1,1
Wuhan-Hu-1,lib1,AAAAAAAAACCGATTA,2,CAG84GAA,Q84E,1,1


## Now do the same for the DMS ACE2 binding and RBD expression scores for each library. 

In [6]:
# hard-coded dictionary to replace values in final variant scores:

names_df = {'Wuhan_Hu_1':'Wuhan-Hu-1',
            'Omicron_BA1':'BA.1',
           }

In [7]:
mut_bind_expr = pd.DataFrame()

for target, file in config['input_mut_bind_expr'].items():
    print(f'Reading variant scores for {target}.')
    df = (pd.read_csv(file)
          .replace(names_df)
          .query('target==@target')
         )
    mut_bind_expr = pd.concat([mut_bind_expr, df])

mut_bind_expr.to_csv(config['mut_bind_expr'], index=False)
display(HTML(mut_bind_expr.head().to_html(index=False)))

Reading variant scores for Wuhan-Hu-1.
Reading variant scores for BA.1.


target,wildtype,position,mutant,mutation,bind,delta_bind,n_bc_bind,n_libs_bind,bind_rep1,bind_rep2,bind_rep3,expr,delta_expr,n_bc_expr,n_libs_expr,expr_rep1,expr_rep2
Wuhan-Hu-1,N,331,A,N331A,8.7936,0.06027,4,2,8.76603,,8.82117,10.29895,0.11422,2,1,10.29895,
Wuhan-Hu-1,N,331,C,N331C,8.61594,-0.15567,5,3,8.7371,8.56255,8.54816,9.67665,-0.50923,4,2,9.4975,9.85579
Wuhan-Hu-1,N,331,D,N331D,8.75409,-0.01751,8,3,8.6599,8.79668,8.8057,10.06985,-0.11602,5,2,10.1461,9.99361
Wuhan-Hu-1,N,331,E,N331E,8.92561,0.154,10,3,8.69116,9.12888,8.9568,10.18436,-0.00151,6,2,10.22575,10.14298
Wuhan-Hu-1,N,331,F,N331F,8.6569,-0.1147,6,3,8.36984,8.80036,8.80051,10.01397,-0.17191,4,2,10.1436,9.88434
