# Processing Sars CoV 2 mutation effect data

### Resources
- Data from this [paper](https://www.cell.com/cell/fulltext/S0092-8674(20)31003-5?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867420310035%3Fshowall%3Dtrue#author-abstract), downloaded from this [repo](https://github.com/jbloomlab/SARS-CoV-2-RBD_DMS/tree/master)

In [1]:
!wget https://media.githubusercontent.com/media/jbloomlab/SARS-CoV-2-RBD_DMS/refs/heads/master/results/single_mut_effects/single_mut_effects.csv

--2025-05-30 08:50:05--  https://media.githubusercontent.com/media/jbloomlab/SARS-CoV-2-RBD_DMS/refs/heads/master/results/single_mut_effects/single_mut_effects.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 269319 (263K) [text/plain]
Saving to: ‘single_mut_effects.csv’


2025-05-30 08:50:07 (351 KB/s) - ‘single_mut_effects.csv’ saved [269319/269319]



Let's load the csv into a dataframe and take a look at the structure

In [17]:
import pandas as pd

df_raw = pd.read_csv("single_mut_effects.csv")
df_raw

Unnamed: 0,site_RBD,site_SARS2,wildtype,mutant,mutation,mutation_RBD,bind_lib1,bind_lib2,bind_avg,expr_lib1,expr_lib2,expr_avg
0,1,331,N,A,N331A,N1A,-0.05,-0.02,-0.03,-0.14,-0.08,-0.11
1,1,331,N,C,N331C,N1C,-0.08,-0.10,-0.09,-1.56,-0.97,-1.26
2,1,331,N,D,N331D,N1D,0.00,0.07,0.03,-0.75,-0.12,-0.44
3,1,331,N,E,N331E,N1E,0.02,-0.02,0.00,-0.39,-0.24,-0.31
4,1,331,N,F,N331F,N1F,-0.03,-0.16,-0.10,-0.83,-0.57,-0.70
...,...,...,...,...,...,...,...,...,...,...,...,...
4216,201,531,T,T,T531T,T201T,0.00,0.00,0.00,0.00,0.00,0.00
4217,201,531,T,V,T531V,T201V,0.03,-0.02,0.01,-0.07,-0.05,-0.06
4218,201,531,T,W,T531W,T201W,0.02,-0.06,-0.02,-0.13,-0.04,-0.08
4219,201,531,T,Y,T531Y,T201Y,0.00,-0.03,-0.01,-0.03,-0.08,-0.05


Looks like they give us mutation sites, WT and mutant residues, then individual binding and expression scores for each library, then the average of the two. 

What I want for machine learning are the full sequences with the bind data. So, I'll first need to generate the full sequences of the RBD given each mutation, then I'll make a new dataframe containing just the data I want.

In [None]:
# Get the exact wildtype sequence used from the dataframe
wt_by_position = df_raw[['site_RBD', 'wildtype']].drop_duplicates('site_RBD')
wt_seq_list = wt_by_position['wildtype'].to_list()
wt_sequence = ''.join(wt_seq_list)
wt_sequence

'NITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKST'

In [None]:
mutant_sequences = []

for i in range(len(df_raw.site_RBD)):
    mut_seq_list = wt_seq_list.copy()
    mut_seq_list[df_raw.site_RBD[i]-1] = df_raw.mutant[i]
    mut_seq = ''.join(mut_seq_list)
    mutant_sequences.append(mut_seq)

mutant_sequences

['AITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKST',
 'CITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKST',
 'DITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKST',
 'EITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKST',
 'FITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVG

In [20]:
df = pd.DataFrame(columns=['sequences', 'bind_score'])
df.sequences = mutant_sequences
df.bind_score = df_raw.bind_avg
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,sequences,bind_score
0,AITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,-0.03
1,CITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,-0.09
2,DITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,0.03
3,EITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,0.00
4,FITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,-0.10
...,...,...
3998,NITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,0.01
3999,NITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,0.00
4000,NITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,0.01
4001,NITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFST...,-0.02


In [24]:
from sklearn.model_selection import train_test_split

train_seq, test_seq, train_labels, test_labels = train_test_split(df.sequences, df.bind_score, test_size=0.1, shuffle=True)

In [28]:
train = pd.concat([train_seq, train_labels], axis=1)
train.to_csv('train.csv')

test = pd.concat([test_seq, test_labels], axis=1)
test.to_csv('test.csv')