# What mutation rate should we use?

We have simulated noisy data with average of 1, 2, 3, and 4 mutations. If we fit a model to each of them, how good are the fits? 
Each library has 30,000 variants and we will use the same parameters in each model.

In [1]:
import pandas as pd
import polyclonal

noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query('concentration in [0.25, 1, 4]')
    .reset_index(drop=True)
    )

noisy_data

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg1muts,,0.25,0.087480,0.1128
1,avg1muts,,0.25,0.034240,0.1128
2,avg1muts,,0.25,0.037880,0.1128
3,avg1muts,,0.25,0.035730,0.1128
4,avg1muts,,0.25,0.000000,0.1128
...,...,...,...,...,...
359995,avg2muts,Y473E L518F D427L,4.00,0.002918,1.1600
359996,avg1muts,Y473S G413Q,4.00,0.000000,0.5780
359997,avg1muts,Y473V P479R F392W,4.00,0.160200,1.4550
359998,avg3muts,Y489Q N501Y,4.00,0.000000,0.5881


Write the dataset for each library into separate output directories.

In [9]:
import torchdms
import Bio.SeqIO
import pickle

wtseq_dna = Bio.SeqIO.read('RBD_seq.fasta', 'fasta').seq
wtseq_aa = str(wtseq_dna.translate())
assert len(wtseq_aa) == 201

def recode_aa_subs(old_aa_subs, offset):
    """
    recode amino acid substitutions by a specific offset
    ex. A331B -> A1B by a 330bp offset.
    """
    subs = old_aa_subs.split()
    recoded_subs = []
    for s in subs:
        wt = s[0]
        mut = s[-1]
        pos = int(s[1:-1]) - offset
        recoded_subs.append(wt + str(pos) + mut)
    recoded_subs = ' '.join(recoded_subs)
    return recoded_subs

def format_data_for_torchdms(df):
    """
    Format the simulated data to be compatible
    with torchdms input.
    1. remove any NaN
    2. 1-index all aa_substitutions
    """
    data = df.fillna('')
    data['aa_substitutions'] = data['aa_substitutions'].apply(
        lambda x: recode_aa_subs(x, 330)
    )
    return data

for n in [1,2,3,4]:
    avg_n_data = noisy_data.query(f"library == 'avg{n}muts'")
    formatted_data = format_data_for_torchdms(avg_n_data)
    assert len(formatted_data.index == 90000)
    with open(f"torchdms_results/noisy_3conc_{n}muts/noisy_3conc_{n}muts_data.pkl", "wb") as f:
        pickle.dump([formatted_data, wtseq_aa], f)
    print(f"Dataset written to torchdms_results/noisy_3conc_{n}muts.")

Dataset written to torchdms_results/noisy_3conc_1muts.
Dataset written to torchdms_results/noisy_3conc_2muts.
Dataset written to torchdms_results/noisy_3conc_3muts.
Dataset written to torchdms_results/noisy_3conc_4muts.


In [26]:
%cd /fh/fast/bloom_j/computational_notebooks/tyu2/2021/polyclonal/notebooks

/fh/fast/bloom_j/computational_notebooks/tyu2/2021/polyclonal/notebooks


Train the `torchdms` models

In [38]:
for n in [1,2,3,4]:
    min_test_per_stratum = [200,250,300,350]
    min_count_per_stratum = [800,1200,1600,2000]
    
    %cd torchdms_results/noisy_3conc_{n}muts
    
    !echo "Prepping dataset."
    !tdms prep --per-stratum-variants-for-test {min_test_per_stratum[n-1]} --skip-stratum-if-count-is-smaller-than {min_count_per_stratum[n-1]} \
    --partition-by library *data.pkl prepped prob_escape
    
    !echo "Training model."
    !tdms go --config config.json

    %cd ../.. 

/fh/fast/bloom_j/computational_notebooks/tyu2/2021/polyclonal/notebooks/torchdms_results/noisy_3conc_1muts
Prepping dataset.
LOG: Targets: ('prob_escape',)
LOG: There are 90000 total variants in this dataset
LOG: Partitioning data via 'avg1muts'
LOG: There are 26409 training samples for stratum: 1
LOG: There are 15612 training samples for stratum: 2
LOG: There are 4417 training samples for stratum: 3
LOG: There are 297 training samples for stratum: 4
LOG: There are 4793 validation samples
LOG: There are 4898 test samples
LOG: Successfully partitioned data
LOG: preparing binary map dataset
LOG: Successfully finished prep and dumped SplitDataset object to prepped
Training model.
LOG: Setting random seed to 0.
LOG: Model defined as: Escape(
  (latent_layer_epi0): Linear(in_features=4221, out_features=1, bias=False)
  (latent_layer_epi1): Linear(in_features=4221, out_features=1, bias=False)
  (latent_layer_epi2): Linear(in_features=4221, out_features=1, bias=False)
)
{'num_epitopes': 3, 'b