# What set of concentrations should we use?

We have simulated noisy data measured at 6 concentrations [0.125, 0.25, 0.5, 1, 2, 4]. The goal is to figure out what set of concentrations to use in actual experiments. Formally, we can write the problem as: what is the optimal set of concentrations that provides the best performance while minimizing the **concentrations** used and the **number of concentrations** used? 

If we perform an exhaustive grid search, there are $2^{6} - 1 = 63$ sets of concentrations to test. I am not going to do that.  

Instead, I first looked at the effect of adding a higher concentration to a set starting at concentration = 0.125. This involved training models on 6 sets of concentrations (Experiment 1).
1. 0.125
2. 0.125, 0.25
3. 0.125, 0.25, 0.5
4. 0.125, 0.25, 0.5, 1
5. 0.125, 0.25, 0.5, 1, 2
6. 0.125, 0.25, 0.5, 1, 2, 4  

I found that the concentration set [0.125, 0.25, 0.5, 1, 2] was the (slightly conservative) point when adding another higher concentration did not contribute to increasing the model performance. I then asked, if I start at concentration = 2, at what point does adding back a lower concentration not improve the model performance? This involved training models on 5 sets of concentrations (Experiment 2). 
1. 2
2. 1, 2
3. 0.5, 1, 2
4. 0.25, 0.5, 1, 2
5. 0.125, 0.25, 0.5, 1, 2  

I found that the concentration set [0.5, 1, 2] was good enough. To then check if three was the optimal number of concentrations, I removed the intermediate concentration and trained a model on the concentration set [0.5, 2]. As expected, removed this intermediate concentration lowered the predictive accuracy (especially for the class 1 epitope). I also repeated this with [0.25, 1, 4] and [0.25, 4].  

Lastly, I compare the set of concentrations that appeared most optimal, [0.5, 1, 2] and [0.125, 1, 4] (tested in previous notebook here), to the model trained on all 6 concentrations. 

In [1]:
import pandas as pd
import polyclonal
import pickle
import random
import altair as alt
import numpy

In [2]:
noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query("library == 'avg3muts'")
    .reset_index(drop=True)
    )

noisy_data

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg3muts,,0.125,0.04859,0.1128
1,avg3muts,,0.125,0.17970,0.1128
2,avg3muts,,0.125,0.13200,0.1128
3,avg3muts,,0.125,0.07772,0.1128
4,avg3muts,,0.125,0.17960,0.1128
...,...,...,...,...,...
179995,avg3muts,Y449I L518Y C525R L461I,4.000,0.02197,2.3100
179996,avg3muts,Y449V K529R N394R,4.000,0.04925,0.9473
179997,avg3muts,Y451L N481T F490V,4.000,0.02315,0.9301
179998,avg3muts,Y453R V483G L492V N501P I332P,4.000,0.00000,5.0120


# Experiment 1

In [3]:
random.seed(123)

conc = [0.125, 0.25, 0.5, 1, 2, 4]
for i in range(1,7):
    poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"concentration in {conc[0:i]}"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    
    opt_res = poly_abs.fit(logfreq=500)
    pickle.dump(poly_abs, open(f'scipy_results/conc_exp1_noisy_{i}conc_3muts.pkl', 'wb'))
    print(f"Model fit on library with {i} concentrations beginning at c=0.125 to scipy_results/conc_exp1_noisy_{i}conc_3muts.pkl")

# First fitting site-level model.
# Starting optimization of 522 parameters at Thu Nov 25 12:21:04 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.027545      11737      11737    0.29701          0
        500     10.994     498.23     494.26     3.9742          0
        602     13.243     498.15      494.2     3.9484          0
# Successfully finished at Thu Nov 25 12:21:17 2021.
# Starting optimization of 5799 parameters at Thu Nov 25 12:21:17 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0    0.02994     659.14      613.8     45.343 6.2482e-30
        500     13.043     308.91     243.81     46.213     18.892
       1000     25.757     304.79     240.71     44.162     19.911
       1500     38.799     302.65     239.55     42.997     20.101
       1546     39.989     302.62     239.55     42.978     20.101
# Successfully finished at Thu Nov 25 12:21:57 2021.
Model fit on library with 1 concentrations begi

## Get correlation between predicted and true beta coefficients for each trained model

In [27]:
all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'num_concentrations' : []}
                        )

conc = [0.125, 0.25, 0.5, 1, 2, 4]
for i in range(1,7):
    model = pickle.load(open(f'scipy_results/conc_exp1_noisy_{i}conc_3muts.pkl', 'rb'))

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    all_corrs = pd.concat([all_corrs, 
                           corr.assign(num_concentrations = [str(conc[0:i])] * len(corr.index))]
                         )
all_corrs.head()

Unnamed: 0,epitope,correlation,num_concentrations
0,class 1,0.679124,[0.125]
1,class 2,0.694542,[0.125]
2,class 3,0.480764,[0.125]
0,class 1,0.773873,"[0.125, 0.25]"
1,class 2,0.896795,"[0.125, 0.25]"


In [28]:
alt.Chart(all_corrs).mark_bar().encode(
    x= alt.X('num_concentrations:O', axis=alt.Axis(labels=False), sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('num_concentrations:N', sort=alt.EncodingSortField('color', order='descending')),
    column='epitope:N',
    tooltip = ['num_concentrations', 'correlation']
).properties(width=125, height=200)

The [0.125, 0.25, 0.5, 1, 2] set appears to be point when adding an additional concentration doesn't improve the model performance.

# Experiment 2

In [12]:
random.seed(123)

conc = [2, 1, 0.5, 0.25, 0.125]
for i in range(1, len(conc)+1):
    poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"concentration in {conc[0:i]}"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    
    opt_res = poly_abs.fit(logfreq=500)
    pickle.dump(poly_abs, open(f'scipy_results/conc_exp2_noisy_{i}conc_3muts.pkl', 'wb'))
    print(f"Model fit on library with {i} concentrations beginning at c=2 to scipy_results/conc_exp2_noisy_{i}conc_3muts.pkl")

# First fitting site-level model.
# Starting optimization of 522 parameters at Thu Nov 25 21:42:13 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.017499     3471.2     3470.9    0.29701          0
        500     8.9243      728.3     723.65     4.6463          0
        989     17.323      727.5     722.28     5.2223          0
# Successfully finished at Thu Nov 25 21:42:30 2021.
# Starting optimization of 5799 parameters at Thu Nov 25 21:42:30 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.019306     811.19     753.78     57.412 1.0546e-29
        500     9.9185     291.36     213.98     50.167     27.208
       1000     19.781     274.78     199.14     47.256      28.38
       1240     24.452     274.29     198.73     47.008     28.551
# Successfully finished at Thu Nov 25 21:42:54 2021.
Model fit on library with 1 concentrations beginning at c=2 to scipy_results/conc_exp2_noisy_1conc_3muts.pkl
# Fir

## Get correlation between predicted and true beta coefficients for each trained model

In [29]:
all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'num_concentrations' : []}
                        )

conc = [2, 1, 0.5, 0.25, 0.125]
for i in range(1,6):
    model = pickle.load(open(f'scipy_results/conc_exp2_noisy_{i}conc_3muts.pkl', 'rb'))

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    all_corrs = pd.concat([all_corrs, 
                           corr.assign(num_concentrations = [str(conc[0:i])] * len(corr.index))]
                         )
all_corrs.head()

Unnamed: 0,epitope,correlation,num_concentrations
0,class 1,0.770642,[2]
1,class 2,0.949397,[2]
2,class 3,0.915621,[2]
0,class 1,0.848612,"[2, 1]"
1,class 2,0.963036,"[2, 1]"


In [30]:
alt.Chart(all_corrs).mark_bar().encode(
    x= alt.X('num_concentrations:O', axis=alt.Axis(labels=False), sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('num_concentrations:N', sort=alt.EncodingSortField('color', order='descending')),
    column='epitope:N',
    tooltip = ['num_concentrations', 'correlation']
).properties(width=125, height=200)

It looks like the set [0.5, 1, 2] is sufficient. Now, lets try running a model on the set [0.5, 2], without the intermediate concentration.

In [15]:
poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"concentration in [0.5, 2]"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    
opt_res = poly_abs.fit(logfreq=500)
pickle.dump(poly_abs, open(f'scipy_results/conc_exp2_noisy_2wointermediate_conc_3muts.pkl', 'wb'))

# First fitting site-level model.
# Starting optimization of 522 parameters at Thu Nov 25 22:17:03 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.043393      11245      11245    0.29701          0
        432     19.294     1520.4     1515.7      4.684          0
# Successfully finished at Thu Nov 25 22:17:22 2021.
# Starting optimization of 5799 parameters at Thu Nov 25 22:17:22 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.045623     1675.7     1623.1     52.562 1.0799e-29
        500     25.072     616.27     526.96     55.267     34.039
       1000     49.611      606.9     519.53     53.179     34.191
       1500     74.015     605.05     518.13     52.066     34.852
       2000     98.558     604.34     518.33     51.383     34.629
       2053     101.13     604.32     518.28     51.409      34.63
# Successfully finished at Thu Nov 25 22:19:04 2021.


In [16]:
model = pickle.load(open(f'scipy_results/conc_exp2_noisy_2wointermediate_conc_3muts.pkl', 'rb'))

mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

corr = (mut_escape_pred
        .groupby('epitope')
        .apply(lambda x: x['escape'].corr(x['predicted escape']))
        .rename('correlation')
        .reset_index()
        )
corr

Unnamed: 0,epitope,correlation
0,class 1,0.832552
1,class 2,0.967225
2,class 3,0.93269


It looks like the intermediate concentration was important. The performance when training on [0.5, 2] was worse than that of [0.5, 1, 2], and comparable to when training on the two concentration example above ([1, 2]).

From our experiments thus far, it appears that [0.5, 1, 2] is the most optimal set. However, we previously observed [here](mutation_rate.ipynb) that [0.25, 1, 4] actually slightly outperforms c = [0.5, 1, 2]. As verification that three concentrations are necessary, lets also check that removing the intermediate concentration here ([0.25, 4]) also decreases the performance.

In [17]:
poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"concentration in [0.25, 4]"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    
opt_res = poly_abs.fit(logfreq=500)
pickle.dump(poly_abs, open(f'scipy_results/conc_exp2_noisy_2wointermediate_expanded_conc_3muts.pkl', 'wb'))

# First fitting site-level model.
# Starting optimization of 522 parameters at Thu Nov 25 22:26:10 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.040115      12531      12531    0.29701          0
        500     21.197     1335.2       1330     5.2073          0
       1000     42.377     1332.5     1326.9     5.6004          0
       1172     49.626     1332.3     1326.6      5.665          0
# Successfully finished at Thu Nov 25 22:27:00 2021.
# Starting optimization of 5799 parameters at Thu Nov 25 22:27:00 2021.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.044358     1518.7     1453.8      64.91 1.6989e-29
        500     26.339     656.49     566.93     59.741     29.822
       1000     51.956      585.9     491.87     55.397     38.628
       1500     78.126     573.86      480.8     54.441     38.618
       1920     99.856     573.31     480.67     53.921      38.72
# Successfully finished at Thu No

In [31]:
model = pickle.load(open(f'scipy_results/conc_exp2_noisy_2wointermediate_expanded_conc_3muts.pkl', 'rb'))

mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

corr = (mut_escape_pred
        .groupby('epitope')
        .apply(lambda x: x['escape'].corr(x['predicted escape']))
        .rename('correlation')
        .reset_index()
        )
corr

Unnamed: 0,epitope,correlation
0,class 1,0.845842
1,class 2,0.971332
2,class 3,0.941974


Yes, it mostly affected the correlation for epitope 1, similar to what we observed previously.

## Summary

In [33]:
good_models = ['conc_exp2_noisy_2wointermediate_conc_3muts.pkl',
               'conc_exp2_noisy_2wointermediate_expanded_conc_3muts.pkl',
               'conc_exp2_noisy_3conc_3muts.pkl',
               'noisy_3conc_3muts.pkl',
               'conc_exp1_noisy_6conc_3muts.pkl']

model_names = ['[0.5, 2]', '[0.25, 4]', '[0.5, 1, 2]', '[0.25, 1, 4]', '[0.125, 0.25, 0.5, 1, 2, 4]']

all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'num_concentrations' : []}
                        )

for model,name in zip(good_models, model_names):
    model = pickle.load(open(f'scipy_results/{model}', 'rb'))

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    all_corrs = pd.concat([all_corrs, 
                           corr.assign(num_concentrations = [name] * len(corr.index))]
                         )
all_corrs.head()

Unnamed: 0,epitope,correlation,num_concentrations
0,class 1,0.832552,"[0.5, 2]"
1,class 2,0.967225,"[0.5, 2]"
2,class 3,0.93269,"[0.5, 2]"
0,class 1,0.845842,"[0.25, 4]"
1,class 2,0.971332,"[0.25, 4]"


In [34]:
alt.Chart(all_corrs).mark_bar().encode(
    x= alt.X('num_concentrations:O', axis=alt.Axis(labels=False), sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('num_concentrations:N', sort=alt.EncodingSortField('color', order='descending')),
    column='epitope:N',
    tooltip = ['num_concentrations', 'correlation']
).properties(width=125, height=200)

[0.25, 1, 4] is a good concentration set to use. However, two concentrations could also work fine.

## What makes this concentration set optimal?

Is there something special about the predicted IC50's of variants under these conditions that makes it most suitable?