# Choosing a set of sera concentrations for multi-mutant DMS

In [1]:
import pandas as pd
import polyclonal
import pickle
import altair as alt
import numpy as np
import time
import os
from plotnine import *

We have simulated noisy data measured at 6 sera concentrations `[0.125, 0.25, 0.5, 1, 2, 4]`.

In [2]:
noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query("library == 'avg3muts'")
    .reset_index(drop=True)
    )

noisy_data

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg3muts,,0.125,0.04859,0.1128
1,avg3muts,,0.125,0.17970,0.1128
2,avg3muts,,0.125,0.13200,0.1128
3,avg3muts,,0.125,0.07772,0.1128
4,avg3muts,,0.125,0.17960,0.1128
...,...,...,...,...,...
179995,avg3muts,Y449I L518Y C525R L461I,4.000,0.02197,2.3100
179996,avg3muts,Y449V K529R N394R,4.000,0.04925,0.9473
179997,avg3muts,Y451L N481T F490V,4.000,0.02315,0.9301
179998,avg3muts,Y453R V483G L492V N501P I332P,4.000,0.00000,5.0120


Each variant contains 3 mutations on average and generally [spans a wide range of escape fractions across the different sera concentrations](fit_RBD.ipynb).

Here, we'll provide guidance on what set of concentrations to use in actual experiments. Using this simulated dataset, we are interested in identifying the optimal set of concentrations that provides the best performance while minimizing the **number of concentrations** (or selection experiments) required.

First, we'll fit multiple `Polyclonal` models, starting with a single concentration of `0.125`, and iteratively adding in variants measured at the next higher concentration.

In [40]:
conc = [0.125, 0.25, 0.5, 1, 2, 4]
conc_sets = [conc[0:i+1] for i in range(len(conc))]

# if model is already fit, don't fit again
to_fit = []
for s in conc_sets: 
    if os.path.exists(f'scipy_results/noisy_{s}conc_3muts.pkl') is True:
        print(f"Model with {s} was already fit.")
    else:
        to_fit.append(s)
                
def fit_polyclonal(conc_set):
    poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"concentration in {conc_set}"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    start = time.time()
    poly_abs.fit()
    return poly_abs, time.time() - start

for s in to_fit:
    model, time_elapsed = fit_polyclonal(s)
    pickle.dump(model, open(f'scipy_results/noisy_{s}conc_3muts.pkl', 'wb'))
    print(f"Model with {s} fit in {time_elapsed:.1f} seconds.")  

Model with [0.125] was already fit.
Model with [0.125, 0.25] was already fit.
Model with [0.125, 0.25, 0.5] was already fit.
Model with [0.125, 0.25, 0.5, 1] was already fit.
Model with [0.125, 0.25, 0.5, 1, 2] was already fit.
Model with [0.125, 0.25, 0.5, 1, 2, 4] was already fit.


Lets look at the correlation between predicted and true beta coefficients for each of the fit models.

In [44]:
all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'conc_set' : []}
                        )

for s in conc_sets:
    model = pickle.load(open(f'scipy_results/noisy_{s}conc_3muts.pkl', 'rb'))

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    all_corrs = pd.concat([all_corrs, 
                           corr.assign(conc_set = [str(s)] * len(corr.index))]
                         )

In [45]:
chart = alt.Chart(all_corrs).mark_bar().encode(
    x= alt.X('conc_set:O', axis=alt.Axis(labels=False), 
             sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('conc_set:N', 
                    sort=alt.EncodingSortField('color', order='descending')),
    column='epitope:N',
    tooltip = ['conc_set', 'correlation']
).properties(width=125, height=200, title='predicted vs. true beta coefficients')
chart

Additionally, lets look at the correlation between predicted and true IC90's for each of the fit models. To do this, we'll predict the IC90's of variants in a different simulated library. 

In [46]:
exact_data = (
    pd.read_csv('RBD_variants_escape_exact.csv', na_filter=None)
    .query('library == "avg4muts"')
    .query('concentration in [1]')
    .reset_index(drop=True)
    )

We'll make the comparison on a log scale, and clip IC90s at values >50 as that is likely to be way outside the dynamic range given the concentrations used.

In [47]:
ic90_corrs = pd.DataFrame({'correlation' : [], 
                           'conc_set' : []}
                            )

max_ic90 = 50
for s in conc_sets:
    model = pickle.load(open(f'scipy_results/noisy_{s}conc_3muts.pkl', 'rb'))
    
    ic90s = (exact_data[['aa_substitutions', 'IC90']]
         .assign(IC90=lambda x: x['IC90'].clip(upper=max_ic90))
         .drop_duplicates()
         )
    ic90s = model.filter_variants_by_seen_muts(ic90s)
    ic90s = model.icXX(ic90s, x=0.9, col='predicted_IC90', max_c=max_ic90)

    ic90s = (
        ic90s
        .assign(log_IC90=lambda x: np.log10(x['IC90']),
            predicted_log_IC90=lambda x: np.log10(x['predicted_IC90']),
            )
    )

    corr = ic90s['log_IC90'].corr(ic90s['predicted_log_IC90'])
    print(f"Correlation is {corr:.2f} for model fit with {s}.")
    
    ic90_corrs = pd.concat([ic90_corrs,
                            pd.DataFrame({'correlation' : corr,
                                          'conc_set' : [str(s)]})])

Correlation is 0.97 for model fit with [0.125].
Correlation is 0.99 for model fit with [0.125, 0.25].
Correlation is 0.99 for model fit with [0.125, 0.25, 0.5].
Correlation is 1.00 for model fit with [0.125, 0.25, 0.5, 1].
Correlation is 1.00 for model fit with [0.125, 0.25, 0.5, 1, 2].
Correlation is 1.00 for model fit with [0.125, 0.25, 0.5, 1, 2, 4].


In [48]:
chart = alt.Chart(ic90_corrs).mark_bar().encode(
    x= alt.X('conc_set:O', axis=alt.Axis(labels=False), 
             sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('conc_set:N', 
                    sort=alt.EncodingSortField('color', order='descending')),
    tooltip = ['conc_set', 'correlation']
).properties(width=125, height=200, title='predicted vs. true IC90')
chart

IC90 prediction was good across all concentration sets, even when only a single concentration was used. However, predicting the true beta coefficients strongly depended on the number of concentrations used.

It looks like having 4 concentrations (i.e., `[0.125, 0.25, 0.5, 1]`) is nearly as good as using all 6 concentrations. Next, let's test 1) how a single, but higher concentration performs and 2) if a subset of 2-3 intermediate concentrations that span a larger range can also be just as good as the 6 concentrations.

In [49]:
conc_sets = [[1],
             [4],
             [0.5,2],
             [0.25,4],
             [0.5,1,2],
             [0.25,1,4]]

# if model is already fit, don't fit again
to_fit = []
for s in conc_sets: 
    if os.path.exists(f'scipy_results/noisy_{s}conc_3muts.pkl') is True:
        print(f"Model with {s} was already fit.")
    else:
        to_fit.append(s)
                
for s in to_fit:
    model, time_elapsed = fit_polyclonal(s)
    pickle.dump(model, open(f'scipy_results/noisy_{s}conc_3muts.pkl', 'wb'))
    print(f"Model with {s} fit in {time_elapsed:.1f} seconds.")  

Model with [1] was already fit.
Model with [4] was already fit.
Model with [0.5, 2] was already fit.
Model with [0.25, 4] was already fit.
Model with [0.5, 1, 2] was already fit.
Model with [0.25, 1, 4] was already fit.


Again, lets look at the correlation between predicted and true beta coefficients for each of the fit models.

In [50]:
conc_sets_to_plot = [[1],
                     [4],
                     [0.5,2],
                     [0.25,4],
                     [0.5,1,2],
                     [0.25,1,4],
                     [0.125,0.25,0.5,1,2,4]]

all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'conc_set' : []}
                        )

for s in conc_sets_to_plot:
    model = pickle.load(open(f'scipy_results/noisy_{s}conc_3muts.pkl', 'rb'))

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    all_corrs = pd.concat([all_corrs, 
                           corr.assign(conc_set = [str(s)] * len(corr.index))]
                         )
all_corrs.head()

Unnamed: 0,epitope,correlation,conc_set
0,class 1,0.806796,[1]
1,class 2,0.945004,[1]
2,class 3,0.908317,[1]
0,class 1,0.720604,[4]
1,class 2,0.939581,[4]


In [51]:
chart = alt.Chart(all_corrs).mark_bar().encode(
    x= alt.X('conc_set:O', axis=alt.Axis(labels=False), 
             sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('conc_set:N', 
                    sort=alt.EncodingSortField('color', order='descending')),
    column='epitope:N',
    tooltip = ['conc_set', 'correlation']
).properties(width=125, height=200, title='predicted vs. true beta coefficients')
chart

And again, lets look at the correlation between predicted and true IC90's for each of the fit models.

In [52]:
ic90_corrs = pd.DataFrame({'correlation' : [], 
                           'conc_set' : []}
                            )

max_ic90 = 50
for s in conc_sets_to_plot:
    model = pickle.load(open(f'scipy_results/noisy_{s}conc_3muts.pkl', 'rb'))
    
    ic90s = (exact_data[['aa_substitutions', 'IC90']]
         .assign(IC90=lambda x: x['IC90'].clip(upper=max_ic90))
         .drop_duplicates()
         )
    ic90s = model.filter_variants_by_seen_muts(ic90s)
    ic90s = model.icXX(ic90s, x=0.9, col='predicted_IC90', max_c=max_ic90)

    ic90s = (
        ic90s
        .assign(log_IC90=lambda x: np.log10(x['IC90']),
            predicted_log_IC90=lambda x: np.log10(x['predicted_IC90']),
            )
    )

    corr = ic90s['log_IC90'].corr(ic90s['predicted_log_IC90'])
    print(f"Correlation is {corr:.2f} for model fit with {s}.")
    
    ic90_corrs = pd.concat([ic90_corrs,
                            pd.DataFrame({'correlation' : corr,
                                          'conc_set' : [str(s)]})])

Correlation is 0.99 for model fit with [1].
Correlation is 0.98 for model fit with [4].
Correlation is 1.00 for model fit with [0.5, 2].
Correlation is 1.00 for model fit with [0.25, 4].
Correlation is 1.00 for model fit with [0.5, 1, 2].
Correlation is 1.00 for model fit with [0.25, 1, 4].
Correlation is 1.00 for model fit with [0.125, 0.25, 0.5, 1, 2, 4].


In [53]:
chart = alt.Chart(ic90_corrs).mark_bar().encode(
    x= alt.X('conc_set:O', axis=alt.Axis(labels=False), 
             sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    color=alt.Color('conc_set:N', 
                    sort=alt.EncodingSortField('color', order='descending')),
    tooltip = ['conc_set', 'correlation']
).properties(width=125, height=200, title='predicted vs. true IC90')
chart

Interestingly, the model trained with `[1]` is quite good and outperforms both model trained with `[4]`, and model trained previously with `[0.125]` alone. Additionally, models trained with `[0.25, 4]` and `[0.25, 1, 4]` are nearly indistinguishable from model trained on all 6 concentrations. However, there is a trend that having 3 instead of 2 concentrations improves performance on the class 1 epitope, which is expected to be the hardest to predict since it has the lowest wildtype activity. Thus, having more concentrations could lead to more confident beta coefficients for more subdominant epitopes.

In summary:  
   - 2-3 concentrations are more than sufficient, and should be strategically chosen to span a range such that the same variant can generally escape one concentration but not the other. 
   - even 1 concentration can work, but needs to be carefully chosen as a concentration that is either too low (most variants escape) or too high (few variants escape) reduce predictive performance.