# Library mutation rate

We’ll use simulated data to show how the average mutation rate of variants in a library affects the performance of `Polyclonal` models. 

In [7]:
import os
import pickle 

import pandas as pd
import numpy as np
import altair as alt
import polyclonal

We have simulated 4 libraries with noisy data measured at three different sera concentrations. The number of mutations per variant in each library is simulated to follow a Poisson distribution. The libraries differ in their average number of mutations (1, 2, 3, or 4) per variant, and are named accordingly.

In [2]:
noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query('concentration in [0.25, 1, 4]')
    .reset_index(drop=True)
    )

noisy_data

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg1muts,,0.25,0.087480,0.1128
1,avg1muts,,0.25,0.034240,0.1128
2,avg1muts,,0.25,0.037880,0.1128
3,avg1muts,,0.25,0.035730,0.1128
4,avg1muts,,0.25,0.000000,0.1128
...,...,...,...,...,...
359995,avg2muts,Y473E L518F D427L,4.00,0.002918,1.1600
359996,avg1muts,Y473S G413Q,4.00,0.000000,0.5780
359997,avg1muts,Y473V P479R F392W,4.00,0.160200,1.4550
359998,avg3muts,Y489Q N501Y,4.00,0.000000,0.5881


We’ll fit a `Polyclonal` model to each library. Here, we use the initialization based on prior knowledge.

In [12]:
avg_mut_rates = [1,2,3,4]

# Make a directory to house pickled models
os.makedirs('fit_polyclonal_models', exist_ok=True)
       
def fit_polyclonal(n):
    """
    Fit `Polyclonal` model with data measured for a specific concentration set.
    Returns fit `Polyclonal` object.
    """
    poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"library == 'avg{n}muts'"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    poly_abs.fit()
    return poly_abs

# Store all fit models in a dictionary for future lookup
fit_models = {}

for n in avg_mut_rates:
    # These are the keys for fit models
    model_string = f'noisy_[0.25, 1, 4]conc_{n}muts'

    # If the pickled model exists in fit_polyclonal_models directory,
    # load it and update fit_models 
    if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
        model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
        fit_models.update({model_string : model})
        print(f"Model with {n} was already fit.")
    else:
        # Else, fit a model using fit_polyclonal(), save it to the
        # fit_polyclonal_models directory, and update fit_models 
        model = fit_polyclonal(n)
        fit_models.update({model_string : model})
        pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
        print(f"Model with {n} fit and saved.")  

Model with 1 was already fit.
Model with 2 was already fit.
Model with 3 was already fit.
Model with 4 was already fit.


We can look at the correlation between predicted and true beta coefficients (mutation effects at each epitope) for the fit models.

In [14]:
all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'mutation_rate' : []})

for n in avg_mut_rates:
    model = fit_models[f'noisy_[0.25, 1, 4]conc_{n}muts']

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    
    all_corrs = pd.concat([all_corrs, 
                    corr.assign(mutation_rate = [f'avg{n}muts'] * len(corr.index))
                        ])

In [15]:
# NBVAL_IGNORE_OUTPUT
alt.Chart(all_corrs).mark_circle(size=125).encode(
    x= alt.X('mutation_rate:O', 
             sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    column='epitope:N',
    tooltip = ['mutation_rate', alt.Tooltip('correlation', format='.3f')],
    color=alt.Color('epitope', legend=None),
).properties(width=200, height=200, title='predicted vs. true beta coefficients')

Now, we’ll fit a `Polyclonal` model to each library again, but this time without any initialization.

In [16]:
# Make a directory to house pickled models
os.makedirs('fit_polyclonal_models', exist_ok=True)
       
def fit_polyclonal(n):
    """
    Fit `Polyclonal` model with data measured for a specific concentration set.
    Returns fit `Polyclonal` object.
    """
    poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"library == 'avg{n}muts'"),
                                     n_epitopes = 3,
                                 )
    poly_abs.fit()
    return poly_abs

# Store all fit models in a dictionary for future lookup
fit_models = {}

for n in avg_mut_rates:
    # These are the keys for fit models
    model_string = f'noisy_[0.25, 1, 4]conc_{n}muts_noinit'

    # If the pickled model exists in fit_polyclonal_models directory,
    # load it and update fit_models 
    if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
        model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
        fit_models.update({model_string : model})
        print(f"Model with {n} was already fit.")
    else:
        # Else, fit a model using fit_polyclonal(), save it to the
        # fit_polyclonal_models directory, and update fit_models 
        model = fit_polyclonal(n)
        fit_models.update({model_string : model})
        pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))
        print(f"Model with {n} fit and saved.")  

Model with 1 fit and saved.
Model with 2 fit and saved.
Model with 3 fit and saved.
Model with 4 fit and saved.


Since the epitope numbers are completely arbitrary here, we can no longer look at the correlation between predicted vs. true beta coefficients. But, we can still visualize the escape heatmaps. 

### 1 average mutation per gene

In [20]:
pickle.load(open(f'fit_polyclonal_models/noisy_[0.25, 1, 4]conc_1muts_noinit.pkl', 'rb')).mut_escape_heatmap()

### 2 average mutations per gene

In [21]:
pickle.load(open(f'fit_polyclonal_models/noisy_[0.25, 1, 4]conc_2muts_noinit.pkl', 'rb')).mut_escape_heatmap()

### 3 average mutations per gene

In [22]:
pickle.load(open(f'fit_polyclonal_models/noisy_[0.25, 1, 4]conc_3muts_noinit.pkl', 'rb')).mut_escape_heatmap()

### 4 average mutations per gene

In [23]:
pickle.load(open(f'fit_polyclonal_models/noisy_[0.25, 1, 4]conc_4muts_noinit.pkl', 'rb')).mut_escape_heatmap()

## Summary

An average of at least 2 mutations per variant is needed to infer the beta coefficients for all epitopes targeted by polyclonal antibodies. When there is an average of 1 mutation per variant, the correlation between predicted and true beta coefficients is highest for the most immunodominant epitope 2 and lowest for the most subdominant epitope 1. This is expected, as we should not observe escape for variants with a single mutation in a subdominant epitope. 

(this is outdated)