# Library mutation rate

We’ll use simulated data to show how the average mutation rate of variants in a DMS library affects the performance of `Polyclonal` models. 

In [1]:
import time
import os

import pandas as pd
import numpy as np
import altair as alt
import polyclonal

We have simulated 4 libraries with noisy measurements measured at three different sera concentrations. The number of mutations per variant in each library is simulated to follow a Poisson distribution. The libraries differ in their average number of mutations (1, 2, 3, or 4) per variant, and are named accordingly.

In [2]:
noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query('concentration in [0.25, 1, 4]')
    .reset_index(drop=True)
    )

noisy_data

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg1muts,,0.25,0.087480,0.1128
1,avg1muts,,0.25,0.034240,0.1128
2,avg1muts,,0.25,0.037880,0.1128
3,avg1muts,,0.25,0.035730,0.1128
4,avg1muts,,0.25,0.000000,0.1128
...,...,...,...,...,...
359995,avg2muts,Y473E L518F D427L,4.00,0.002918,1.1600
359996,avg1muts,Y473S G413Q,4.00,0.000000,0.5780
359997,avg1muts,Y473V P479R F392W,4.00,0.160200,1.4550
359998,avg3muts,Y489Q N501Y,4.00,0.000000,0.5881


We’ll fit a `Polyclonal` model to each library.

In [5]:
avg_mut_rates = [1,2,3,4]

# Store all fit models in a dictionary for future lookup
fit_models = {}

for n in avg_mut_rates:
    # key name for model
    model_string = f'noisy_[0.25, 1, 4]conc_{n}muts'
    
    poly_abs = polyclonal.Polyclonal(data_to_fit=noisy_data.query(f"library == 'avg{n}muts'"),
                                     activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                 )
    print(f"Fitting model on library with variants containing {n} average mutation(s).")
    poly_abs.fit()
    fit_models.update({model_string : poly_abs})

Fitting model on library with variants containing 1 average mutation(s).
Fitting model on library with variants containing 2 average mutation(s).
Fitting model on library with variants containing 3 average mutation(s).
Fitting model on library with variants containing 4 average mutation(s).


We can look at the correlation between predicted and true beta coefficients (mutation effects at each epitope) for the fit models.

In [6]:
all_corrs = pd.DataFrame({'epitope' : [], 
                          'correlation' : [], 
                          'mutation_rate' : []})

for n in avg_mut_rates:
    model = fit_models[f'noisy_[0.25, 1, 4]conc_{n}muts']

    mut_escape_pred = (
        pd.read_csv('RBD_mut_escape_df.csv')
        .merge((model.mut_escape_df
                .assign(epitope=lambda x: 'class ' + x['epitope'].astype(str))
                .rename(columns={'escape': 'predicted escape'})
                ),
               on=['mutation', 'epitope'],
               validate='one_to_one',
               )
        )

    corr = (mut_escape_pred
            .groupby('epitope')
            .apply(lambda x: x['escape'].corr(x['predicted escape']))
            .rename('correlation')
            .reset_index()
            )
    
    all_corrs = pd.concat([all_corrs, 
                    corr.assign(mutation_rate = [f'avg{n}muts'] * len(corr.index))
                        ])

In [7]:
# NBVAL_IGNORE_OUTPUT
alt.Chart(all_corrs).mark_circle(size=125).encode(
    x= alt.X('mutation_rate:O', 
             sort=alt.EncodingSortField('x', order='descending')),
    y='correlation:Q',
    column='epitope:N',
    tooltip = ['mutation_rate', alt.Tooltip('correlation', format='.3f')],
    color=alt.Color('epitope', legend=None),
).properties(width=200, height=200, title='predicted vs. true beta coefficients')

## Summary

An average of at least 2 mutations per variant is needed to infer the beta coefficients for all epitopes targeted by polyclonal antibodies. When there is an average of 1 mutation per variant, the correlation between predicted and true beta coefficients is highest for the most immunodominant epitope 2 and lowest for the most subdominant epitope 1. This is expected, as we should not observe escape for variants with a single mutation in a subdominant epitope. 