# Specifying epitopes

We’ll use simulated data to show how incorrectly guessing the number of true epitopes affects the performance of `Polyclonal` models.

In [1]:
import os
import pickle 

#import numpy as np
import pandas as pd
import polyclonal

First, we read in a simulated “noisy” dataset containing 30,000 variants measured at three different sera concentrations. The variants in this library were simulated to contain a Poisson-distributed number of mutations, with an average of three mutations per gene.

In [2]:
noisy_data = (
    pd.read_csv('RBD_variants_escape_noisy.csv', na_filter=None)
    .query('library == "avg3muts"')
    .query('concentration in [0.25, 1, 4]')
    .reset_index(drop=True)
    )
noisy_data

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg3muts,,0.25,0.00000,0.1128
1,avg3muts,,0.25,0.01090,0.1128
2,avg3muts,,0.25,0.01458,0.1128
3,avg3muts,,0.25,0.09465,0.1128
4,avg3muts,,0.25,0.03299,0.1128
...,...,...,...,...,...
89995,avg3muts,Y449I L518Y C525R L461I,4.00,0.02197,2.3100
89996,avg3muts,Y449V K529R N394R,4.00,0.04925,0.9473
89997,avg3muts,Y451L N481T F490V,4.00,0.02315,0.9301
89998,avg3muts,Y453R V483G L492V N501P I332P,4.00,0.00000,5.0120


Additionally, we’ll make a directory for storing our fit models as [pickle](https://docs.python.org/3/library/pickle.html#module-pickle) files, so that we can conveniently load them in the future without having to fit again.

In [3]:
os.makedirs('fit_polyclonal_models', exist_ok=True)

We’ll start by correctly initializing a `Polyclonal` model with three epitopes and fitting to the data. We know from prior work the three most important epitopes and a key mutation in each, so we use this prior knowledge to “seed” initial guesses that assign large escape values to a key site in each epitope:

- site 417 for class 1 epitope, which is often the least important

- site 484 for class 2 epitope, which is often the dominant one

- site 444 for class 3 epitope, which is often the second most dominant one

In [7]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_3epitopes'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 3 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 1.0),
                                          ('2', 3.0),
                                          ('3', 2.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))

Model with 3 epitopes specified was already fit.


In [8]:
import itertools

epitopes = model.mut_escape_site_summary_df['epitope'].unique()
all_sites = True

if epitopes is None:
    epitopes = model.mut_escape_site_summary_df['epitope'].unique().tolist()
elif not set(epitopes).issubset(model.mut_escape_site_summary_df['epitope']):
    raise ValueError('invalid entries in `epitopes`')

df = model.mut_escape_site_summary_df.query('epitope in @epitopes')
escape_metrics = [m for m in df.columns
                      if m not in {'epitope', 'site', 'wildtype'}]

sites = df['site'].unique().tolist()
if all_sites:
        sites = list(range(min(sites), max(sites) + 1))

df = (df
          .merge(pd.DataFrame(itertools.product(sites, epitopes),
                              columns=['site', 'epitope']),
                 on=['site', 'epitope'], how='right')
          .sort_values('site')
          .melt(id_vars=['epitope', 'site', 'wildtype'],
                var_name='metric',
                value_name='escape'
                )
          .pivot_table(index=['site', 'wildtype', 'metric'],
                       values='escape',
                       columns='epitope',
                       dropna=False)
          .reset_index()
          )

In [18]:
list(range(min(sites), max(sites) + 1, 5))

[331,
 336,
 341,
 346,
 351,
 356,
 361,
 366,
 371,
 376,
 381,
 386,
 391,
 396,
 401,
 406,
 411,
 416,
 421,
 426,
 431,
 436,
 441,
 446,
 451,
 456,
 461,
 466,
 471,
 476,
 481,
 486,
 491,
 496,
 501,
 506,
 511,
 516,
 521,
 526,
 531]

In [38]:
import altair as alt 

zoom_bar_width=500

zoom_brush = alt.selection_interval(encodings=['x'],
                                        mark=alt.BrushConfig(
                                            stroke='black',
                                            strokeWidth=2),
                                        )
zoom_bar = (alt.Chart(df)
                .mark_rect(color='gray')
                .encode(x='site:O')
                #values=list(range(min(sites), max(sites) + 1, 5)))))
                .add_selection(zoom_brush)
                .configure_axis(labelOverlap='parity')
                .properties(width=zoom_bar_width,
                            height=25,
                            title='site zoom bar',
                            )
                )

In [39]:
zoom_bar

In [5]:
model.activity_wt_barplot()

In [6]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_heatmap()

As expected, the mutation escape values, $\beta_{m,e}$, and wildtype activity values, $a_{wt,e}$, inferred by the model strongly match the "true" values.

Now, we'll try initializing a `Polyclonal` model with 2 epitopes and fitting to the data instead. We'll ignore the most subdominant class 1 epitope to simulate a scenario where we did not have any prior knowledge of its mutations having an effect on escape. 

In [52]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_2epitopes'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 2 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('2', 2.0),
                                          ('3', 1.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))

Model with 2 epitopes specified was already fit.


In [53]:
model.activity_wt_barplot()

In [54]:
model.mut_escape_heatmap()

We observe that this model identifies the correct escape mutations at epitopes 2 and 3, the escape mutations in the class 1 epitope are not present, and interestingly the wildtype activity values, $a_{wt,e}$, for epitopes 2 and 3 are slightly more positive than their "true" values.

Next, we'll try initializing a `Polyclonal` model with 4 epitopes and fitting to the data. We'll first try initializing an additional epitope with a site that is not in any of the other epitopes and has no effect on escape.

In [55]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_4epitopes'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 4 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 2.0),
                                          ('2', 4.0),
                                          ('3', 3.0),
                                          ('4', 1.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ('4', 386, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))

Model with 4 epitopes specified was already fit.


In [56]:
model.activity_wt_barplot()

In [57]:
model.mut_escape_heatmap()

For class 1, 2, and 3 epitopes, the mutation escape values, $\beta_{m,e}$, and wildtype activity values, $a_{wt,e}$, inferred by the model strongly match the "true" values. The class 4 epitope, which is redundant, contained no clear escape mutations and had a strongly negative wildtype activity value, $a_{wt,e}$, suggesting either there are antibodies that always remain tightly bound to the epitope as mutations there have no antigenic effect, or there are no antibodies in the polyclonal mix that target this epitope.

Lastly, we'll try initializing an additional epitope with a site that is in another epitope. Specifically, we'll seed it with site 460, which is in the class 1 epitope and has many escape mutations.

In [58]:
# The key for the fit model
model_string = 'noisy_[0.25, 1, 4]conc_3muts_4epitopes_2'

# If the pickled model exists in the fit_polyclonal_models directory, load it.
if os.path.exists(f'fit_polyclonal_models/{model_string}.pkl') is True:
    model = pickle.load(open(f'fit_polyclonal_models/{model_string}.pkl', 'rb'))
    print(f"Model with 4 epitopes specified was already fit.")
else:
    # Else, fit a model and save it to the fit_polyclonal_models directory.
    model = polyclonal.Polyclonal(data_to_fit=noisy_data,
                                  activity_wt_df=pd.DataFrame.from_records(
                                         [('1', 2.0),
                                          ('2', 4.0),
                                          ('3', 3.0),
                                          ('4', 1.0),
                                          ],
                                         columns=['epitope', 'activity'],
                                         ),
                                     site_escape_df=pd.DataFrame.from_records(
                                         [('1', 417, 10.0),
                                          ('2', 484, 10.0),
                                          ('3', 444, 10.0),
                                          ('4', 460, 10.0),
                                          ],
                                         columns=['epitope', 'site', 'escape'],
                                         ),
                                     data_mut_escape_overlap='fill_to_data',
                                  )
    opt_res = model.fit(logfreq=500)
    pickle.dump(model, open(f'fit_polyclonal_models/{model_string}.pkl', 'wb'))

Model with 4 epitopes specified was already fit.


In [59]:
model.activity_wt_barplot()

In [60]:
model.mut_escape_heatmap()

Again, for class 1, 2, and 3 epitopes, the mutation escape values, $\beta_{m,e}$, and wildtype activity values, $a_{wt,e}$, inferred by the model strongly match the "true" values. In this case, the class 4 epitope that was seeded with a different key mutation in the class 1 epitope still contained no clear escape mutations.

## Summary

These simulation experiments provide a general guideline for specifying the number of epitopes. When fitting `Polyclonal` models, one can start with 1 epitope and iteratively fit models with increasing number of epitopes. At some point, the newly seeded $N$-th epitope will become redundant, as evidenced by a profile of near-zero mutation escape values, $\beta_{m,e}$, and a strongly negative wildtype activity value, $a_{wt,e}$. This is indication to the user that the previous fit model, containing $N - 1$ epitopes, is the one that best captures the data and polyclonal mix.