# Bootstrapping model fits
The previous section describes fitting a single model.
But we may also want to have confidence estimates for the fit.
We can do that via bootstrapping the data set.

The overall recommended workflow is to first fit models to all the data to determine the number of epitopes, etc.
Then once the desired fitting parameters are determined, you can bootstrap to get confidence on predictions.

Here we illustrate bootstrapping on the simulated RBD data using the noisy data with an average of 2 mutations per gene.

## Get a model fit to all the data
The first step is just to fit a `Polyclonal` model to all the data.
We just did that in the previous notebook for our RBD example and saved the model using [pickle](https://docs.python.org/3/library/pickle.html), so here we just read in that model rather than re-fitting.
We will call this the "root" model as it's used as the starting point (root) for the subsequent bootstrapping.
Note that data (which we will bootstrap) are attached to this pre-fit model:

In [1]:
# NBVAL_IGNORE_OUTPUT

import pickle

with open("fit_RBD_model.pickle", "rb") as f:
    root_poly = pickle.load(f)

root_poly.data_to_fit.head()

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg2muts,,0.25,0.05044,0.1128
1,avg2muts,,0.25,0.1431,0.1128
2,avg2muts,,0.25,0.05452,0.1128
3,avg2muts,,0.25,0.08473,0.1128
4,avg2muts,,0.25,0.04174,0.1128


## Create and fit bootstrapped models
To create the bootstrapped models, we initialize a `PolyclonalCollection`, here just using 10 samples for speed (for better error estimates you may want more on the order of 20 to 100).
Note it is important that the root model you are using has already been fit to the data!

In [2]:
import polyclonal

n_bootstrap_samples = 10

bootstrap_poly = polyclonal.PolyclonalCollection(
    root_polyclonal=root_poly,
    n_bootstrap_samples=n_bootstrap_samples,
)

Now fit the models:

In [3]:
with open("_temp_bootstrap_poly.pickle", "rb") as f:
    # pickle.dump(bootstrap_poly, f)
    bootstrap_poly = pickle.load(f)

# NBVAL_IGNORE_OUTPUT

import time

start = time.time()
print(f"Starting fitting bootstrap models at {time.asctime()}")
n_fit, n_failed = bootstrap_poly.fit_models()
print(f"Fitting took {time.time() - start:.3g} seconds, finished at {time.asctime()}")
assert n_failed == 0 and n_fit == n_bootstrap_samples

## Look at summarized results
We can get the resulting measurements for the epitope activities and mutation effects both per-replicate and summarized across replicates (mean, median, standard deviation).

Epitope activities for each replicate:

In [4]:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.activity_wt_df_replicates.round(1).head()

Unnamed: 0,epitope,activity,bootstrap_replicate
0,1,1.3,1
1,2,3.2,1
2,3,1.9,1
3,1,1.2,2
4,2,3.2,2


Epitope activities summarized across replicates.
The `std` column gives the standard deviation:

In [5]:
bootstrap_poly.activity_wt_df.round(1)

Unnamed: 0,epitope,mean,median,std
0,1,1.2,1.3,0.1
1,2,3.2,3.2,0.0
2,3,1.9,1.9,0.1


We can plot the epitope activities summarized across replicates.
The dropdown allows you to choose the summary stat (mean, median), and the black lines indicate the standard deviation.
Mouse over for values:

In [6]:
bootstrap_poly.activity_wt_barplot()

Mutation escape values for each replicate:

In [8]:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df_replicates.round(1).head()

Unnamed: 0,epitope,site,wildtype,mutant,mutation,escape,bootstrap_replicate
0,1,331,N,A,N331A,0.1,1
1,1,331,N,D,N331D,0.1,1
2,1,331,N,E,N331E,0.3,1
3,1,331,N,F,N331F,0.1,1
4,1,331,N,G,N331G,2.2,1


Mutation escape values summarizes across replicates.
Note the `frac_bootstrap_replicates` column has the fraction of bootstrap replicates with a value for this mutation:

In [9]:
bootstrap_poly.mut_escape_df.round(1).head(n=3)

Unnamed: 0,epitope,site,wildtype,mutant,mutation,mean,median,std,n_bootstrap_replicates,frac_bootstrap_replicates
0,1,331,N,A,N331A,0.2,0.1,0.2,10,1.0
1,1,331,N,D,N331D,0.4,0.3,0.4,10,1.0
2,1,331,N,E,N331E,-0.0,-0.1,0.3,10,1.0


Site summaries of mutation escape values for replicates:

In [10]:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df_replicates.round(1).head()

Unnamed: 0,epitope,site,wildtype,mean,total positive,max,min,total negative,bootstrap_replicate
0,1,331,N,0.7,11.7,2.2,-0.0,-0.0,1
1,1,332,I,0.8,15.6,2.5,-0.6,-1.1,1
2,1,333,T,0.4,7.3,1.4,-0.5,-0.9,1
3,1,334,N,0.7,12.6,2.3,-0.2,-0.2,1
4,1,335,L,0.1,5.4,1.4,-1.4,-3.7,1


Site summaries of mutation escape values summarized (e.g., averaged) across replicates.
Note that the `metric` column now indicates a different row for each site-summary metric type, which is then summarized by its mean, median, and standard deviation:

In [11]:
bootstrap_poly.mut_escape_site_summary_df.round(1).head()

Unnamed: 0,epitope,site,metric,mean,median,std,n_bootstrap_replicates,frac_bootstrap_replicates
0,1,331,max,2.1,2.1,0.3,10,1.0
1,1,331,mean,0.7,0.7,0.1,10,1.0
2,1,331,min,-0.2,-0.2,0.2,10,1.0
3,1,331,total negative,-0.3,-0.3,0.2,10,1.0
4,1,331,total positive,11.1,10.8,1.1,10,1.0
