# Bootstrapping model fits
The previous section describes fitting a single model.
But we may also want to have confidence estimates for the fit.
We can do that via bootstrapping the data set.

The overall recommended workflow is to first fit models to all the data to determine the number of epitopes, etc.
Then once the desired fitting parameters are determined, you can bootstrap to get confidence on predictions.

## Get model fit to the data
The first step is just to fit a `Polyclonal` model to all the data we are using.
We do similar to the previous notebook for our RBD example, but first shrink the size of the data set to just 7500 variants to provide more "error" to better illustrate the bootstrapping.

We will call this model fit to all the data we are using the "root" model as it's used as the starting point (root) for the subsequent bootstrapping.
Note that data (which we will bootstrap) are attached to this pre-fit model:

In [None]:
# NBVAL_IGNORE_OUTPUT

import pandas as pd

import polyclonal

# read the data, and just make "barcode" the numerical rank of the variants
noisy_data = (
    pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
    .query('library == "avg3muts"')
    .query("concentration in [0.25, 1, 4]")
    .sort_values(["concentration", "aa_substitutions"])
    .reset_index(drop=True)
    .assign(barcode=lambda x: x.groupby("concentration").cumcount())
)

# just keep some variants to make fitting "noisier"
n_keep = 7500
barcodes_to_keep = (
    noisy_data["barcode"]
    .drop_duplicates()
    .sample(n_keep, random_state=1).tolist()
)
noisy_data = noisy_data.query("barcode in @barcodes_to_keep")

# make and fit the root Polyclonal object with all the data we are using
root_poly = polyclonal.Polyclonal(
    data_to_fit=noisy_data,
    activity_wt_df=pd.DataFrame.from_records(
        [
            ("1", 1.0),
            ("2", 3.0),
            ("3", 2.0),
        ],
        columns=["epitope", "activity"],
    ),
    site_escape_df=pd.DataFrame.from_records(
        [
            ("1", 417, 10.0),
            ("2", 484, 10.0),
            ("3", 444, 10.0),
        ],
        columns=["epitope", "site", "escape"],
    ),
    data_mut_escape_overlap="fill_to_data",
)

opt_res = root_poly.fit(logfreq=100)

## Create and fit bootstrapped models
To create the bootstrapped models, we initialize a `PolyclonalCollection`, here just using 10 samples for speed (for better error estimates you may want more on the order of 20 to 100).
Note it is important that the root model you are using has already been fit to the data!

In [None]:
n_bootstrap_samples = 10

bootstrap_poly = polyclonal.PolyclonalCollection(
    root_polyclonal=root_poly,
    n_bootstrap_samples=n_bootstrap_samples,
)

Now fit the bootstrapped models:

In [None]:
# NBVAL_IGNORE_OUTPUT

import time

start = time.time()
print(f"Starting fitting bootstrap models at {time.asctime()}")
n_fit, n_failed = bootstrap_poly.fit_models()
print(f"Fitting took {time.time() - start:.3g} seconds, finished at {time.asctime()}")
assert n_failed == 0 and n_fit == n_bootstrap_samples

In [None]:
# temp cell when we don't want to re-run above

import pickle

import pandas as pd

import polyclonal

with open("_temp_bootstrap_poly.pickle", "rb") as f:
 #   pickle.dump(bootstrap_poly, f)
    bootstrap_poly = pickle.load(f)

## Look at summarized results
We can get the resulting measurements for the epitope activities and mutation effects both per-replicate and summarized across replicates (mean, median, standard deviation).

### Epitope activities
Epitope activities for each replicate:

In [None]:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.activity_wt_df_replicates.round(1).head()

Epitope activities summarized across replicates.
The `std` column gives the standard deviation:

In [None]:
bootstrap_poly.activity_wt_df.round(1)

We can plot the epitope activities summarized across replicates.
The dropdown allows you to choose the summary stat (mean, median), and the black lines indicate the standard deviation.
Mouse over for values:

In [None]:
bootstrap_poly.activity_wt_barplot()

### Mutation escape values
Mutation escape values for each replicate:

In [None]:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_df_replicates.round(1).head()

Mutation escape values summarizes across replicates.
Note the `frac_bootstrap_replicates` column has the fraction of bootstrap replicates with a value for this mutation:

In [None]:
bootstrap_poly.mut_escape_df.round(1).head(n=3)

We can plot the mutation escape values across replicates.
The dropdown selects the statistic shown in the heatmap (mean or median), and mouseovers give details on points.
Here we set `min_frac_bootstrap_replicates=0.9` to only report escape values observed in at least 90% of bootstrap replicates (this gets rid of rare mutations):

In [1]:
bootstrap_poly.mut_escape_heatmap(min_frac_bootstrap_replicates=0.9)

### Site summaries of mutation escape
Site summaries of mutation escape values for replicates:

In [None]:
# NBVAL_IGNORE_OUTPUT
bootstrap_poly.mut_escape_site_summary_df_replicates.round(1).head()

Site summaries of mutation escape values summarized (e.g., averaged) across replicates.
Note that the `metric` column now indicates a different row for each site-summary metric type, which is then summarized by its mean, median, and standard deviation:

In [None]:
bootstrap_poly.mut_escape_site_summary_df.round(1).head()