# Bootstrapping model fits
The previous section describes fitting a single model.
But we may also want to have confidence estimates for the fit.
We can do that via bootstrapping the data set.

The overall recommended workflow is to first fit models to all the data to determine the number of epitopes, etc.
Then once the desired fitting parameters are determined, you can bootstrap to get confidence on predictions.

Here we illustrate bootstrapping on the simulated RBD data using the noisy data with an average of 2 mutations per gene.

## Get a model fit to all the data
The first step is just to fit a `Polyclonal` model to all the data.
We just did that in the previous notebook for our RBD example and saved the model using [pickle](https://docs.python.org/3/library/pickle.html), so here we just read in that model rather than re-fitting.
We will call this the "root" model as it's used as the starting point (root) for the subsequent bootstrapping.
Note that data (which we will bootstrap) are attached to this pre-fit model:

In [1]:
import pickle

with open("fit_RBD_model.pickle", "rb") as f:
    root_poly = pickle.load(f)
    
root_poly.data_to_fit

Unnamed: 0,library,aa_substitutions,concentration,prob_escape,IC90
0,avg2muts,,0.25,0.05044,0.1128
1,avg2muts,,0.25,0.14310,0.1128
2,avg2muts,,0.25,0.05452,0.1128
3,avg2muts,,0.25,0.08473,0.1128
4,avg2muts,,0.25,0.04174,0.1128
...,...,...,...,...,...
89995,avg2muts,Y508V,4.00,0.00000,0.2531
89996,avg2muts,Y508V A520L,4.00,0.03180,0.4688
89997,avg2muts,Y508V H519N,4.00,0.10630,0.5528
89998,avg2muts,Y508W,4.00,0.03754,0.2285


Fit the model:

## Now fit bootstrapped models
To fit the bootstrapped models, we initialize a `PolyclonalCollection`, here just using 5 samples for speed (for good error estimates you may want more on the order of 20 to 100):

In [2]:
import polyclonal

n_bootstrap_samples = 5

bootstrap_poly = polyclonal.PolyclonalCollection(
    root_polyclonal=root_poly,
    n_bootstrap_samples=n_bootstrap_samples,
)

In [7]:
# NBVAL_IGNORE_OUTPUT

import time

start = time.time()
print(f"Starting fitting bootstrap models at {time.asctime()}")
n_fit, n_failed = bootstrap_poly.fit_models()
print(f"Fitting took {time.time() - start:.2g} seconds, finished at {time.asctime()}")
assert n_failed == 0 and n_fit == n_bootstrap_samples

Starting fitting bootstrap models at Sun Mar 13 16:06:33 2022


ValueError: Mapping matrix does not have a 1-to-1 mapping.

## Tests to see if seed is respected (different results)

The test for mutation frequency in the mutation frequency dictionary is not suitable here.
With so many multi-variants, we may not come across situations where a mutation isn't sampled by at least one model.
For our seed tests, I will just make sure we get different summary stats for each seed.

In [None]:
(
    rbd_pc_a_escape_dict,
    rbd_pc_a_activity_wt_dict,
) = rbd_pc_a.summarize_bootstrapped_params()
(
    rbd_pc_b_escape_dict,
    rbd_pc_b_activity_wt_dict,
) = rbd_pc_b.summarize_bootstrapped_params()
(
    rbd_pc_a_copy_escape_dict,
    rbd_pc_a_copy_activity_wt_dict,
) = rbd_pc_a_copy.summarize_bootstrapped_params()
(
    rbd_pc_b_copy_escape_dict,
    rbd_pc_b_copy_activity_wt_dict,
) = rbd_pc_b_copy.summarize_bootstrapped_params()

In [None]:
# Test to see if inferreed params are the same with the same seed.
assert rbd_pc_a_escape_dict["mean"].equals(rbd_pc_a_copy_escape_dict["mean"])
assert rbd_pc_a_escape_dict["median"].equals(rbd_pc_a_copy_escape_dict["median"])
assert rbd_pc_a_escape_dict["std"].equals(rbd_pc_a_copy_escape_dict["std"])

In [None]:
assert rbd_pc_a_activity_wt_dict["mean"].equals(rbd_pc_a_copy_activity_wt_dict["mean"])
assert rbd_pc_a_activity_wt_dict["median"].equals(
    rbd_pc_a_copy_activity_wt_dict["median"]
)
assert rbd_pc_a_activity_wt_dict["std"].equals(rbd_pc_a_copy_activity_wt_dict["std"])

In [None]:
# Test to see if inferred params are the same with same seed and different thread count
assert rbd_pc_b_escape_dict["mean"].equals(rbd_pc_b_copy_escape_dict["mean"])
assert rbd_pc_b_escape_dict["median"].equals(rbd_pc_b_copy_escape_dict["median"])
assert rbd_pc_b_escape_dict["std"].equals(rbd_pc_b_copy_escape_dict["std"])

assert rbd_pc_b_activity_wt_dict["mean"].equals(rbd_pc_b_copy_activity_wt_dict["mean"])
assert rbd_pc_b_activity_wt_dict["median"].equals(
    rbd_pc_b_copy_activity_wt_dict["median"]
)
assert rbd_pc_b_activity_wt_dict["std"].equals(rbd_pc_b_copy_activity_wt_dict["std"])

In [None]:
# Make sure inferred params are different
assert not rbd_pc_a_escape_dict["mean"].equals(rbd_pc_b_escape_dict["mean"])
assert not rbd_pc_a_escape_dict["median"].equals(rbd_pc_b_escape_dict["median"])
assert not rbd_pc_a_escape_dict["std"].equals(rbd_pc_b_escape_dict["std"])

In [None]:
assert not rbd_pc_a_activity_wt_dict["mean"].equals(rbd_pc_b_activity_wt_dict["mean"])
assert not rbd_pc_a_activity_wt_dict["median"].equals(
    rbd_pc_b_activity_wt_dict["median"]
)
assert not rbd_pc_a_activity_wt_dict["std"].equals(rbd_pc_b_activity_wt_dict["std"])

In [None]:
test_df = rbd_data.sample(n=200, random_state=0)
pc_a_preds = pd.concat(rbd_pc_a.make_predictions(test_df))
pc_b_preds = pd.concat(rbd_pc_b.make_predictions(test_df))
pc_a_copy_preds = pd.concat(rbd_pc_a_copy.make_predictions(test_df))
pc_b_copy_preds = pd.concat(rbd_pc_b_copy.make_predictions(test_df))

In [None]:
assert pc_a_preds.equals(pc_a_copy_preds)

In [None]:
assert not pc_a_preds.equals(pc_b_preds)

In [None]:
# Test threads for reproducability
assert pc_b_preds.equals(pc_b_copy_preds)