# Bootstrapping to assess fits
The previous section describes fitting a single model.
But we may also want to have confidence estimates for the fit.
We can do that via bootstrapping the data set.

Here we illustrate bootstrapping on the simulated RBD data using the noisy data with an average of 2 mutations per gene.

## First and fit a model to all the data
We begin by first fitting a model to all the data.
We call that the "root model" as it's used as a starting point for the bootstrapping:

In [3]:
import numpy

import pandas as pd

import polyclonal

data = (
    pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
    .query('library == "avg2muts"')
    .query("concentration in [0.25, 1, 4]")
    .reset_index(drop=True)
)

root_poly = polyclonal.Polyclonal(
    data_to_fit=data,
    activity_wt_df=pd.DataFrame.from_records(
        [
            ("1", 1.0),
            ("2", 3.0),
            ("3", 2.0),
        ],
        columns=["epitope", "activity"],
    ),
    site_escape_df=pd.DataFrame.from_records(
        [
            ("1", 417, 10.0),
            ("2", 484, 10.0),
            ("3", 444, 10.0),
        ],
        columns=["epitope", "site", "escape"],
    ),
    data_mut_escape_overlap="fill_to_data",
)

Fit the model:

In [4]:
# NBVAL_IGNORE_OUTPUT
_ = root_poly.fit(logfreq=100)

# First fitting site-level model.
# Starting optimization of 522 parameters at Fri Mar 11 16:01:05 2022.
       step   time_sec       loss   fit_loss reg_escape  regspread
          0   0.053556     9144.4     9144.2    0.29701          0
        100     5.0608     1337.1     1333.6     3.5314          0
        200     10.215     1313.4       1309     4.3625          0
        300     15.004     1305.1     1300.1     5.0451          0
        400     19.924     1301.6       1296     5.6744          0
        500      24.61       1298     1291.9     6.0788          0
        600     29.105     1297.1     1290.7     6.3842          0
        700     33.772     1296.4     1289.5     6.8421          0
        800     38.364       1296       1289     7.0159          0
        900     43.102     1295.6     1288.5     7.1281          0
       1000     47.747     1295.4     1288.2     7.1467          0
       1100     52.429     1295.2       1288      7.186          0
       1200     57.137  

## Now fit bootstrapped models
To fit the bootstrapped models, we initialize a `PolyclonalCollection`:

In [4]:
boot_poly = PolyclonalCollection(
    root_polyclonal=rbd_poly,
    n_bootstrap_samples=n_samps,
    n_threads=n_threads,
    seed=0,
)

In [5]:
rbd_pc_a.fit_models(fit_site_level_first=False)
rbd_pc_b.fit_models(fit_site_level_first=False)

In [6]:
rbd_pc_a_copy.fit_models(fit_site_level_first=False)
rbd_pc_b_copy.fit_models(fit_site_level_first=False)

## Tests to see if seed is respected (different results)

The test for mutation frequency in the mutation frequency dictionary is not suitable here.
With so many multi-variants, we may not come across situations where a mutation isn't sampled by at least one model.
For our seed tests, I will just make sure we get different summary stats for each seed.

In [7]:
(
    rbd_pc_a_escape_dict,
    rbd_pc_a_activity_wt_dict,
) = rbd_pc_a.summarize_bootstrapped_params()
(
    rbd_pc_b_escape_dict,
    rbd_pc_b_activity_wt_dict,
) = rbd_pc_b.summarize_bootstrapped_params()
(
    rbd_pc_a_copy_escape_dict,
    rbd_pc_a_copy_activity_wt_dict,
) = rbd_pc_a_copy.summarize_bootstrapped_params()
(
    rbd_pc_b_copy_escape_dict,
    rbd_pc_b_copy_activity_wt_dict,
) = rbd_pc_b_copy.summarize_bootstrapped_params()

In [8]:
# Test to see if inferreed params are the same with the same seed.
assert rbd_pc_a_escape_dict["mean"].equals(rbd_pc_a_copy_escape_dict["mean"])
assert rbd_pc_a_escape_dict["median"].equals(rbd_pc_a_copy_escape_dict["median"])
assert rbd_pc_a_escape_dict["std"].equals(rbd_pc_a_copy_escape_dict["std"])

In [9]:
assert rbd_pc_a_activity_wt_dict["mean"].equals(rbd_pc_a_copy_activity_wt_dict["mean"])
assert rbd_pc_a_activity_wt_dict["median"].equals(
    rbd_pc_a_copy_activity_wt_dict["median"]
)
assert rbd_pc_a_activity_wt_dict["std"].equals(rbd_pc_a_copy_activity_wt_dict["std"])

In [10]:
# Test to see if inferred params are the same with same seed and different thread count
assert rbd_pc_b_escape_dict["mean"].equals(rbd_pc_b_copy_escape_dict["mean"])
assert rbd_pc_b_escape_dict["median"].equals(rbd_pc_b_copy_escape_dict["median"])
assert rbd_pc_b_escape_dict["std"].equals(rbd_pc_b_copy_escape_dict["std"])

assert rbd_pc_b_activity_wt_dict["mean"].equals(rbd_pc_b_copy_activity_wt_dict["mean"])
assert rbd_pc_b_activity_wt_dict["median"].equals(
    rbd_pc_b_copy_activity_wt_dict["median"]
)
assert rbd_pc_b_activity_wt_dict["std"].equals(rbd_pc_b_copy_activity_wt_dict["std"])

In [11]:
# Make sure inferred params are different
assert not rbd_pc_a_escape_dict["mean"].equals(rbd_pc_b_escape_dict["mean"])
assert not rbd_pc_a_escape_dict["median"].equals(rbd_pc_b_escape_dict["median"])
assert not rbd_pc_a_escape_dict["std"].equals(rbd_pc_b_escape_dict["std"])

In [12]:
assert not rbd_pc_a_activity_wt_dict["mean"].equals(rbd_pc_b_activity_wt_dict["mean"])
assert not rbd_pc_a_activity_wt_dict["median"].equals(
    rbd_pc_b_activity_wt_dict["median"]
)
assert not rbd_pc_a_activity_wt_dict["std"].equals(rbd_pc_b_activity_wt_dict["std"])

In [13]:
test_df = rbd_data.sample(n=200, random_state=0)
pc_a_preds = pd.concat(rbd_pc_a.make_predictions(test_df))
pc_b_preds = pd.concat(rbd_pc_b.make_predictions(test_df))
pc_a_copy_preds = pd.concat(rbd_pc_a_copy.make_predictions(test_df))
pc_b_copy_preds = pd.concat(rbd_pc_b_copy.make_predictions(test_df))

In [14]:
assert pc_a_preds.equals(pc_a_copy_preds)

In [15]:
assert not pc_a_preds.equals(pc_b_preds)

In [16]:
# Test threads for reproducability
assert pc_b_preds.equals(pc_b_copy_preds)