# Averaging models
Probably the best way to ensure robust inferences and estimate error isn't to use bootstrapping of a single model, but rather to actuall have multiple experimental replicates, ideally on different libraries.

Here we describe how to average model fits across libraries and/or replicates.

## Split data into replicates
We will use our data for the RBD as an earlier examples, but split it into several libraries / replicates.

Specifically, we will fit two different libraries: `avg3muts` and `avg4muts`, which have different barcodes and also different mutation rates (although of course in real life you might sometimes want to average results from different libraries with the same mutation rates).
We will also simulate having two replicates for each library just by drawing bootstrap samples from each library and then dropping duplicates in samples:

In [1]:
import pandas as pd

import polyclonal.polyclonal
import polyclonal.polyclonal_collection


# read data
all_data = pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)

# split by library and replicates
libraries = ["avg3muts", "avg4muts"]  # the two libraries to use
concentrations = [1, 4]  # use juste these two concentrations
n_replicates = 2  # number of replicates per library

data_by_replicate = {
    (library, replicate + 1): polyclonal.polyclonal_collection.create_bootstrap_sample(
        all_data.query("library == @library").query("concentration in @concentrations"),
        seed=replicate + 1,
    ).drop_duplicates()
    for library in libraries
    for replicate in range(n_replicates)
}

## Fit models to each replicate
We now fit a `Polyclonal` model to each replicate using just 2 epitopes, as the data don't seem sufficient to accurately fit all three epitopes.
Then we arrange the models in a data frame:

In [2]:
# first create a data frame with all the models
models_by_replicate = {}
for (library, replicate), data in data_by_replicate.items():
    model = polyclonal.Polyclonal(data_to_fit=data, n_epitopes=2)
    models_by_replicate[(library, replicate)] = model
models_df = (
    pd.Series(models_by_replicate, name="model")
    .rename_axis(["library", "replicate"])
    .reset_index()
)

# now fit the models
n_fit, n_failed, models_df["model"] = polyclonal.polyclonal_collection.fit_models(
    models_df["model"],
    n_threads=2,
)

Note how the models are arranged in a data frame:

In [3]:
# NBVAL_IGNORE_OUTPUT

models_df

Unnamed: 0,library,replicate,model
0,avg3muts,1,<polyclonal.polyclonal.Polyclonal object at 0x...
1,avg3muts,2,<polyclonal.polyclonal.Polyclonal object at 0x...
2,avg4muts,1,<polyclonal.polyclonal.Polyclonal object at 0x...
3,avg4muts,2,<polyclonal.polyclonal.Polyclonal object at 0x...


## Average the models
Now we create a `PolyclonalAverage` model with the models to average.
Note that by default the "average" used by `PolyclonalAverage` is the **median** rather than the **mean** between epitopes, although this is a parameter that can also be set to mean.

If your epitopes are too different or poorly defined (e.g., you are trying to fit more epitopes than can be consistently inferred from the data), then you may get an epitope harmonization error:

In [4]:
model_avg = polyclonal.PolyclonalAverage(models_df)

Let's look at the correlation among the escape at each epitope across models:

In [5]:
# NBVAL_IGNORE_OUTPUT

model_avg.mut_escape_corr_heatmap()

And the activities of the epitopes.
It should generally be the case that the epitope with greater activity (in plot below) should also be better correlated among replicates (heatmap above) as it can be inferred more reliably:

In [6]:
# NBVAL_IGNORE_OUTPUT

model_avg.activity_wt_barplot()

Now let's look at the escape as a line plot:

In [7]:
# NBVAL_IGNORE_OUTPUT

model_avg.mut_escape_lineplot(
    mut_escape_site_summary_df_kwargs={"min_times_seen": 3},
)

And the actual mutation escape values:

In [8]:
# NBVAL_IGNORE_OUTPUT

model_avg.mut_escape_heatmap(init_min_times_seen=3)