# Fit more complex curves
In the previous section we fit a model to data that was simulated from neutralization curves with a Hill coefficient of one and a non-neutralized fraction of zero.

Now we fit a model to data simulated with a non-one Hill coefficient and a non-neutralized fraction different from zero.
We also do the fitting constraining the Hill coefficient to one and the non-neutralized fraction to zero, and get much worse results.
This underscores importance of making both of these parameters free.

In [1]:
import copy
import requests
import tempfile

import numpy

import pandas as pd

import polyclonal
import polyclonal.plot

data_to_fit = pd.read_csv("2epitope_escape.csv", na_filter=None)

data_to_fit

Unnamed: 0,barcode,aa_substitutions,concentration,prob_escape,true IC90
0,AAAAAAACGTCAGGAG,,0.25,0.026970,0.1156
1,AAAAAAGTCGATGACA,,0.25,0.001488,0.1156
2,AAAAACGTATCGAGCA,,0.25,0.004059,0.1156
3,AAAAACGTTCTTATAC,,0.25,0.005425,0.1156
4,AAAAATGGAGTATTCT,,0.25,0.000000,0.1156
...,...,...,...,...,...
119995,CACACATCAGAAAAGT,Y508V L518N,4.00,0.000000,0.1156
119996,CGCGACTCCGTGTTCC,Y508V P521S,4.00,0.010540,0.1156
119997,ACCGTTATATACCCGG,Y508W,4.00,0.000000,0.1156
119998,ATTCACCACCAGGTAG,Y508W,4.00,0.000000,0.1156


For spatial regularization (encouraging epitopes to be structurally proximal residues), we read the inter-residue distances in angstroms from [PDB 6m0j](https://www.rcsb.org/structure/6m0j): 

In [2]:
# we read the PDB from the webpage into a temporary file and get the distances from that.
# you could also just download the file manually and then read from it.
r = requests.get("https://files.rcsb.org/download/6XM4.pdb")
with tempfile.NamedTemporaryFile() as tmpf:
    _ = tmpf.write(r.content)
    tmpf.flush()
    spatial_distances = polyclonal.pdb_utils.inter_residue_distances(tmpf.name, ["A"])

Initialize a `Polyclonal` model with two epitopes:

In [3]:
reg_escape_weight = 0.02

model = polyclonal.Polyclonal(
    data_to_fit=data_to_fit,
    n_epitopes=2,
    spatial_distances=spatial_distances,
)

model_fixed = copy.deepcopy(model)

Now fit the `Polyclonal` model with all free parameters:

In [4]:
# NBVAL_IGNORE_OUTPUT
opt_res = model.fit(logfreq=200, reg_escape_weight=reg_escape_weight)

#
# Fitting site-level fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 348 parameters at Sun Apr  6 10:33:11 2025.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.023073       16121       16112           0           0           0              0               0        9.047                    0                        0
         173      4.6982      1221.9      1144.2       1.978           0      15.691              0          3.0809       56.892                    0                        0
# Successfully finished at Sun Apr  6 10:33:16 2025.
#
# Fitting fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 3866 parameters at Sun Apr  6 10:33:16 2025.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_h

Next fit a model fixing the Hill coefficient to default of one and the non-neutralized fraction to default of zero:

In [5]:
# NBVAL_IGNORE_OUTPUT
opt_res_fixed = model_fixed.fit(
    logfreq=200,
    reg_escape_weight=reg_escape_weight,
    fix_hill_coefficient=True,
    fix_non_neutralized_frac=True,
)

#
# Fitting site-level model.
# Starting optimization of 348 parameters at Sun Apr  6 10:34:51 2025.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0     0.02223       16114       16112           0           0           0              0               0       1.8094                    0                        0
          98      2.3583        1519      1491.5     0.68005           0      9.3701              0         0.20907       17.215                    0                        0
# Successfully finished at Sun Apr  6 10:34:54 2025.
#
# Fitting model.
# Starting optimization of 3866 parameters at Sun Apr  6 10:34:54 2025.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.038878        1816      1712.1       

Now look at the actual curve specs:

In [6]:
pd.read_csv("2epitope_model_curve_specs.csv")

Unnamed: 0,epitope,activity,hill_coefficient,non_neutralized_frac
0,class 2,3.2,1.2,0.01
1,class 3,2.4,1.6,0.05


Now look at the curve specs for fitting the non-fixed model.
See how the Hill coefficients and non-neutralized fractions are fit to values different than one and zero.

In [7]:
# NBVAL_IGNORE_OUTPUT

model.curve_specs_df.round(2)

Unnamed: 0,epitope,activity,hill_coefficient,non_neutralized_frac
0,1,2.31,1.67,0.05
1,2,2.91,1.45,0.02


And the curve specs for fitting the fixed model:

In [8]:
model_fixed.curve_specs_df.round(2)

Unnamed: 0,epitope,activity,hill_coefficient,non_neutralized_frac
0,1,5.21,1.0,0.0
1,2,-5.34,1.0,0.0


Plot the curves for the non-fixed model:

In [9]:
# NBVAL_IGNORE_OUTPUT
model.curves_plot()

Versus the fixed model:

In [10]:
# NBVAL_IGNORE_OUTPUT
model_fixed.curves_plot()

Here are plots of the escape values inferred using the non-fixed model
Note how this successfully captures two epitopes:

In [11]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot()

And here are the escape values using the fixed model.
While this fixed model does not capture both epitopes:

In [12]:
# NBVAL_IGNORE_OUTPUT
model_fixed.mut_escape_plot()

Now correlate the model predicted and actual IC90s for each variant.
The full (non-fixed) model correlates better with the true log10 IC50s:

In [13]:
# NBVAL_IGNORE_OUTPUT

ic90s = (
    data_to_fit[["aa_substitutions", "true IC90"]]
    .drop_duplicates()
    .pipe(model.icXX, x=0.9, col="model IC90")
    .pipe(model_fixed.icXX, x=0.9, col="fixed model IC90")
    .assign(
        log10_true_IC90=lambda x: numpy.log10(x["true IC90"]),
        log10_model_IC90=lambda x: numpy.log10(x["model IC90"]),
        log10_fixed_model_IC90=lambda x: numpy.log10(x["fixed model IC90"]),
    )
)

# print the correlations
ic90_corrs = (
    ic90s[[c for c in ic90s.columns if c.startswith("log")]]
    .corr(numeric_only=True)
    .round(2)
)
display(ic90_corrs)

assert (
    ic90_corrs.at["log10_true_IC90", "log10_model_IC90"]
    > ic90_corrs.at["log10_true_IC90", "log10_fixed_model_IC90"]
)

Unnamed: 0,log10_true_IC90,log10_model_IC90,log10_fixed_model_IC90
log10_true_IC90,1.0,0.97,0.79
log10_model_IC90,0.97,1.0,0.84
log10_fixed_model_IC90,0.79,0.84,1.0


We also examine the correlation between the "true" and inferred mutation-escape values, $\beta_{m,e}$:

In [14]:
# NBVAL_IGNORE_OUTPUT

mut_escape = (
    pd.read_csv("RBD_mut_escape_df.csv")
    .query("epitope != 'class 1'")  # not used in simulation
    .assign(epitope=lambda x: "true " + x["epitope"])
    .pivot_table(index="mutation", columns="epitope", values="escape")
    .reset_index()
)

for m, mname in [(model, "model"), (model_fixed, "fixed model")]:
    # Sort so model with biggest average escape is first. This makes testing more
    # robust as it is sort of random which epitope name gets assigned to biggest:
    model_df = (
        m.mut_escape_df.assign(
            mean_escape=lambda x: x.groupby("epitope")["escape"].transform("mean")
        )
        .sort_values("mean_escape", ascending=False)
        .pivot_table(index="mutation", columns="epitope", values="escape", sort=False)
        .reset_index()
    )
    model_df.columns = ["mutation", f"{mname} epitope A", f"{mname} epitope B"]
    mut_escape = mut_escape.merge(model_df, on="mutation", validate="one_to_one")

mut_escape_corr = mut_escape.corr(numeric_only=True).drop(
    columns=[c for c in mut_escape.columns if not c.startswith("true")],
    index=[c for c in mut_escape.columns if c.startswith("true")],
    errors="ignore",
)

display(mut_escape_corr.round(2))

assert (
    mut_escape_corr.query("index.str.startswith('model')").max().max()
    > mut_escape_corr.query("index.str.startswith('fixed')").max().max()
)

Unnamed: 0,true class 2,true class 3
model epitope A,-0.06,0.96
model epitope B,0.98,-0.06
fixed model epitope A,0.67,0.52
fixed model epitope B,-0.07,-0.05


As can be seen from correlations above, the non-fixed model captures both epitopes well, but the fixed model just captures one epitope.