# Biophysical model

The `multidms` joint model applies to a case where you have DMS datasets
for two or more experimental conditions and are interested in identifying shifts
in mutational effects between conditions. 
Here we describe the biophysical model of conditional shifts that motivates the approach in this package.

We suggest reading the [Otwinowski et al. 2018](https://www.pnas.org/doi/10.1073/pnas.1804015115) paper to understand the approach to modeling global epistasis before reading the rest of the documentation.

## Model overview

`multidms` extends the traditional global epistasis model by informing the parameters with multiple dms experiments under differing experimental conditions. For example, distinct experimental conditions (referred to as $h$ throughout the documentation) may include sets of experiments that are performed with differing homology of wildtype sequence (i.e. SARS-CoV-2 Delta Vs. Omicron spike). This approach could also be used for experimental conditions which share homology of the wildtype but differing selection targets of the same protein. This model makes the assumption that differing conditions should result in _mostly_ the same effect of mutations -- but some shifts in mutational effect due to biological mechanisms (i.e. epistasis). Ultimately, this model was designed to identify those shifts using lasso regularization applied to the shift parameters described in the additive latent phenotype section below.

[//]: # "We find that the qualitative results are robust to choice for lasso strength, and generally this lasso"
[//]: # "acts as a single dial to increase signal and noise in a linear fasion"


At a high level, the model is a composition of three
functions which describe the expected biophysical interactions underlying a given phenotype;
(1) an additive model, $\phi$, describing a variant's _latent_ phenotype under a given condition,
(2) a global epistasis model, $g$, to disentangle the effects of multiple mutations on the same variant, and
(3) a final output activation function, $t$, accounting for an expected _lower bound_ on the variant's phenotype,
where the observed functional score may sit below due to experimental sensitivity.

$$
f(v,h) = t_{\gamma}(g_{\alpha}(\phi_{\beta, S, C_{r}}(v,h))
$$

Where 
$\gamma$, $\alpha$, $\beta$, $S$, and $C_{r}$
are _free_ parameters inferred from experimental observations during the fitting process.

**Note** The motivation behind defining an abstract model in terms of its components offers us (1) modularity for method testing and development, and (2) The flexibility of multiple options for model components that encompass the needs of differing research goals and experimental techniques. While there is only a single option in for the $\phi$ latent prediction, we offer a few options both for the global epistasis ($g$) and output activation ($t$) functions. Generally, the package defaults for these components described below should be sufficient for most purposes, and in this case feel free to ignore the `multidms.biophysical` module all-together as this functionality is generally hidden unless explicitly specified during the instantiation of a `MultiDmsModel` object. 

Below, we'll describe the individual components in more detail.

## Additive latent phenotype, $\phi$

The model defines one condition as a _reference_ condition.
For each mutation $m$, the model fits a single mutation effect parameter, $\beta_{m}$.
Additionally, the model fits set of shift parameters, $S_{m,h}$, 
that quantifies the shift a given mutation's
effect. Each mutation is associated with an independent shift parameter 
for each non-reference condition. 
For example if there exists 3 total experimental conditions, $h_{1}$, $h_{2}$, & $h_{3}$, 
then each mutation, $m$, will be get assigned a
single $\beta_{m}$ parameter, 
and two non-reference condition _shift_ parameters $S_{m, h_{2}}$, $S_{m, h_{3}}$ for the latent prediction


Shift parameters can be regularized, encouraging most of them to be
close to zero. This regularization step is a useful way to eliminate
the effects of experimental noise, and is most useful in cases where
you expect most mutations to have the same effects between conditions,
such as for conditions that are close relatives. 

Concretely, the latent phenotype of any variant, $v$, from the experimental condition, $h$,
is computed like so:

$$
\phi(v,h) = c_{r} + \sum_{m \in v} (\beta_{m} + S_{m,h})
$$

where:

* $c_{r}$ is the wild type latent phenotype for the reference condition.
* $\beta_{m}$ is the latent phenotypic effect of mutation $m$. See the note below
* $s_{m,h}$ is the shift of the effect of mutation $m$ in condition $h$.
  These parameters are fixed to zero for the reference condition. For
  non-reference conditions, they are defined in the same way as $x_m$ parameters.
* $v$ is the set of all mutations relative to the reference wild type sequence
  (including all mutations that separate condition $h$ from the reference condition).

**Note** This function is implimented in `multidms.biophysical.additive_model`

**Note** The $\beta_m$ variable is defined such that mutations are always relative to the
reference condition. For example, if the wild type amino acid at site 30 is an
A in the reference condition, and a G in a non-reference condition,
then a Y30G mutation in the non-reference condition is recorded as an A30G
mutation relative to the reference. This way, each condition informs
the exact same parameters, even at sites that differ in wild type amino acid.
These are encoded in a `BinaryMap` object, where all sites that are non-identical
to the reference are 1's.

## Sigmoidal global epistasis, $g$

In contrast to the common family of spline functions typically applied to these models, 
the global-epistasis function here assumes a sigmoidal relationship between
a protein's latent property and it's functional score measured in the experiment
(e.g., log enrichment score). Using free parameters, the sigmoid
can flexibly conform to an optimal shape informed by the data. 
Note that this function is independent from the
experimental condition from which a variant is observed.

The sigmoidal function that relates a given _latent phenotype_, $z$, to its functional score is given by:

$$
g(z) = \alpha_{scale} * \frac{1}{1 + e^{-z}} + \alpha_{bias}
$$

where:
* $\alpha_{scale}$ is a free parameter defining the range of the sigmoid
* $\alpha_{bias}$ is a free parameter defining the lower bound of the sigmoid.

Below is an interactive plot showing the effect of the sigmoidal global epistasis as a function of an adjustable $\alpha_{scale}$, and $\alpha_{bias}$:

In [50]:
import altair as alt

import numpy

import pandas as pd


df = pd.DataFrame({"latent": numpy.linspace(-10, 10, 100)})

slider_s = alt.binding_range(min=0.1, max=10)
var_s = alt.param(bind=slider_s, value=1, name="alpha_scale")

slider_b = alt.binding_range(min=-10, max=5) #, step=1)
var_b = alt.param(bind=slider_b, value=0, name="alpha_bias")

(
    alt.Chart(df)
    .transform_calculate(
        phenotype=(1 / (1 + alt.expr.exp(-1*alt.datum['latent'])))
        * var_s
        + var_b
    )
    .encode(
        x=alt.X("latent", title="latent phenotype", scale=alt.Scale(domain=[-10, 10])),
        y=alt.Y("phenotype:Q", title="predicted phenotype", scale=alt.Scale(domain=[-10, 10]))
    )
    .mark_line()
    .add_params(var_s, var_b)
)

## Output activation, $t$

DMS reported phenotypes are often the result of a fold-change calculation on barcode read counts after a selection stage as compared to the pre-selection barcode read counts. Often this means that a lower frequency of variants in the initial library will often result in far more extreme observation of phenotype than we expect the sensitivity of this protocol to detect. In this case, one option is to filter the data to exclude these low frequency barcodes -- However in certain cases even a modest filtering threshold cuts out many of the individual mutations you may want to learn about from the dataset. In this case, it's common for researchers to clip their data at some lower bound, $l$. This type of clipping causes a specific unwanted behavior in the model. Put simply, the model, $g(\phi(v,h))$ _learns_ this lower bound and thus is encouraged to limit it's shape to conform to this lower bound. 

We account for this unwanted behavior by applying a final activation on $z'$, where the output of the global epistasis model $g(\phi(v,h)) = z$. In essence, this is a generic softplus activation, softplus($z'$) = $\log(1 + e^{z'})$, a smooth approximation of the well-known ReLU activation function. Because sofplus is a rigid function with a lower bound at zero, we introduced; (1) a user defined parameter, $\lambda_{sp}$ which scales the rate at which the "hinge" interpolates from the lower bound to the identity function, and (2) a user-defined parameter to shift the lower bound to where the data was clipped, $l$.

Given the output of a our global epistasis function, $z'$, the final predicted phenotype of our model is computed like so:

$$
t(z') = \lambda_{sp}\log(1 + e^{\frac{z' - l}{\lambda_{sp}}}) + l
$$


Below is an interactive plot showing the effect of the modified softplus activation as a function of an adjustable $\lambda_{sp}$ scaling parameter, and lower bound, $l$:

In [51]:
import altair as alt

import numpy

import pandas as pd


df = pd.DataFrame({"latent": numpy.linspace(-10, 10, 100)})

slider_lsp = alt.binding_range(min=0.1, max=10)
var_lambda_sp = alt.param(bind=slider_lsp, value=1, name="lambda_sp")

slider_lb = alt.binding_range(min=-10, max=0)
var_lower_bound = alt.param(bind=slider_lb, value=-3.5, name="lower_bound")

(
    alt.Chart(df)
    .transform_calculate(
        phenotype=alt.expr.log(1 + alt.expr.exp((alt.datum['latent']-var_lower_bound)/var_lambda_sp))
        * var_lambda_sp
        + var_lower_bound
    )
    .encode(
        x=alt.X("latent", title="global epistasis prediction (z')", scale=alt.Scale(domain=[-10, 10])),
        y=alt.Y("phenotype:Q", title="predicted phenotype", scale=alt.Scale(domain=[-10, 10]))
    )
    .mark_line()
    .add_params(var_lambda_sp, var_lower_bound)
)

## Accounting for the experimental bias using $\gamma_{h}$ 

We expect that there exists some constant bias in the observed functional scores between differing experimental conditions. We believe this to be the case esspecially when experimental conditions have differing wildtypes. Ideally, in this scenerio we could spike-in the reference wildtype in each non reference experiment to guage exactly what that difference may be and adjust accordingly. However, this is not always feasible in the library design, and an analogous procedure for experiments sharing homology is unclear. To account for this bias, we introduce a free parameter, $\gamma$

TODO

## TODO - Fitting procedure