# Biophysical model
Here we describe the biophysical model of conditional shifts that motivates the approach in this package.

## Conceptual Overview of model
The `multidms` model applies to a case where you have DMS datasets
for two or more conditions and are interested in identifying shifts
in mutational effects between conditions.
To do so, the model defines one condition as a reference condition.
For each mutation, the model fits one parameter that quantifies
the effect of the mutation in the reference condition.
For each non-reference condition, it also fits a shift
parameter that quantifies the shift in the mutation's
effect in the non-reference condition relative to the reference.
Shift parameters can be regularized, encouraging most of them to be
close to zero. This regularization step is a useful way to eliminate
the effects of experimental noise, and is most useful in cases where
you expect most mutations to have the same effects between conditions,
such as for conditions that are close relatives.

The model uses a global-epistasis function to disentangle the effects
of multiple mutations on the same variant. To do so, it assumes
that mutational effects additively influence a latent biophysical
property the protein (e.g., $\Delta G$ of folding).
The mutational-effect parameters described above operate at this latent level.

The global-epistasis function then assumes a sigmoidal relationship between
a protein's latent property and its functional score measured in the experiment
(e.g., log enrichment score). Ultimately, mutational parameters, as well as ones
controlling the shape of the sigmoid, are all jointly fit to maximize agreement
between predicted and observed functional scores acorss all variants of all conditions.

## Detailed description of the model
For each variant $v$ from condition $h$, we use a global-epistasis function
$g$ to convert a latent phenotype $\phi$ to a functional score $f$:

$$
f(v,h) = g_{\alpha}(\phi(v,h)) + γ_h
$$

where $g$ is a post-latent model
and $\alpha$ is the set of parameters that define the model

The latent phenotype is computed in the following way:

$$
\phi(v,h) = c_{r} + \sum_{m \in v} (x_{m} + s_{m,h})
$$

where:

* $c_{r}$ is the wild type latent phenotype for the reference condition.
* $x_m$ is the latent phenotypic effect of mutation $m$. See details below.
* $s_{m,h}$ is the shift of the effect of mutation $m$ in condition $h$.
  These parameters are fixed to zero for the reference condition. For
  non-reference conditions, they are defined in the same way as $x_m$ parameters.
* $v$ is the set of all mutations relative to the reference wild type sequence
  (including all mutations that separate condition $h$ from the reference condition).

The $x_m$ variable is defined such that mutations are always relative to the
reference condition. For example, if the wild type amino acid at site 30 is an
A in the reference condition, and a G in a non-reference condition,
then a Y30G mutation in the non-reference condition is recorded as an A30G
mutation relative to the reference. This way, each condition informs
the exact same parameters, even at sites that differ in wild type amino acid.
These are encoded in a `BinaryMap` object, where all sites that are non-identical
to the reference are 1's.

Ultimately, we fit parameters using a loss function with one term that
scores differences between predicted and observed values and another that
uses L1 regularization to penalize non-zero $s_{m,h}$ values:


$$
L\ *{\text{total}} = \sum*\ {h} \left[\sum\ *{v} L*\ {\text{fit}}(y\ *{v,h}, f(v,h)) + \lambda \sum*\ {m} |s_{m,h}|\right]
$$

where:


* $L_{\text{total}}$ is the total loss function.
* $L_{\text{fit}}$ is a loss function that penalizes differences
    in predicted vs. observed functional scores.
* $y_{v,h}$ is the experimentally measured functional score of
    variant $v$ from condition $h$.

## Model using matrix algebra
We compute a vector or predicted latent phenotypes $P_{h}$ as:

$$P_{h} = c + (X_h \cdot (β + S_h))$$

where:


* $β$ is a vector of all $β_m$ values.
* $S\ *h$ is a matrix of all $s*\ {m,h}$ values.
* $X_h$ is a sparse matrix, where rows are variants,
    columns are mutations (all defined relative to the reference condition),
    and values are weights of 0's and 1's. These weights are used to
    compute the phenotype of each variant given the mutations present.
* $c$ is the same as above.

In the matrix algebra, the sum of $β\ *m$ and $S*\ {m,h}$
gives a vector of mutational effects, with one entry per mutation.
Multiplying the matrix $X_h$ by this vector gives a new
vector with one entry per variant, where values are the
sum of mutational effects, weighted by the variant-specific weights in $X_h$.
Adding the $c$ value to this vector will give a vector of
predicted latent phenotypes for each variant.

Next, the global-epistasis function can be used to convert
    a vector of predicted latent phenotypes to a vector of
    predicted functional scores.

$$F\ *{h,pred} = g*\ {\alpha}(P_h)$$

Finally, this vector could be fed into a loss function and
compared with a vector of observed functional scores.