(c) 2023 Manuel Razo. This work is licensed under a [Creative Commons
Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
All code contained herein is licensed under an [MIT
license](https://opensource.org/licenses/MIT).

In [2]:
# Load project package
@load_pkg BayesFitUtils

# Import project package
import BayesFitUtils

# Import package to handle DataFrames
import DataFrames as DF
import CSV

# Import package to load chains
import JLD2

# Import package to handle MCMC chain objects
import MCMCChains

# Import basic statistical functions
import StatsBase
import Distributions
import Random

# Import basic math
import LinearAlgebra

# Load CairoMakie for plotting
using CairoMakie
import ColorSchemes
import Makie
# Activate backend
CairoMakie.activate!()

# Set PBoC Plotting style
BayesFitUtils.viz.pboc_makie!()

# Simulation Based Calibration (SBC)

A key part of the [Principled Bayesian
Workflow](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html)
comes in the form of the so-called prior predictive checks implemented as the
Simulation-Based Calibration (SBC). In Bayesian statistics there is not
guarantee that the posterior distribution will actually contain the "ground
truth" parameter value. That is the whole point of the SBC in the first place.
What it is guaranteed is that if one samples parameter values from the prior (as
in the prior predictive checks), and runs the inference on this prior sample, in
the long term, the posterior distributions one obtains should recover the prior
distribution. The procedure looks as follows:

First we sample a set of parameters from the prior
$$
\tilde\theta \sim \pi(\theta),
$$
Then we generate data using the set of parameters and the generative model
$$
\tilde{y} \sim \pi(y \mid \tilde\theta).
$$
Then we run the
inference on the parameters given the synthetic data
$$
\pi(\tilde\theta \mid \tilde y) \propto 
\pi(\tilde y \mid \tilde\theta) \pi(\tilde\theta).
$$
Doing this over and over and over again should at the end of the day recover the
prior distribution. This is
$$
\pi(\theta) = \int d\tilde y\; d\tilde\theta\; \pi(\theta
\mid \tilde y) \pi(\tilde y \mid \tilde\theta)\pi(\tilde\theta),
$$
where for this result we first used
$$
\int d\tilde\theta\; \pi(\tilde y \mid \tilde\theta)
\pi(\tilde\theta) = \int d\tilde\theta\; \pi(\tilde y, \tilde\theta) =
\pi(\tilde y),
$$
and then
$$
\int d\tilde y\; \pi(\theta \mid \tilde y) \pi(\tilde y) = 
\int d\tilde y\; \pi(\theta, \tilde y) = \pi(\theta).
$$
In other words, for any model the average of any exact posterior expectation
with respect to data generated from the Bayesian joint distribution reduces to
the corresponding prior expectation.We then know that repeating the inference
over and over again with many different ground truths should recover the prior
distribution.

## Generating simulated frequency trajectories

To apply the simulation-based calibration procedure on our fitness inferences,
we need to generate synthetic datasets according to our model. In particular,
our inference is of the form
$$
\pi\left(
    \underline{s}^{M}, \underline{\bar{s}}_t, \underline{\underline{F}} \mid 
    \underline{\underline{R}}
\right) \propto 
\pi\left(
    \underline{\underline{R}} \mid 
    \underline{\underline{F}}, \underline{s}^{M}, \underline{\bar{s}}_t
\right) 
\pi\left(
    \underline{\underline{F}}, \underline{s}^{M}, \underline{\bar{s}}_t\right
),
$$
where
- $\underline{s}^{M}$ is a vector with the $M$ mutant barcode fitness values.
- $\underline{\bar{s}}_t$ is a vector with the $T-1$ population mean fitness
  values for every pair of time points.
- $\underline{\underline{F}}$ is a $T \times B$ matrix with the frequency
  profiles for all $B = M + N$ barcodes.
- $\underline{\underline{R}}$ is a $T \times B$ matrix with the raw barcode
  count profiles for all $B = M + N$ barcodes.

Our model assumes that the mutant barcode frequencies evolve over time according
to an exponential growth of the form 
$$
f_{t+1}^{(m)} = f_{t}^{(m)} \mathrm{e}^{(s^{(m)} - \bar{s}_t)\tau},
$$
where
- $f_{t}^{(m)}$ is the frequency of mutant $m$ at time $t$.
- $s^{(m)}$ is the mutant relative fitness (the $m$-th entry of
  $\underline{s}^{(m)}$).
- $\bar{s}_t$ is the population mean fitness at time $t$ (the $t$-th) entry of
  $\underline{s}_t$.
- $\tau$ is the time interval between $t$ and $t+1$.

The easiest way to generate a barcode trajectory is to first generate the
deterministic trajectories using this exponential growth function and then add
noise on top of these trajectories.

First, let's define the number of barcodes, their corresponding fitness, and
their initial count.

In [10]:
Random.seed!(42)

# Define number of neutral barcodes
n_neutral = 10
# Define number of mutant barcodes
n_mutant = 90
# Define total number of barcodes
n_barcode = n_neutral + n_mutant

# Define prior values for mutant fitness
s_mut_prior = [0.3, 0.3]
# Define prior values for initial λ values
λ_prior = [3, 3]

# Sample mutant fitness values
s̲ᴹ = Random.rand(Distributions.Normal(s_mut_prior...), n_mutant)

# Sample λ parameter values for all mutants
R_init = Random.rand(Distributions.LogNormal(λ_prior...), n_barcode);

Next, let's compute the frequency dynamics over time.

In [None]:
# Define number of time points
n_time = 5

# Initialize matrix to store neutral counts and frequencies
R̲̲ᴺ = Matrix{Float64}(undef, n_time, n_neutral)
# Initialize matrix to store mutant frequencies
R̲̲ᴹ = Matrix{Float64}(undef, n_time, n_mutant)

# Add initial time point
R̲̲ᴺ[1, :] = R_init[1:n_neutral]
R̲̲ᴹ[1, :] = R_init[n_neutral+1:end]

# Loop through time points
for t = 2:n_time
    # Integrate dynamics
end # for