(c) 2024 Manuel Razo. This work is licensed under a [Creative Commons
Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
All code contained herein is licensed under an [MIT
license](https://opensource.org/licenses/MIT).

In [1]:
# Import project package
import Antibiotic

# Import package to handle DataFrames
import DataFrames as DF
import CSV

# Import library for Bayesian inference
import Turing
import FillArrays

# Import library to list files
import Glob

# Import packages to work with data
import DataFrames as DF

# Load CairoMakie for plotting
using CairoMakie
import PairPlots
import ColorSchemes

# Import basic math libraries
import LsqFit
import StatsBase
import LinearAlgebra
import Random

# Activate backend
CairoMakie.activate!()

# Set PBoC Plotting style
Antibiotic.viz.theme_makie!()

# Bayesian Inference of $IC_{50}$ values

In this notebook, we will perform Bayesian inference on the $IC_{50}$ values of
the antibiotic resistance landscape. For this, we will use the raw `OD620`
measurements provided by Iwasawa et al. (2022).

Let's begin by loading the data into a DataFrame.

In [None]:
# Load data into a DataFrame
df = CSV.read(
    "$(git_root())/data/Iwasawa_2022/iwasawa_tidy.csv", DF.DataFrame
)

first(df, 5)

To double-check that the structure of the table makes sense, let's plot the time
series for one example to see if the sequence agrees with the expectation.

In [None]:
# Define data to use
data = df[
    (df.antibiotic.=="KM").&(df.env.=="Parent_in_KM").&(df.strain_num.==13).&.!(df.blank).&(df.concentration_ugmL.>0),
    :]
# Remove blank measurement
# Group data by day
df_group = DF.groupby(data, :day)

# Initialize figure
fig = Figure(size=(500, 300))

# Add axis
ax = Axis(
    fig[1, 1],
    xlabel="antibiotic concentration",
    ylabel="OD₆₂₀",
    xscale=log2
)

# Define colors for plot
colors = get(ColorSchemes.Blues_9, LinRange(0.25, 1, length(df_group)))

# Loop through days
for (i, d) in enumerate(df_group)
    # Sort data by concentration
    DF.sort!(d, :concentration_ugmL)
    # Plot scatter line
    scatterlines!(
        ax, d.concentration_ugmL, d.OD, color=colors[i], label="$(first(d.day))"
    )
end # for

# Add legend to plot
fig[1, 2] = Legend(
    fig, ax, "day", framevisible=false, nbanks=3, labelsize=10
)

fig

The functional form used by the authors to fit the data is
$$
f(x) = \frac{a}
{1+\exp \left[b\left(\log _2 x-\log _2 \mathrm{IC}_{50}\right)\right]} + c
\tag{1}
$$
where $a$, $b$, and $c$ are nuisance parameters of the model, $\mathrm{IC}_{50}$
is the parameter of interest, and $x$ is the antibiotic concentration. We can
define a function to compute this model.

In [None]:
@doc raw"""
    logistic(logx, a, b, c, logic50)

Compute the logistic function used to model the relationship between antibiotic
concentration and bacterial growth, using log2 inputs for concentration and
IC₅₀.

This function implements the following equation:

f(x) = a / (1 + exp(b * (log(x) - log(IC₅₀)))) + c

# Arguments
- `logx`: log of the antibiotic concentration (input variable)
- `a`: Maximum effect parameter (difference between upper and lower asymptotes)
- `b`: Slope parameter (steepness of the curve)
- `c`: Minimum effect parameter (lower asymptote)
- `logic50`: log of the IC₅₀ parameter

# Returns
The computed effect (e.g., optical density) for the given log₂ antibiotic
concentration and parameters.

Note: This function is vectorized and can handle array inputs for `log2x`.
"""
function logistic(logx, a, b, c, logic50)
    return @. a / (1.0 + exp(b * (logx - logic50))) + c
end

## Bayesian model

Given the model presented in Eq. (1), and the data, our objective is to infer
the value of all parameters. By Bayes theorem, we write
$$
\pi(\mathrm{IC}_{50}, a, b, c \mid \text{data}) = 
\frac{\pi(\text{data} \mid \mathrm{IC}_{50}, a, b, c) 
\pi(\mathrm{IC}_{50}, a, b, c)}
{\pi(\text{data})},
\tag{2}
$$
where $\text{data}$ consists of the pairs of antibiotic concentration and
optical density.


### Likelihood $\pi(\text{data} \mid \mathrm{IC}_{50}, a, b, c)$

Let's begin by defining the likelihood function. For simplicity, we assume each
datum is independent and identically distributed (i.i.d.) and write
$$
\pi(\text{data} \mid \mathrm{IC}_{50}, a, b, c) = 
\prod_{i=1}^n \pi(d_i \mid \mathrm{IC}_{50}, a, b, c),
\tag{3}
$$
where $d_i = (x_i, y_i)$ is the $i$-th pair of antibiotic concentration and
optical density, respectively, and $n$ is the total number of data points.

As a first pass, we assume that our experimental measurements can be expressed
as
$$
y_i = f(x_i, \mathrm{IC}_{50}, a, b, c) + \epsilon_i,
\tag{4}
$$
where $\epsilon_i$ is the experimental error. Furthermore, we assume that the
experimental error is normally distributed, i.e.,
$$
\epsilon_i \sim \mathcal{N}(0, \sigma^2),
\tag{5}
$$
where $\sigma^2$ is an unknown variance parameter that must be included in our
inference. Notice that we assume the same variance parameter for all data
points since $\sigma^2$ is not indexed by $i$.

Given this likelihood function, we must update our inference on the parameters
as
$$
\pi(\mathrm{IC}_{50}, a, b, c, \sigma^2 \mid \text{data}) = 
\frac{\pi(\text{data} \mid \mathrm{IC}_{50}, a, b, c, \sigma^2) 
\pi(\mathrm{IC}_{50}, a, b, c, \sigma^2)}
{\pi(\text{data})},
\tag{6}
$$
to include the new parameter $\sigma^2$.

Our likelihood function is then of the form
$$
y_i \mid \mathrm{IC}_{50}, a, b, c, \sigma^2 \sim 
\mathcal{N}(f(x_i, \mathrm{IC}_{50}, a, b, c), \sigma^2).
\tag{7}
$$


### Prior $\pi(\mathrm{IC}_{50}, a, b, c, \sigma^2)$

For the prior, we assume that all parameters are independent and write
$$
\pi(\mathrm{IC}_{50}, a, b, c, \sigma^2) = 
\pi(\mathrm{IC}_{50}) \pi(a) \pi(b) \pi(c) \pi(\sigma^2).
\tag{8}
$$

Let's detail each prior.

1. $\mathrm{IC}_{50}$: The $IC_{50}$ is a strictly positive parameter. However,
we will fit for $\log_2(\mathrm{IC}_{50})$. Thus, we will use a normal prior
for $\log_2(\mathrm{IC}_{50})$. This means we have
$$
\log_2(\mathrm{IC}_{50}) \sim 
\mathcal{N}(
    \mu_{\log_2(\mathrm{IC}_{50})}, \sigma_{\log_2(\mathrm{IC}_{50})}^2
).
\tag{9}
$$

2. $a$: This nuisance parameter scales the logistic function. Again, the natural
scale for this parameter is a strictly positive real number. Thus, we will
use a lognormal prior for $a$. This means we have
$$
a \sim \text{LogNormal}(\mu_a, \sigma_a^2).
\tag{10}
$$

3. $b$: This parameter controls the steepness of the logistic function. Again,
the natural scale for this parameter is a strictly positive real number. Thus,
we will use a lognormal prior for $b$. This means we have
$$
b \sim \text{LogNormal}(\mu_b, \sigma_b^2).
\tag{11}
$$

4. $c$: This parameter controls the minimum value of the logistic function. 
Since this is a strictly positive real number that does not necessarily scale
with the data, we will use a half-normal prior for $c$. This means we have
$$
c \sim \text{Half-}\mathcal{N}(0, \sigma_c^2).
\tag{12}
$$

5. $\sigma^2$: This parameter controls the variance of the experimental error.
Since this is a strictly positive real number that does not necessarily scale
with the data, we will use a half-normal prior for $\sigma^2$. This means we
have
$$
\sigma^2 \sim \text{Half-}\mathcal{N}(0, \sigma_{\sigma^2}^2).
\tag{13}
$$

With all of this in place, we are ready to define a `Turing` model to perform
Bayesian inference on the parameters of the model.

In [None]:
Turing.@model function logistic_model(
    logx, logy, prior_params::NamedTuple=NamedTuple()
)
    # Define default prior parameters
    default_params = (
        logic50=(0, 1),
        a=(0, 1),
        b=(0, 1),
        c=(0, 1),
        σ²=(-2, 1)
    )

    # Merge default parameters with provided parameters
    params = merge(default_params, prior_params)

    # Define priors
    logic50 ~ Turing.Normal(params.logic50...)
    a ~ Turing.LogNormal(params.a...)
    b ~ Turing.LogNormal(params.b...)
    c ~ Turing.truncated(Turing.Normal(params.c...), 0, Inf)
    σ² ~ Turing.MvLogNormal(
        fill(params.σ²[1], size(logx, 1)),
        LinearAlgebra.Diagonal(fill(params.σ²[2], size(logx, 1)))
    )

    # Loop through each row
    for i in 1:size(logx, 1)
        # Define likelihood
        logy[i, :] ~ Turing.MvNormal(
            log.(logistic(logx[i, :], a, b, c, logic50)),
            fill(σ²[i], size(logx, 2))
        )
    end # for
end

Having defined the model, let's perform inference on the data. First, we need to
extract the data into the format required by the model.

In [66]:
# Group data by antibiotic, environment, and day
df_group = DF.groupby(
    df[(.!df.blank).&(df.concentration_ugmL.>0), :],
    [:antibiotic, :env, :day]
)

# Extract data
data = df_group[2]

# Get unique concentrations and sort them
unique_conc = sort(unique(data.concentration_ugmL))

# Count occurrences of each concentration
conc_counts = [count(==(c), data.concentration_ugmL) for c in unique_conc]

# Create concentration matrix
conc_matrix = reduce(
    vcat,
    [fill(c, count)' for (c, count) in zip(unique_conc, conc_counts)]
)

# Create OD matrix
od_matrix = zeros(length(unique_conc), maximum(conc_counts))

# Loop through each concentration
for (i, conc) in enumerate(unique_conc)
    # Get OD values for the current concentrationm
    od_values = data.OD[data.concentration_ugmL.==conc]
    # Store OD values in the matrix
    od_matrix[i, 1:length(od_values)] = od_values
end

Now we can perform inference on the data.

In [None]:
Random.seed!(42)
# Perform inference
model = logistic_model(
    log.(conc_matrix),
    log.(od_matrix),
)

# Define number of steps
n_burnin = 10_000
n_samples = 1_000

chain = Turing.sample(
    model, Turing.NUTS(), Turing.MCMCThreads(), n_burnin + n_samples, 4
)

Let's now look at the posterior predictive check. For this, we first generate
samples from the posterior predictive distribution.

In [None]:
Random.seed!(42)

# Initialize matrix to store samples
logy_samples = Array{Float64}(
    undef, size(conc_matrix)..., n_samples
)

# Convert chain to DataFrame
df_chain = DF.DataFrame(chain[n_burnin+1:end, :, :])
# Extract index for variances
σ²_idx = occursin.(["σ²"], names(df_chain))

# Loop through samples
for i in 1:n_samples
    # Generate mean prediction
    logμ = log.(
        logistic(
            log.(conc_matrix),
            df_chain[i, :a],
            df_chain[i, :b],
            df_chain[i, :c],
            df_chain[i, :logic50]
        )
    )
    # Extract vector of variances
    σ² = collect(df_chain[i, σ²_idx])
    # Combine mean prediction and noise
    logy_samples[:, :, i] = logμ .+ randn(size(logμ)) .* sqrt.(σ²)
end # for

# Compute quantiles for 95% and 68% intervals as well as the median
logy_quant = reduce(
    hcat,
    StatsBase.quantile.(
        vec.(eachslice(logy_samples, dims=1)),
        Ref([0.025, 0.975, 0.18, 0.82, 0.5])
    )
)

oNow we can plot the posterior predictive check. as shaded regions.

In [None]:
# Initialize figure
fig = Figure(size=(350, 300))
# Add axis
ax = Axis(
    fig[1, 1],
    xlabel="antibiotic concentration",
    ylabel="optical density",
    title="Posterior Predictive Check",
    xscale=log10,
    yscale=log10
)

# Plot 95% and 68% intervals
band!(
    ax,
    conc_matrix[:, 1],
    exp.(logy_quant[1, :]),
    exp.(logy_quant[2, :]),
    color=(ColorSchemes.Blues_9[5], 1)
)
band!(
    ax,
    conc_matrix[:, 1],
    exp.(logy_quant[3, :]),
    exp.(logy_quant[4, :]),
    color=(ColorSchemes.Blues_9[6], 1)
)
# Plot median
lines!(
    ax,
    conc_matrix[:, 1],
    exp.(logy_quant[5, :]),
    color=ColorSchemes.Blues_9[7]
)

# Plot data
scatter!(ax, data.concentration_ugmL, data.OD, color=:black)

fig