# Modeling microtubule catastrophe II

<hr>

In [1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
    data_path = "https://raw.githubusercontent.com/justinbois/learnbayes-livecode/main/"
else:
    data_path = "./"
# ------------------------------

import numpy as np
import pandas as pd

import cmdstanpy
import arviz as az

import bebi103

import iqplot

import bokeh.io
bokeh.io.output_notebook()

<hr>

In our analysis so far, we found that the Gamma likelihood was the most plausible generative model. It does have the shortcoming,though, that it does not *directly* match a story that might arise from chemical kinetics, which we strongly suspect would regulate microtubule castastrophe. We expect an *integer* number of Poisson processes to arrive sequentially in order for catastrophe to occur. (Bio)chemical processes are discrete events that can proceed at different rates, and the Gamma distribution describes the amount of time for a non-integer series of Poisson processes, all with the same rate, to arrive. This is not really what we would expect from chemical kinetics.

Of course, to proceed, we need to have the data set loaded in.

In [2]:
t = np.loadtxt(os.path.join(data_path, 'gardner_zanic_mt_catastrophe.csv'))
data = dict(t=t, N=len(t))

## Model 4: Two discrete Poisson processes

We now consider another model, where catastrophe happens upon the arrival of the second of two Poisson processes, each of which arrive at different rates. We can work out that the probability density function for this story is

\begin{align}
f(t\mid \tau_1, \tau_2) &= \frac{\mathrm{e}^{-t/\tau_2} - \mathrm{e}^{-t/\tau_1}}{\tau_2 - \tau_1}.
\end{align}

Our goal, then, is to get estimates for the time scales of the two Poisson processes, $\tau_1$ and $\tau_2$, and then perform posterior predictive checks to see if this model could generate the observed data.

## Priors for the rates

We note that the likelihood is invariant to switching the labels 1 and 2. Therefore, we can arbitrarily choose $\tau_1 < \tau_2$, assigning $\tau_1$ to the faster of the two Poisson processes. We will use the same priors as before, but insist that $\tau_1 < \tau_2$.

\begin{align}
&\log_{10} T_1 \sim \text{Norm}(1/2, 3/4),\\[1em]
&\log_{10} T_2 \sim \text{Norm}(1/2, 3/4),\\[1em]
&\tau_1 = 10^{\min(\log_{10} T_1,\; \log_{10} T_2)},\\[1em]
&\tau_2 = 10^{\max(\log_{10} T_1,\; \log_{10} T_2)}.
\end{align}

## Prior predictive checks

Generating data from this distribution is most easily accomplished by drawing one time out of $\text{Expon}(1/\tau_1)$ and one out of $\text{Expon}(1/\tau_2)$, and adding the results.

In [3]:
n_ppc = 50
rng = np.random.default_rng(seed=3252)

tau_1 = 10 ** (rng.normal(0.5, 0.75, size=n_ppc))
tau_2 = 10 ** (rng.normal(0.5, 0.75, size=n_ppc))

t_twostep_ppc = [
    rng.exponential(t1, size=len(t)) + rng.exponential(t2, size=len(t))
    for t1, t2 in zip(tau_1, tau_2)
]

# Take a look
p = iqplot.ecdf(
    pd.DataFrame(t_twostep_ppc).transpose().melt(var_name="trial"),
    q="value",
    cats="trial",
    style="staircase",
    line_kwargs=dict(line_color="#1f78b4", line_width=0.5),
    show_legend=False,
    title="twostep",
    x_axis_type="log",
    frame_height=150,
    x_range=[1e-3, 1e3],
)

bokeh.io.show(p)

These prior predictive checks looks fine; let's proceed!

## Sampling out of the posterior

The full model, including the prior we have just defined, is

\begin{align}
&\log_{10} T_1 \sim \text{Norm}(1/2, 3/4),\\[1em]
&\log_{10} T_2 \sim \text{Norm}(1/2, 3/4),\\[1em]
&\tau_1 = 10^{\min(\log_{10} T_1,\; \log_{10} T_2)},\\[1em]
&\tau_2 = 10^{\max(\log_{10} T_1,\; \log_{10} T_2)},\\[1em]
&f(\mathbf{t}\mid\tau_1, \tau_2) = \prod_i \frac{\mathrm{e}^{-t_i/\tau_2} - \mathrm{e}^{-t_i/\tau_1}}{\tau_2 - \tau_1}.
\end{align}

To code this up in Stan, we need to make a custom function for the log PDF of the likelihood. We also need to make sure that $\tau_1 < \tau_2$, which we can do using the convenient `ordered` data type in Stan. I should also be careful to handle the case where $\tau_1 \approx \tau_2$, in which case the two-step distribution we have defined is a Gamma distribution with $\alpha = 2$. Following is the Stan code.

In [4]:
twostep_model = """
functions {
  real twostep_lpdf(array[] real ys, vector tau) {
    // log of the PDF for the two-successive Poisson process model
    real log_pdf;

    // Special case where taus are very close
    if (tau[2] - tau[1] < 1e-8) {
      log_pdf = gamma_lpdf(ys | 2, 1.0 / tau[1]);
    }
    else {
      log_pdf = -num_elements(ys) * log(tau[2] - tau[1]);
      for (y in ys) {
        log_pdf += log_diff_exp(-y / tau[2], -y / tau[1]);
      }
    }
    
    return log_pdf;
  }
}


data {
  int N;
  array[N] real t;
}


parameters {
  ordered[2] log_tau;
}


transformed parameters {
  vector[2] tau = 10 ^ log_tau;
}


model {
  log_tau ~ normal(0.5, 0.75);

  t ~ twostep(tau);
}


generated quantities {
  array[N] real t_ppc;

  for (i in 1:N) {
    t_ppc[i] = exponential_rng(1.0 / tau[1]) + exponential_rng(1.0 / tau[2]);
  }
}
"""

Let's compile and sample out of this model. I need to set `adapt_delta` above the default to avoid possible divergences and to get good effective sample size.

In [5]:
with open("twostep.stan", "w") as f:
    f.write(twostep_model)
        
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file="twostep.stan")
    samples = sm.sample(data=data, adapt_delta=0.975)

samples = az.from_cmdstanpy(samples, posterior_predictive='t_ppc')

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                


As is good practice, let's do a diagnostic check.

In [6]:
bebi103.stan.check_all_diagnostics(samples)

Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


0

Looks good! Now, let's look at the posterior predictive checks.

In [7]:
t_ppc = (
    samples.posterior_predictive["t_ppc"]
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "t_ppc_dim_0")
)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        t_ppc,
        data=t,
        title="Two-step",
        diff='ecdf',
        x_axis_label='time to catastrophe (min)',
    )
)

The two-step model fails pretty spectacularly; it's better than Exponential, but not as good as Weibull.

Even though they don't mean much since we failed the posterior predictive checks, we can look at the parameter values in a corner plot.

In [8]:
bokeh.io.show(
    bebi103.viz.corner(
        samples, parameters=[("tau[0]", "τ₁ (min)"), ("tau[1]", "τ₂ (min)")]
    )
)

## Next steps

While this is disappointing, we can push further with this idea of multiple discrete Poisson processes, each with a different rate, and we do so in the next notebook.

## Computing enviroment

In [9]:
%load_ext watermark
%watermark -v -p numpy,pandas,bebi103,cmdstanpy,arviz,bokeh,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.12.0

numpy     : 1.24.3
pandas    : 1.5.3
bebi103   : 0.1.14
cmdstanpy : 1.1.0
arviz     : 0.16.1
bokeh     : 3.2.1
jupyterlab: 4.0.3

cmdstan   : 2.32.2
