In [1]:
import warnings
import multiprocessing

import pandas as pd
import numpy as np
import scipy.optimize
import scipy.stats as st

import numba
import bebi103
import tqdm
import bokeh_catplot
import holoviews as hv
from bokeh.layouts import row

hv.extension('bokeh')
bebi103.hv.set_defaults()

import bokeh.io
bokeh.io.output_notebook()

## Microtubule Catastrophe

Microtubules are intracellular filaments, responsible for shuttling cargo throughout the cellular interior. When viewed under the microscope in the presence of ATP, microtubules can be observed slowly growing, then rapidly collapsing, as they explore the environment around them.

This latter feat -- rapid collapse -- is a process known as "microtubule catastrophe". To explore this phenomenon in a quantitative manner, Melissa K. Gardner, Marija Zanic, Christopher Gell, Volker Bormuth and Jonathan Howard performed experiments that uncovered a role for the kinesins, Kip3 and MCAK, in controlling microtubule catastrophe.

The paper is: **Gardner MK, _et al._ "Depolymerizing Kinesins Kip3 and MCAK Shape Cellular Microtubule Architecture by Differential Control of Catastrophe" _Cell_ (2011)**.

Throughout this analysis, we will refer to this paper, as we go from exploratory data analysis to in-depth analysis and mathematical modeling.

The dataset used in this analysis contains observations for the times to microtubule catastrophe (in seconds) for various concentrations of tubulin. The dataset used in this analysis can be accessed via the link below.


In [2]:
from IPython.core.display import display, HTML
display(HTML("""<a href="https://www.cell.com/cell/fulltext/S0092-8674(11)01287-6?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867411012876%3Fshowall%3Dtrue">Gardner et al. Cell (2011), link to paper.</a>"""))

display(HTML("""<a href="http://bebi103.caltech.edu.s3-website-us-east-1.amazonaws.com/2019a/content/homework/hw9/hw9.1.html">Dataset download, link to course website.</a>"""))

## 9.1a: Exploratory Data Analysis

In [3]:
#Begin by importing the relevant dataset
df = pd.read_csv('../data/gardner_mt_catastrophe_only_tubulin.csv', comment='#')

#Take a look at the layout
df.head()

Unnamed: 0,12 uM,7 uM,9 uM,10 uM,14 uM
0,25.0,35.0,25.0,50.0,60.0
1,40.0,45.0,40.0,60.0,75.0
2,40.0,50.0,40.0,60.0,75.0
3,45.429,50.0,45.0,75.0,85.0
4,50.0,55.0,50.0,75.0,115.0


Each value in the dataframe is the amount of time (in seconds) that elapsed between the start of a microtubule's growth and its collapse. The concentrations of tubulin (μM) were systematically varied, and the times to catastrophe observed via TIRF microscopy with Alexa-488 used to label GTP-tubulin.

The DataFrame, in its current form, is not tidy. Each row should be an independent observation. We will begin by renaming the column headers to add additional experimental details.

In [4]:
col_names = {
    '12 uM': '12 μM tubulin',
    '7 uM': '7 μM tubulin',
    '9 uM': '9 μM tubulin',
    '10 uM': '10 μM tubulin',
    '14 uM': '14 μM tubulin'
}

df = df.rename(columns=col_names)

df.head()

Unnamed: 0,12 μM tubulin,7 μM tubulin,9 μM tubulin,10 μM tubulin,14 μM tubulin
0,25.0,35.0,25.0,50.0,60.0
1,40.0,45.0,40.0,60.0,75.0
2,40.0,50.0,40.0,60.0,75.0
3,45.429,50.0,45.0,75.0,85.0
4,50.0,55.0,50.0,75.0,115.0


This DataFrame still is not tidy, so we will 'melt' it and remove 'NaN' values, since many of the columns have different lengths.

In [5]:
df = pd.melt(df, value_name = 'time to catastrophe (s)')
df.tail()

Unnamed: 0,variable,time to catastrophe (s)
3455,14 μM tubulin,
3456,14 μM tubulin,
3457,14 μM tubulin,
3458,14 μM tubulin,
3459,14 μM tubulin,


In [6]:
df_tidy = df.dropna()
df_tidy.tail()

Unnamed: 0,variable,time to catastrophe (s)
2904,14 μM tubulin,1005.0
2905,14 μM tubulin,1135.0
2906,14 μM tubulin,1305.0
2907,14 μM tubulin,1400.0
2908,14 μM tubulin,1420.0


In [7]:
df_tidy = df_tidy.rename(columns={"variable": "concentration"})
df_tidy.tail()

Unnamed: 0,concentration,time to catastrophe (s)
2904,14 μM tubulin,1005.0
2905,14 μM tubulin,1135.0
2906,14 μM tubulin,1305.0
2907,14 μM tubulin,1400.0
2908,14 μM tubulin,1420.0


Now that the DataFrame is in a "tidy format", the NaNs have been removed, and the columns have been appropriately labeled with the relevant tubulin concentrations and times to catastrophe, we can begin our initial exploratory data analysis.

First, we will analyze ECDFs for times to catastrophe for all five tubulin concentrations.

In [8]:
p = bokeh_catplot.ecdf(
    data=df_tidy,
    cats='concentration',
    val='time to catastrophe (s)',
    style='dots',
)

bokeh.io.show(p)

From this simple ECDF, it does appear that there are differences in the times to catastrophe, particularly for the '14 μM tubulin' dataset, which appears to be "right-shifted" towards higher times to catastrophe.

Let's take a closer look by visualizing this same ECDF, but with added 95% confidence intervals, using the "conf_int" value in bokeh_catplot.ecdf

In [9]:
p = bokeh_catplot.ecdf(
    data=df_tidy,
    cats='concentration',
    val='time to catastrophe (s)',
    style='dots',
    conf_int=True,
)

bokeh.io.show(p)

This plot (zoom in) does seem to affirm that there is a difference in the times to catastrophe for higher concentrations of tubulin.

To continue this exploratory data analysis, it would be beneficial to assess our claim that the 14 μM tubulin dataset is, indeed, statistically different from the other datasets. To do this, we will perform nonparametric bootstrapping and visualize these datasets in a different way.

First, we convert each tubulin concentration from the DataFrame into its own NumPy array, which will be simpler to work with.

In [10]:
tubulin12 = df_tidy.loc[df_tidy['concentration'] == '12 μM tubulin', 'time to catastrophe (s)'].values
tubulin7 = df_tidy.loc[df_tidy['concentration'] == '7 μM tubulin', 'time to catastrophe (s)'].values
tubulin9 = df_tidy.loc[df_tidy['concentration'] == '9 μM tubulin', 'time to catastrophe (s)'].values
tubulin10 = df_tidy.loc[df_tidy['concentration'] == '10 μM tubulin', 'time to catastrophe (s)'].values
tubulin14 = df_tidy.loc[df_tidy['concentration'] == '14 μM tubulin', 'time to catastrophe (s)'].values

#Also make a dataframe to hold the 12uM tubulin values
df_tidy_12 = df_tidy.loc[df_tidy['concentration'] == '12 μM tubulin']

In [11]:
#Verify that the variable 'tubulin12' is, indeed, a NumPy array
tubulin12.shape

(692,)

Finally, I will also compute the 'mean' for each of the tubulin concentrations, and store them as 'tubulinXX_mean' variables.

In [12]:
tubulin12_mean = np.mean(tubulin12)
tubulin7_mean = np.mean(tubulin7)
tubulin9_mean = np.mean(tubulin9)
tubulin10_mean = np.mean(tubulin10)
tubulin14_mean = np.mean(tubulin14)

### Bootstrap p-values using nonparametric inference

To assess the confidence intervals for each of the tubulin concentrations, and more rigorously assess whether there is a true, statistical difference between the different experiments, we will perform a nonparametric bootstrap.

This works, simply, by randomly pulling out samples, and replacing them, from the experimental dataset, with a total of len(dataset) samples drawn. By repeating this sampling many times, we are able to obtain a range of values that encompass our distribution. In this case, we will perform this nonparametric bootstrapping on the _mean_ value of the five different tubulin concentration experiments.

Here, we write two functions: one draws a random sample (np.random.choice) for the len(data), and the other returns the mean of our bootstrap replicates.

In [13]:
@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))

@numba.njit
def draw_bs_reps_mean(data, size=1):
    """Draw boostrap replicates of the mean from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out

Now we can run the 'draw_bs_reps_mean' function for each of our five experimental tubulin concentrations.

In [14]:
bs_reps_12 = draw_bs_reps_mean(tubulin12, size=100000)
bs_reps_7 = draw_bs_reps_mean(tubulin7, size=100000)
bs_reps_9 = draw_bs_reps_mean(tubulin9, size=100000)
bs_reps_10 = draw_bs_reps_mean(tubulin10, size=100000)
bs_reps_14 = draw_bs_reps_mean(tubulin14, size=100000)

Now we have our bootstrapped means saved as five different variables, and we can readily compute the 95% confidence interval for each by calling 'np.percenti

In [15]:
bs_tubulin12_conf_int = np.percentile(bs_reps_12, [2.5, 97.5])
bs_tubulin7_conf_int = np.percentile(bs_reps_7, [2.5, 97.5])
bs_tubulin9_conf_int = np.percentile(bs_reps_9, [2.5, 97.5])
bs_tubulin10_conf_int = np.percentile(bs_reps_10, [2.5, 97.5])
bs_tubulin14_conf_int = np.percentile(bs_reps_14, [2.5, 97.5])

Now that we have means and confidence intervals for those means for all five tubulin concentrations, we can visualize these experiments by making plots with error bars, as described by Justin Bois.

In [16]:
#Write function to create plots with error bars. Set frame_height and width, set up bokeh.plotting.figure
def plot_with_error_bars(means, confs, names, **kwargs):
    """Make a horizontal plot of means/conf ints with error bars."""
    frame_height = kwargs.pop("frame_height", 150)
    frame_width = kwargs.pop("frame_width", 450)

    p = bokeh.plotting.figure(
        y_range=names, frame_height=frame_height, frame_width=frame_width, **kwargs
    )

    p.circle(x=means, y=names)
    for conf, name in zip(confs, names):
        p.line(x=conf, y=[name, name], line_width=2)

    return p

In [17]:
#Now visualize the means and confidence intervals for each concentration of tubulin

bokeh.io.show(
    plot_with_error_bars(
        [tubulin12.mean(), tubulin10.mean(), tubulin7.mean(), tubulin9.mean(), tubulin14.mean()],
        [bs_tubulin12_conf_int, bs_tubulin10_conf_int, bs_tubulin7_conf_int, bs_tubulin9_conf_int, bs_tubulin14_conf_int],
        names=["12μM tubulin", "10μM tubulin", "7μM tubulin", "9μM tubulin", "14μM tubulin"],
        x_axis_label="mean time to catastrophe (s)",
    )
)

Based on this exploratory data analysis, we conclude that there does indeed appear to be a difference between the higher tubulin concentrations (e.g. 12 and 14μM tubulin), compared to the lower tubulin concentrations. 

However, it would be beneficial, in this case, to take our analysis even further and to model a generative distribution for microtubule catastrophe.

## 9.1b: Comparing Models for Microtubule Catastrophe

We have previously shown that the PDF of a Gamma distribution is given by:

$P(t) = \frac{1}{\Gamma (\alpha)} \frac{(\beta t)^{\alpha}}{t} e^{- \beta t}$

And the joint PDF can then be written as:

$P(\underline{t}) = \sum^n_{i=1} \frac{1}{\Gamma (\alpha)} \frac{(\beta t_i)^{\alpha}}{t_i} e^{- \beta t_i}$

To explore the Gamma distribution model, we first need to compute the maximum likelihood estimates for the parameter, $\alpha$ and $\beta$ for our observed dataset, given that the generative distribution is Gamma distributed.

**TAs: The code cells that follow are a repeat of our homework 8.2**

We will do this by writing two functions. The first function, below, simply returns the log likelihood of the PDF for Gamma measurements, assuming that the measurements are i.i.d, or 'independent and identically distributed'.

In [18]:
def log_like_iid_gamma_log_params(log_params, n):
    """
    Log likelihood for i.i.d. Gamma measurements with
    input being logarithm of parameters.
    
    Parameters
    ----------
    log_params: array
        Logarithm of the parameters alpha and beta.
    n: array
        Array of counts.
    
    Returns
    -------
    output: float
        log-likelihood.
    """
    log_alpha, log_beta = log_params
    
    alpha = np.exp(log_alpha)
    beta = np.exp(log_beta)
    
    return np.sum(st.gamma.logpdf(n, alpha, loc=0, scale=1/beta))

Now we can write out function to actually compute MLE based on our labeled dataset (which will be saved, for the 12μM tubulin concentration, as the variable 'n'). We will use the L-BFGS-B solver, as it is a nice, general option, and we will use initial parameter estimates of alpha = 2 and beta = 1/300.

For this model, we can make reasonable initial parameter estimates by looking at the ECDF of our dataset, and just by understanding what each parameter _means_ in this model.

$\alpha$ for this Gamma distribution model, represents the "number of steps" of the process (in our double Poisson model, we model a two-step process).

$\beta$ here represents the time to microtubule catastrophe, in units of 1/s. Based on our ECDF, a prediction of something like 300 seconds seems reasonable, which would be represented by 1/300.

In [19]:
def mle_iid_gamma(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda log_params, n: -log_like_iid_gamma_log_params(log_params, n),
          x0=np.array([2, 1/300]),
          args=(n,),
          method='L-BFGS-B',
    )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

Now we call the function 'mle_iid_gamma' for our dataset, stored in the variable 'n'. We will save the output as a new variable, called 'mle_gamma'.

In [20]:
mle_gamma = mle_iid_gamma(tubulin12)
mle_gamma

array([ 1.06996502, -4.87166259])

At first glance, these values for $\alpha$ and $\beta$ seem ridiculous, as neither parameter can take on a negative value. However, in our initial 'log_like_iid_gamma_log_params' function, we used the log of each value. We must convert these values back...

In [21]:
alpha_mle, beta_mle = np.exp(mle_gamma)

print("alpha: ", alpha_mle)
print("beta: ", beta_mle)

alpha:  2.915277533625213
beta:  0.007660618228816688


These values seem much more reasonable, and state that $\alpha$ has an estimated value of 2.9, while the time to arrival, 1/$\beta$, has a value of 0.007. This equates to a catastrophe event every 143 seconds.

However, we should not take these parameter estimates at 'face value'. We must critically assess how well these parameters, from this Gamma generative distribution, actually capture the observed, experimental dataset. 

To assess our model, we can make a number of graphical visualizations, including a Q-Q plot, which "compare(s) the measured distribution to the theoretical one parametrized by the MLE" (J. Bois, lesson 9).

To make a Q-Q plot, we first set up a variable, rg, that initializes our random number generator, and write a function, 'draw_gamma', that pulls a random value from a Gamma distribution.

In [22]:
rg = np.random.default_rng()

def draw_gamma(alpha, beta, size=1):
    return rg.gamma(alpha, 1/beta, size=size)

Now, very simply, we can call upon the incredible functionality of the 'bebi103' module, and use 'bebi103.viz.qqplot' to easily compare our gamma distribution, with parameters from the MLE (alpha_mle, beta_mle) and compare that to our actual, experimental dataset.

In [23]:
p = bebi103.viz.qqplot(
    data=tubulin12,
    gen_fun=draw_gamma,
    args=(alpha_mle, beta_mle),
    x_axis_label="time to catastrophe (s)",
    y_axis_label="time to catastrophe (s)",
    title='QQ Plot, 12μM tubulin, Gamma distribution'
)

bokeh.io.show(p)

We can see, from this analysis, that there are "outlier" values that are not appropriately captured by the Gamma distribution with the input parameters from the MLE. However, approximately 80% of values appear to be captured by the generative distribution.

To gain a better graphical visualization, we will now proceed to make a "predictive ECDF" for the Gamma distribution.

To do this, we must first "draw samples" from the "draw_gamma" function, for the input parameters, and then write a function that returns the value of an ECDF for an arbitrary point, X, to generate the "predictive ECDF".

In [24]:
parametric_bs_gamma = draw_gamma(alpha_mle, beta_mle, size=(5000, len(tubulin12)))

In [25]:
@numba.njit
def ecdf(x, data):
    """Give the value of an ECDF at arbitrary points x."""
    y = np.arange(len(data) + 1) / len(data)
    return y[np.searchsorted(np.sort(data), x, side="right")]

Now we write functions to plot the "theory" curve of an ECDF given a Gamma distribution.

In [26]:
n_theor = np.arange(0, parametric_bs_gamma.max() + 1)

ecdfs = np.array([ecdf(n_theor, sample) for sample in parametric_bs_gamma])

In [27]:
#Compute confidence interval for the theory ECDF.
ecdf_low, ecdf_high = np.percentile(ecdfs, [2.5, 97.5], axis=0)

Now that we have written the necessary functions and developed the "theory" curve and computed the associated confidence intervals, we are ready to plot this "predictive ECDF" using the built-in bebi103.viz.fill_between module.

In [28]:
p = bebi103.viz.fill_between(
    x1=n_theor,
    y1=ecdf_high,
    x2=n_theor,
    y2=ecdf_low,
    patch_kwargs={"fill_alpha": 0.5},
    x_axis_label="time to catastrophe (s)",
    y_axis_label="ECDF",
)

bokeh.io.show(p)

Now plot the "true" experimental dataset on top of this predictive ECDF curve.

In [29]:
p = bokeh_catplot.ecdf(data=df_tidy_12, val='time to catastrophe (s)', palette=['orange'], p=p)

bokeh.io.show(p)

Wow! It seems that a predictive ECDF of the Gamma distribution for our computed MLE parameters captures our experimental data quite well. This is a strong contender for the generative model.

To take a closer look at the differences in values for the Gamma distribution ECDF and the actual, experimental data, we can make a plot that shows their "difference" for each value of X, using the bebi103 module yet again.

In [30]:
samples = np.array([draw_gamma(alpha_mle, beta_mle, size=len(tubulin12)) for _ in range(5000)])

In [31]:
p = bebi103.viz.predictive_ecdf(
    samples=samples, data=tubulin12, diff=True, discrete=True, x_axis_label="time to catastrophe (s)"
)

bokeh.io.show(p)

While the data is well-captured by the predictive ECDF, there are other reasonable, mathematical models that we can pursue in our quest to evaluate generative distributions.

In the next part of this analysis, we will compare this distribution to a "double Exponential" model.

## Exploring a "Double Exponential" Model

We have already shown that, assuming that catastrophe events are i.i.d., the joint PDF of our double exponential model can be written as:

$$
\\[1.3em]
f(t; \beta_1, \beta_2) = \left( \frac{\beta_1 \beta_2}{\beta_2 - \beta_1} \right)^n \prod_{i=1}^n \left(\mathrm{e}^{-\beta_1 t_i} - \mathrm{e}^{-\beta_2 t_i}\right)
$$


And the log-likelihood function is then given by:

$$
\ell(\theta; y) = n \ln \beta_1 + n \ln \beta_2 - n \ln(\beta_2 - \beta_1) + \sum_i^n \ln \left(\mathrm{e}^{-\beta_1 t_i} - \mathrm{e}^{-\beta_2 t_i}\right) = \\[1em]
\ell(\theta; y) = n \left[ \ln \beta_1 + \ln \beta_2 - \ln(\beta_2 - \beta_1)\right] + \sum_i^n \ln \left(\mathrm{e}^{-\beta_1 t_i} - \mathrm{e}^{-\beta_2 t_i}\right)
$$

Now that we have derived our log-likelihood functions for a "double Exponential" model, we can actually begin exploring this as a generative distribution, much like we did with the Gamma distribution.

To explore this model, we first need to compute MLEs for the two parameters, $\beta_1$ and $\beta_2$. Indeed, that is the whole reason that we wrote out the log-likelihood functions initially.

To avoid under/overflow issues, we will handle the boundary case when $\beta_1 \sim \beta_2$, by approximating it to a Gamma($\alpha$ = 2, $\beta$). Moreover, we will use a couple of numerical tricks to handle the log of sums of exponentials. 

When using a numerical optimizer, under/overflow issues can cause the solver to not converge. Therefore, we will define a function that we'll use in our log-likelihood for numerical stability.

To obtain MLE for our two parameters, we first write a function that computes the "log-sum-exp", which is essential because our log-likelihood functions must be solved numerically, but contain the term $\sum_i^n \ln \left(\mathrm{e}^{-\beta_1 t_i} - \mathrm{e}^{-\beta_2 t_i}\right)$

**TAs: The next few code cells are repeats from 8.2**

In [32]:
n = tubulin12

def log_sum_exp_trick(betas): 
    """
    Compute log-sum-exp-trick for numerical robustness of 
    log(beta2 - beta 1), i.e. ensure that the difference is always positive. 
    """
    m = np.max(betas)
    x = np.min(betas)
    
    lset = np.log(m) + np.log(1 - (x / m))
    
    return lset

And we also write another function, which simply computes the log sum of exponentials for our log-likelihood function.

In [33]:
def log_sum_of_exp(betas,t): 
    """
    Compute log sum of exponentials for the double exponential model. 
    """
    m = np.max(betas)
    x = np.min(betas)
    
    log_sum_exp = -x*t + np.log(1 - np.exp((x - m)*t))
    
    return log_sum_exp

Now we can go ahead and make a function for our log-likelihood, similar to how we wrote out the Gamma distribution log-likelihood function. In this case, obviously, we need to feed our function the length of our dataset (saved as n) and we want to "collapse" the double Poisson model to a gamma distribution in the event that beta1 is close to beta2.

In [34]:
def log_like_iid_double_poisson_log_params(log_params, t):
    """
    Log likelihood for i.i.d. double Poisson measurements with
    input being logarithm of parameters.
    
    Parameters
    ----------
    log_params: array
        Logarithm of the parameters beta1, and beta 2.
    t: array
        Array of times to catastrophe.
    
    Returns
    -------
    output: float
        log-likelihood.
    """
    log_beta1, log_beta2 = log_params
    
    n = len(t)
    beta1 = np.exp(log_beta1)
    beta2 = np.exp(log_beta2)
    
    # Collapse to gamma distribution if b1 == b2
    if np.isclose(beta1, beta2):
        return np.sum(st.gamma.logpdf(t, a= 2, loc=0, scale=1/beta1))
    
    # Make a list of betas
    betas = [beta1, beta2]
    
    # Constant for the log_likelihood function
    cnt = n*(np.log(beta1) + np.log(beta2) - log_sum_exp_trick(betas))

    # Compute log likelihood
    log_likelihood = cnt + np.sum(log_sum_of_exp(betas,t))    
    
    return log_likelihood

Now, let's modify our MLE solver, which we also used to evaluate the Gamma distribution, by changing the L-BFGS-B solver to the Powell solver, the latter of which has proven more stable for making bootstrap replicates in subsequent analyses.

In [35]:
def mle_iid_double_poisson_log_params(t):
    """Perform MLE for parameters for the double exponential model."""

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        res = scipy.optimize.minimize(
            fun=lambda log_params, n: -log_like_iid_double_poisson_log_params(log_params, t),
            x0=np.array([np.log(1/200), np.log(1/300)]),
            args=(t,),
            method='Powell',
            
            # Log bounds 
            #bounds=((-9, -9), (-2,-2))
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)


Now that we have our function, we compute the MLEs...and feed it 'n', our NumPy dataset from earlier in this analysis pipeline.

In [36]:
mle_double_poisson = mle_iid_double_poisson_log_params(n)

In [37]:
betas = np.exp(mle_double_poisson)

betas

array([0.00530143, 0.00517468])

These parameters seem reasonable for $beta_1$ and $beta_2$, corresponding to the "time of arrival" for a microtubule catastrophe event of approximately 200s. However, we should go into more depth to evaluate them. We will again compute bootstrapped confidence intervals and explore our model more deeply to see if these estimates make sense!

To do that, let's start by writing out a function for a theoretical curve for the double exponential model.

In [38]:
def model_cdf(t, beta_1, beta_2):
    """
    Theoretical curve for double exponential model. 
    """
    if np.isclose(beta_1, beta_2):
        return st.gamma.cdf(a = 2, loc=0, scale=1/beta_1,x = t)
    
    cdf = (1 - np.exp(-beta_1 * t)) / beta_1 - (1 - np.exp(-beta_2 * t)) / beta_2

    return beta_1 * beta_2 * cdf / (beta_2 - beta_1)

Specify variables to plot the theory curve on top of the true ECDF from the dataset..

In [39]:
ecdf= np.linspace(0,1, len(n))
sorted_times = np.sort(n)
ecdf_plot = hv.Curve((sorted_times, ecdf), label ='ECDF') 

theor_curve = model_cdf(sorted_times, betas[0], betas[1])
theor_plot = hv.Curve((sorted_times, theor_curve),  label ='theoretical CDF')

Now plot our theoretical curve (which uses these parameter values from the Powell MLE) over the data from labeled microtubule catastrophe experiments.

In [40]:
theor_plot*ecdf_plot.opts(
    xlabel = 'time to catastrophe',
    ylabel = 'CDF',
    width = 600
)

We can see that the theoretical CDF matches really well to the ECDF. But, much like our analysis with the Gamma distribution model, we can first easily compare the experimental dataset to the generative double Exponential model by creating so-called QQ plots.

To make QQ plots for our "double exponential" model, we first write a function that draws a number of samples matching the length of our dataset, and computes a value that is merely a sum of two exponential functions.

In [41]:
def draw_double_exponential(beta1, beta2, size):
    d = np.empty(size)
    for i in range(size):
        d[i] = rg.exponential(1/beta1) + rg.exponential(1/beta2)
    
    return d

In [42]:
p = bebi103.viz.qqplot(
    data=n,
    gen_fun=draw_double_exponential,
    args=(betas[0], betas[1]),
    x_axis_label='time to catastrophe (s)',
    y_axis_label='time to catastrophe (s)'
)

bokeh.io.show(p)

We see that the QQ plot for the 12μM tubulin dataset is well-captured by the double exponential model. However, as we showed in the previous model (Gamma distribution), it seems that predictive ECDFs offer a "nicer" way of predicting how well a generative distribution actually captures our dataset. 

Let's explore predictive ECDFs. To do that, we need to write a series of functions that draw samples for the ECDF, which we will run 5000 replicates of. We can then compute confidence intervals to get a "window" for our predictive ECDF. Finally, we plot our true dataset on top of the generated ECDF.

In [43]:
def draw_double_exponential_ecdfs(beta1, beta2, size):
    return rg.exponential(1/beta1, size) + rg.exponential(1/beta2, size)

Now we use this function to bootstrap 5000 samples...

In [44]:
double_exponential_bs_samples = draw_double_exponential_ecdfs(betas[0], betas[1], size=(5000, len(n)))

In [45]:
#Check to ensure that output bs_samples are 5000 total, and correspond to length of original dataset (692 rows)
double_exponential_bs_samples.shape

(5000, 692)

Now write a function for our ECDF plot that sorts the data points from the bootstrap

In [46]:
def ecdf(x, data):
    """"""
    y = np.arange(len(data) + 1) / len(data)
    return y[np.searchsorted(np.sort(data), x, side="right")]

In [47]:
n_theor = np.arange(0, double_exponential_bs_samples.max() + 1)

ecdfs = np.array([ecdf(n_theor, sample) for sample in double_exponential_bs_samples])

Now compute the confidence intervals of the ecdf, and save them as ecdf_low and _high

In [48]:
ecdf_low, ecdf_high = np.percentile(ecdfs, [2.5, 97.5], axis=0)

Now we plot the "background" of our generative distribution, which we generated via bootstrapping.

In [49]:
p = bebi103.viz.fill_between(
    x1=n_theor,
    y1=ecdf_high,
    x2=n_theor,
    y2=ecdf_low,
    patch_kwargs={"fill_alpha": 0.5},
    x_axis_label='time to catastrophe (s)',
    y_axis_label='ECDF'
)

bokeh.io.show(p)

Now plot our "real" data on top, and see if this generative distribution fits the data well.

In [50]:
p = bokeh_catplot.ecdf(data=df_tidy_12, val='time to catastrophe (s)', palette=['orange'], p=p)

bokeh.io.show(p)

It looks like it does an OK job of capturing our dataset, but a better comparison is to look at the difference between the predictive ECDF and our dataset. Let's do that now.

In [51]:
samples_double_exponential = np.array([draw_double_exponential(betas[0], betas[1], size=len(n)) for _ in range(5000)])

In [52]:
p = bebi103.viz.predictive_ecdf(
    samples=samples_double_exponential, data=tubulin12, diff=True, discrete=True, x_axis_label="time to catastrophe (s)"
)

bokeh.io.show(p)

It appears, based on the ECDF difference plot, that this model does not capture our dataset as well as the Gamma distribution. In the Gamma distribution, far fewer points from our dataset lied outside of the confidence interval (simply from a visual assessment). Therefore, for subsequent analysis, we will choose to model this microtubule catastrophe data as **Gamma distributed**

## 9.1c: Parameter Estimates for Other Tubulin Concentrations

Now we must compute parameter estimates for the other tubulin concentrations. To do this, we will simply compute the MLE for each individual dataset in the same way that we did above, and then we will store the MLEs in a new DataFrame, called df_mle

In [53]:
#Create a new "empty" DataFrame with the necessary columns
df_mle = pd.DataFrame(index=['12 μM tubulin', '7 μM tubulin', '9 μM tubulin', '10 μM tubulin', '14 μM tubulin'])

In [54]:
#Compute the MLE for all of the tubulin concentrations -- we previously stored these data as NumPy arrays (see above)
mle_gamma_12 = mle_iid_gamma(tubulin12)
mle_gamma_7 = mle_iid_gamma(tubulin7)
mle_gamma_9 = mle_iid_gamma(tubulin9)
mle_gamma_10 = mle_iid_gamma(tubulin10)
mle_gamma_14 = mle_iid_gamma(tubulin14)

Great! Now we should have our estimates for alpha and beta stored in these variables, 'mle_gamma_##'. Let's now take the np.exp of each of these, since the original function returns the log of variables.

In [55]:
alpha_mle_12, beta_mle_12 = np.exp(mle_gamma_12)
alpha_mle_7, beta_mle_7 = np.exp(mle_gamma_7)
alpha_mle_9, beta_mle_9 = np.exp(mle_gamma_9)
alpha_mle_10, beta_mle_10 = np.exp(mle_gamma_10)
alpha_mle_14, beta_mle_14 = np.exp(mle_gamma_14)

#Let's see the MLE estimates for thee 14μM tubulin dataset
print("alpha: ", alpha_mle_14)
print("beta: ", beta_mle_14)

alpha:  3.3615009074314983
beta:  0.007174865482202131


In [56]:
#Update our DataFrame to store these values
df_mle = pd.DataFrame([[alpha_mle_7, alpha_mle_9, alpha_mle_10, alpha_mle_12, alpha_mle_14], [beta_mle_7, beta_mle_9, beta_mle_10, beta_mle_12, beta_mle_14]], columns=['7 μM tubulin', '9 μM tubulin', '10 μM tubulin', '12 μM tubulin', '14 μM tubulin'])

In [57]:
#Rename the indexes of our dataframe to $\alpha$ and $\beta$
df_mle = df_mle.rename({0: 'alpha', 1: 'beta'}, axis='index')
df_mle

Unnamed: 0,7 μM tubulin,9 μM tubulin,10 μM tubulin,12 μM tubulin,14 μM tubulin
alpha,2.443928,2.679863,3.210822,2.915278,3.361501
beta,0.00755,0.008779,0.00903,0.007661,0.007175


Before we discuss these parameter estimates, in accordance with the question for homework 9.1c, let us first **confirm** that the generative Gamma distribution still captures these datasets for each concentration of tubulin. To do that, we will make a QQ plot for each tubulin concentration and visualize them side-by-side.

In [58]:
## Test qq plot with all values of tubulin
p12 = bebi103.viz.qqplot(
    data=tubulin12,
    gen_fun=draw_gamma,
    args=(df_mle['12 μM tubulin']),
    x_axis_label="time to catastrophe (s)",
    y_axis_label="time to catastrophe (s)",
    title='12 μM tubulin'
)

p14 = bebi103.viz.qqplot(
    data=tubulin14,
    gen_fun=draw_gamma,
    args=(df_mle['14 μM tubulin']),
    x_axis_label="time to catastrophe (s)",
    y_axis_label="time to catastrophe (s)",
    title='14 μM tubulin'
)

p7 = bebi103.viz.qqplot(
    data=tubulin7,
    gen_fun=draw_gamma,
    args=(df_mle['7 μM tubulin']),
    x_axis_label="time to catastrophe (s)",
    y_axis_label="time to catastrophe (s)",
    title = '7 μM tubulin'
)

p10 = bebi103.viz.qqplot(
    data=tubulin10,
    gen_fun=draw_gamma,
    args=(df_mle['10 μM tubulin']),
    x_axis_label="time to catastrophe (s)",
    y_axis_label="time to catastrophe (s)",
    title='10 μM tubulin'
)

p9 = bebi103.viz.qqplot(
    data=tubulin9,
    gen_fun=draw_gamma,
    args=(df_mle['9 μM tubulin']),
    x_axis_label="time to catastrophe (s)",
    y_axis_label="time to catastrophe (s)",
    title = '9 μM tubulin'
)

bokeh.io.show(row(p7, p9, p10, p12, p14))

Based on this quick analysis, it does seem that the Gamma distribution captures the experimental data for each concentration of tubulin. Now, we can look at the parameter estimates again, and discuss our opinion on the estimates as the concentration of tubulin increases.

The question asks us, specifically, _'Given that microtubules polymerize faster with higher tubulin concentrations, is there anything you can say about the occurrence of catastrophe by looking at the values of the parameters versus tubulin concentration?_. 

Let's first refresh our memory -- what do the parameter estimates look like for the different tubulin concentrations?

In [59]:
df_mle

Unnamed: 0,7 μM tubulin,9 μM tubulin,10 μM tubulin,12 μM tubulin,14 μM tubulin
alpha,2.443928,2.679863,3.210822,2.915278,3.361501
beta,0.00755,0.008779,0.00903,0.007661,0.007175


What we see here is a bit strange. What stands out immediately is that, at no concentration of tubulin does the step side _decrease_ or approach 1. It appears that the step size for this model of microtubule catastrophe, $\alpha$, approximately increases with tubulin concentration. This is good, as it maintains that microtubule catastrophe is a multistep process over a range of tubulin concentrations.

Secondly, we see that $\beta$, our parameter for the time to arrival, in units of 1/s, does not really change much over this range of microtubule concentrations. Overall, the effect of tubulin concentration on this rate parameter, $\beta$, is very slight. This indicates that tubulin concentration has only a marginal effect on catastrophe times.

We can therefore conclude that catastrophe frequency is dependent on microtubule age, regardless of tubulin concentration. Indeed, this is also the conclusion indicated in the paper by Gardner MK, et al. It is, therefore, irrelevant that microtubule polymerize faster with higher tubulin concentrations, as their time to catastrophe is dependent on age, and not rate of polymerization.

_________________________________________________________________________________________

**Attributions:** All team members contributed to this problem.