# MLE of microtubule catastrophe data
In this notebook, we continue to explore the data from the [Gardneer, Zanic, et al. paper](https://www.sciencedirect.com/science/article/pii/S0092867411012876?via%3Dihub). Because we showed in the control_analysis notebook that there is little difference between labeled and unlabeled tubulin, in this notebook we only use the data for labeled tubulin. The purpose of this notebook is to obtain MLE estimates for the parameters of each distribution outlined in the distributions notebook.

## Imports and Load Dataset

In [14]:
import warnings

import pandas as pd
import numpy as np
import scipy.stats
import scipy.optimize

import bokeh.io
import bokeh.plotting
import bokeh.palettes
import bokeh_catplot

from tqdm.notebook import tqdm, trange

import bebi103
import microtubule_catastrophe as mc

bokeh.io.output_notebook()

In [2]:
# Load the dataset
df = pd.read_csv('../../data/gardner_time_to_catastrophe_dic_tidy.csv', usecols=[1,2])
df.head()

Unnamed: 0,time to catastrophe (s),labeled
0,470.0,True
1,1415.0,True
2,130.0,True
3,280.0,True
4,550.0,True


In [3]:
# Extract values of interest as numpy arrays
labeled = df.loc[df['labeled'] == True, 'time to catastrophe (s)'].values

As a reminder, this is what the data distribution looks like.

In [4]:
# View the data distribution
p = bokeh_catplot.ecdf(
    data=df.loc[df['labeled'] == True],
    cats=None,
    val='time to catastrophe (s)',
    style='staircase',
    title='labeled tubulin'
)
bokeh.io.show(p)

## Gamma Distribution MLE

### MLE parameters

In [6]:
mle_alpha, mle_beta = mc.dist.gamma.mle(labeled)

print('MLE alpha:', mle_alpha)
print('MLE beta: ', mle_beta)

MLE alpha: 2.258888610361833
MLE beta:  0.005125404717753318


### Comparison of MLE values with experimental data

In [7]:
# Generate theoretical CDF using MLE Gamma parameter values
rg = np.random.default_rng()
gamma_theor = rg.gamma(mle_alpha, 1/mle_beta, size=len(labeled))
gamma_ecdf = pd.DataFrame(data=gamma_theor, columns=['time to catastrophe (s)'])
gamma_ecdf['data type'] = 'MLE theoretical'

# Add experimental data to DataFrame
experimental = pd.DataFrame(data=labeled, columns=['time to catastrophe (s)'])
experimental['data type'] = 'experimental'
gamma_ecdf = pd.concat((experimental, gamma_ecdf))


# Plot the ECDF of experimental data vs Gamma model with MLE values
p = bokeh_catplot.ecdf(
    data=gamma_ecdf.loc[gamma_ecdf['data type'] == 'experimental'],
    cats='data type',
    val='time to catastrophe (s)',
    style='staircase',
    conf_int=True,
)
p = bokeh_catplot.ecdf(
    data=gamma_ecdf,
    cats='data type',
    val='time to catastrophe (s)',
    p=p,
    style='staircase',
    conf_int=False,
)

p.legend.location='bottom_right'

bokeh.io.show(p)

### Compute parametric confidence interval by drawing out of Gamma distribution

Because the MLE is a random variable, we use a bootstrapping method to compute the confidence interval. We choose the parametric method (i.e. using the model distribution, Gamma, to approximate the generative distribution) because the authors of the paper already showed that the Gamma distribution was a good approximation for modeling the experimental data they obtained.

In [7]:
bs_reps_parametric = bebi103.draw_bs_reps_mle(
    mc.dist.gamma.mle,
    mc.dist.gamma.draw,
    labeled,
    mle_args=(),
    gen_args=(),
    size=5000,
    n_jobs=2,
    progress_bar=True,
)

100%|██████████| 2500/2500 [00:29<00:00, 85.12it/s] 
100%|██████████| 2500/2500 [00:29<00:00, 83.79it/s] 


In [8]:
conf_ints = np.percentile(bs_reps_parametric, [2.5, 97.5], axis=0).transpose()
print('95% confidence interval for alpha:', conf_ints[0])
print('95% confidence interval for beta: ', conf_ints[1])

95% confidence interval for alpha: [1.89247401 2.79250768]
95% confidence interval for beta:  [0.0042021  0.00638696]


### Contour plots of MLE parameters

In [9]:
# Package replicates in data frame for plotting
df_res = pd.DataFrame(data=bs_reps_parametric, columns=["α*", "β*"])

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    p = bebi103.viz.corner(
        samples=df_res,
        pars=["α*", "β*"],
        show_contours=True,
        levels = [0.95],
    )

bokeh.io.show(p)

## Alternative Model MLE

Because the alternative distribution models a 2-step process, for our initial guess we assume that each step occurs at the same rate. Thus, our estimates for beta1 and beta2 are the same, and we obtain them by multiplying the average rate (1 / mean catastrophe time) by 2.

### MLE of alternative distribution parameters

In [20]:
mle_beta1, mle_beta2 = mc.dist.alt.mle(labeled)

print('MLE beta1:', mle_beta1)
print('MLE beta2:', mle_beta2)

MLE beta1: 0.0045381223507861505
MLE beta2: 0.0045380806510549815


### Comparison of MLE values with experimental data

In [11]:
# Generate theoretical CDF using MLE Gamma parameter values
alt_theor = mc.dist.alt.draw((mle_beta1, mle_beta2), len(labeled))
alt_ecdf = pd.DataFrame(data=alt_theor, columns=['time to catastrophe (s)'])
alt_ecdf['data type'] = 'MLE theoretical'

# Add experimental data to DataFrame
alt_ecdf = pd.concat((experimental, alt_ecdf))


# Plot the ECDF of experimental data vs Gamma model with MLE values
p = bokeh_catplot.ecdf(
    data=alt_ecdf.loc[alt_ecdf['data type'] == 'experimental'],
    cats='data type',
    val='time to catastrophe (s)',
    style='staircase',
    conf_int=True,
)
p = bokeh_catplot.ecdf(
    data=alt_ecdf,
    cats='data type',
    val='time to catastrophe (s)',
    p=p,
    style='staircase',
    conf_int=False,
)

p.legend.location='bottom_right'

bokeh.io.show(p)

### Compute parametric confidence interval by drawing out of alternative distribution

Because the MLE is a random variable, we use a bootstrapping method to compute the confidence interval. We choose the nonparametric method (i.e. sampling out of the empirical distribution to approximate the generative distribution) because we are not yet sure that the alternative model is a good approximation for the generative distribution.

In [17]:
def draw_bs_sample(params, data, size, rg=None):
    """Draw a bootstrap sample from a 1D data set."""
    if rg is None:
        rg = np.random.default_rng()
        
    return np.random.choice(data, size=len(data))

In [18]:
# Compute MLE confidence interval using multiprocessing method
bs_reps_nonparametric = bebi103.draw_bs_reps_mle(
    mc.dist.alt.mle,
    draw_bs_sample,
    labeled,
    mle_args=(),
    gen_args=(labeled,),
    size=5000,
    n_jobs=2,
    progress_bar=True,
)

100%|██████████| 2500/2500 [00:29<00:00, 84.10it/s] 
100%|██████████| 2500/2500 [00:29<00:00, 83.71it/s]


In [19]:
conf_ints = np.percentile(bs_reps_nonparametric, [2.5, 97.5], axis=0).transpose()
print('95% confidence interval for beta1:', conf_ints[0])
print('95% confidence interval for beta2:', conf_ints[1])

95% confidence interval for beta1: [0.00413949 0.00498053]
95% confidence interval for beta2: [0.00413949 0.00498052]


One thing to note is that because beta1 and beta2 are symmetrical in the PDF of the alternative model, they could switch values and still have the same likelihood estimation. Thus, in our bootstrap samples, it is possible that some MLE estimates are flipped, and this could potentially mix the distributions of estimates.

### Contour plots of MLE parameters

In [21]:
# Package replicates in data frame for plotting
df_res = pd.DataFrame(data=bs_reps_nonparametric, columns=["β1*", "β2*"])

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    p = bebi103.viz.corner(
        samples=df_res,
        pars=["β1*", "β2*"],
        show_contours=True,
        levels = [0.95],
    )

bokeh.io.show(p)

## Computing Environment

In [None]:
%load_ext watermark
%watermark -v -p jupyterlab,pandas,numpy,scipy,bokeh,bokeh_catplot,bebi103 -m