In [1]:
import numpy as np
import pandas as pd

import arviz as az

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Features requiring DataShader will not work and you will get exceptions.
  Features requiring DataShader will not work and you will get exceptions.""")


## Problem 8.2: Microtubule catastrophe, 40 pts

_Note: This problem is best done after the lecture November 22._

In this problem, we use data from [Gardner, Zanic, et al., Depolymerizing kinesins Kip3 and MCAK shape cellular microtubule architecture by differential control of catastrophe, *Cell*, **147**, 1092-1103, 2011](https://doi.org/10.1016/j.cell.2011.10.037). The authors investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state.  In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules in a single-molecule [TIRF assay](https://en.wikipedia.org/wiki/Total_internal_reflection_fluorescence_microscope) by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin. We will carefully analyze the data and make some conclusions about the processes underlying microtubule catastrophe.

In the file `gardner_mt_catastrophe_only_tubulin.csv` (which you can download [here](../data/gardner_mt_catastrophe_only_tubulin.csv)), we have observed catastrophe times of microtubules with different concentrations of tubulin. To start with, we will consider the experiment run with a tubulin concentration of 12 µM. So, our data set consists of a set of measurements of the amount of time to catastrophe. We will consider three models for microtubule catastrophe.

- Model 1: The time to catastrophe is Exponentially distributed.
- Model 2: The time to catastrophe is Gamma distributed.
- Model 3: The time to catastrophe is Weibull distributed.

Note that these descriptions are for the likelihood; we have not specified priors.


**a)**  Describe the three models in words. Give physical descriptions of the meanings of their parameters. Describe how these models are related to each other. Tutorial 3c will be useful. 

<br />



- Model 1: The time to catastrophe is Exponentially distributed.

This suggests that the occurance of catastrophe is a Poisson process, so it is a "rare event" that requires multiple subprocesses to lead it it. The parameter for the process, if it is exponential, $\beta$, represents the characteristic rate of catastrophe, that is how often catastrophe happens in a certain amount of time. It can also be parametrized as $\tau=1/\beta$, the characteristic catastrophe time, which fits what we are given in our data. The Exponential distribution is a special case of the Gamma distribution where $\alpha = 1$ and a special case of the Weibull distribution where $\alpha = 1$ and $\sigma=1/\beta$



- Model 2: The time to catastrophe is Gamma distributed.

This suggests that the occurance of catastrophe represents a specific number of occurances of a Poisson process, that is a discrete number of steps that occur at the same rate must occur for catastrophe to occur. There are two parameters for this distribution, $\alpha$ and $\beta$, where $\alpha$ is the number of arrivals (or "steps") required to trigger catastrophe, and $\beta$ is the rate of the arrivals. Thus, the characteristic catastrophe time is given by $\alpha/\beta$. 


- Model 3: The time to catastrophe is Weibull distributed.

This suggests that the likelihood of catastrophe is dependent on the amount of time it has been since the last catastrophe, so the longer it has been since the last catastrophe, the more likely it is that catastrophe will occur. There are two parameters for this distribution, $\alpha$ which defines how the probability changes over time, and $\sigma$ which is the characteristic catastrophe time.

**b)** Perform parameter estimates for the respective models and make model comparisons. Comment on what this means with respect to our understanding of how microtubule catastrophe works.

<br />




**c)** Using whichever model you favor based on your work in part (b), obtain parameter estimates for the other tubulin concentrations. Given that microtubules polymerize faster with higher tubulin concentrations, is there anything you can say about the occurrence of catastrophe by looking at the values of the parameters versus tubulin concentration?


In [2]:
df = pd.read_csv('/home/ec2-user/data/gardner_mt_catastrophe_only_tubulin.csv', comment = "#")

In [31]:
df.tail(40)

Unnamed: 0,12 uM,7 uM,9 uM,10 uM,14 uM
652,780.0,,,,
653,780.0,,,,
654,785.0,,,,
655,785.0,,,,
656,805.0,,,,
657,805.0,,,,
658,810.0,,,,
659,820.0,,,,
660,835.0,,,,
661,855.0,,,,


In [7]:
sm = bebi103.stan.StanModel(file='./8.2_prior_pred_12.stan')

Using cached StanModel.


In [8]:
# Store input parameters in a dictionary so stan can access them
data = dict(N=692)

# Generate samples
samples_gen = sm.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=1000)

# Store samples in a dataframe
df_gen = bebi103.stan.to_dataframe(samples_gen, diagnostics=False)
df_gen.head()

Unnamed: 0,chain,chain_idx,warmup,uM_12[1],uM_12[2],uM_12[3],uM_12[4],uM_12[5],uM_12[6],uM_12[7],...,uM_12[685],uM_12[686],uM_12[687],uM_12[688],uM_12[689],uM_12[690],uM_12[691],uM_12[692],beta_,lp__
0,1,1,0,5.667073,133.597385,66.728099,116.784315,0.336574,110.88177,56.224936,...,28.427295,49.9906,221.040653,26.761524,77.111801,190.213399,265.369544,16.723153,0.012607,0.0
1,1,2,0,2.886173,2.524564,7.426056,8.237017,21.136369,10.851168,56.049784,...,14.423289,11.226733,9.503071,7.478171,19.173707,11.407967,10.48361,17.522801,0.069496,0.0
2,1,3,0,10.9473,6.878905,7.676416,8.767637,7.759627,7.394502,5.926164,...,8.258528,17.441934,6.847938,10.855697,9.433541,18.380595,7.949728,1.946534,0.129731,0.0
3,1,4,0,42.309116,63.91904,6.170442,8.725135,12.426638,19.786762,5.656074,...,14.882937,25.356318,17.962547,57.818515,12.251426,4.529969,3.307255,7.186817,0.048016,0.0
4,1,5,0,36.24358,24.240394,34.483137,16.137572,8.475436,2.457494,29.169069,...,34.968246,52.423418,4.464528,21.583087,8.603382,85.638574,5.134447,20.45393,0.036667,0.0


In [9]:
p = bebi103.viz.predictive_ecdf(samples_gen, "uM_12",
                                x_axis_label = "intercatastrophe time (s)")
p.x_range = bokeh.models.Range1d(-10, 200)
bokeh.io.show(p)

In [33]:
sm = bebi103.stan.StanModel(file='./8.2_mcmc_12.stan')

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_9047521a464fa50c2ead3f4c9b2616a2 NOW.
  tree = Parsing.p_module(s, pxd, full_module_name)


In [34]:
data = dict(N=len(df),
           uM_12=df['12 uM'].values.astype(float))



In [35]:
samples = sm.sampling(data=data)

In [36]:
df_mcmc = bebi103.stan.to_dataframe(samples, diagnostics=False, inc_warmup=False)

# Take a look
df_mcmc.head()

Unnamed: 0,chain,chain_idx,warmup,tao,beta_,lp__
0,1,1,0,378.50445,0.002642,-4808.784343
1,1,2,0,383.08456,0.00261,-4808.643205
2,1,3,0,379.188433,0.002637,-4808.756746
3,1,4,0,391.042776,0.002557,-4808.632478
4,1,5,0,386.43953,0.002588,-4808.603363


In [37]:
plot = bebi103.viz.ecdf(df_mcmc["beta_"], x_axis_label="beta_", plot_height=200, plot_width=250) 
bokeh.io.show(plot)

In [38]:
plot = bebi103.viz.histogram(df_mcmc["beta_"], x_axis_label="beta_", 
                               plot_height=200, plot_width=250, 
                               bins=30, density=True) 
                 
bokeh.io.show(plot)