In [1]:
import numpy as np
import pandas as pd

import arviz as az

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

## Problem 8.2: Microtubule catastrophe, 40 pts

_Note: This problem is best done after the lecture November 22._

In this problem, we use data from [Gardner, Zanic, et al., Depolymerizing kinesins Kip3 and MCAK shape cellular microtubule architecture by differential control of catastrophe, *Cell*, **147**, 1092-1103, 2011](https://doi.org/10.1016/j.cell.2011.10.037). The authors investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state.  In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules in a single-molecule [TIRF assay](https://en.wikipedia.org/wiki/Total_internal_reflection_fluorescence_microscope) by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin. We will carefully analyze the data and make some conclusions about the processes underlying microtubule catastrophe.

In the file `gardner_mt_catastrophe_only_tubulin.csv` (which you can download [here](../data/gardner_mt_catastrophe_only_tubulin.csv)), we have observed catastrophe times of microtubules with different concentrations of tubulin. To start with, we will consider the experiment run with a tubulin concentration of 12 µM. So, our data set consists of a set of measurements of the amount of time to catastrophe. We will consider three models for microtubule catastrophe.

- Model 1: The time to catastrophe is Exponentially distributed.
- Model 2: The time to catastrophe is Gamma distributed.
- Model 3: The time to catastrophe is Weibull distributed.

Note that these descriptions are for the likelihood; we have not specified priors.


**a)  Describe the three models in words. Give physical descriptions of the meanings of their parameters. Describe how these models are related to each other. Tutorial 3c will be useful.** 

<br />



- Model 1: The time to catastrophe is Exponentially distributed.

This suggests that the occurance of catastrophe is a Poisson process, so it is a "rare event" that requires multiple subprocesses to lead it it. The parameter for the process, if it is exponential, $\beta$, represents the characteristic rate of catastrophe, that is how often catastrophe happens in a certain amount of time. It can also be parametrized as $\tau=1/\beta$, the characteristic catastrophe time, which fits what we are given in our data. The Exponential distribution is a special case of the Gamma distribution where $\alpha = 1$ and a special case of the Weibull distribution where $\alpha = 1$ and $\sigma=1/\beta$



- Model 2: The time to catastrophe is Gamma distributed.

This suggests that the occurance of catastrophe represents a specific number of occurances of a Poisson process, that is a discrete number of steps that occur at the same rate must occur for catastrophe to occur. There are two parameters for this distribution, $\alpha$ and $\beta$, where $\alpha$ is the number of arrivals (or "steps") required to trigger catastrophe, and $\beta$ is the rate of the arrivals. Thus, the characteristic catastrophe time is given by $\alpha/\beta$. 


- Model 3: The time to catastrophe is Weibull distributed.

This suggests that the likelihood of catastrophe is dependent on the amount of time it has been since the last catastrophe, so the longer it has been since the last catastrophe, the more likely it is that catastrophe will occur. There are two parameters for this distribution, $\alpha$ which defines how the probability changes over time, and $\sigma$ which is the characteristic catastrophe time.

**b) Perform parameter estimates for the respective models and make model comparisons. Comment on what this means with respect to our understanding of how microtubule catastrophe works.**

We first load in our data and come up with our priors for the parameters of the three distributions. We don't have much prior knowledge so we will keep them simple (normal distribution).

*Exponential(tao)*

tao ~ normal(700, 100)

*Gamma($\alpha$, $\beta$)*

alpha ~ normal(10, 3)

beta ~ normal(10, 3)

*Weibull($\alpha$, $\sigma$)*

$\alpha$ ~ normal(3, 0.05)

$\sigma$ ~ normal(10, 3)


In [2]:
df = pd.read_csv('../data/gardner_mt_catastrophe_only_tubulin.csv', comment = "#")

In [3]:
prior_sm1 = bebi103.stan.StanModel(file='./8.2_prior_pred_12_m1.stan')
print(prior_sm1.model_code)
print("\n---------------------------------------------------------------\n")
prior_sm2 = bebi103.stan.StanModel(file='./8.2_prior_pred_12_m2.stan')
print(prior_sm2.model_code)
print("\n---------------------------------------------------------------\n")
prior_sm3 = bebi103.stan.StanModel(file='./8.2_prior_pred_12_m3.stan')
print(prior_sm3.model_code)

Using cached StanModel.
data{
    // Model 1
    int N;
}

generated quantities{
    // Parameters
    real uM_12[N];
    real beta_;
       
    beta_ = 1.0 / normal_rng(700, 100);
    
    // Data
    for (i in 1:N) {
        uM_12[i] = exponential_rng(beta_);
    }
}


---------------------------------------------------------------

Using cached StanModel.
data{
    // Model 2
    int N;
}

generated quantities{
    // Parameters
    real uM_12[N];
    real beta_;
    real alpha;
       
    alpha = normal_rng(10, 3);
    beta_ = normal_rng(10, 3);
    
    // Data
    for (i in 1:N) {
        uM_12[i] = gamma_rng(alpha, beta_);
    }
}

---------------------------------------------------------------

Using cached StanModel.
data{
    // Model 3
    int N;
}

generated quantities{
    // Parameters
    real uM_12[N];
    real alpha;
    real sigma;

    // Likelihood  
    alpha = normal_rng(3, 0.05);
    sigma = normal_rng(10, 3);
    
    // Data
    for (i in 1:N) {
        uM_12

In [8]:
# Store input parameters in a dictionary so stan can access them
data = dict(N=692)

# Generate samples
samples_gen1 = prior_sm1.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=1000)
samples_gen2 = prior_sm2.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=1000)
samples_gen3 = prior_sm3.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=1000)

# Store samples in a dataframe
df_gen1 = bebi103.stan.to_dataframe(samples_gen1, diagnostics=False)
df_gen2 = bebi103.stan.to_dataframe(samples_gen2, diagnostics=False)
df_gen3 = bebi103.stan.to_dataframe(samples_gen3, diagnostics=False)

# Let's look at one of the dataframes to make sure they look away
df_gen1.head()

Unnamed: 0,chain,chain_idx,warmup,uM_12[1],uM_12[2],uM_12[3],uM_12[4],uM_12[5],uM_12[6],uM_12[7],...,uM_12[685],uM_12[686],uM_12[687],uM_12[688],uM_12[689],uM_12[690],uM_12[691],uM_12[692],beta_,lp__
0,1,1,0,893.876619,99.069444,336.687588,874.153415,57.288336,211.91839,117.215338,...,437.283462,1367.379314,54.343024,991.521961,346.311374,648.863838,803.961463,507.205881,0.001429,0.0
1,1,2,0,201.94771,175.106996,1304.424657,309.525696,1133.560756,2665.482802,41.559374,...,804.229385,708.809695,394.504621,514.513666,555.601425,143.285132,399.057382,1234.274496,0.001549,0.0
2,1,3,0,3899.065369,1802.485427,204.117236,3406.23516,456.315315,531.675525,400.21218,...,249.870781,691.930836,737.052655,259.124273,0.03346,224.316242,833.286222,1356.629982,0.001103,0.0
3,1,4,0,126.457509,504.627487,338.353799,140.906616,97.058808,223.33063,2172.791198,...,1148.97583,807.374666,159.286808,676.150573,1182.198149,2546.398498,425.386468,41.288067,0.001544,0.0
4,1,5,0,579.184321,219.691194,448.462779,1509.149165,594.547277,57.179407,63.063346,...,153.75745,75.078529,890.616448,241.645073,1655.577321,787.237245,684.594703,457.835522,0.001478,0.0


In [7]:
p = bebi103.viz.predictive_ecdf(samples_gen1, "uM_12",
                                x_axis_label = "intercatastrophe time (s)")
p.x_range = bokeh.models.Range1d(-10, 3000)
bokeh.io.show(p)

Well intercotastrophe time is not on the order of an hour, so your priors are wrong. 

In [9]:
p = bebi103.viz.predictive_ecdf(samples_gen2, "uM_12",
                                x_axis_label = "intercatastrophe time (s)")
p.x_range = bokeh.models.Range1d(-10, 10)
bokeh.io.show(p)

This also looks a little steep. 

In [10]:
p = bebi103.viz.predictive_ecdf(samples_gen3, "uM_12",
                                x_axis_label = "intercatastrophe time (s)")
p.x_range = bokeh.models.Range1d(-10, 50)
bokeh.io.show(p)

This seems closer to what we might expect. 

Our priors look reasonable, so now we can move on to creating our mcmc models and sampling. **Actually, they really don't look fine, but I guess we'll worry about that later**

In [67]:
sm1 = bebi103.stan.StanModel(file='./8.2_mcmc_12_m1.stan')
sm2 = bebi103.stan.StanModel(file='./8.2_mcmc_12_m2.stan')
sm3 = bebi103.stan.StanModel(file='./8.2_mcmc_12_m3.stan')

Using cached StanModel.
Using cached StanModel.
Using cached StanModel.


In [51]:
data = dict(N=len(df),
           uM_12=df['12 uM'].values.astype(float))

In [60]:
samples1 = sm1.sampling(data=data)
samples2 = sm2.sampling(data=data)
samples3 = sm3.sampling(data=data)

In [61]:
df_mcmc1 = bebi103.stan.to_dataframe(samples1, diagnostics=False, inc_warmup=False)
df_mcmc2 = bebi103.stan.to_dataframe(samples2, diagnostics=False, inc_warmup=False)
df_mcmc3 = bebi103.stan.to_dataframe(samples3, diagnostics=False, inc_warmup=False)

In [62]:
df_mcmc3.head()

Unnamed: 0,chain,chain_idx,warmup,alpha,sigma,log_like[1],log_like[2],log_like[3],log_like[4],log_like[5],...,uM_12_ppc[684],uM_12_ppc[685],uM_12_ppc[686],uM_12_ppc[687],uM_12_ppc[688],uM_12_ppc[689],uM_12_ppc[690],uM_12_ppc[691],uM_12_ppc[692],lp__
0,1,1,0,0.66552,80.554756,-4.863764,-5.189537,-5.189537,-5.287579,-5.364648,...,198.620163,11.098617,69.380277,0.320216,49.748341,2.31993,386.777988,6.400399,0.284556,-6875.511834
1,1,2,0,0.648505,85.074963,-4.898102,-5.224358,-5.224358,-5.321834,-5.398238,...,132.558162,29.099731,16.968297,154.637017,1.565568,17.459336,32.636257,42.508185,0.544517,-6872.636984
2,1,3,0,0.682327,84.213002,-4.866421,-5.180813,-5.180813,-5.275832,-5.350657,...,179.780255,96.451287,10.17908,47.230644,79.154152,227.621295,1.912042,216.915176,16.025026,-6871.143061
3,1,4,0,0.677586,84.760981,-4.872626,-5.18813,-5.18813,-5.28331,-5.358206,...,207.268492,63.782835,400.182804,674.366415,1.861013,286.322047,23.228846,5.646027,11.912947,-6870.308615
4,1,5,0,0.6575,85.222696,-4.891023,-5.213671,-5.213671,-5.310342,-5.386202,...,272.499014,345.552418,82.56412,162.514377,54.266604,0.683849,5.648467,32.668135,131.112216,-6871.198522


We now plot the post predictive checks to do model comparison.

In [93]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples1, 
                                          percentiles=[99, 70, 50, 30],
                                          name='uM_12_ppc', 
                                          data=df['12 uM'].values,
                                          diff=True,
                                          data_line=False))

In [94]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples2, 
                                          percentiles=[99, 70, 50, 30],
                                          name='uM_12_ppc', 
                                          data=df['12 uM'].values,
                                          diff=True,
                                          data_line=False))

In [95]:
bokeh.io.show(bebi103.viz.predictive_ecdf(samples3, 
                                          percentiles=[99, 70, 50, 30],
                                          name='uM_12_ppc', 
                                          data=df['12 uM'].values,
                                          diff=True,
                                          data_line=False))

This doesn't give us a whole lot of information, so we compute the loo given the log likelihood and use the bebi103 compare function to calculate the loo. **Ugh these look really jank guys**

In [65]:
bebi103.stan.compare({'exponential': samples1, 'gamma': samples2, 'weibull': samples3},
                     log_likelihood='log_like', ic='loo')

Unnamed: 0,loo,ploo,dloo,weight,se,dse,warning
gamma,9278.44,2.14642,0.0,0.984134,45.7341,0.0,0
exponential,9608.84,0.364711,330.406,0.0158657,31.3845,28.7602,0
weibull,10911.6,1.38093,1633.15,1.23517e-09,68.3094,51.0139,0


We can see that Weibull (Model 3) is the best distribution to fit our data because **XXXX**. 

**IS IT?!?! URG I DON'T KNOW**

**c) Using whichever model you favor based on your work in part (b), obtain parameter estimates for the other tubulin concentrations. Given that microtubules polymerize faster with higher tubulin concentrations, is there anything you can say about the occurrence of catastrophe by looking at the values of the parameters versus tubulin concentration?**
