In [128]:
import hvplot
import holoviews as hv
import hvplot.pandas
import bokeh_catplot
import bokeh 
import bokeh.io
from bokeh.themes import Theme
from bokeh.io import output_file, save, output_notebook
import datashader as ds
from holoviews.operation.datashader import datashade
from bokeh.plotting import figure
import numba
import colorcet as cc
import bebi103
from bebi103 import viz

import micro_cat_29

import numpy as np
import scipy.stats
import scipy.special
import pandas as pd

bokeh.io.output_notebook()
hv.extension('bokeh')

ModuleNotFoundError: No module named 'micro_cat_29'

## Microtubule Catastrophe: Notebook #1

Microtubules are intracellular filaments, responsible for shuttling cargo throughout the cellular interior. When viewed under the microscope in the presence of ATP, microtubules can be observed slowly growing, then rapidly collapsing, as they explore the environment around them.

This latter feat -- rapid collapse -- is a process known as "microtubule catastrophe". To explore this phenomenon in a quantitative manner, Melissa K. Gardner, Marija Zanic, Christopher Gell, Volker Bormuth and Jonathan Howard performed experiments that uncovered a role for the kinesins, Kip3 and MCAK, in controlling microtubule catastrophe.

The paper is: **Gardner MK, _et al._ "Depolymerizing Kinesins Kip3 and MCAK Shape Cellular Microtubule Architecture by Differential Control of Catastrophe" _Cell_ (2011)**.

Throughout this analysis, we will refer to this paper, as we go from exploratory data analysis to in-depth analysis and mathematical modeling.

The dataset used in this analysis contains observations for the times to microtubule catastrophe (in seconds) for tubulin that is either _labeled_ with GFP, or _unlabeled_. The authors used TIRF microscopy to assess microtubule catastrophe for the labeled dataset, and used differential interference microscopy (DIC) to evaluate time to catastrophe of unlabeled tubulin. 

TIRF is much easier to perform than DIC, which would make subsequent experiments simpler, but the authors first had to confirm that the addition of AlexaFluor 488 did not affect catastrophe times.

Throughout these notebooks, you should have Justin Bois' Utilities Package for BE/Bi103 installed on your machine. To do that, you can use pip:

In your Anaconda terminal, simply run: pip install bebi103

_This is notebook 1 of 2. The order of analysis is:_

1) **exploratory.ipynb**

2) models.ipynb

In this notebook (#1), we will perform an initial, exploratory data analysis of microtubule catastrophe. This analysis includes ECDFs that compare labeled vs. unlabeled tubulin. We also perform nonparametric bootstrapping to more critically analyze the distribution of labeled vs. unlabeled tubulin, and assess whether we can proceed, in future notebooks, with considerations of solely the labeled dataset.

In [41]:
from IPython.core.display import display, HTML
display(HTML("""<a href="https://www.cell.com/cell/fulltext/S0092-8674(11)01287-6?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867411012876%3Fshowall%3Dtrue">Gardner et al. Cell (2011), link to paper.</a>"""))

display(HTML("""<a href="http://bebi103.caltech.edu.s3-website-us-east-1.amazonaws.com/2019a/content/homework/hw3/hw3.3.html">Dataset download, link to course website.</a>"""))

In [95]:
#Read in the dataframe from Gardner et al.
df = pd.read_csv('../data/gardner_time_to_catastrophe_dic_tidy.csv', na_values='*')
df.head(), df.tail()

(   Unnamed: 0  time to catastrophe (s)  labeled
 0           0                    470.0     True
 1           1                   1415.0     True
 2           2                    130.0     True
 3           3                    280.0     True
 4           4                    550.0     True,
      Unnamed: 0  time to catastrophe (s)  labeled
 301         301                    180.0    False
 302         302                    145.0    False
 303         303                    745.0    False
 304         304                    390.0    False
 305         305                   1410.0    False)

This DataFrame contains two sets of microtubules: those that are labeled, and those that are not labeled. We want to evaluate whether or not "labeling" affects microtubule catastrophe times.

In [43]:
#Make a new, independent dataframe called df_label, that contains only "labeled" datapoints
df_label = df.loc[df['labeled'] == True, :]

#Make a new, independent dataframe called df_unlabel, that contains only "unlabeled" datapoints
df_unlabel = df.loc[df['labeled'] == False, :]

In [44]:
#Visualize the new DataFrames
df_label.head(), df_unlabel.head()

(   Unnamed: 0  time to catastrophe (s)  labeled
 0           0                    470.0     True
 1           1                   1415.0     True
 2           2                    130.0     True
 3           3                    280.0     True
 4           4                    550.0     True,
      Unnamed: 0  time to catastrophe (s)  labeled
 211         211                    355.0    False
 212         212                    425.0    False
 213         213                    540.0    False
 214         214                    265.0    False
 215         215                   1815.0    False)

Terrific! Now that we have two DataFrames, each corresponding to either "labeled" or "unlabeled" microtubules, we will visualize their distributions via plotting empirical distribution functions (ECDFs).

To do this, we will import the GitHub module "bebi103" that we previously installed, and use the built-in function, bokeh_catplot.ecdf.

In [45]:
#Plot ECDFs of labeled and unlabeled tubulin, with labeled ECDF in "green"
p = bokeh_catplot.ecdf(
    data=df_label,
    val='time to catastrophe (s)',
    palette=['#458b00']
)

p = bokeh_catplot.ecdf(
    data=df_unlabel,
    val='time to catastrophe (s)',
    p=p
)

bokeh.io.show(p)

It is really challenging to say, with certainty, whether the labeling affects the time to catastrophe events without plotting confidence intervals and performing statistical analyses. Based on eye alone, I would say that the ECDFs look sufficiently similar, or within the margin of experimental error, for the labeled vs. unlabeled data.The 90th and 10th percentiles, for instance, look quite similar between the datasets, as does the median value (around 300s for both). Based on this, I think that labeling does not impact time to catastrophe.

To get a better idea of this data, we can plot another ECDF, but this time with confidence intervals added.

In [46]:
#Plot ECDFs of labeled and unlabeled tubulin, with labeled ECDF in "green", plus add confidence intervals
p = bokeh_catplot.ecdf(
    data=df_label,
    val='time to catastrophe (s)',
    palette=['#458b00'],
    conf_int=True
)

p = bokeh_catplot.ecdf(
    data=df_unlabel,
    val='time to catastrophe (s)',
    conf_int=True,
    p=p
)

bokeh.io.show(p)

There is clearly _significant_ overlap between the two distributions, which strongly implies that there is no difference between labeled and unlabeled tubulin. 

However, we should perform more in-depth analysis of the labeled vs. unlabeled tubulin datasets. We will do so via "nonparametric hacker stats", calling upon bootstrapping methods to more rigorously assess whether there is an experimental difference between labeled and unlabeled tubulin.

## Nonparametric "Hacker Stats" to Assess Labeled vs. Unlabeled Tubulin

We have already plotted the ECDFs of the catastrophe times for microtubules with labeled tubulin and for those with unlabeled tubulin, including a confidence interval. In looking at the plot, we assessed that the two datasets could, indeed, be identically distributed.

In this next part of the "exploratory analysis", we will compute confidence intervals for the plug-in estimate for the _mean time_ to catastrophe for each of the two conditions: labeled vs. unlabeled.

In [70]:
labeled = df.loc[df['labeled']==True, 'time to catastrophe (s)'].values
unlabeled = df.loc[df['labeled']==False, 'time to catastrophe (s)'].values

In [71]:
@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


@numba.njit
def draw_bs_reps_mean(data, size=1):
    """Draw boostrap replicates of the mean from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out

In [72]:
bs_labeled = draw_bs_reps_mean(labeled, size=100000)
bs_unlabeled = draw_bs_reps_mean(unlabeled, size=100000)

bs_conf_int_labeled = np.percentile(bs_labeled, [2.5, 97.5])
bs_conf_int_unlabeled = np.percentile(bs_unlabeled, [2.5, 97.5])

In [73]:
def plot_with_error_bars(means, confs, names, **kwargs):
    """Make a horizontal plot of means/conf ints with error bars."""
    frame_height = kwargs.pop("frame_height", 150)
    frame_width = kwargs.pop("frame_width", 450)

    p = bokeh.plotting.figure(
        y_range=names, frame_height=frame_height, frame_width=frame_width, **kwargs
    )

    p.circle(x=means, y=names)
    for conf, name in zip(confs, names):
        p.line(x=conf, y=[name, name], line_width=2)

    return p

In [74]:
bokeh.io.show(
    plot_with_error_bars(
        [bs_labeled.mean(), bs_unlabeled.mean()],
        [bs_conf_int_labeled, bs_conf_int_unlabeled],
        names=["labeled", "unlabeled"],
        x_axis_label="mean time to catastrophe (s)",
    )
)

Based on this bootstrapping analysis, it does seem that there is significant overlap between the labeled and unlabeled tubulin distributions.

We will next test the hypothesis the distribution of catastrophe times for microtubules with labeled tubulin is the same as that for unlabeled tubulin.


**Answer**: Let's use a permutation NHST recipe.

1. State the null hypothesis (more- or less already stated): The distributions of the labeled and unlabeled tubulin are the same. 

2. Define a test statistic: Difference of means squared weighted by the the product of the standard deviations. That is:

$$
\text{D} = \frac{ \left( \bar{t_1} - \bar{t_2} \right) ^2}{\sigma_{t_1} \sigma_{t_2}}
$$

We chose this statistic because it takes into account both the first and second moments of the distribution.

3. Rejection region : We'll choose an alpha of $1x10^-3$. 

4. Simulate data acquisition: Okay ! 

But before we apply the bootstrap replicates of our test statistic, let's think about why is this is a good measure of the difference.

We know that the first and second moments of the exponential distribution are just $1/\beta$ and $1\beta ^2$ respectively. We know the PDF is a monotonically decreasing function and therefore be more interested in the rightward tail of the distribution. From this assumption it is okay to assume that for example, a difference of the means would in essence be a good test statistic by itself. Moreover, because we want to **test whether the distributions are different** and not just the means, we want to include that in the test statistic we use. 

We acknowledge the fact that the test statistic *might* be biased, and that we may be a more formal way to characterize the difference in exponential distributions, but we'll go with this for now. 

In [75]:
@numba.njit
def diff_means(x, y): 
    
    diff_means = (np.mean(x) - np.mean(y))**2 / (np.std(x) - np.std(y))
    
    return diff_means

Now let's concatenate and scramble both samples. 

In [76]:
@numba.njit
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)

    return concat_data[:len(x)], concat_data[len(x):]

In [77]:
def draw_perm_reps(x, y, stat_fun, size=1):
    """Generate array of permuation replicates."""
    return np.array([stat_fun(*draw_perm_sample(x, y)) for _ in range(size)])

In [78]:
@numba.njit
def draw_perm_reps_diff_mean(x, y, size=1):
    """Generate array of permuation replicates."""
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = diff_means(x_perm, y_perm)

    return out

Because we want to resolve values below  to a $1x10^-3$ p-value, we will make $1x10^4$ bootstrap reps. 

In [79]:
perm_reps = draw_perm_reps_diff_mean(labeled, unlabeled,size = np.int(1e4))

In [80]:
diff_mean = diff_means(labeled, unlabeled)

In [81]:
pval = np.sum(perm_reps >= diff_mean) / len(perm_reps)

In [82]:
pval

0.8997

It is possible to observe this test statistic under the null hypothesis, namely, that the labeled and unlabeled catastrophe times are identically distributed. This suggests that the measured data are commensurate with identical distributions.


**d)** In part (b), you used bootstrapping to compute a confidence interval for the plug-in estimate for the mean time to catastrophe. As is often (though definitely not always) the case, we could use a theoretical result to construct a confidence interval. The central limit theorem states that the mean, which is the sum of many processes, should be approximately Normally distributed. We will not derive it here, but the mean and variance of that Normal distribution are approximately

\begin{align}
&\mu = \bar{x},\\[1em]
&\sigma^2 = \frac{1}{n(n-1)}\sum_{i=1}^n (x_i - \bar{x})^2,
\end{align}

where $\bar{x}$ is the arithmetic mean of the data points. To compute a confidence interval of the mean, then, you can compute the interval over which 95% of the probability mass of the above described Normal distribution lies. 

Compute this approximate confidence interval and compare it to the result you got in part (b). *Hint*: You can use the `scipy.stats` package to conveniently get intervals for named distributions.



In [83]:
mean_lab_ci, _, _ = scipy.stats.bayes_mvs(labeled)
mean_unlab_ci, _, _ = scipy.stats.bayes_mvs(unlabeled)

In [84]:
print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the labeled microtubules using scipy. \n""".format(*list(mean_lab_ci)[1]))

print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the unlabeled microtubules using scipy.""".format(*list(mean_unlab_ci)[1]))

Mean time to catastrophe 95 % confidence interval [407.02, 474.40]
for the labeled microtubules using scipy. 

Mean time to catastrophe 95 % confidence interval [360.24, 464.81]
for the unlabeled microtubules using scipy.


In [85]:
print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the labeled microtubules using bootstrap. \n""".format(*mt_lab_mean_ci))

print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the unlabeled microtubules using bootstrap.""".format(*mt_non_lab_mean_ci))

Mean time to catastrophe 95 % confidence interval [401.78, 481.95]
for the labeled microtubules using bootstrap. 

Mean time to catastrophe 95 % confidence interval [352.31, 476.79]
for the unlabeled microtubules using bootstrap.


We can see that the results are fairly similar. In both cases the bootstrap confidence intervals are broader than the ones calculated with scipy.

In the notebooks that follow, we will restrict our analysis to the "labeled" datasets.

<hr>

In [127]:
%load_ext watermark
%watermark -v -p bebi103,pandas,scipy,numpy,bokeh,bokeh_catplot,jupyterlab

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.4
IPython 5.8.0

bebi103 0.0.46
pandas 0.24.2
scipy 1.3.1
numpy 1.17.3
bokeh 1.4.0
bokeh_catplot 0.1.6
jupyterlab 1.2.0
