# Confidence intervals for microtubule catastrophe times

The purpose of this notebook is to analyze a control experiment in the [Gardneer, Zanic, et al. paper](https://www.sciencedirect.com/science/article/pii/S0092867411012876?via%3Dihub) in which some tubulin was labeled with a fluorescent marker, and some were not. The purpose of this experiment was to make sure that fluoresceent labels and exposure to laser light did not affect the microtubule catastrophe times. So, here we will perform analyses and nonparametric inferences to verify that there is little to no difference in the two datasets.

## Imports and Loading Dataset

In [1]:
import math

import pandas as pd
import numpy as np
import numba
import scipy.stats

import bokeh.io
import bokeh.plotting
import bokeh.palettes
import bokeh_catplot

import microtubule_catastrophe as mc
import bebi103

bokeh.io.output_notebook()

In [2]:
# Load the dataset
df = pd.read_csv('../../data/gardner_time_to_catastrophe_dic_tidy.csv', usecols=[1,2])
df.head()

Unnamed: 0,time to catastrophe (s),labeled
0,470.0,True
1,1415.0,True
2,130.0,True
3,280.0,True
4,550.0,True


In [3]:
# Extract values of interest as numpy arrays
labeled = df.loc[df['labeled'] == True, 'time to catastrophe (s)'].values
unlabeled = df.loc[df['labeled'] == False, 'time to catastrophe (s)'].values

## Confidence Intervals for ECDFs

In [4]:
palette = bokeh.palettes.brewer['RdBu'][4]

p = bokeh_catplot.ecdf(
    data=df,
    cats='labeled',
    val='time to catastrophe (s)',
    conf_int=True,
    palette=[palette[0], palette[-1]],
)

p.legend.title = 'Labeled Tubulin'

bokeh.io.show(p)

Looking at the plotted ECDFs with confidence intervals above, there is a lot of overlap in the confidence intervals, and the original ECDFs also follow each other very closely. Based on that observation, it could be likely that the two conditions are identically distributed. 

## Confidence Intervals for Plug-in Estimate of Mean Time to Catastrophe via Bootstrapping

In [7]:
@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return np.random.choice(data, size=len(data))


@numba.njit
def draw_bs_reps_mean(data, size=1):
    """Draw boostrap replicates of the mean from 1D data set."""
    out = np.empty(size)
    for i in range(size):
        out[i] = np.mean(draw_bs_sample(data))
    return out

Use bootstrapping method to obtain confidence intervals for the plug-in estimate for the mean time to catastrophe.

In [8]:
# Sample boostrapped mean values
bs_reps_mean_labeled = draw_bs_reps_mean(labeled, size=10000)
bs_reps_mean_unlabeled = draw_bs_reps_mean(unlabeled, size=10000)

In [9]:
# 95% confidence intervals
mean_labeled_conf_int = np.percentile(bs_reps_mean_labeled, [2.5, 97.5])
mean_unlabeled_conf_int = np.percentile(bs_reps_mean_unlabeled, [2.5, 97.5])

print("""
Mean time to catastrophe 95% conf int labeled (s):   [{0:.2f}, {1:.2f}]
Mean time to catastrophe 95% conf int unlabeled (s): [{2:.2f}, {3:.2f}]
""".format(*(tuple(mean_labeled_conf_int) + tuple(mean_unlabeled_conf_int))))


Mean time to catastrophe 95% conf int labeled (s):   [401.63, 480.47]
Mean time to catastrophe 95% conf int unlabeled (s): [354.42, 476.32]



Considering the confidence intervals for the plug-in estimate of the mean, we again observe that there is a pretty large overlap between the intervals, from 401.63 to 476.32. More than 50% of either condition's confidence interval is within this overlap. Though this cannot confirm that the two conditions are identically distributed, it does not refute our observation in part a) that they could be identically distributed.

## NHST for Labeled vs. Unlabeled Tubulin Distributions

Because we are testing the hypothesis that the two distributions are the same, we will perform a permutation hypothesis test. This test simulates the condition of the two distributions being identical by scrambling the data points between two "new" datasets. We choose difference of means between the two datasets as our test statistic to measure similarity or dissimilarity between the two distributions. This is because if the two distributions are identical, their means should be close together (important to note, however, that the converse is not necessarily true). Then, we will compute the difference of means for the original dataset and see how many of our simulated differences are at least as extreme as the original difference to obtain a p-value.

In [11]:
# Helper functions from lesson 7:
@numba.njit
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)

    return concat_data[:len(x)], concat_data[len(x):]


@numba.njit
def draw_perm_reps_diff_mean(x, y, size=10000):
    """Generate array of permuation replicates."""
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)

    return out

In [12]:
# Compute test statistic for original data set
diff_mean = np.mean(labeled) - np.mean(unlabeled)

# Draw replicates
perm_reps = draw_perm_reps_diff_mean(labeled, unlabeled, size=10000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('p-value =', p_val)

p-value = 0.2238


This is a rather large p-value. Again, this does not really refute our original hypothesis that the two conditions could be identically distributed.

## Confidence Intervals for Plug-in Estimate of Mean Time to Catastrophe via Central Limit Theorem

In [19]:
def clt_conf_int(data):
    """
    Computes an approximation of the confidence interval using the theoretical
    result (Central Limit Theorem) that the mean is approximately Normally
    distributed.
    """
    n = len(data)
    # Approximations of the Normal distribution parameters using given dataset
    mu = np.mean(data)
    sigma = np.sqrt(np.sum((data - mu) ** 2) / (n * (n+1)))
    
    return (scipy.stats.norm.interval(.95, loc=mu, scale=sigma))

In [20]:
print('labeled:', clt_conf_int(labeled))
print('unlabeled:', clt_conf_int(unlabeled))

labeled: (400.9301791367841, 480.4916218110832)
unlabeled: (351.4855037957038, 473.5671277832436)


Compared to the confidence intervals computed in part b) by bootstrapping, this result is quite similar, where the endpoints only vary by a small percentage of the width of the intervals. In fact, if you repeated the bootstrapping method in part b), they are all similar to these computed intervals.

##

In [8]:
def ecdf(x, data):
    """Returns ECDF values of the given dataset, at the given x values."""
    n = len(data)
    vals = np.empty(len(x))
    
    x = np.sort(x)
    data = np.sort(data)
    
    # ECDF is stepwise function with left end closed, right end open
    # Refer to http://bois.caltech.edu/distribution_explorer/background/review_of_key_concepts.html
    
    # Indices into data and x, resp.
    i, j = 0, 0
    # Cumulative height/value of ECDF at first i values of data
    height = 0
    
    while i < n:
        # Check if x[j] falls here on the ECDF
        while j < len(x) and x[j] < data[i]:
            vals[j] = height / n
            j += 1
        
        # Increment the ECDF height based on data
        height += 1
        i += 1
    
    # Add remaining values of x that are all larger than data
    while j < len(x):
        vals[j] = 1
        j += 1
            
    return vals

## Confidence Intervals for Plug-in Estimate of Mean Time to Catastrophe via Dvoretzky-Kiefer-Wolfowitz Inequality (DKW)

The DKW inequality puts an upper bound on the maximum distance between the ECDF $\hat{F}(x)$ and the generative CDF $F(x)$. It states that, for any $\epsilon > 0$,

\begin{align} 
P\left(\mathrm{sup}_x \left|F(x) - \hat{F}(x)\right| > \epsilon\right) \le 2\mathrm{e}^{-2 n \epsilon^2}
\end{align}

where $n$ is the number of points in the data set. We could use this inequality to set up a bound for the confidence interval. To construct the bound on the $100 \times (1-\alpha)$ percent confidence interval, we specify that

\begin{align}
\alpha = 2\mathrm{e}^{-2 n \epsilon^2}
\end{align}
which gives
\begin{align}
\epsilon = \sqrt{\frac{1}{2n}\,\log \frac{2}{\alpha}}.
\end{align}
Then, the lower bound on the confidence interval is 

\begin{align} 
L(x) = \max\left(0, \hat{F}(x) - \epsilon\right),
\end{align}
and the upper bound is
\begin{align}
U(x) = \min\left(1, \hat{F}(x) + \epsilon\right).
\end{align} 

Note that this is not strictly speaking a confidence interval, but rather a set of bounds for where the confidence interval can lie (it's the DKW **in**equality after all).

In [9]:
def dkw_conf_int(data, alpha=0.25):
    n = len(data)
    epsilon = math.sqrt(1 / 2 / n * math.log(2 / alpha))
    
    # Generate x-values so they cover the spread of data
    x = np.sort(data)
    vals = ecdf(x, data)
    
    lower = np.maximum(np.zeros(n), vals - epsilon)
    upper = np.minimum(np.ones(n), vals + epsilon)
    
    return x, lower, upper

In [10]:
# Get 95% confidence intervals
x_labeled, lower_labeled, upper_labeled = dkw_conf_int(labeled)
x_unlabeled, lower_unlabeled, upper_unlabeled = dkw_conf_int(unlabeled)

In [12]:
p = bokeh.plotting.figure(height=300, 
                          width=400)

p.circle(np.sort(labeled), np.arange(1, len(labeled) + 1) / len(labeled), color=palette[0])
p.patch(x=np.concatenate((x_labeled, x_labeled[::-1])), 
        y=np.concatenate((lower_labeled, upper_labeled[::-1])), 
        legend_label='labeled',
        alpha=0.3,
        color=palette[0])

p.circle(np.sort(unlabeled), np.arange(1, len(unlabeled) + 1) / len(unlabeled), color=palette[-1])
p.patch(x=np.concatenate((x_unlabeled, x_unlabeled[::-1])), 
        y=np.concatenate((lower_unlabeled, upper_unlabeled[::-1])),
        legend_label='unlabeled',
        alpha=0.3,
        color=palette[-1])

p.legend.location = 'bottom_right'
p.xaxis[0].axis_label = 'time to catastrophe (s)'
p.yaxis[0].axis_label = 'ECDF'

bokeh.io.show(p)

This is similar to the result we got in part a) when plotting the confidence intervals of ECDFs via bootstrapping. Since the original ECDFs are the same, again we see that they are close to each other, but moreover, the shaded region shows a lot of overlap both here and in the ECDF in part a).

## Computing Environment

In [13]:
%load_ext watermark
%watermark -v -p numpy,pandas,numba,scipy,bokeh,bokeh_catplot -m

CPython 3.7.4
IPython 7.9.0

numpy 1.17.3
pandas 0.25.2
numba 0.46.0
scipy 1.3.1
bokeh 1.4.0
bokeh_catplot 0.1.6

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.0.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
