# Homework 7.1: Confidence intervals for microtubule catastrophe (35 pts)

**[Dataset download](https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv)**

<hr />

In [None]:
import pandas as pd
import numpy as npg
import hvplot
import holoviews as hv
import hvplot.pandas
import bokeh_catplot
import bokeh 
import bokeh.io
from bokeh.themes import Theme
from bokeh.io import output_file, save, output_notebook
import datashader as ds
from holoviews.operation.datashader import datashade
from bokeh.plotting import figure
import numba
from viz import pboc_style_bokeh
from bebi103 import viz


output_notebook()
hv.extension('bokeh')
#viz.set_plotting_style()

In [None]:
theme = Theme(json=pboc_style_bokeh())
bokeh.io.curdoc().theme = theme
hv.renderer('bokeh').theme = theme

Refresh yourself about the microtubule catastrophe data we explored in homeworks [3.3](http://bebi103.caltech.edu.s3-website-us-east-1.amazonaws.com/2019a/content/homework/hw3/hw3.3.html) and [6.2](http://bebi103.caltech.edu.s3-website-us-east-1.amazonaws.com/2019a/content/homework/hw6/hw6.2.html). We will again work with this data set here.

**a)** Remember that the confidence interval of the plug-in estimate of any statistical functional may be computed using bootstrapping. (This does not mean, however, that bootstrapping has great performance for any statistical functional; some have better behavior that others.) This includes the ECDF itself. Computing and plotting confidence intervals are implemented in the `bokeh_catplot.ecdf()` function. Plot the ECDFs of the catastrophe times for microtubules with labeled tubulin and for those with unlabeled tubulin including a confidence interval. In looking at the plot, do you think they two could be identically distributed?


In [None]:
df_mt = pd.read_csv('~/Downloads/gardner_time_to_catastrophe_dic_tidy.csv')

In [None]:
del(df_mt['Unnamed: 0'])

In [None]:
df_mt.head()

In [None]:
mt_lab = df_mt[df_mt['labeled']== True]
mt_non_lab = df_mt[df_mt['labeled']== False]

In [None]:
import bokeh_catplot

In [None]:
import colorcet as cc

In [None]:
e = bokeh_catplot.ecdf(data = df_mt, 
                  cats = 'labeled', 
                  val = 'time to catastrophe (s)',
                  conf_int = True, 
                  palette = cc.glasbey_dark[-2:])

In [None]:
bokeh.io.show(e)

**Answer**: It looks like both distributions are very similar. From looking at the plot, because the 95% confidence intervals overlap so much I would say that there's no difference between the labeled and unlabeled tubulin. 

<hr>

**b)** Compute confidence intervals for the plug-in estimate for the mean time to catastrophe for each of the two conditions and comment on the result.



In [None]:
@numba.njit
def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""

    bs_sample = np.random.choice(data, size=len(data))
    
    return bs_sample 

In [None]:
@numba.njit
def draw_bs_rep_mean(data, n_reps):
    """
    Draw bootstrap replicates for the mean. 
    """
    means_bs_reps = np.empty(n_reps)
    
    for i in range(n_reps):
        #bs_sample = 
        means_bs_reps[i] = np.mean(draw_bs_sample(data))
        
    return means_bs_reps

Let's extract the values corresponding to the labeled and unlabeled tubulin. 

In [None]:
mt_lab_tcat = mt_lab['time to catastrophe (s)'].values
mt_non_lab_tcat = mt_non_lab['time to catastrophe (s)'].values

With both distributions in `np.arrays` we can now go ahead and compute bootstrap replicates for the mean. 

In [None]:
mt_lab_mean =  draw_bs_rep_mean(mt_lab_tcat, n_reps= np.int(1e4))
mt_non_lab_mean =  draw_bs_rep_mean(mt_non_lab_tcat, n_reps= np.int(1e4))  

Going forward we can now get the 95% confidence intervals for the mean of both distributions.

In [None]:
mt_lab_mean_ci = np.percentile(mt_lab_mean, [2.5, 97.5])
mt_non_lab_mean_ci = np.percentile(mt_non_lab_mean, [2.5, 97.5])

print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the labeled microtubules. \n""".format(*mt_lab_mean_ci))

print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the unlabeled microtubules.""".format(*mt_non_lab_mean_ci))

It does look that the mean time to catastrophe is similar in the higher end of the values but diverges quite a bit on the lower end of the distribution. 


**c)** Test the hypothesis the distribution of catastrophe times for microtubules with labeled tubulin is the same as that for unlabeled tubulin. Think carefully about a good test statistic and justify your choice.


**Answer**: Let's use a permutation NHST recipe.

1. State the null hypothesis (more- or less already stated): The distributions of the labeled and unlabeled tubulin are the same. 

2. Define a test statistic: Difference of means squared weighted by the the product of the standard deviations. That is:

$$
\text{D} = \frac{ \left( \bar{t_1} - \bar{t_2} \right) ^2}{\sigma_{t_1} \sigma_{t_2}}
$$

We chose this statistic because it takes into account both the first and second moments of the distribution.

3. Rejection region : We'll choose an alpha of $1x10^-3$. 

4. Simulate data acquisition: Okay ! 

But before we apply the bootstrap replicates of our test statistic, let's think about why is this is a good measure of the difference.

We know that the first and second moments of the exponential distribution are just $1/\beta$ and $1\beta ^2$ respectively. We know the PDF is a monotonically decreasing function and therefore be more interested in the rightward tail of the distribution. From this assumption it is okay to assume that for example, a difference of the means would in essence be a good test statistic by itself. Moreover, because we want to **test whether the distributions are different** and not just the means, we want to include that in the test statistic we use. 

We acknowledge the fact that the test statistic *might* be biased, and that we may be a more formal way to characterize the difference in exponential distributions, but we'll go with this for now. 

In [None]:
@numba.njit
def diff_means(x, y): 
    
    diff_means = (np.mean(x) - np.mean(y))**2 / (np.std(x) - np.std(y))
    
    return diff_means

Now let's concatenate and scramble both samples. 

In [None]:
@numba.njit
def draw_perm_sample(x, y):
    """Generate a permutation sample."""
    concat_data = np.concatenate((x, y))
    np.random.shuffle(concat_data)

    return concat_data[:len(x)], concat_data[len(x):]

In [None]:
def draw_perm_reps(x, y, stat_fun, size=1):
    """Generate array of permuation replicates."""
    return np.array([stat_fun(*draw_perm_sample(x, y)) for _ in range(size)])

In [None]:
@numba.njit
def draw_perm_reps_diff_mean(x, y, size=1):
    """Generate array of permuation replicates."""
    out = np.empty(size)
    for i in range(size):
        x_perm, y_perm = draw_perm_sample(x, y)
        out[i] = diff_means(x_perm, y_perm)

    return out

Because we want to resolve values below  to a $1x10^-3$ p-value, we will make $1x10^4$ bootstrap reps. 

In [None]:
perm_reps = draw_perm_reps_diff_mean(mt_lab_tcat,mt_non_lab_tcat,size = np.int(1e4))

In [None]:
diff_mean = diff_means(mt_lab_tcat, mt_non_lab_tcat)

In [None]:
pval = np.sum(perm_reps >= diff_mean) / len(perm_reps)

In [None]:
pval

We can see that indeed there is no difference in the distributions when one takes into a account both the first and second moments. 

<hr>


**d)** In part (b), you used bootstrapping to compute a confidence interval for the plug-in estimate for the mean time to catastrophe. As is often (though definitely not always) the case, we could use a theoretical result to construct a confidence interval. The central limit theorem states that the mean, which is the sum of many processes, should be approximately Normally distributed. We will not derive it here, but the mean and variance of that Normal distribution are approximately

\begin{align}
&\mu = \bar{x},\\[1em]
&\sigma^2 = \frac{1}{n(n-1)}\sum_{i=1}^n (x_i - \bar{x})^2,
\end{align}

where $\bar{x}$ is the arithmetic mean of the data points. To compute a confidence interval of the mean, then, you can compute the interval over which 95% of the probability mass of the above described Normal distribution lies. 

Compute this approximate confidence interval and compare it to the result you got in part (b). *Hint*: You can use the `scipy.stats` package to conveniently get intervals for named distributions.



In [None]:
import scipy.stats

In [None]:
mean_lab_ci, _, _ = scipy.stats.bayes_mvs(mt_lab_tcat)
mean_unlab_ci, _, _ = scipy.stats.bayes_mvs(mt_non_lab_tcat)

In [None]:
print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the labeled microtubules using scipy. \n""".format(*list(mean_lab_ci)[1]))

print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the unlabeled microtubules using scipy.""".format(*list(mean_unlab_ci)[1]))

In [None]:
print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the labeled microtubules using bootstrap. \n""".format(*mt_lab_mean_ci))

print("""Mean time to catastrophe 95 % confidence interval [{0:.2f}, {1:.2f}]
for the unlabeled microtubules using bootstrap.""".format(*mt_non_lab_mean_ci))

We can see that the results are fairly similar. In both cases the bootstrap confidence intervals are broader than the ones calculated with scipy. From reading the docs of the `scipy.stats.bayes_mvs` it seems that there's quite a bit of theory behind the calculation and the implementation of the confidence intervals, the paper shows that the mean is $\text{t}$ distributed, and the standard deviation and variance are $\text{Gamma}$ distributed. 

<hr>

**e)** Write a function with call signature `ecdf(x, data)`, which computes the value of the ECDF built from the one-dimensional array `data` at arbitrary points `x`. That is, `x` can be an array. Write this function also helps cement in your mind what an ECDF is and will be useful in part (f).



**Answer** Sorry, but the x as an argument seems a little ambiguous. If I understood correctly we want to extract the of a given distribution at specified percentiles. Thus, instead of x, I will call the argument percentiles. By default it will return the 2nd to the 98th percentile in linearly spaced intervals.

In [None]:
def ecdf(data, percentiles = np.linspace(2, 98, 21)):
    
    """
    Generate an ECDF along specified percentiles. 
    
    Parameters
    ------------
    percentiles (array-like): list of percentiles.
    data (array-like): distribution to make an ECDF from. 
    
    Returns 
    ---------
    
    sorted_sel_data : sorted dataset at specified percentiles.
    """
    
    sorted_data = np.sort(data)
    
    n = len(data)
    
    # Get the indices for specified percentiles
    ixs = [ int(pcnt/100*n) for pcnt in percentiles]
    
    # Extract the values for the specified indices
    sorted_sel_data = sorted_data[ixs]
    
    return sorted_sel_data, percentiles/100

In [None]:
sorted_mt_lab_mean, ecdf_mt_lab_mean = ecdf(mt_lab_mean)

In [None]:
sorted_mt_unlab_mean, ecdf_un_mt_lab_mean = ecdf(mt_non_lab_mean)

In [None]:
labeled = hv.Curve((sorted_mt_lab_mean,
                    ecdf_mt_lab_mean), label = 'labeled').opts(xlabel= 'mean time to catastrophe',
                                                      ylabel = 'ECDF')

unlabeled = hv.Curve((sorted_mt_unlab_mean,
                    ecdf_mt_lab_mean), label = 'unlabeled').opts(xlabel= 'mean time to catastrophe',
                                                      ylabel = 'ECDF')

In [None]:
(labeled*unlabeled).opts(padding = 0.1,
                         width = 600,
                         show_grid = True,
                         legend_position = 'left')

Despite there is a potentially similar distribution, the bootstrap replicates of the mean have lower values for the unlabeled distribution, though it converges at higher values of the distribution. 

<hr>

**f)** In part (a), you used bootstrapping to compute a confidence interval on the ECDF. As is often (though definitely not always) the case, we could use a theoretical result to construct a confidence interval. We could alternatively use the **Dvoretzky-Kiefer-Wolfowitz Inequality** (DKW) to compute confidence intervals for an ECDF. The DKW inequality puts an upper bound on the maximum distance between the ECDF $\hat{F}(x)$ and the generative CDF $F(x)$. It states that, for any $\epsilon > 0$,

\begin{align}
P\left(\mathrm{sup}_x \left|F(x) - \hat{F}(x)\right| > \epsilon\right) \le 2\mathrm{e}^{-2 n \epsilon^2},
\end{align}

where $n$ is the number of points in the data set. We could use this inequality to set up a bound for the confidence interval. To construct the bound on the $100 \times (1-\alpha)$ percent confidence interval, we specify that

\begin{align}
\alpha = 2\mathrm{e}^{-2 n \epsilon^2},
\end{align}

which gives

\begin{align}
\epsilon = \sqrt{\frac{1}{2n}\,\log \frac{2}{\alpha}}.
\end{align}

Then, the lower bound on the confidence interval is

\begin{align}
L(x) = \max\left(0, \hat{F}(x) - \epsilon\right),
\end{align}

and the upper bound is 

\begin{align}
U(x) = \min\left(1, \hat{F}(x) + \epsilon\right).
\end{align}

Note that this is not strictly speaking a confidence interval, but rather a set of bounds for where the confidence interval can lie (it's the DKW **in**equality after all).

Plot the upper and lower bounds for the 95% confidence interval as computed from the DKW inequality for the  microtubule catastrophe data and comment on what you see.

**Answer** : All right, let code a function to get the DKW conf interval bounds. 

In [None]:
def dkw_ecdf_conf_int(data, alpha = 0.05):
    
    sorted_data = np.sort(data)
    
    ecdf = np.linspace(0,1, len(data))
    
    n = len(data)
    
    eps = np.sqrt( 1/(2*n) * np.log((2/alpha)))
    
    dkw_low_bound = ecdf - eps 
    
    dkw_upper_bound = ecdf + eps 
    
    return dkw_low_bound, dkw_upper_bound
    

mt_lab_tcat = mt_lab['time to catastrophe (s)'].values
mt_non_lab = mt_non_lab['time to catastrophe (s)'].values

We can now compute the DKW bounds for both distributions

In [None]:
dkw_low_lab, dkw_upper_lab = dkw_ecdf_conf_int(mt_lab_tcat)
dkw_low_unlab, dkw_upper_unlab = dkw_ecdf_conf_int(mt_non_lab_tcat)

Let's make a sorted version of the arrays to make plots using the `bebi103.viz.fill_between` function. (Sorry for all the different variables!)

In [None]:
# Sort the time to catastrophe values for the labeled microtubules
sorted_mt_lab_tcat = np.sort(mt_lab_tcat)

# Sort the t_cat for the non-labeled microtubules
sorted_mt_non_lab_tcat = np.sort(mt_non_lab_tcat)

Awesome! Nowe we can go ahead and overlay the bootstrap ECDF from using `bokeh_catplot` and the theoretical bounds from the DKW calculation. 

Let's start with the labeled ones. 

In [None]:
l = bokeh_catplot.ecdf(data = mt_lab, 
                  cats = 'labeled', 
                  val = 'time to catastrophe (s)',
                  conf_int = True, 
                  palette = cc.glasbey_dark[-1:],
                  )

In [None]:
l = viz.fill_between(sorted_mt_lab_tcat,
                      dkw_low_lab,
                      sorted_mt_lab_tcat,
                      dkw_upper_lab,
                      p = l,
                      fill_color = cc.glasbey_dark[-1],
                      line_color = cc.glasbey_dark[-1],
                      fill_alpha= 0.2)

In [None]:
# show plot for labeled microtubules
bokeh.io.show(l)

Pretty cool! This is freaking art!! Haha ok back to science... We can see that in the most part the DKW bound really captures an upper bound on the bootstrap confidence intervals. 

In [None]:
un = bokeh_catplot.ecdf(data = mt_non_lab, 
                  cats = 'labeled', 
                  val = 'time to catastrophe (s)',
                  conf_int = True, 
                  palette = cc.glasbey_dark[-2:],
                  #p = un 
                       )

In [None]:
un = viz.fill_between(sorted_mt_non_lab_tcat,
                      dkw_low_unlab,
                      sorted_mt_non_lab_tcat,
                      dkw_upper_unlab,
                      p = un,
                      fill_color = cc.glasbey_dark[-2],
                      line_color = cc.glasbey_dark[-2],
                      fill_alpha= 0.3)

In [None]:
# show plot for unlabeled microtubules
bokeh.io.show(un)

Another thing that we note is that in the upper tail of the distribution (towards higher values) the DKW bounds seem to be broader, but it might just be a visual effect of the fact that there are less points from which to sample... These results give some intiution on the construction of frequentist confidence intervals: If the experiment is repeated over and over again the estimate for the parameter (in this case, the ECDF itself) will lie between the bounds of the 95% confidence interval (for 95% of the experiments). Thus, if there are very few points for the a given region of the ECDF, we will have more uncertaintity the of where the ECDF will lie in for 95% of the time. 

<hr />

**Attributions**: All of the members of the group contributed to the discussed and solved the individual exercises. The final notebook was edited by Manu. 