## Hypothesis Testing Tutorial

This notebook outlined the process we used for hypothesis testing between for the microtubule catastrophe experiment. Specifically, we compare the results for fluorescent labeled and non-fluorescent labeled tubulin.

In [9]:
%load_ext autoreload
%autoreload 2
import os, sys
# Using alias for packages can reduce having to write out the entire name every single time you call them
import pandas as pd
import holoviews as hv
import bokeh.io
import scipy
import numpy as np
import scipy.stats as st
import MCAT_pkg as mc 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
mc?

[0;31mType:[0m        module
[0;31mString form:[0m <module 'MCAT_pkg' from '/Users/joeyta/Downloads/BEbi103/MCAT/MCAT_pkg/MCAT_pkg/__init__.py'>
[0;31mFile:[0m        ~/Downloads/BEbi103/MCAT/MCAT_pkg/MCAT_pkg/__init__.py
[0;31mDocstring:[0m  
The contents of this package allow users to reproduce our microtubule catastrophe analysis. It includes:
Our methods for parsing the raw data
Methods for performing exploratory analysis
Statistical analysis modules (bootstrapping, hypothesis testing, MLE analysis, and model assessment)


In [3]:
# Identify the location that the data file is in  
data = "data/gardner_time_to_catastrophe_dic_tidy.csv"

# Read in the file as df

df = pd.read_csv(data, comment = "#")

In [4]:
# parse out the labeled and not_labeled data using our function
labeled, not_labeled = mc.separate_categories(df)

We can use mc.categorical_plot to gain ECDFs for the two samples with confidence intervals, so that we can observe the amount of overlap between the two datasets.

In [5]:
# plot the ECDFs for each of them with confidence intervals generated by bootstrapping
p = mc.categorical_plot(df, "time to catastrophe (s)", "labeled", conf_int = True, palette = ["green", "gray"])
bokeh.io.show(p)

For more quantiative metrics, we can perform a hypothesis test using bootstrapping, as well as look at means and confidence intervals for each of the datasets.

In [10]:
# Generate samples (10,000)
bs_reps = mc.draw_bs_reps_test_stat(labeled, not_labeled, size=10000)

# Compute p-value
stat = st.ks_2samp(labeled, not_labeled)[0]
p_val = np.sum(bs_reps >= stat) / len(bs_reps)

print("p-value =", p_val)

p-value = 0.8331


In [13]:
# generate bootstrap replicates for the mean of each data set, then take the mean and confidence intervals
bs_reps_mean_tubulin = mc.draw_bs_reps_mean(labeled, size=10000)
bs_reps_mean_no_tubulin = mc.draw_bs_reps_mean(not_labeled, size = 10000)

mean_tubulin = np.mean(bs_reps_mean_tubulin)
mean_no_tubulin = np.mean(bs_reps_mean_no_tubulin)

conf_int_tubulin = np.percentile(bs_reps_mean_tubulin, [2.5, 97.5])
conf_int_no_tubulin = np.percentile(bs_reps_mean_no_tubulin, [2.5, 97.5])

print('Mean time to catastrophe with labeled tubulin is {} with 95% conf int : {}'.format(mean_tubulin,
                                                                                         conf_int_tubulin))
print('Mean time to catastrophe with nonlabeled tubulin is {} with 95% conf int: {}'.format(mean_no_tubulin,
                                                                                           conf_int_no_tubulin))

Mean time to catastrophe with labeled tubulin is 440.5893388625592 with 95% conf int : [401.94253555 481.61137441]
Mean time to catastrophe with nonlabeled tubulin is 412.61481052631575 with 95% conf int: [354.84210526 477.52631579]


We use the DKW inequality to further compare the two models as well as to observe the bounds for the confidence intervals.

In [6]:
p1 = mc.plot_conf_int(labeled, "Confidence Interval for Labeled", "Time to catastrophe (s)")
bokeh.io.show(p1)

In [7]:
p2 = mc.plot_conf_int(not_labeled, "Confidence Interval for Labeled", "Time to catastrophe (s)")
bokeh.io.show(p2)

See help(function) for more information on the functions and their arguments.