# Tutorial 5: exercise

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

This document was prepared at [Caltech](http://www.caltech.edu) with financial support from the [Donna and Benjamin M. Rosen Bioengineering Center](http://rosen.caltech.edu).

<img src="../data/caltech_rosen.png">

*This tutorial exercise was generated from an Jupyter notebook.  You can download the notebook [here](t5_exercise.ipynb). Use this downloaded Jupyter notebook to fill out your responses.*

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import numba

import bebi103

import altair as alt

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

### Exercise 1

What is the plug-in principle and how is it used in non-parametric statistics?

Suppose we do not know the probability distributions that characterize the system we are experimenting, but instead we have data. From the data we can create an emperical distribution, and calculate summary statistics from the emperical distribution rather than the actual distribution. These summary statistics can then serve as a basis of comparison between experiments. 

### Exercise 2

What is a bootstrap sample and what is a bootstrap replicate?

A bootstrap sample is a collections of $N$ points that have been drawn in a uniform-random fashion from a dataset of length $N$ with replacement. 

A bootstrap replicate, by what I can infer from your tutorial, is a summary statistic that is computed from a bootstrap sample, like a mean or a median. However, I can't find this usage of the word anywhere else, so I'm not sure. Wikipedia calls this a bootstrap estimate. 

### Exercise 3

Consider the following data set for waiting times in minutes for nuclear localization events of the gene MSN-2 in yeast (these are real measurements done by Yihan Lin in Michael Elowitz's lab).

In [2]:
t = np.array([3, 6, 11, 5, 5, 4, 73, 31, 7, 6, 
             30, 4, 32, 30, 5, 53, 2, 15, 18, 
             14, 3, 49, 7, 4, 4, 2, 9, 11, 8, 
             5, 14, 6, 32, 40, 3, 5, 24])

You suspect that MSN-2 localization may be a Poisson process and therefore that these waiting times are Exponentially distributed with a characteristic wait time of 20 minutes. Do a quick graphical analysis to check out this idea.

Suppose the wait time is truly exponentially distributed with a characteristic wait time of 20 minutes. Then, we can make a model using these parameters, and the model should fall within overlap of bootstrap replicates from emperical data. 

In [3]:
@numba.jit(nopython=True)
def draw_bs_sample(data):
    """
    Draw a bootstrap sample from a 1D data set.
    """
    return np.random.choice(data, size=len(data))

In [9]:
model = np.random.exponential(20, 1000) # This is our exponential model for wait time

p = bebi103.viz.ecdf(t, x_axis_label='Waiting Time', color='gray', alpha=0.1, formal=True)
 
for _ in range(100):
    sample = draw_bs_sample(t)
    p = bebi103.viz.ecdf(sample, p=p, color='gray', alpha=0.1, formal=True)
    
p = bebi103.viz.ecdf(model, p=p, formal=True, line_width=2)

bokeh.io.show(p)

It seems reasonable that this model could be the true distribution for the data, given that there is significant overlap between the model and the distribution of bootstrap samples. That said, the data seems to take off at a sligntly later time than the model, and has many more values at earlier waiting times. It is likely we could find a better model. 