# Tutorial 5: exercise

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

This document was prepared at [Caltech](http://www.caltech.edu) with financial support from the [Donna and Benjamin M. Rosen Bioengineering Center](http://rosen.caltech.edu).

<img src="caltech_rosen.png">

*This tutorial exercise was generated from an Jupyter notebook.  You can download the notebook [here](t5_exercise.ipynb). Use this downloaded Jupyter notebook to fill out your responses.*

### Exercise 1

What is the plug-in principle and how is it used in non-parametric statistics?

The plug-in principle is a method of estimating functional for a sample by computing the functionals using the empirical distribution of the sample. It is used to estimate parameters in non-parametric statistics based on the empirical distribution, so without assuming a model distribution.

### Exercise 2

What is a bootstrap sample and what is a bootstrap replicate?

A bootstrap sample is a sample gathered from a data set by drawing random data points from the data set with replacement to form a new sample the size of the original data set. A bootstrap replicate is a statistic of interest calculated from a bootstrap sample rather than the original sample.

### Exercise 3

Consider the following data set for waiting times in minutes for nuclear localization events of the gene MSN-2 in yeast (these are real measurements done by Yihan Lin in Michael Elowitz's lab).

In [41]:
t = [3, 6, 11, 5, 5, 4, 73, 31, 7, 6, 
     30, 4, 32, 30, 5, 53, 2, 15, 18, 
     14, 3, 49, 7, 4, 4, 2, 9, 11, 8, 
     5, 14, 6, 32, 40, 3, 5, 24]

You suspect that MSN-2 localization may be a Poisson process and therefore that these waiting times are Exponentially distributed with a characteristic wait time of 20 minutes. Do a quick graphical analysis to check out this idea.

In [31]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import pystan

import bebi103

import altair as alt
import altair_catplot as altcat

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

First let's make ourselves an exponential distribution of the same size as our dataset. We'll set the parameter as 1 to start.

In [42]:
t_gen = np.random.exponential(scale = 1, size=len(t))

Let's display both t and t_gen on the same graph.

In [44]:
p = bebi103.viz.ecdf(t, x_axis_label='Waiting Times', formal=True, line_width=2)
p = bebi103.viz.ecdf(t_gen, p=p, color='gray', formal=True, line_width=2)

bokeh.io.show(p)

Those don't look the same, but then our data set doesn't seem to have an average at 1, so we'd be surprised if that worked. Let's try again, using the mean of our data set as our parameter:

In [46]:
mean = np.mean(t)
t_gen = np.random.exponential(scale = mean, size=len(t))
p = bebi103.viz.ecdf(t, x_axis_label='Waiting Times', formal=True, line_width=2)
p = bebi103.viz.ecdf(t_gen, p=p, color='gray', formal=True, line_width=2)

bokeh.io.show(p)

Those look similar, but I'm not certain. Let's try to make many plots from our exponential distribution and see if it looks better:

In [47]:
p = bebi103.viz.ecdf(t_gen, x_axis_label='Waiting Times', color='gray', alpha=0.1, formal=True)
 
for _ in range(100):
    t_gen = np.random.exponential(scale = mean, size=len(t))
    p = bebi103.viz.ecdf(t_gen, p=p, color='gray', alpha=0.1, formal=True)
    
p = bebi103.viz.ecdf(t, p=p, formal=True, line_width=2)

bokeh.io.show(p)

That looks exponential to me!