In [1]:
import random
import numpy as np

In [2]:
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.io import output_notebook
from bokeh.layouts import gridplot

In [3]:
output_notebook()

## Intro to Bootstrapping, or Resampling
### goals:
* Understand the difference between population and samples
* understand how to reample with Python
* understand how to quantify the differences 
* understand the null hypothesis
* understand a p value
* understand how to tell the difference between noise (randomness) and signal

Often, in big data, we are asked to tell the difference between two samples. Was there really more rain in 2018 than in 2019? Was the temperature hotter in these two years? Did more people ride bikes in Boston than in Seattle? 

There is a mathmatical way in big data where we can answer these questions. 

### Experiment
Imagine a small town, Willow. There is a single school with two classrooms. Class A has all students with freckles, Class B has students with no freckles. 

Answer the following:
Do you expect a difference in height between the students in the classroom? Why? 

#### Answer
No, you woud not expect any difference in heights because freckles does not determine height

A scientist believes students with freckles are taller than students without freckles. Let's determine if he is right.



In [4]:
def get_classroom_willow_a():
    return [146.1,152.7, 146.3, 142.0, 151.8,
     151.4, 145.1, 153.5, 151.2, 143.5, 158.2,
     150.6, 143.5, 151.6, 149.9, 154.2, 142.4,
     154.6, 154.1, 152.8, 152.4, 155.9, 152.9,
     149.9, 145.0]


In [5]:
def get_classroom_willow_b():
    return [142.9, 150.9, 154.0, 146.4, 142.4,
     148.7, 151.4, 154.5, 142.9, 142.5, 152.5,
     151.4, 156.7, 153.9, 148.5, 147.6, 161.9,
     147.6, 145.1, 143.3, 149.5, 147.3,
     148.7, 150.4]

The above functions has the heights for the students in each classroom. Lets graph the 2 classrooms.

In [11]:
WILLOW_A = get_classroom_willow_a()
WILLOW_B = get_classroom_willow_b()


In [14]:
def make_bar(labels, nums, title = None, y_range = None, plot_width = 350, plot_height = 350):
    p = figure(title = title, plot_width = plot_width, plot_height = plot_height,
              y_range = y_range)
    p.vbar(x=labels, top=nums, width=0.9)
    p.xgrid.grid_line_color = None
    return p

P_WILLOW_A = make_bar(labels = [x for x in range(len(WILLOW_A))], nums = sorted(WILLOW_A), 
               title = 'Freckles', y_range = (120, 160))
P_WILLOW_B = make_bar(labels = [x for x in range(len(WILLOW_B))], nums = sorted(WILLOW_B), 
               title = "No Freckles", y_range = (120, 160))


grid = gridplot([ P_WILLOW_A, P_WILLOW_B,], ncols = 2)
show(grid)

Let's get the means

In [15]:

print(np.mean(WILLOW_A))
print(np.mean(WILLOW_B))


150.064
149.20833333333334


Class A has a mean 150.064, and class B has a mean of 149.208. That is a difference of .86 cm. 

Answer the following:

Do you still think there is no difference between the two classrooms in the Willow School? Why? 

#### Answer: 

Yes, maybe there is a difference. See the discussion below.

### Randomness
Mabye the difference in heigh between the two classrooms  is just due to randomness. How can we know? 

Imagine an experiment. We got to a big city and find 1,000 students who are 12 years old and have freckles. We ramdonly choose 25 (the size of the class A).

In the same big city, we find 1,000 students who are 12 years old and have no freckles. We randomly choose 24, (the size of Class B). 

We calculate the means of each of these groups. 

Answer the following: Do you expect the same result as the mean above (150.064 and 149.228)? Why? 

Do you expect the mean from either group to be much different from the original (around 150 cm)? Why?

#### Answer:

No, you would expect to get different means since 12 year olds are not the same height. You would expect the means to be around 150 cm for each.


### Resample with Python 

Let's conduct the experiment in Python. We want to imitate our experiment of going the big city, finding 1,000 students and choosing 25 and 24. 

The resample function takes a list and expands it to the size of 1,000. It shuffles it, or mixes it up.

The make_grid function creates the graphs. 

In [20]:
#NOTE!! Do not use seed for results. Just using this for instructional purposes
random.seed(13)
def resample(l):
    """There is a better way to do this; but this helps in explaining"""
    population = l * 40 # we are making around 1,000 studens
    random.shuffle(population) # mix them up
    return population[:len(l)] # return number of sample size

def make_grid(sample_a, sample_b):
    resample_a = resample(sample_a)
    resample_b = resample(sample_b)
    p1= make_bar(labels = [x for x in range(len(resample_a))], nums = sorted(resample_a), 
               title = "Resample A", y_range = (120, 180))
    p2 = make_bar(labels = [x for x in range(len(resample_b))], nums = sorted(resample_b), 
               title = "Resample B", y_range = (120, 180))
    grid = gridplot([p1, p2], ncols = 2)
    print('the mean of resample 1 is {r1} and the mean of resample2 is {r2}'.format(
        r1 = np.mean(resample_a), r2 = np.mean(resample_b)))
    return grid
show(make_grid(WILLOW_A, WILLOW_B))


the mean of resample 1 is 149.712 and the mean of resample2 is 150.14166666666668


Now the mean of class A is smaller than the mean of Class B. 

## Population vs Samples

In statistics, we talk about a sample and the population. The population is *everthing*. For example, the population for the average rainfall in the Amazon is the average for every single year, from the beginnig of time until the end of the world. A sample is the set of year 2000 to 2010. 

Other examples: The population is every vote from the American election in 2020. A sample is when a poll was taken and only 1,000 people were asked. The population is all the games from the NBA basketball season in 2018 when the Golden State Warriors won. A sample is the games only played for the first 3 months.

Generally we think of our data sets as samples. We can never have all the data. We only have little bits. We have to guess what these little bits mean. Because we don't know the population, we have to estimate if our samples are reallly different. The differences between the population and the sample is known as the sampling error. We have sampling error in all polls, because we are trying to estimate the population based on a sample. 

We can think of each classroom as a sample. We created a population by expanding the list to 1,000 and then shuffling it. 

If the students come from the same population, then if we mix them up (resample) we would expect to get the same results. But if they came from different populations, we would expect to get different results. 

For the Willow school, we can think of the students coming from the same population. That is why the resampled means look the same.


### Repeat Random Trials
We conducted our experiment once. That is not enough. We want to conduct the experiment, many, many times. We want to go to the city choose 25 and 24 students, get the mean, and do it again, then again, and so on. Let's do the experiment a thousand times. 

In [22]:
def hist(l):
    hist, edges = np.histogram(l, density=True)
    p = figure()
    p.quad(top = hist, bottom=0, left=edges[:-1], right=edges[1:], alpha = .4)
    return p

In [23]:
def repeat_resample(sample_a, sample_b, num_iter = 1000):
    difference_in_means = []#keep track of the difference in heights for each experiment
    for i in range(num_iter):
        resample_a = resample(sample_a)
        resample_b = resample(sample_b)
        difference = np.mean(resample_a) - np.mean(resample_b)
        difference_in_means.append(difference)
    return difference_in_means
results = repeat_resample(WILLOW_A, WILLOW_B)
# results is a list of all differnces between each group, each time we did the experiment
# show the results in a historgram. 
# a historgram is a quick way to tell us how to view data
# https://en.wikipedia.org/wiki/Histogram
show(hist(results))


We see from our histogram that most of the differences were between 0 to 2 cm. Only a few times did the difference exceed 4. Let's determine how many:

In [24]:
def show_diff(results):
    greater_than_zero = len([x for x in results if x > 0])
    print('number geater than 0 is {g}'.format(g = greater_than_zero))
show_diff(results)

number geater than 0 is 744


This is not much of a difference. (I will explain a formal way to define this shortly.) About 76% of the time, students with freckles were taller than students without freckles. 

## Quantifying Our Results

Let's quantify this, though. Remember, when we took the original mean of the Willow school, we found that class A has a mean of 150.064, and Class B has a mean of 149.208. That is a difference of .86 cm. When we resampled, we found that Class A had a greater mean than class B 766/1000 times, or .76. 

Using the above math, let's reverse the question. What is the probability for the Willow school, that class A was *not* taller than class B? 1- 766/1000 = .24. If you went to the city and did this experiment 1,000 times, 24% of the times you would actually find that class B was taller. 


# Intro to Two Test Hypothesis theory. 
Let's talk about two statitical terms, *null hypothesis*, and *p value*. 

In statistics, the null hypothesis is our original assumption. It is always the most conservative assumption, usually that there is no difference. For our case, the null hypothesis is "Class A is *not* greater than class B." 

To test this theory, we resampled each classroom. Our assumption, or null hypthesis, was that classroom A was not taller than classroom B. This was wrong 24% of the time. We can think of the .24 as a failure. Imagine that you tell your friend that classroom A is taller. He says no. He conducts an experiment and shows that you are wrong 24% of the time. This .24 is known as the p value. It is the probability that the null hypothesis is true. The p value is .24.

You might think that .24 is not a bad failure rate (p value), but in hypothesis testing, it is terrible. In general, we only accept a p value that is .05, or even better, .01. In this case, since the p value is > .05, we do not reject the null hypothesis. We say that there is not enough evidence to support the claim that classroom A is taller than classroom B.


### P Value round

Note that if your p-value is 0, you should say it is < .01. We can never know anything with 100% certainty. Also, a palue should only have two decimal places: .03, not .022885. 

## Example with two different populations
Let's imagine another small town, Birch. In this town there is also a small school two classroom. Classroom a has students who are 12 years old, but classroom B has students who are 17 years old. In this case, we would expect a difference. Let's do the same experience. 

In [26]:
# here is the data
def get_classroom_birch_a():
    return [141.8, 150.2, 147.6, 146.6, 153.8,
     149.3, 147.6, 158.0, 146.5, 142.7, 142.1,
     146.4, 152.3, 153.3, 154.7, 158.3, 157.6,
     152.3, 155.8, 152.4, 146.4, 153.3, 149.5,
     148.2, 159.3]

def get_classroom_birch_b():
    return [169.6, 163.3,  177.6, 164.5,
     169.5, 168.9, 168.4, 168.7, 163.3, 163.4,
     165.0, 164.0, 169.9, 173.6, 161.3, 168.5,
     160.8, 162.3, 164.6, 166.3, 163.6, 152.6,
     172.8, 164.4]

BIRCH_A = get_classroom_birch_a()
BIRCH_B = get_classroom_birch_b()

In [30]:

P_BIRCH_A = make_bar(labels = [x for x in range(len(BIRCH_A))], nums = sorted(BIRCH_A), 
                title = '12-year Old', y_range = (120, 180))
P_BIRCH_B = make_bar(labels = [x for x in range(len(BIRCH_B))], nums = sorted(BIRCH_B), 
                title = "17-year Old", y_range = (120, 180))

grid_birch = gridplot([ P_BIRCH_A, P_BIRCH_B,], ncols = 2)
print(np.mean(BIRCH_A))
print(np.mean(BIRCH_B))
show(grid_birch)

150.64
166.12083333333337


We want to show that class B is greater than class A. Our null hypothesis is that Class B is *not* greater than class A. Le't resample.

In [32]:
results_birch = repeat_resample(BIRCH_B, BIRCH_A)
show(hist(results_birch))


number geater than 0 is 1000


Look at the histogram, and note that the difference is 16cm. 0 does not appear at all. Let's find out the p value.

In [34]:
show_diff(results_birch)

number geater than 0 is 1000


In this case *all* of the samples were greater than 0. Our p value is 1 - 1000/1000 or 0. Remember, we don't use 0. Instead we say the p value < .01. 

Think about it this way. You say that class B is taller class A. Your friend doubts you and he conducts an experiment. He goes to the big city and finds 1,000 12-year and chooses 25. He find 1,000 17-year old and chooses 24. He finds the means of each group. He does this experiment 1,000 times. Each time the mean of the 17-year olds is greater than the class of 12-year olds--100% of the time! You say to your friend "What is the probability that if the the two classrooms were the same, but there is less than 1% of that being true. That is too unlikey. I am rejecting your conclusion (the null hypothesis). 

## The P-value one last time

Students and even some academics struggle with the p value. *The p value is the probability that the null hypothesis is true.*

A real life example is the correlation between smoking and lung cancer. We can't prove smoking causes lung cancer. But we can form a null hypothesis that it does not and then conduct a bunch of experiments. This is what happened decades ago. After massive amounts of data, scientists concluded that it was too improbable that smoking did not cause lung cancer. Collectively, the p value was too small. The null hypothesis was rejected. 

## Summary

* Differences between 2 or more groups is often random because we are taking a sample from the population.

* Resampling is a way to determine if the differences are real, or just random

* The null hypothesis is the original, convervative assumption. 

* We reject the null hypothesis if the p value is small. That means differences are not due to randomness. 

## Fixing our resamling function

Finally, lets fix out resample function to make it more standard. Normally, we resample *with replacement*. That means we take a random item, put it back in, and then choose again. That means we might pick the same element again. That is ok for resampling. 

If you think about it, our original code does the same thing. We take a list and expand it: 

[1, 2, 3] => [1,2, 3, 1, 2 ,3, 1, 2, 3...]

We then reshuffle it

[2,3,3,1,2,1,3.....]

And then take the first 3 elements:
[2,3,3]

You can see we have acually repeated the same elements. That is the desired result. It is the same as creating a huge population and choosing from it. 

The code below is our new resample function. 

In [1]:
def resample(l):
    final = []
    for i in range(len(l)):
        final.append(random.choice(l))
    return final