In [1]:
import random
import numpy as np

In [2]:
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.io import output_notebook
from bokeh.layouts import gridplot

In [3]:
output_notebook()

## Intro to Bootstrapping, or Resampling
### goals:
* Understand the difference between population and samples
* understand how to reample with Python
* understand how to quantify the differences 
* understand the null hypothesis
* understand a p value
* understand how to tell the difference between noise (randomness) and signal

Often, in big data, we are asked to tell the difference between two samples. Was there really more rain in 2018 than in 2019? Was the temperature hotter in these two years? Did more people ride bikes in Boston than in Seattle? 

There is a mathmatical way in big data where we can answer these questions. 

### Experiment
Imagine a small town.There are two schools. Each school has two classrooms. In the north school, the two classrooms have students who are 12 years old. In the south class room, classroom a has students who are 12 years old, but classroom b has students who are 17 years old. 

Answer the following:
Do you expect a difference in height between the students in the north school (both classrooms has students who are 12 years old)? Why? 
Do you expect a difference in height between the students in the north school (both classrooms has students who are 12 years old)? Why? 
​
Do you expect a difference in height between the students in the two classrooms in the south school (one class has students who are 12 years old, and one has students who are 17 years old)? Why?

In [4]:
def get_classroom_north_a():
    return [146.1,152.7, 146.3, 142.0, 151.8,
     151.4, 145.1, 153.5, 151.2, 143.5, 158.2,
     150.6, 143.5, 151.6, 149.9, 154.2, 142.4,
     154.6, 154.1, 152.8, 152.4, 155.9, 152.9,
     149.9, 145.0]


In [5]:
def get_classroom_north_b():
    return [142.9, 150.9, 154.0, 146.4, 142.4,
     148.7, 151.4, 154.5, 142.9, 142.5, 152.5,
     151.4, 156.7, 153.9, 148.5, 147.6, 161.9,
     147.6, 145.1, 143.3, 149.5, 147.3,
     148.7, 150.4]

In [6]:
def get_classroom_south_a():
    return [141.8, 150.2, 147.6, 146.6, 153.8,
     149.3, 147.6, 158.0, 146.5, 142.7, 142.1,
     146.4, 152.3, 153.3, 154.7, 158.3, 157.6,
     152.3, 155.8, 152.4, 146.4, 153.3, 149.5,
     148.2, 159.3]

In [7]:
def get_classroom_south_b():
    return [169.6, 163.3,  177.6, 164.5,
     169.5, 168.9, 168.4, 168.7, 163.3, 163.4,
     165.0, 164.0, 169.9, 173.6, 161.3, 168.5,
     160.8, 162.3, 164.6, 166.3, 163.6, 152.6,
     172.8, 164.4]

The above functions has the heights for the students in each classroom. Lets graph the 4 classrooms.

In [8]:
NORTH_A = get_classroom_north_a()
NORTH_B = get_classroom_north_b()
SOUTH_A = get_classroom_south_a()
SOUTH_B = get_classroom_south_b()

In [9]:
def make_bar(labels, nums, title = None, y_range = None, plot_width = 350, plot_height = 350):
    p = figure(title = title, plot_width = plot_width, plot_height = plot_height,
              y_range = y_range)
    p.vbar(x=labels, top=nums, width=0.9)
    p.xgrid.grid_line_color = None
    return p

P_NORTH_A = make_bar(labels = [x for x in range(len(NORTH_A))], nums = sorted(NORTH_A), 
               title = 'North A', y_range = (120, 160))
P_NORTH_B = make_bar(labels = [x for x in range(len(NORTH_B))], nums = sorted(NORTH_B), 
               title = "North B", y_range = (120, 160))
P_SOUTH_A = make_bar(labels = [x for x in range(len(SOUTH_A))], nums = sorted(SOUTH_A), 
                title = 'South A', y_range = (120, 180))
P_SOUTH_B = make_bar(labels = [x for x in range(len(SOUTH_B))], nums = sorted(SOUTH_B), 
                title = "South B", y_range = (120, 180))


grid = gridplot([ P_NORTH_A, P_NORTH_B, P_SOUTH_A,P_SOUTH_B,], ncols = 2)
show(grid)

Let's get the means

In [10]:

print(np.mean(NORTH_A))
print(np.mean(NORTH_B))
print(np.mean(SOUTH_A))
print(np.mean(SOUTH_B))


150.064
149.20833333333334
150.64
166.12083333333337


For the south school, classroom a has a mea of 150.064, and classroom b has a mean of 149.208. That is a difference of .86 cm. 

Answer the following:

Do you still think there is no difference between the two classrooms in the north school? Why? 

### Randomness
Mabye the difference in heigh between the two classrooms in the north school is just due to randomness. How can we know? 

Imagine an experiment. We got to a big city and find 1,000 students who are 12 years old. We ramdonly choose 25 for the first class, and 24 for the second (the original number of students for each). We calcuate the mean for each classroom. 

Answer the following: Do you expect the same result as the mean above (150.064 and 149.228)? Why? Do you expect the mean from either group to be much different from the original (around 150 cm)? Why? 

Now, imagine that we conduct the same experiment in the south classroom. We go into a big city and find 1,000 students who are 12 year olds. We choose 25 for class a. We find 1,000 students who are 17 years old. We choose 24 for classroom b. We calculate the mean for each classroom

Answer the following:

Do you expect the means for the two classrooms to be different? (Remember, one classroom has old students, and one has young students.) Why? 

### Resample with Python 

Let's conduct the experiment in Python
The resample function takes a list and expands it to the size of 1,000. It shuffles it, or mixes it up.

The make_grid function creates the graphs. 

In [11]:
#NOTE!! Do not use seed for results. Just using this for instructional purposes
random.seed(11)
def resample(l):
    """There is a better way to do this; but this helps in explaining"""
    population = l * 40 # we are making around 1,000 studens
    random.shuffle(population) # mix them up
    return population[:len(l)] # return number of sample size

def make_grid(sample_a, sample_b):
    resample_a = resample(sample_a)
    resample_b = resample(sample_b)
    p1= make_bar(labels = [x for x in range(len(resample_a))], nums = sorted(resample_a), 
               title = "Resample A", y_range = (120, 180))
    p2 = make_bar(labels = [x for x in range(len(resample_b))], nums = sorted(resample_b), 
               title = "Resample B", y_range = (120, 180))
    grid = gridplot([p1, p2], ncols = 2)
    print('the mean of resample 1 is {r1} and the mean of resample2 is {r2}'.format(
        r1 = np.mean(resample_a), r2 = np.mean(resample_b)))
    return grid
show(make_grid(NORTH_A, NORTH_B))


the mean of resample 1 is 150.136 and the mean of resample2 is 148.95416666666665


The result is random, but you will most likely see that the graphs don't look different. Now let's try the same experiment for the south school


In [12]:
show(make_grid(SOUTH_A, SOUTH_B))

the mean of resample 1 is 149.31599999999997 and the mean of resample2 is 165.54999999999998


The results from the north school look similar. But the results from the south school show a difference in height.

## Population vs Samples

In statistics, we talk about a sample and the population. The population is *everthing*. For example, the population for the average rainfall in the Amazon is the average for every single year, from the beginnig of time until the end of the world. A sample is the set of year 2000 to 2010. 

Other examples: The population is every vote from the American election in 2020. A sample is when a poll was taken and only 1,000 people were asked. The population is all the games from the NBA basketball season in 2018 when the Golden State Warriors won. A sample is the games only played for the first 3 months.

Generally we think of our data sets as samples. We can never have all the data. We only have little bits. We have to guess what these little bits mean. Because we don't know the population, we have to estimate if our samples are reallly different.

We can think of each classroom as a sample. We created a population by expanding the list to 1,000 and then shuffling it. 

If the students come from the same population, then if we mix them up (resample) we would expect to get the same results. But if they came from different populations, we would expect to get different results. 

For the south school, we can think of the students coming from the same population. That is why the resampled means look the same.

But for the north school, the students came from different populations. That is why we got different results.


### Repeat Random Trials
We conducted our experiment once. That is not enough. We want to conduct the experiment, many, many times. We want to go to the city choose 25 students, get the mean, and do it again, then again, and so on. Let's do the experiment a thousand times. 

In [13]:
def hist(l):
    hist, edges = np.histogram(l, density=True)
    p = figure()
    p.quad(top = hist, bottom=0, left=edges[:-1], right=edges[1:], alpha = .4)
    return p

In [14]:
def repeat_resample(sample_a, sample_b, num_iter = 1000):
    difference_in_means = []#keep track of the difference in heights for each experiment
    for i in range(num_iter):
        resample_a = resample(sample_a)
        resample_b = resample(sample_b)
        difference = np.mean(resample_a) - np.mean(resample_b)
        difference_in_means.append(difference)
    return difference_in_means
results = repeat_resample(NORTH_A, NORTH_B)
# results is a list of all differnces between each group, each time we did the experiment
# show the results in a historgram. 
# a historgram is a quick way to tell us how to view data
# https://en.wikipedia.org/wiki/Histogram
show(hist(results))


We see from our histogram that most of the differences were between 0 to 2 cm. Only a few times did the difference exceed 4. Let's determine how many:

In [15]:
def show_diff(results):
    greater_than_zero = len([x for x in results if x > 0])
    less_than_zero = len([x for x in results if x < 0])
    print('number geater than 0 is {g}'.format(g = greater_than_zero))
    print('number less than 0 is {g}'.format(g = less_than_zero))
show_diff(results)

number geater than 0 is 766
number less than 0 is 234


This is not much of a difference. (I will explain a formal way to define this shortly.) About 76% of the times the there was a difference from 0. Lets the do the same results for the south school:

In [16]:
results2 = repeat_resample(SOUTH_B, SOUTH_A)
show(hist(results2))
show_diff(results2)

number geater than 0 is 1000
number less than 0 is 0


Wow! 100% of the time there was a difference. 

## Quatifying Our Results

Let's quantify this, though. Remember, when we took the original mean of the north school, we found that classroom a has a mean of 150.064, and classroom b has a mean of 149.208. That is a difference of .86 cm. When we resampled, we found that classroom had a greater mean than classroom b 766/1000 times, or .76. 

For the south school, classroom b was taller than classroom a by  15.48 cm. When we resampled, we found classroom b was taller than classroom a 1000/1000 times, or a probability of 1.

Last, using the above math, let's reverse the question. What is the probability for the south school, that classroom a was *not* taller than classroom b? 1- 766/1000 = .24. If you went to the city and did this experiment 1,000 times, 24% of the times you would actually find that classroom b was taller. 

For the south side, the probability that classroom b is not taller than classoorm a is 1 - 1000/1000, or 0. If you conducted this experiment 1,000 times, you would never find that classroom a is taller.

# Intro to Hypothesis theory. 
Let's talk about two statitical terms, *null hypothesis*, and *p value*. 

In statistics, the null hypothesis is our original assumption. It is always the most conservative assumption, usually that there is no difference. For our case, the null hypothesis is "There is no difference between the heigh of classroom a and classroom b." 

To test this theory, we resampled each classroom. Our assumption, or null hypthesis, was that classroom a was taller than classroom b. This was wrong 23% of the time. We can think of the .24 as a failure. Imagine that you tell your friend that classroom a is taller. He says no. He conducts an experiment and shows that you are wrong 24% of the time. This .24 is known as the p value. It is the probability that the null hypothesis is false. The p value is .24.

You might think that .24 is not a bad failure rate (p value), but in hypothesis testing, it is terrible. In general, we only accept a p value that is .05, or even better, .01. In this case, since the p value is > .05, we do not reject the null hypothesis. We say that there is not enough evidence to support the claim that there is a differnce between two classrooms.

For the south school, the situation is different. You tell your friend that classroom b is taller, and he tries to prove you wrong. He conducts an experiment 1,000 times, and he shows you are right each time. In this case, the p value is 0. Certainly, p < .01, so we *reject* the null hypothesis. If the difference between the two classrooms was really random (just chance), we would expect to see some failures. Our probability of failures is 0%. We can no longer believe that the difference is just random.

### P Value round

Note that if your p-value is 0, you should say it is < .01. We can never know anything with 100% certainty. Also, a palue should only have two decimal places: .03, not .022885. 

## Summary

* Differences between 2 or more groups is often random because we are taking a sample from the population.

* Resampling is a way to determine if the differences are real, or just random

* The null hypothesis is the original, convervative assumption. 

* We reject the null hypothesis if the p value is small. That means differences are not due to randomness. 

Finally, lets fix out resample function to make it more standard. Normally, we resample *with replacement*. That means we take a random item, put it back in, and then choose again. That means we might pick the same element again. That is ok for resampling. 

If you think about it, our original code does the same thing. We take a list and expand it: 

[1, 2, 3] => [1,2, 3, 1, 2 ,3, 1, 2, 3...]

We then reshuffle it

[2,3,3,1,2,1,3.....]

And then take the first 3 elements:
[2,3,3]

You can see we have acually repeated the same elements. That is the desired result. It is the same as creating a huge population and choosing from it. 

The code below is our new resample function. 

In [18]:
def resample_(l):
    final = []
    for i in range(len(l)):
        final.append(random.choice(l))
    return final