In [None]:
import numpy as np
import matplotlib.pyplot as plt
from datascience import *
%matplotlib inline

# What is Bootstrapping?

To understand bootstrapping, let's work though an easy example. We roll a six-sided die 50 times and calculate the mean.

In [None]:
possible_rolls = np.arange(1,7)
num_rolls = 50
sample_rolls = np.random.choice(possible_rolls, num_rolls)
avg_roll = np.mean(sample_rolls)
print(f"The average of {num_rolls} is {avg_roll}")

## What is the 95% confidence interval?

We have an average based on our sample of 50 rolls, but we know that this is only approximate. To find the true average, we would have to average millions of rolls. How good is an estimate based on a sample size of 50?

One way to answer this would be to take many, many samples of 50 and look at the distribution of means. Let's try that.

In [None]:
means = []
for i in np.arange(10000):
    sample_rolls = np.random.choice(possible_rolls, num_rolls)
    avg_roll = np.mean(sample_rolls)
    means.append(avg_roll)

plt.hist(means, bins=30);

To find the 95% confidence interval, we just figure out the interval that contains 95% our our results. The datascience module has a handy function for this: `percentile.` The 95% confidence interval would be all the value above the bottom 2.5% and below the top 2.5%.

In [None]:
left = percentile(2.5, means)
right = percentile(97.5, means)
print(f"The 95% confidence interval ranges from {left} to {right}")

So we have an estimate of the mean based on 50 rolls and we are 95% confident that the true mean lies within this range.

**But in the real world we cannot replicate our experiment thousands of times!** 

So, if we only have the one sample of 50, how do we estimate a confidence interval for the mean?

Pay close attention, but this is where the magic happens. First, we assume that the sample we have is a reasonable reflection of the true population as a whole. So instead of generating another new set of data by rolling the die again, or by performing the experiment again, **we sample from our sample.**  In this case, we draw 50 rolls, with replacement after each draw, to create a new sample of 50. We calculate the mean of this new sample. We do this again and again, to get a distribution of means based on our sample.

In [None]:
bootstrap_means = []
for i in np.arange(10000):
    bootstrap_sample = np.random.choice(sample_rolls, size=num_rolls, replace=True)
    avg_roll = np.mean(bootstrap_sample)
    bootstrap_means.append(avg_roll)

plt.hist(bootstrap_means, bins=30);

In [None]:
left = percentile(2.5, bootstrap_means)
right = percentile(97.5, bootstrap_means)
print(f"The 95% confidence interval based on bootstrapping ranges from {left} to {right}")

This confidence interval is **very close** to the estimate we got replicating the experiment thousands of times without actually collecting any new data.

## Student Challenge
What assumption unlies bootstapping? When do you think it is likely to work well, and when not?