# KIN 482E - Programming and Data Science for Kinesiology


## Lecture: Bootstrapping!

## Objectives

* Describe real-world examples of questions that can be answered with statistical inference.
* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
* Explain the difference between a population parameter and a sample point estimate.
* Use Python to draw random samples from a finite population.
* Use Python to create a sampling distribution from a finite population.
* Describe how sample size influences the sampling distribution.
* Define bootstrapping.
* Understand why we would want to bootstrap data
* Use Python to create a bootstrap distribution to approximate a sampling distribution.
* Contrast the bootstrap and sampling distributions.

## Review

**Inference:** Using a sample to make a conclusion about the wider population

What do the following terms mean:

- population (and population parameter)
- sample
- estimation (and estimate)
- sampling distribution

**Your answer here**

- A **population** is the *complete* collection of individuals or cases we are interested in studying; rarely do we ever have access to the full population, which is why we need statistics!
- A **sample** is a subset of the population
- **Estimation** refers to the process of computing a numerical characteristic of our data (an estimate)&mdash;for example, the sample mean, the sample standard deviation
- **Sampling distribution** refers to a distribution of the estimate for all possible samples (i.e., an arbitrarily large number of samples) of a given size from a population

In [None]:
# load libraries for wrangling and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

---
### Task 1:
- Read in the `listings.csv` data set and put it into a dataframe called `airbnb`
- Print 7 random rows

In [None]:
# Read in data


---
### Task 2
We will assume that our listings csv contained all Airbnb rentals in Vancouver. In other words, we are in the rare situation of having access to the entire population of interest. 

- Confirm that the true population mean price per night is $154.51


In [None]:
# your answer here


---
### Task 3:
- Use list comprehension to produce 20,000 samples of size 40; assign all 20,000 samples to a data frame called `samples`
    - Remember to create a new column `replicate` which keeps track of sample number
- Calculate the mean for each sample and plot the distribution of means
    - Remember to create to rename the `price` column and call it `mean_price`

In [None]:
# Set seed for reproducibility
np.random.seed(255)

# Use list comprehension to calculate 20,000 samples of size 40


In [None]:
# Calculate 20,000 sample estimates of the mean price


# Plot the sampling distribution 


---
## Questions for you

1) What does this figure represent?

2) What happens if we make my sample smaller / larger? Where is the peak of the histogram centered?

3) Can we actually create this figure in a real data analysis problem?

4) What problem does that cause for us?

## Answers
1. This is a visualization of the sampling distribution. In other words, we took the mean of an arbitrarily large number of samples from our population.
2. If sample size gets larger, the spread of the distribution shrinks (and vice versa). The peak is centered at the true population parameter value.
3. No; in a real data analysis problem we typically only have one sample to work with (we can't create many samples)
4. This means we cannot visualize the spread, and so have no way of understanding how reliable our point estimate is using just one sample (unless we make lots of assumptions).

---
## Bootstrapping

We only have one sample... but if it's big enough, the sample looks like the population! This means we can create what's known as a "bootstrap" (or "bootstrapped") distribution that makes the absolute *most* out of our single sample. This is *very very* good!

### Your tasks:
Set-up a figure with 6 (3 x 2) subplots.
1. Plot the population distribution in the last subplot (ie., axs[2, 1]).
2. Create 1 sample each of sizes `[10, 20, 50, 100, 200]`
3. Plot the distributions of these different samples in the remaining 5 subplots.

In [None]:
# Plot sample distributions for n = 10, 20, 50, 100, 200


# Set up subplots

# Plot population distribution

# # Plot sample distributions for different sample sizes (some starter code provided)
# for i, sample_n in enumerate(sample_sizes):
#     sample = airbnb.sample(sample_n)
#     ax = axs[i % 3, i // 3]...

# # Adjust layout
# plt.tight_layout(rect=[0, 0, 1, 0.96])
# plt.show()

---
# Working through the bootstrapping process


---
## Creating a *bootstrap* sampling distribution

1. Randomly select an observation from the original sample (e.g., 9 Airbnb listings, 15 individuals post-stroke, 20 healthy young adults, etc.), which was drawn from the population (all Vancouver Airbnb listings, all Canadian adults post-stroke, all healthy, neurotypical adults between the ages of 18-35 across the globe, etc.).
2. Record the observation's value.
3. Replace that observation.
4. Repeat steps 1&ndash;3 (sampling *with* replacement) until you have $n$ observations (where n=# of observations in your single sample from the population), which form a bootstrap sample.
5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
7. Calculate the plausible range of values around our observed point estimate.

![](../images/intro-bootstrap.jpeg)

---
### Task 1:
- Let's pretend our *sample* **is** our population. Then we can take many samples from our original sample (called *bootstrap samples*) to give us an approximation of the sampling distribution (the *bootstrap sampling distribution*).
We can use its shape and spread to get the plausible range for our population parameter!
- This is typically what's done in nearly all experiments. You collect your ***single*** sample of rental prices/individuals with diabetes/control subjects/etc from the population of interest (Airbnb rentals, all North American adults with diabetes, all neurotypical adults, etc.). ***The big difference here is that we have access to the full population, so we can compare our estimates to ground truth!***
- Generate a single sample of 40 from the airbnb data and calculate the mean.

In [None]:
# Generate a single sample 


---
### Task 2: Generating a single bootstrap sample

1. randomly draw an observation from the original sample (which was drawn from the population)

2. record the observation's value

3. return the observation to the original sample

4. repeat the above the **same number of times as there are observations in the original sample**

In [None]:
# Calculate and plot a single bootstrap sample


### Question
- What would happen if we sampled *without* replacement? What does that mean?

**Your answer here**

---
### Task 3:
- Generate 20,000 bootstrapped samples from our single sample we made previously

In [None]:
# We will first create 20,000 bootstrapped samples


### Think

- Why are there are 800,000 rows instead of 20,000?

---
### Task 4: Create the bootstrapped sampling distribution

- We must calculate our sample statistic (i.e., the mean) for each of the 20,000 samples.
- For every sample we have made, calculate the mean for each and store them in a new variable (column) called "mean_price".
- Plot this distribution.

In [None]:
# From our bootstrapped samples, we calculate the mean of each one


In [None]:
# Plot the bootstrapped sampling distribution


---
## Sampling vs bootstrap distribution
### Task 1:
- Visualize the sampling distribution and compare it to our bootstrapped distribution.

Reminder - true population mean = \\$154.51. 


In [None]:
# Create subplots


# Plot sampling distribution


# Plot bootstrap distribution


# Set titles


# Adjust layout



---
## Two essential points that we can take away from the above histograms

- First, the shape and spread of the true sampling
distribution and the bootstrap distribution are similar; the bootstrap
distribution lets us get a sense of the point estimate's variability. 
- The second important point is that the means of these two distributions are
slightly different. The sampling distribution is centered at
\\$154.51, the population mean value. However, the bootstrap
distribution is centered at the original sample's mean price per night,
\\$148.24. Because we are resampling from the
original sample repeatedly, we see that the bootstrap distribution is centered
at the original sample's mean value (unlike the sampling distribution of the
sample mean, which is centered at the population parameter value).

---
## Take a minute to appreciate what bootstrapping has done

- With a single sample of 40 observations (from a much, much larger population), we've closely approximated the true sampling distribution
- We know this because we earlier created the sampling distribution from the entire population of observations, something we almost never have access to in real world data science applications. 
- In essence, bootstrapping is a means of simulating many data collections (or experiments), premised only on the assumption that our single sample is representative of the population&mdash;amazing!</d>
- As an aside, a friend who started working for Google after earning his PhD joked that he aced his technical interviews by answering with "I'd use bootstrapping," for any question he didn't have a real answer for. The point being that bootstrapping is almost never a bad idea.

---
## Using the bootstrap to calculate a plausible range

We can use our bootstrap distribution to calculate the plausible range of values for the population parameter (only requires 1 sample as opposed to thousands of samples from the entire population!):


### Task 2:
We can report **both** our sample point estimate and the plausible range where we expect our true population quantity to fall. The steps invovled are the following: 

1. Arrange the observations in the bootstrap distribution in ascending order.
2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval.
3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval.

To do this in Python, we wiill again use the `quantile` function of our DataFrame.
Quantiles are expressed in proportions rather than percentages,
so the 2.5th and 97.5th percentiles
would be quantiles 0.025 and 0.975, respectively.

In [None]:
# Your answer here


In [None]:
# Plot bootstrapped distribution with 95% CI


---
This lecture has been adapted from Data Science: A First Introduction (Python Edition), by Trevor Campbell, Tiffany Timbers, Melissa Lee, Lindsey Heagy, and Joel Ostblom of UBC.