# Week 3 Discussion: The Bootstrap

![sf-boots](img/sf-boots.jpg)

In [0]:
set.seed(125)

library(tidyverse)

theme_set(theme_bw())

## Setup

For this section, we will work with salary data from 2014 for SF city employees 
(derived from the 
[Kaggle SF Salaries dataset](https://www.kaggle.com/kaggle/sf-salaries)).

In [0]:
# The data can be downloaded from: 
# https://5harad.com/mse125/discussions/week_3/sf_salaries2014.tsv
salaries <- read_tsv("https://raw.githubusercontent.com/stanford-policylab/mse125/main/week_3/sf_salaries2014.tsv")

# peek at 5 random rows
sample_n(salaries, 5)

In [0]:
nrow(salaries)

According to the data, SF city had 37,427 employees in the year 2014

For observations that are likely to have extreme outliers, such as salary data,
a common statistic of interest is the median. 

Since we have all the data, we can easily compute the **true** median 
base salary for SF city employees in 2014.

### Warm up

Compute the median, mean, sd, minimum, and maximum salary in SF in 2014. Which job(s) have the lowest pay? Which have the highest pay?

In [0]:
# Your code here!



## A realistic sample of salaries

Now, let's pretend that we don't have this data set, and instead we randomly 
sampled $n = 100$ SF city employees in 2014, and asked for their base
salary. 

In other words, we only see $n$ random rows from `BasePay` column of
the full data.

In [0]:
n <- 100

# `pull(df, col)` is identical to `df$col`,
# but `pull` is pipeable!
sample_basepay <- salaries %>% 
  sample_n(n) %>% 
  pull(BasePay)  

print(sample_basepay)

Using our "observed" data, we could estimate the median, assuming that 
our sample is representative of the population, by simply taking the median 
of the observed base pay:

In [0]:
median(sample_basepay)

But now is the difficult part: what is the uncertainty is associated with our estimate? And how do we find the sampling distribution of our estimator? 

> Recall from last week's discussion that we can think of the **sampling distribution** of an estimator as the distribution of values that the estimator would take across many parallel universes.

## The idea

One approach: we could (1) take multiple samples of salaries ("parallel universes"), (2) calculate an estimate of the median for each sample, and (3) compute the standard deviation of the estimates across parallel universes.

The problem: parallel universes aren't easily accessible. For example, it may be costly to sample more than once.

Potential solution: if we are willing to assume that our sample is representative of the
population, we could use it to mimic parallel universes. **This is the core idea of the bootstrap.**

## Bootstrap samples

We can create an artificial universe by **resampling from our observed data with replacement**, and then computing our estimate over this new "sample".

> 1. Draw $X_1^*,\dots,X_{100}^* \sim \hat{F}_{100}$ 
> 2. Compute $T_{100}^* = median(X_1^*,\dots,X_{100}^*)$

An analogy would be to write down each of the 100 base rate salaries we collected, put them in an urn, and then randomly pull out numbers out of that urn. 

Let's try this in `R`!

### Exercise

Generate a single bootstrap sample from our sample of 100 salaries. Calculate the median of the bootstrap sample.

Bonus: Write an `R` function `get_boot_median` to generate a bootstrapped median.

Some things to keep in mind:

- The `sample` command may come in handy. See `?sample`.
- How many observations should we resample from our 100 salaries?
- Why do we need to sample with replacement?

In [0]:
# Your code here!



## Bootstrapped sampling distribution

We can repeat the process above many times to approximate the sampling distribution of the estimator for the median income.

Note that we only use the bootstrapped sampling distribution to calculate the standard error of our estimator, not its expected value! Our best guess of the expected value is the median of the **original** sample of 100 incomes.

### Exercise

Generate `B` bootstrap samples, and calculate the median of each sample. Using your `B` bootstrapped estimates of the median, calculate the standard error of our estimator of the median income, and construct a confidence interval for the median income using the normal approximation. 

Some things to keep in mind:

- The `replicate` command may come in handy.
- How many bootstrap samples should we generate? In other words, what's a good value of `B`?
- What should be the point estimate of our confidence interval?
- Is the normal approximation reasonable?


In [0]:
# Your code here!



## Pivotal interval

Computing confidence intervals with the normal approximation assumes that the sampling distribution is normal. 

What if we aren't confident in this assumption? We can use the pivotal method:
$\left(2 \hat{\theta} - \hat{\theta}^*_{1-\alpha/2}, 2 \hat{\theta} - \hat{\theta}^*_{\alpha/2}\right)$

Note that the pivotal method still assumes that the bootstrapped sampling distribution is a good approximation of the true sampling distribution. The sampling distribution just doesn't have to be normal!

### Exercise

Compute the pivotal interval of the median income. 

In [0]:
# Your code here!

