# (Hesterberg, 2015) What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum

[Link](https://amstat.tandfonline.com/doi/full/10.1080/00031305.2015.1089789)

# Abstract

Bootstrapping has enormous potential but there are subtle issues and ways to go wrong. Goal of this article is to provide a deeper understanding of bootstrap methods - how they work, when they work or not, and which methods work better - and to highlight pedagogical issues.

# 1. Introduction

The bootstrap is used for estimating standard errors and bias, obtaining confidence intervals, and sometimes for tests.

## 1.1 Verizon Example

## 1.2 One-Sample Bootstrap

Let $\hat\theta$ be a statistic calculated from a sample of $n$ iid observations (i.e., time series and other dependent data are beyond the scope of this article).

In the ordinary *nonparametric bootstrap*, we draw $n$ observations *with replacement* from the original data to create a *bootstrap sample* or *resample*, and calculate the statistic $\hat\theta^{*} $ for this sample (use ${*}$ to denote a bootstrap quantity). The bootstrap statistics comprise the *bootstrap distribution*.

Let's say we do this for two distributions (from the Verizon example) to estimate certain things about the corresponding sampling distribution, including:

**standard error**: the *bootstrap standard error* is the sample standard deviation of the bootstrap distribution,

$$s_b=\sqrt{\frac{1}{(r-1)}\sum_{i=1}^{r}(\hat\theta_i^{*}-\bar{\hat\theta}^{*})^2}$$

**confidence intervals**: a quick-and-dirty interval, the *bootstrap percentile interval*, is the range of the middle 95% of the bootstrap distribution

**bias**: the bootstrap bias estimate is $\bar{\hat\theta}^{*}-\hat\theta$

The bootstrap separates the concept of a standard error - the standard deviation of the sampling distribution - from the common formula $s / \sqrt{n}$ for estimating the standard error of a sample mean.

## 1.3 Two-Sample Bootstrap

For a two-sample bootstrap, we independently draw bootstrap samples with replacement from each sample, and compute a statistic that compares the samples (compute the difference in means $\bar x_1 - \bar x_2$). The bootstrap distribution is centered at the observed statistic; it is used for confidence intervals and standard errors.

## 1.4 Bootstrap $t$-distribution

It is not surprising that $t$ procedures are inaccurate for skewed data with a sample of size 23, or for the difference when one sample is that small.

More surprising is how bad $t$ confidence intervals are for the larger sample size of 1664. To see this, we bootstrap $t$ statistics.

Above, we resampled *univariate* distributions of *estimators* like $\bar x$ or $\bar x_1 - \bar x_2$. Here we look at joint distribution such as the joint of $\bar X$ and $s$, and distributions of statistics that depend on both $\hat\theta$ and $\theta$.

To estimate the sampling distribution of $\hat\theta-\theta$ ($\hat\theta$ is the sample statistic and $\theta$ is the true population value), we use the bootstrap distribution of $\hat\theta^{*}-\hat\theta$. The boostrap bias estimate is $E(\hat\theta^{*}-\hat\theta)$, an estimate of $E(\hat\theta-\theta)$. Thus, to estimate the sampling distribution of a $t$-statistic

$$t=\frac{\hat\theta-\theta}{\text{SE}}$$

where $\text{SE}$ is a standard error calculated from the original sample, we use the bootstrap distribution of

$$t^{*}=\frac{\hat\theta^{*}-\hat\theta}{\text{SE}^{*}}$$

The amount of skewness apparent in the bootstrap t-distribution matters. The bootstrap distribution is a sampling distribution, not raw data; the CLT has already had its one chance to work. At this point, any deviations indicate errors in procedures that assume normal or t sampling distributions.

A common flaw in statistical practice is to fail to judge how accurate standard CLT-based methods are for specific data; the bootstrap t-distribution provides an effective way to do so.

> To summarize, you can use the Bootstrap to estimate the sample distribution of a *statistic*. For example, when you are performing a t-test, you are likely assuming via the CLT that t-statistic has sufficiently converged to normal. since the statistic involves $\bar X$, you only see that one point and not the distribution of the statistic. The bootstrap allows you to do so.

## 1.5 Pedagogical and Practical Value

Students can obtain confidence intervals by working directly with the statistic of interest, rather than using a t-statistic. In mathematical statistics, students can use the bootstrap to help understand joint distributions of estimators like $\bar X$ and $s$, and to understand the distribution of $t$ statistics, and compute bootstrap t confidence intervals.

Resampling is also important in practice. It often provides the only practical way to do inference - when it is too difficult to derive formulas, or the data stored in a way that make calculating the formulas impractical.




# 2. The Idea Behind Bootstrapping

Inferential statistics is based on sampling distributions. In theory, to get these we draw (all or infinitely many) samples from the *population*, and compute teh statistic of interest for each sample (such as the mean,  median, etc.,)

The distribution of the statistics is the *sampling distribution*.

In practice we cannot draw arbitrarily many samples from the population; we have only one sample. The bootstrap idea is to draw samples from an estimate of the population, in lieu of the population:

draw samples from *an estimate* of the population, and compute the statistic of interest for each sample. The distribution of the statistics is the *bootstrap distribution*.

## 2.1 Plug-In Principle

The bootstrap is based on the *plug-in principle* - if something is unknown, we substitute an estimate for it. For example, the sd of the sample mean is $\sigma /\sqrt{n}$; when $\sigma$ is unknown we substitute an estimate for $s$, the sample standard deviation.

With the bootstrap we go one step farther - instead of plugging in an estimate for a single parameter, we plug in an estimate for the whole populatoin $F$.

So what do we substitute for $F$? This includes nonparametric, parametric, and smoothed bootstrap. The primary focus of this article is the nonparametric bootstrap (most common), which consists of drawing samples from the empirical distribution $\hat F_n$ (with probability $1/n$ on each observation), that is, drawing samples with replacement from the data.

In the parametric bootstrap, we assume a model (e.g., a gamma with unknown shape and scale), estimate parameters for that model, then draw bootstrap samples from the model with those estimated parameters.

The smoothed bootstrap is a compromise between para and nonpara. If we believe the population is continuous, we may sample from a continuous $\hat F$, say a kernel density estimate. Smoothing is not common; it is rarely needed, and does not generalize well to multivariate and factor data.

## 2.2 Fundamental Bootstrap Principle

The fundamental Bootstrap principle is that this substitution *usually works* - that we can plug in an estimate for $F$, then sample, and the resulting bootstrap distribution provides useful information about the sampling distribution.

## 2.3 Inference, Not Better Estimates

*The bootstrap distribution is centered at the observed statistic, not the population parameter*, for example, at $\bar x$, not $\mu$.

Two profound implications.

1. We do not use the mean of the bootstrap statistics as a replacement for the original estimate - we cannot use the bootstrap to improve on $\bar x$. Instead, we use the bootstrap to tell how accurate the original estimate is.

In this regard, the bootstrap is like formula methods that use the data twice - once to compute an estimate, and again to compute a standard error for the estimate. The bootstrap just uses a different approach to estimating the standard error.

2. We do not use the CDF or quantiles of the bootstrap distribution of $\hat\theta^{*}$ to estimate the CDF or quantiles of the sampling distribution of an estimator $\hat\theta$. Instead, we bootstrap or estimate things like the standard deviation, the expected value of $\hat\theta - \theta$, and the CDF and quantiles of $\hat\theta-\theta$ or $(\hat\theta-\theta)/\text{SE}$.

## 2.4 Key Idea Versus Implementation Details

if $n$ is small we could evaluate all possible bootstrap samples. This is called an *exhaustive boostrap* or *theoretical bootstrap*. Since exhaustive methods are infeasible, we draw 10,000 random samples instead; we call this the *Monte Carlo sampling implementation*.

## 2.5 How to Sample

We typically sample with the same size as the original data - because by doing so the standard errors reflect the actual data, rather than a hypothetical larger or smaller dataset.



# 3. Variation in Bootstrap Distributions

We claimed above that the bootstrap distribution usually provides useful information about the sampling distribution. We elaborate on that here and address two questions:
1. How accurate is the theoretical (exhaustive) bootstrap?
2. How accurately does the Monte Carlo implementation approximate the theoretical bootstrap?

## 3.1 Sample Mean: Large Sample Size

Remember, the bootstrap:
- does not provide a better estimate of the population parameter
- similarly, quantiles of the bootstrap distributions are *not* useful for estimating quantiles of the sampling distribution
- instead, the bootstrap distributions are useful for estimating the *spread* and *shape* of the sampling distribution.

We also see examples with r = 1000 vs r = 10^4 resamples. Using more samples reduces random Monte Carlo variation, but does not fundamentally change the bootstrap variation - still same approximate center, spread, and shape.

The variation that comes from the Monte Carlo is much smaller than the variation due to different original samples, so for many uses such as a quick-and-dirty estimation of standard errors for the test statistic, r=1000 resamples is adequate.

## 3.2 Sample Mean: Small Sample Size

The bootstrap distributions tend to be too narrow on average, by a factor of $\sqrt{(n-1)/n}$ for the sample mean and approximately that for many other statistics.


This goes back to the plug-in principle; the empirical distribution has variance $\hat\sigma^2=\text{var}_{\hat F_n}(X)=\frac{1}{n}\sum (x_i-\bar x)^2$, and the theoretical bootstrap standard error is the standard deviation of a mean $n$ independent observations from that distributions, $s_b=\hat\sigma / \sqrt{n}$. That is, smaller than the usual formula $s/\sqrt{n}$ by a factor of $\sqrt{(n-1)/n}$.

This *narrowness bias* and variability in spread makes some bootstrap CI's under-cover.

## 3.3 Sample Median

The ordinary bootstrap tends not to work well for statistics such as the median or other quantiles in small samples that depend heavily on a small number of obesrvations out of a larger sample.

## 3.4 Mean-Variance Relationship

In many applications, the sperad or shape of the sampling distribution depends on the parameter of interest. For example, for an exponential distribution, the standard deviation of the sampling distribution of $\bar x$ is proportional to $\mu$.

## 3.5 Summary of Visual Lessons

The bootstrap distribution reflects the original *sample*. If the sample is narrower than the population, the bootstrap distribution is narrower than the samplinlg distribution.

Typically for large samples the data represent the population well.

*Bootstrapping does not overcome the weakness of small samples as a basis for inference*. For very small samples, it may be better to make additional assumptions such as a parametric family.

Looking ahead, two things matter for accurate inferences:
1. How close the bootstrap distribution is to the sampling distribution
2. How well the procedures allow for variation in samples (e.g., fudge factor).

## 3.6 How Many Bootstrap Samples

A bootstrap distribution based on $r$ random samples corresponds to drawing $r$ observations with replacement from the theoretical bootstrap distribution.

We can quantify the Monte Carlo variation in two ways - using formulas, or by bootstrapping.

Let $G$ be the CDF of a theoretical bootstrap distribution and $\hat G$ the Monte Carlo approximation, then the variance of $\hat G(x)$ is $G(x)(1-G(x))/r$, which we can estimate using

$$\hat G(x)(1-\hat G(x))/r$$





# 4. Confidence Intervals

This section describes a number of confidence intervals and compare their pedagogical value and accuracy.

## 4.1 Statistics 101 - Percentile, and $t$ with Bootstrap SE

Neither the bootstrap percentile interval or the t-interval is very accurate. They are only first-order accurate, and poor in small samples - they tend to be too narrow. The bootstrap SE is too small by a factor of $\sqrt{(n-1)/n}$, thus the $t$ interval with bootstrap SE is too narrow by that factor.

The percentile interval suffers the same narrowness and more.

In practice, the $t$ with bootstrap standard error offers no advantage over a standard $t$ procedure for the sample mean. Its advantages are pedagogical, and that it can be used for statistics that lack easy standard error formulas.

The percentile interval is not a good alternative to standard $t$ intervals for the mean of small samples - while it handles skewed populations better, it is less accurate for small samples because it is too narrow. For exponential populations, it is less accurate than standard $t$ for $n\leq 34$.

## 4.2 Reverse Bootstrap Percentile Interval

## 4.3 Bootstrap $t$ Interval

The bootstrap $t$ confidence interval is based on the $t$ statistic, but estimates quantiles of the actual distribution using the data rather than a table.

## 4.4 Confidence Interval Accuracy

For a 95% interval, a perfectly accurate interval misses the parameter 2.5% of the time on each side.

## 4.5 Skewness and Mean-Variance Relationship

## 4.6 Confidence Interval Details

## 4.7 Bootstrap Hypothesis Testing

Two borad approaches.

1. Invert a CI - reject $H_0$ if the corresponding interval excludes $\theta_0$.

2. Sample in a way that is consistent with $H_0$, then calculate a $p$-value as a tail probability. However this is not as accurate as a permutation test.








# 5. Regression

Two ways that bootstrapping in regression is particularly useful pedagogically.

1. Understand the variability of regression predictions by a graphical bootstrap.

A bootstrap percentile CI for $E(Y|x)$ is the range of the middle 95% of the $y$ values for regression lines at any $x$; these intervals are wider for more extreme $x$.

2. Understand the difference between confidence and prediction intervals.

The bootstrap esetimates the performance of the model that was actually fit to the data, regardless of whether that is a poor model.

## 5.1 Resample Observations or Conditional Distributions

Two common procedures when bootstrapping regression are

1. bootstrap observations
2. bootstrap residuals



