# Bootstrapping regression

## Outline

- Basics of the bootstrap

- Pairs bootstrap

- Residual bootstrap

## Distribution of least squares estimators (random X)

- Recall...

- Suppose $(X_i,Y_i) \overset{IID}{\sim} G, 1 \leq i \leq n$ with $X \in \mathbb{R}^{p+1}$.
(We assume that $X_{i,1}=1$ to handle the intercept.)

- Let $X$ be the $n \times (p+1)$ design matrix and $Y$ the $n \times 1$ response vector and
$$
\hat{\beta} =  \hat{\beta}_n = (X^TX)^{-1}(X^TY).
$$

- Define 
$$
\beta(G) = E_G(X_1X_1^T)^{-1} E_G(X_1\cdot Y)
$$
and
$$
\epsilon_i = \epsilon_i(G) = Y_i - X_i^T\beta(G).
$$

- Then, 
$$
n^{1/2} \left(\hat{\beta}_n - \beta(G) \right) \overset{n \to \infty}{\to} N\left(0, E_G(X_1X_1^T)^{-1} Var_G(X_1 \cdot \epsilon_1) E_G(X_1X_1^T)^{-1} \right)
$$


- **Q: How can we estimate covariance?**

- **A: Pairs bootstrap does this implicitly.**

- Good reading: [arxiv.org/1404.1578](http://arxiv.org/abs/1404.1578)

## Bootstrapping the sample mean

- Suppose $\mathbb{R} \ni X_i \overset{IID}{\sim} F, 1 \leq i \leq \infty$.

- Set 
$$\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i.$$

- We want to know something about $\mu=E_F(X_1)$ based on $\bar{X}_n$?

- What if we know the distribution of $\bar{X}_n$? Or maybe $\bar{X}_n - \mu$?

In [None]:
set.seed(0) # reproducibility
mu = 2
n = 50
noise_scale = 4
noise = function(n) {
    return(noise_scale*(rexp(n) - 1)) # skewed, but centered to have mean 0
}

X = mu + noise(50)
Xbar = mean(X)
Xbar

## Sampling distribution of $\bar{X}_n$

In [None]:
B = 50000
Xbar_sample = c()

for (i in 1:B) {
    Xbar_sample = c(Xbar_sample, mean(mu + noise(n)))
}
plot(density(Xbar_sample), col='red')
abline(v=mu)

## Bootstrap distribution

- The bootstrap distribution samples $n$ entries from $X$ with replacement and recomputes
the sample mean of these $n$ entries.

In [None]:
bootstrap_sample = c()

for (i in 1:B) {
    X_b = sample(X, n, replace=TRUE)
    bootstrap_sample = c(bootstrap_sample, mean(X_b))
}
plot(density(bootstrap_sample), col='blue')
lines(density(Xbar_sample), col='red')
abline(v=Xbar, lty=2)
abline(v=mu)

## Bootstrap distribution

- The bootstrap distribution is centered around $\bar{X}$.

- The true sampling distribution is centered around $\mu$.

In [None]:
plot(density(bootstrap_sample - Xbar), col='blue')
lines(density(Xbar_sample - mu), col='red')
abline(v=0)

## Increasing the sample size

In [None]:
n = 100
X = mu + noise(n)
Xbar = mean(X)
Xbar_sample = c()
bootstrap_sample = c()
for (i in 1:B) {
    Xbar_sample = c(Xbar_sample, mean(mu + noise(n)))
    X_b = sample(X, n, replace=TRUE)
    bootstrap_sample = c(bootstrap_sample, mean(X_b))
}

plot(density(Xbar_sample - mu), col='red')
lines(density(bootstrap_sample - Xbar), col='blue')
abline(v=0)

## Justification  of the bootstrap

- Whatever $F$ is, under reasonable assumptions
$\bar{X}_n$ is close to normally distributed for large enough $n$, centered around $\mu$.

- If we knew $F$ we could just repeatedly sample from $F$ to get the sampling
distribution of $\bar{X}_n$.

- Our best guess at $F$ is $\hat{F}_n$ the empirical distribution function of $\{X_1, \dots, X_n\}$.

- Under $\hat{F}_n$, drawing a sample of size $n$ and computing the sample mean $\bar{X}^*_n$ will also 
be asymptotically normal centered around $\bar{X}_n$. 
The bootstrap repeats this procedure many times.

- The law of $\bar{X}_n - \mu$ is close to $\bar{X}^*_n- \bar{X}_n$. The closeness can be measured in terms of how close $\hat{F}_n$ is to $F$. 

- Therefore, things that require the distribution of $\bar{X}_n-\mu$ can be approximated with
the distributon of $\bar{X}^*_n -\bar{X}_n$.

In [None]:
plot(ecdf(X), col='red')
xval = seq(-5,20,length=100)
lines(xval, 1 - exp(-(xval-mu) / noise_scale - 1), lwd=2)

# Bootstrapping regression

## Random X

In this model, the quantities that are IID are  the pairs $(X_i,Y_i) \sim G$.

## Pairs bootstrap

- Draw $n$ pairs $(X_{i,b}^*, Y_{i,b}^*), 1 \leq i \leq n$ with replacement from all cases.

- Compute
$$
\hat{\beta}^*_b = ((X^*_b)^TX^*_b)^{-1}(X^*_b)^TY^*_b, \qquad 1 \leq b \leq B
$$

- Distribution of $\hat{\beta}^*_b - \hat{\beta}$ can be used as an approximation of the distribution of $\hat{\beta}-\beta(G)$.

- Pretty similar to using
$$
\tilde{\beta}^*_b = (X^TX)^{-1}(X^*_b)^TY^*_b.
$$

### [Reference for more R examples](https://socserv.socsci.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Bootstrapping.pdf)

In [None]:
library(car)
n = 50
X = rexp(n)
Y = 3 + 0.5 * X + X * rnorm(n)
Y.lm = lm(Y ~ X)
confint(Y.lm) # Gauss model is false here!

In [None]:
pairs.Y.lm = Boot(Y.lm, coef, method='case')
confint(pairs.Y.lm)

# Bootstrapping regression

## Fixed X

In this model, the quantities that are IID are  the errors
$$
\epsilon_i = Y_i - X_i^T\beta.
$$

## Residual bootstrap

- Compute
$$
\hat{\beta} = (X^TX)^{-1}X^TY
$$
and
$$
e = Y - X\hat{\beta} = Y - \hat{Y}.
$$

- Draw $n$ errors $\epsilon^*_{i,b}$ with replacement from $(e_{1}, \dots, e_{n})$ and form
$$
Y^*_b = X\hat{\beta} + \epsilon^*_b
$$

- Compute
$$
\hat{\beta}^*_b = (X^TX)^{-1}X^TY^*_b.
$$

- Bootstrap approximation: the law of $X^TY - E[X^TY]$ is close to the law of $X^TY^*_b - X^TY$.


In [None]:
resid.Y.lm = Boot(Y.lm, coef, method='resid')
confint(resid.Y.lm)

## How is the coverage?

In [None]:
simulate_correct = function(n=50, b=0.5) {
    X = rexp(n)
    Y = 3 + b * X + noise(n)
    Y.lm = lm(Y ~ X)

    # parametric interval
    
    int_param = confint(Y.lm)[2,]
    
    # pairs bootstrap interval
    
    pairs.Y.lm = Boot(Y.lm, coef, method='case')
    int_pairs = confint(pairs.Y.lm)[2,]
    
    # resid bootstrap interval
    
    resid.Y.lm = Boot(Y.lm, coef, method='resid')
    int_resid = confint(resid.Y.lm)[2,]
    
    return(c((int_param[1] < b) * (int_param[2] > b),
             (int_pairs[1] < b) * (int_pairs[2] > b),
             (int_resid[1] < b) * (int_resid[2] > b)))
}

simulate_correct()

In [None]:
nsim = 100
coverage = c()
for (i in 1:nsim) {
    coverage = rbind(coverage, simulate_correct())
}
colnames(coverage) = c('parametric', 'pairs bootstrap', 'residual bootstrap')
print(apply(coverage, 2, mean))

## Misspecified model

In [None]:
simulate_incorrect = function(n=50, b=0.5) {
    X = rexp(n)
    # the Gauss model is 
    # quite off here -- Var(X^Tepsilon) is not well
    # approximated by X^TX Var(epsilon)...
    Y = 3 + b * X + X * noise(n)
    Y.lm = lm(Y ~ X)

    # parametric interval
    
    int_param = confint(Y.lm)[2,]
    
    # pairs bootstrap interval
    
    pairs.Y.lm = Boot(Y.lm, coef, method='case')
    int_pairs = confint(pairs.Y.lm)[2,]
    
    # resid bootstrap interval
    
    resid.Y.lm = Boot(Y.lm, coef, method='resid')
    int_resid = confint(resid.Y.lm)[2,]
    
    return(c((int_param[1] < b) * (int_param[2] > b),
             (int_pairs[1] < b) * (int_pairs[2] > b),
             (int_resid[1] < b) * (int_resid[2] > b)))
}

simulate_incorrect()

In [None]:
nsim = 100
coverage = c()
for (i in 1:nsim) {
    coverage = rbind(coverage, simulate_incorrect())
}
colnames(coverage) = c('parametric', 'pairs bootstrap', 'residual bootstrap')
print(apply(coverage, 2, mean))

## Logistic regression

- Pairs bootstrap is most obvious to define: $(X_i,Y_i) \overset{IID}{\sim} F$. Then, resample pairs and refit logistic regression.

### Similar approximation

- Recall
$$
\hat{\beta} \approx \beta + (X^TW(\beta)X)^{-1}X^T(Y - \pi(\beta)).
$$

- Pairs bootstrap is similar to using
$$
\hat{\beta}^*_b = (X^TW(\hat{\beta})X)^{-1}(X^*_b)^T(Y^*_b - \pi^*(\hat{\beta}))
$$
where 
$$\pi^*_i(\hat{\beta}) = \frac{\exp((X_{i,b}^*)^T\hat{\beta})}{1 + \exp((X_{i,b}^*)^T\hat{\beta})}$$

In [None]:
flu.table = read.table('http://stats203.stanford.edu/data/flu.table', header=T)
flu.glm = glm(Shot ~ Age + Health.Aware, data=flu.table, family=binomial())
flu.boot = Boot(flu.glm, coef, method='case')
confint(flu.boot)

## Residual bootstrap

Not well-defined for logistic regression

In [None]:
# resid bootstrap fails -- not really a well-defined model
flu.boot = Boot(flu.glm, coef, method='resid')


### Parametric bootstrap

- Assumes the parametric model is correct.

- Repeatedly draw 
$$y_i^* \sim \text{Bernoulli}\left(\frac{\exp(X_i^T\hat{\beta})}{1 + \exp(X_i^T\hat{\beta})}\right)$$

- Reestimate $\hat{\beta}^*_b$ and use law of $\hat{\beta}^*_b - \hat{\beta}$ as a
surrogate for law of $\hat{\beta}-\beta$.