# L10: Resampling methods

**Sources and additional reading:**
- Efron & Gong, [A leisurely look at the Bootstrap, the Jackknife, and Cross-validation](https://www.jstor.org/stable/2685844)
- Lupton, chapter 6
- [Statisticelle blog post](https://statisticelle.com/resampling-the-jackknife-and-pseudo-observations/)

## Error estimation

In the course of these lectures, our main focus has always been to estimate values of model parameter and their distribution (or uncertainties) given some set of observational data. We have seen that we can obtain analytic estimates of the variance of our estimator (e.g. mean of Gaussian random variable, linear least squares), we can approximate variances (MLE) and in the case of full Bayesian inference, we can always resort to MCMCs to estimate the posterior and thus parameter uncertainties.

But what do we do for example if we do not know the likelihood? Can we still estimate uncertainties for our estimators?

The goal of resampling methods is to use the data itself to understand the distribution of the estimator. This is analogous to the fact that we can use an iid sample of length $n$ to estimate both the mean $\hat{\mu}$ and the standard deviation $\hat{\sigma}$ of a given underlying population. Resampling methods just generalize this notion and the two most popular ones are the *Jackknife* and the *Bootstrap*.

The main idea of these methods is to treat a observed sample from a population as a population itself. One can then generate many samples from this "new" population and compute the value of the estimator in question for all of these samples. This allows for constructing an empirical sampling distribution from which we can derive quantities like the estimator's variance and its bias.

Let us look at these methods in more detail.

## Setup

Let us assume that the data consists of a random sample of size $n$ drawn from an underlying unknown probability distribution $F$ on $\mathbb{R}$, i.e. $$x_1, ..., x_n \sim F.$$ We can use these data to estimate the value of a model parameter $\theta$ using an estimator $\hat{\theta}(x_1,..., x_n)$. We would now like to estimate the variance as well as potential bias of this estimator without having to specify anything else about the estimator or its distribution. 

## The Jackknife

The basic idea of the Jackknife method is to create new samples out of the original sample by removing one measurement point at the time. This procedure leads to $n$ samples of length $n-1$ and for each of these samples we can now compute the value of our estimator. Let $\hat{\theta}_{(i)}=\hat{\theta}(x_1,...,x_{i-1}, x_{i+1},..., x_n)$ be the value of the statistic when $x_i$ is removed from the data set, and let the mean of all these estimators be $\hat{\theta}_{(.)}=\frac{1}{n}\sum_{i=1}^{n}\hat{\theta}_{(i)}.$

Then we can estimate the variance of $\hat{\theta}(x_1,..., x_n)$ using our Jackknife sample $(\hat{\theta}_{(1)}, ..., \hat{\theta}_{(n)})$ as $$\hat{\sigma}_J=\left[\frac{n-1}{n}\sum_{i=1}^{n}\left(\hat{\theta}_{(i)}-\hat{\theta}_{(.)}\right)^2\right]^{1/2},$$ i.e. the variance of the estimator is estimated as a rescaled version of the variance of the leave-one-out Jackknife estimates. Essentially, you can think of the $\frac{n-1}{n}$ factor as a rescaling of the variance to account for the fact that the Jackknife estimates are similar and correlated and thus their spread underestimates the true variance of the estimator. This expression is very general as it can be applied to any statistic that is a function of $n$ iids.

In addition to the variance, we can also use the Jackknife algorithm to estimate the bias $b$ of a given estimator.  In general the bias of an estimator is defined as $b(\hat{\theta})=\langle\hat{\theta}\rangle-\theta$. The Jackknife estimate of the bias of the estimator $\hat{\theta}_{(i)}=\hat{\theta}(x_1,,..., x_n)$ is given by $$\hat{b}_{J}=(n-1)(\hat{\theta}_{(.)}-\hat{\theta}),$$ i.e. it is proportional to the difference of the mean of the Jackknife estimates to the estimator evaluated for the full sample.

The usefulness of this bias estimate is that it allows for removing a bias that is proportional to $1/n$. Let us look at why this is the case. We assume that the full sample estimator is biased such that $$\langle\hat{\theta}\rangle=\theta+\frac{b_1(\theta)}{n}+\frac{b_2(\theta)}{n^2}+... .$$ Then we have $$\langle \hat{\theta}_{(.)} \rangle = \frac{1}{n}\sum_{i=1}^{n}\langle\hat{\theta}_{(i)}\rangle = \frac{1}{n}\sum_{i=1}^{n}\left(\theta+\frac{b_1(\theta)}{n-1}+\frac{b_2(\theta)}{(n-1)^2}+...\right).$$ Therefore we obtain $$\langle \hat{\theta}_{(.)}-\hat{\theta} \rangle = \frac{b_1(\theta)}{n(n-1)}+\mathcal{O}(n^{-3}),$$ and we see that the Jackknife bias thus provides an estimate of the estimator bias term $\frac{b_1(\theta)}{n}$. Therefore, as long as such a bias is present, the Jacknife estimate will always reduce the bias but it will not necessarily reduce it to zero depending on the structure of the bias.

Using this bias estimate, we can define a bias-corrected Jackknife estimator through $$\hat{\theta}_{\mathrm{Jack}}=\hat{\theta}-\hat{b}_{J}=n\hat{\theta}-(n-1)\hat{\theta}_{(.)}.$$

<font color='pink'>[AN: Asymptotic normality for variance]</font>

## The bootstrap

The bootstrap is similar in spirit to the Jackknife but it uses a different approach to create new samples from the original data. The procedure is based on the so-called *empirical distribution function* $\hat{F}$ which is construted by putting probability mass $\frac{1}{n}$ on each $x_i$, i.e. $$\hat{F}(x)=\frac{1}{n}\sum_i\delta(x-x_i).$$ A bootstrap sample of the original data $x_1^*, ..., x_n^*$ is defined as a draw from the empirical distribution function, i.e. $$x_1^*, ..., x_n^* \sim \hat{F}.$$ In other words, each $x_i^*$ is drawn independently and with replacement from the inital sample $x_1, ..., x_n$.

The bootstrap estimate of variance is based on the idea that if we know $F$, then the standard deviation of our estimator $\theta$ can be computed as $$\sigma(F)=\left[\underset{F}{\mathrm{var}}\;\hat{\theta}(x_1, ..., x_n)\right]^{1/2}.$$ The next best thing we have is $\hat{F}$ and in analogy, the bootstrap estimate of the standadr deviation is defined as $$\sigma_B=\sigma(\hat{F})=\left[\underset{\hat{F}}{\mathrm{var}}\;\hat{\theta}^*(x_1^*, ..., x_n^*)\right]^{1/2},$$ where $\hat{\theta}^*(x_1^*, ..., x_n^*)$ denotes the estimate of $\theta$ obtained for the Bootstrap sample.

In practice, we generate a number of $B$ Bootstrap samples $\hat{\theta}^*(x_1^*, ..., x_n^*)$ and then compute $$\hat{\sigma}_B=\left[\frac{1}{B-1}\sum_{i=1}^B (\hat{\theta}^{*b}-\hat{\theta}^{*.})^2\right]^{1/2},$$ where $$\hat{\theta}^{*.}=\frac{1}{B}\sum_{i=1}^B \hat{\theta}^{*b},$$ denotes the mean of the $B$ Bootstrap estimates.

As with the Jackknife, we can also use the Bootstrap method to compute the bias of an estimator. To motivate the bias definition for the Bootstrap we note that the bias of an estimator can be thought of as a function of the underlying pdf $F$, i.e. $$b(F)=\langle\hat{\theta}-\theta\rangle_F=\langle\hat{\theta}\rangle_F-\theta.$$ Analogously, we obtain the Bootstrap estimate of the bias by replacing $F$ with $\hat{F}$ as $$\hat{b}_B=\hat{b}(\hat{F})=\langle\hat{\theta}(\hat{F}^*)-\theta(\hat{F})\rangle_{\hat{F}^*}=\langle\hat{\theta}(\hat{F}^*)\rangle_{\hat{F}^*}-\theta(\hat{F}).$$

As before, we can practically compute the bias using Monte Carlo methods i.e. $$\hat{b}_B=\hat{\theta}^{*.}-\hat{\theta}=\frac{1}{B}\sum_{i=1}^B (\hat{\theta}^{*b}-\hat{\theta}),$$ where $\hat{\theta}$ denotes the estimator evaulated on the full sample. This identity makes the approximation that the bias is inherent in how we estimate a given quantity and not what exactly we estimate.