A large part of a data scientist's job is to build models that try to learn patterns from existing data, and apply those patterns to new data. For many of our models, we can call `predict` and get a "point estimate" - or our best guess - at what the new value should be. Our models generally don't have a robust uncertainity associated with them: if I tell my model about the features of my house, its best prediction might be that my house is worth 340k, but it would be nice to know if the expected error on this is $\pm 10$k, $\pm 50$k, or even $\pm 100$k! It is difficult to make decisions if I don't know what my uncertainty looks like.

Statisticians face a similar problem called [_inference_](https://en.wikipedia.org/wiki/Statistical_inference), where they try to make a statement about the some property of the entire population (called a population _parameter_), but they only have access to a sample of the population. Statistics inference is the task of estimating those population parameters from the sample parameters. 

As a simple example, take trying to estimate the mean height $\mu$ of a population based on a sample of $N$ pepople. The best guess that you are able to make (a so-called "unbiased estimate") is to use the mean height $\bar{x}$ of your sample. We intuitively know that as the number of people in the sample $N$ gets larger, our estimate of the population mean ($\bar{x}$) should get closer to the population mean height ($\mu$). In short, we are more confident in large samples than small ones. For the specific problem of estimating a mean, the central limit theorem from statistics gives a nice way of estimating the uncertainty.

Our data scientist who wants to estimate the uncertainty in her model would probably have a similar intuition: the more data she has in her training set, the more confident she will be in the models predictions. However, she knows that more complicated models have a tendency to overfit to single data points, so a single data point in the wrong place might throw off the results of her model.

For simple (i.e. high bias) models like Linear Regression, we have ways of estimating the uncertainty -- provided the assumptions of the high bias models are followed (e.g. in Linear Regression, if you use the standard formula for confidence intervals, it is important that you residuals are noramlly distributed and [homoscedastic](https://en.wikipedia.org/wiki/Homoscedasticity)). For more complicated or non-parameteric models, such as tree-based models, it becomes a lot harder to give "formula-based" uncertainities.

**Bootstrapping** is a _non-parameteric_ (i.e. no assumption about the distrubtion of the underlying data) way of estimating the sampling error, and can be used in both machine learning prediction problems, and statistical inference problems. We will go through a high-level overview, show a couple of examples, and give a couple of disclaimers.

## What's the idea behind bootstrapping?

We have alreday discussed that we are generally more confident in larger samples, or that the degree of uncertainity decreases as the amount of training data increases. For one source of error - sampling error - this is generally true. Our _particular_ model with its particular coefficients (or decision boundaries) was arrived at by looking at one _particular_ sample. If we had sampled different examples, we would arrive at a (slightly) different model.

If we had _many_ samples that were the same size, we could fit our model to each of the different samples, and see how much of an effect the particular sample had on our model. The error associated with selecting a particular set of points (rather than some other set of points) is the _sampling error_.

We generally cannot collect a many samples of our data; if we could we would have just collected more data to start with! More precisely, we have some number of observations (say 100 subjects) and it is up to us to decide how to analyze them. The discussion above might lead you to consider doing an analysis with all 100 subjects, or ten samples of 10 subjects each, or one hundred samples with 1 subject each! Bootstrapping takes a different approach: it says "create many samples with 100 subjects, by randomly selecting 100 subjects from our _sample_ (with replacement)". In any given bootstrapped sample, there will be repeated subjects, and some subjects that are left out. The basic idea is:
> If the _sample_ is representative of the _population_, then a _bootstrap sample_ is respresentative of the _sample_

The thing that bootstrap captures well is the effect of some points having no say (by being left out), and some regions being oversampled. If the sample is not representative of the population, this technique will fail (but so would any other technique that doesn't explicitly correct for _how_ the sample fails to be representative).

By being the same size as the original sample, the bootstrapped samples have _similar_ statistical properties. We have to be a little careful, because unlike really going out and making new samples from the population, we have correlation between the observations in a bootstrapped sample


However, our _particular_ model or our particular coefficients were arrived at because we had _this_ particular 