# Bootstrapping data

## Introduction

In the last lesson, we saw that our random forests use multiple decision trees, referred to as estimators, to predict the target variable.  To create variation among the trees, the random forest trains each tree on a subset of the training data.  In this lesson we'll learn about bootstrapping with random forests, and how it can further increase the variation among our trees.  

**Bootstrapping** means that when we sample a dataset, we sample our data with replacement.  The advantage of this is that it increases the differences of the subsamples we use to train our trees.  First, let's make sure we understand how bootstrapping works, and then we can understand how it increases the variance of our subsamples, and thus our trees. 

## What it means to bootstrap

Say we have observations one through thirteen.

In [2]:
import numpy as np
one_to_thirteen = np.arange(1, 14)
one_to_thirteen

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

If we were just to take a normal subsample of the data, then we would select five of the observations.  And each observation could only be selected once. 

In [3]:
np.random.seed(21)
np.random.choice(one_to_thirteen, 5, replace = False)

array([6, 8, 2, 7, 3])

So above, each of our selected datapoints is unique as once they are selected, they are removed from the pool of available.  

When we sample with replacement, or bootstrap, this isn't the case.  We select from the list of observations, but then place the datapoint back into the pool of data we can select from.  

Let's see this in action.  We can bootstrap our array with the following code.

In [4]:
np.random.seed(21)
np.random.choice(one_to_thirteen, 8, replace = True)

array([10,  9,  5,  1,  1,  9,  4, 13])

So here we see that we can select the same number multiple times, as even after selected once, it is still available to be selected again.

### Applying Bootstrapping to Our Models

Now that we understand the operation behind bootstrapping let's see what happens if we apply bootstrapping to our training sets.  We'll continue to use our list of numbers, one through thirteen, to keep things simple.

In [23]:
one_to_thirteen = np.arange(1, 14)
x_train, x_test = train_test_split(one_to_thirteen)

And now, mimicking the procedure of a random forest, let's begin by taking subsamples of our data, without replacement.  Let's say that we select sixty percent of our training set each time.

In [13]:
np.random.seed(23)
datasets = np.stack([np.random.choice(x_train, 6, replace = False) for num in range(0, 5)])
datasets.sort(axis=1)
datasets

array([[ 3,  4,  6,  7,  9, 10],
       [ 3,  4, 10, 11, 12, 13],
       [ 3,  4,  6,  7, 11, 12],
       [ 4,  7,  9, 10, 11, 12],
       [ 3,  4,  6,  9, 10, 12]])

If we look closely at our datasets, notice that a couple of them are fairly similar.  The first and last rows share five of the six datapoints.  And the second and third rows share four of six datapoints.  This can be problematic because remember we want variation in our trees, as we want our trees to discover different patterns in the data.

Now let's see what happens when we selecting subsamples of our training set with bootstrapping.

In [14]:
np.random.seed(23)
datasets = np.stack([np.random.choice(x_train, 6, replace = True) for num in range(0, 5)])
datasets.sort(axis=1)
datasets

array([[ 3,  3,  7, 11, 11, 13],
       [ 6,  9, 10, 10, 11, 13],
       [ 6,  7, 10, 12, 12, 13],
       [ 4,  4,  6,  6,  7, 11],
       [ 3,  9, 11, 12, 13, 13]])

Here, we can see that we get a lot more variation in our data.  With bootstrapping, we simply have a lot more variations in our data. And this increases the variations among our trees.

### Bootstrapping in Random Forests

In our random forest in sklearn, bootstrapping is the default behavior of a random forest.  If we wanted to turn bootstrapping off, we can pass through an argument of `bootstrap = False`.

In [24]:
# RandomForestRegressor(bootstrap = False)

But we don't want to do that.  

With the current default behavior, our random forest will sample with replacement before constructing each tree in the random forest.  This increases the variances of our trees.  And this is a good thing, because we will ultimately accomodate for that variance by aggregate our trees at the end.

### Summary

In this lesson, we saw how bootstrapping can increase the variation among our trees.  Bootstrapping means to sample a dataset with replacement.  As we saw this increases the variation among each sampled dataset, which leads to more variation among each tree that we train.

### Resources

[Random Forest Top to Bottom](https://www.gormanalysis.com/blog/random-forest-from-top-to-bottom/)