# Subsampling Observations

### Introduction

In the last lesson, we were introduced to a random forest.  We saw that with decision trees, we not only get a high degree of variation in the leafs of a tree, but also throughout the tree.  This is because a decision tree works sequentially, a split of a data at a higher level affects how the data will be split at lower levels.  It's also because decision trees happen to have very few assumptions about the data -- unlike say a linear model which assumes that the data fits some form of a line.  With a decision trees, we are working with a very flexible model and this gives us a wide degree of variation.

This variance is a problem if using just one decision tree to make our predictions.  However, we can reduce our variance by fitting many different trees, and then taking the average of the predictions.  This is the main idea behind a random forest.  In this lesson, we'll begin working with a random forest in sklearn.  As we move through this and subsequent lessons, it's important to remember that the main idea behind a random forest:

* Use a highly flexible, and variant, model to find different patterns in the data and then reduce that variance through aggregation. 

Aggregating the models reduces the effects of random patterns which are unlikely to happen again.

### Loading our data

For this lesson, let's work with the diabetes dataset.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

dataset = load_diabetes()
X = dataset['data']
y = dataset['target']

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=20)

## Forest from the Trees

The first idea of random forests is one that we already know.  With random forests we are averaging the predictions from a number of decision trees.  We can see this directly by loading up our random forest regressor.

In [10]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 25, random_state = 1)
rfr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=None,
           oob_score=False, random_state=1, verbose=0, warm_start=False)

After we fit our random forest regressor, our regressors comes with a new attribute, called `estimators`.  An estimator is a particular tree of a random forest.  Let's take a look at the first two estimators.

In [25]:
estimators = rfr.estimators_
len(estimators)
# 25
estimators[:2]

[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1791095845, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=2135392491, splitter='best')]

So each of these estimators is simply one of the decision trees that we saw earlier on.  We can see each of the estimator's predictions.

In [12]:
import numpy as np
tree_predictions = np.hstack([estimator.predict(X_test[:1, :]) for estimator in rfr.estimators_])
tree_predictions

array([ 83.,  93.,  47.,  77.,  96.,  69.,  69.,  69.,  93.,  64.,  96.,
        88., 135.,  77.,  96.,  60.,  37.,  31., 116., 125.,  64.,  77.,
        96.,  96.,  47.])

So to find the prediction of the random forest, we can just take the average of these predictions.

In [26]:
np.mean(tree_predictions)

80.04

Which is exactly the prediction of our random forest regressor.

In [27]:
rfr.predict(X_test[:1, :])

array([80.04])

### Subsampling Data

Now if we look at the tree predictions, we'll see that we received different predictions on the dataset.

In [7]:
set(tree_predictions)

{31.0, 64.0, 77.0, 78.0, 96.0, 116.0, 144.0, 146.0}

Now how would we receive ten predictions if we are training each estimator on exactly the same data, and each estimator is following the same rules?  The short answer is, we can't.  If we train ten estimators on exactly the same data, each tree would wind up exactly the same, and each would fit thus fit to the random variation in the data in the same way.  

Fitting the same highly variant tree multiple times and then taking the average doesn't help us reduce our error of variance.  Instead we would like ten different trees, and then take the average of these differing trees.

To accomplish this, we simply train our tree on different samples of the data.  This idea is just what it sounds like.  Each tree of a random forest only trains on a subset of the training data.  This way, even if all of the features were trained on, we would get different trees.  This is because each tree would see different training data.

So with a random forest we not only train multiple trees, but train the trees on different subsets of the data.

## Summary

In this lesson we were introduced to a random forest.  We saw that we can initialize a random forest with a number of estimators, where each estimator is a decision tree that fits to our data.  To create variation in our decision trees, each decision tree is fit to a subset of the training data.

### Resources

[Random Forest Top to Bottom](https://www.gormanalysis.com/blog/random-forest-from-top-to-bottom/)