# Subsampling Observations

### Introduction

Main idea:

* Use a highly flexible, and variant, model to find different patterns in the data and then reduce that variance through aggregation

Goal: 
1. Increase differences between the trees with 
2. Each tree as unbiased as possible


### Loading our data

For this lesson, let's work with the diabetes dataset.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

dataset = load_diabetes()
X = dataset['data']
y = dataset['target']

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=20)

## Exploring Random Forests

In [6]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 25, random_state = 1)
rfr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=None,
           oob_score=False, random_state=1, verbose=0, warm_start=False)

* Viewing estimators

So each of these estimators is simply one of the decision trees that we saw earlier on.  We can see each of the estimator's predictions.

In [10]:
import numpy as np
tree_predictions = np.hstack([estimator.predict(X_test[:1, :]) for estimator in rfr.estimators_])
tree_predictions

array([ 83.,  93.,  47.,  77.,  96.,  69.,  69.,  69.,  93.,  64.,  96.,
        88., 135.,  77.,  96.,  60.,  37.,  31., 116., 125.,  64.,  77.,
        96.,  96.,  47.])

In [11]:
np.mean(tree_predictions)

80.04

In [12]:
rfr.predict(X_test[:1, :])

array([80.04])

### Subsampling Data

Now if we look at the tree predictions, we'll see that we received different predictions on the dataset.

In [7]:
set(tree_predictions)

{31.0, 64.0, 77.0, 78.0, 96.0, 116.0, 144.0, 146.0}

### Resources

[Random Forest Top to Bottom](https://www.gormanalysis.com/blog/random-forest-from-top-to-bottom/)