# Introduction to Random Forests

### Variance from the leaves

Now we have already seen the problem of overfitting in decision trees.  The problem of overfitting is a problem of variance.  There is randomness in our training data, and our decision tree perfectly matches this randomness, but then does not do so well when this randomness is not repeated.  The way that a decision tree can perfectly match the data in the training set, is because each leaf node predicting a small number of samples (maybe just one).  But because each prediction is made on a sample size of just one, these predictions are not necessarily predictive of future data.

Now we saw a solution for correcting for this overfitting, which is prevent our decision tree from splitting before the leafs are left with just one sample.  This solution, does a fine job of overfitting in the leaf nodes, but unfortunately, we have not solved the problem of variance all together.

Beyond the leaf nodes, there are other reasons why decision trees are highly susceptible to variance.  With decision trees, any split made at one layer can affect the splits made at each subsequent layers.  So if we made a separate split, we could have arrived at a totally different tree.  Let's see how variance can play a roll.

### Loading our data

Let's work with our Airbnb dataset again.  We have already cleaned our data, so we just need to load it into our notebook, and then can feed it into a decision tree.

In [6]:
import pandas as pd
df = pd.read_feather('cleaned_df.feather')
X = df.drop(columns=['price'])
y = df.price

#### Training our trees

1. The first tree

For our first decision tree, we simply split our data into training and test sets fit our tree. Notice that we are limiting our tree to a max depth of 3.  We are doing this to keep the tree relatively simple, so that we can plot the tree and then compare it against another tree later on.

In [4]:
from sklearn.model_selection import train_test_split
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.40, random_state=42)
from sklearn.tree import DecisionTreeRegressor
dtr_1 = DecisionTreeRegressor(max_depth=4)
dtr_1.fit(X_train_1, y_train_1)

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

2. The second tree

We train the second tree the same way.  The only difference is that to ensure we are training on a different subset of the data, we change the value of our random state.

In [5]:
from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size=0.40, random_state=21)
from sklearn.tree import DecisionTreeRegressor
dtr_2 = DecisionTreeRegressor(max_depth=4)
dtr_2.fit(X_train_2, y_train_2)

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

### Comparing our trees

Ok, now let's compare our two trees.

The first tree's middle two layers look like the following.

![](./dtr_1-a.png)

And this is the second tree's middle two layers.

![](./dtr_2-a.png)

So while there is some overlap between the attributes selected, half of the attributes in the second tree are new.  The differences in the trees leads to different predictions for the same datapoints.

> Here, to correct for overfitting from either tree, we find the datapoints that were in the test sets of both.

In [123]:
in_both_test_sets = np.intersect1d(y_test_1.index, y_test_2.index)
ten_in_both_tests = np.random.choice(in_both_test_sets, 10)
ten_in_both_tests = np.array([  221,  7922, 17116, 11118,  2497,  4199,  8692,  5348, 14548,
        1455])
ten_X = X.iloc[ten_in_both_tests, :]
ten_y = y[ten_in_both_tests]

Then we make predictions on the datapoints with the two trees and see how we do.

In [93]:
dtr_1_predictions = dtr_1.predict(ten_X)
dtr_2_predictions = dtr_2.predict(ten_X)

In [118]:

predictions_grid = np.hstack((dtr_1_predictions.reshape(-1, 1),
                              dtr_2_predictions.reshape(-1, 1),
                              ten_y.to_numpy().reshape(-1, 1))
                             )

df_predictions = pd.DataFrame(predictions_grid, columns = ['dtr_1_predictions', 'dtr_2_predictions', 'y'])
df_predictions

Unnamed: 0,dtr_1_predictions,dtr_2_predictions,y
0,51.46048,69.814031,63.0
1,51.46048,40.752675,20.0
2,51.46048,69.814031,45.0
3,51.46048,69.814031,40.0
4,51.46048,40.752675,22.0
5,106.64289,126.464427,95.0
6,51.46048,40.752675,43.0
7,51.46048,40.752675,25.0
8,51.46048,69.814031,42.0
9,51.46048,69.814031,100.0


Now we can see from the dataframe above that from our two trees each of are making fairly different predictions, even though they both trained on most of the data, and we are using relatively simple trees.  So this our problem of variance: each tree is picking up on the random variations in the data, which leads to variations in our predictions.

Now with our error of variance, we expect each model to deviate from the actual model, but if we were to take many of these mdoels we would expect the error of variance to be reduced.

### Introducing Random Forest

That is the principal idea behind a random forest.  We begin with a machine learning algorithm of trees that suffers from variance, and we reduce the error due to variance by creating many trees and then taking the average of our various models predictions.

### Summary

In this lesson, we introduced the idea of random forest.