## Combining Model Predictions With Ensembles

A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction. We'll make a simple ensemble to see how they work.

Let's create two decision trees with slightly different parameters:

* One with min_samples_leaf set to 2
* One with max_depth set to 5
Then, we'll check their accuracies separately. On the next screen, we'll combine their predictions and compare the combined accuracy with the individual accuracies of both trees.

In [40]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import pandas as pd
import numpy as np

In [5]:
income = pd.read_csv('income.csv')

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship",
           "race", "sex", "hours_per_week", "native_country"]

In [7]:
for col in columns:
    income[col] = pd.Categorical(income[col]).codes

income['high_income'] = pd.Categorical(income['high_income']).codes

In [9]:
X_train, X_test, y_train, y_test = train_test_split(income[columns], income['high_income'], test_size=0.2, random_state=1)

In [10]:
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
clf.fit(X_train, y_train)

clf2 = DecisionTreeClassifier(random_state=1, max_depth=5)
clf2.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5, random_state=1)

In [11]:
auc1 = roc_auc_score(y_test, clf.predict(X_test))
auc2 = roc_auc_score(y_test, clf2.predict(X_test))

print(auc1)
print(auc2)

0.6962823579658807
0.6777293380407088


## Combining Our Predictions

In [13]:
predictions = clf.predict_proba(X_test)[:,1]
predictions2 = clf2.predict_proba(X_test)[:,1]

preds = np.round((predictions+predictions2)/2)

roc_auc_score(y_test, preds)

0.7251618416781492

In [16]:
np.unique(preds)

array([0., 1.])

## Why Ensembling Works

As we can see from the previous screen, the combined predictions of the two trees had a higher AUC than that of either tree on its own:

|                     settings                    | train AUC | test AUC |
|:-----------------------------------------------:|:---------:|:--------:|
| default (min_samples_split: 2, max_depth: None) | 0.947     | 0.694    |
| min_samples_split: 13                           | 0.843     | 0.700    |
| min_samples_split: 13, max_depth: 7             | 0.748     | 0.7744   |
| min_samples_split: 100, max_depth: 2            | 0.662     | 0.655    |

To intuitively understand why this makes sense, think about two people with the same level of talent. One learned programming in college, while the other learned it on her own (let's say using Dataquest!).

If we give both of them the same project, they'll approach it in slightly different ways, due to their different experiences. They may both produce code that achieves the same result, but one may run faster in certain areas. The other may have a better interface. Even though both of them have about the same level of talent, their solutions are stronger in different areas because they approached the problem differently.

If we combine the best parts of both of their projects, we'll end up with a stronger combined project.

Ensembling is exactly the same. The models are approaching the same problem in slightly different ways, and building different trees because we used different parameters for each one. Each tree makes different predictions in different areas. Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches.

The more "diverse" or dissimilar the models we use to construct an ensemble are, the stronger their combined predictions will be (assuming that all of the models have about the same accuracy). Ensembling a decision tree and a logistic regression model, for example, will result in stronger predictions than ensembling two decision trees with similar parameters. That's because those two models use very different approaches to arrive at their answers.

On the other hand, if the models we ensemble are very similar in how they make predictions, ensembling will result in a negligible boost.

Ensembling models with very different accuracies generally won't improve overall accuracy. Ensembling a model with a .75 AUC and a model with a .85 AUC on a test set will usually result in an AUC somewhere in between the two original values. There's a way around this that we'll discuss later on, which we call weighting.

## Introducing Variation With Bagging

A random forest is an ensemble of decision trees. If we don't make any modifications to the trees, each tree will be exactly the same, so we'll get no boost when we ensemble them. In order to make ensembling effective, we have to introduce variation into each individual decision tree model.

If we introduce variation, each tree will be be constructed slightly differently, and will therefore make different predictions. This variation is what puts the "random" in "random forest."

There are two main ways to introduce variation in a random forest -- bagging and random feature subsets. We'll dive into bagging first.

In a random forest, we don't train each tree on the entire data set. **We train it on a random sample of the data, or a "`bag`,"** instead. We perform this sampling with replacement, which means that after we select a row from the data we're sampling, we put the row back in the data so it can be picked again. Some rows from the original data may appear in the "bag" multiple times.

In [37]:
# We'll build 10 trees
tree_count = 10

# Each "bag" will have 60% of the number of original rows
bag_proportion = .6

predictions = []
for i in range(tree_count):
    # We select 60% of the rows from train, sampling with replacement
    # We set a random state to ensure we'll be able to replicate our results
    # We set it to i instead of a fixed value so we don't get the same sample in every loop
    # That would make all of our trees the same
    bag = X_train.sample(frac=bag_proportion, replace=True, random_state=i)
    # Fit a decision tree model to the "bag"
    clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
    clf.fit(bag[columns], y_train[bag.index])
    
    # Using the model, make predictions on the test data
    predictions.append(clf.predict_proba(X_test)[:,1])

preds = np.round(sum(predictions)/len(predictions))
                       
roc_auc_score(y_test, preds)


0.7415762848252972

## Random Subsets in scikit-learn

We can also repeat our random subset selection process in scikit-learn. We just set the splitter parameter on DecisionTreeClassifier to "random", and the max_features parameter to "auto". If we have N columns, this will pick a subset of features of size 
$\sqrt{N}$
, compute the Gini coefficient for each (this is similar to information gain), and split the node on the best column in the subset.

In [39]:
# We'll build 10 trees
tree_count = 10

# Each "bag" will have 60% of the number of original rows
bag_proportion = .6

predictions = []
for i in range(tree_count):
    # We select 60% of the rows from train, sampling with replacement
    # We set a random state to ensure we'll be able to replicate our results
    # We set it to i instead of a fixed value so we don't get the same sample every time
    bag = X_train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    # Fit a decision tree model to the "bag"
    clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2, splitter='random', max_features='auto')
    clf.fit(bag[columns], y_train[bag.index])
    
    # Using the model, make predictions on the test data
    predictions.append(clf.predict_proba(X_test)[:,1])

combined = np.sum(predictions, axis=0) / 10
rounded = np.round(combined)

print(roc_auc_score(y_test, rounded))

0.7487868062537482


## Practice Putting it All Together

|                     settings                    | train AUC | test AUC |
|:-----------------------------------------------:|:---------:|:--------:|
| default (min_samples_split: 2, max_depth: None) | 0.947     | 0.694    |
| min_samples_split: 13                           | 0.843     | 0.700    |
| min_samples_split: 13, max_depth: 7             | 0.748     | 0.7744   |
| min_samples_split: 100, max_depth: 2            | 0.662     | 0.655    |

In [41]:
clf = RandomForestClassifier(n_estimators=5, random_state=1, min_samples_leaf=2)

clf.fit(X_train, y_train)
predes = clf.predict(X_test)

print(roc_auc_score(y_test, predes))

0.746084784139288


## Tweaking Parameters to Increase Accuracy

we can tweak some of the parameters for random forests, including:

* min_samples_leaf
* min_samples_split
* max_depth
* max_leaf_nodes

In [42]:
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2)

clf.fit(X_train, y_train)
predes = clf.predict(X_test)

print(roc_auc_score(y_test, predes))

0.7552550543495277


In [44]:
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2, max_depth=3)

clf.fit(X_train, y_train)
predes = clf.predict(X_test)

print(roc_auc_score(y_test, predes))

0.6605350228576031


## Reducing Overfitting

One of the major advantages of random forests over single decision trees is that they tend to overfit less. Although each individual decision tree in a random forest varies widely, the average of their predictions is less sensitive to the input data than a single tree is. This is because while one tree can construct an incorrect and overfit model, the average of 100 or more trees will be more likely to hone in on the signal and ignore the noise. The signal will be the same across all of the trees, whereas each tree will hone in on the noise differently. This means that the average will discard the noise and keep the signal.

In [45]:
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=5)

clf.fit(X_train, y_train)

predictions = clf.predict(X_train)
print(roc_auc_score(y_train, predictions))

predictions = clf.predict(X_test)
print(roc_auc_score(y_test, predictions))


clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=5)

clf.fit(X_train, y_train)

predictions = clf.predict(X_train)
print(roc_auc_score(y_train, predictions))

predictions = clf.predict(X_test)
print(roc_auc_score(y_test, predictions))

0.8177223176546391
0.7207468039095158
0.7934448046614619
0.7556351223804341


In [49]:
%%time
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2, n_jobs=4)

clf.fit(X_train, y_train)
predes = clf.predict(X_test)

print(roc_auc_score(y_test, predes))

0.7552550543495277
Wall time: 2.2 s


In [50]:
%%time
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2)

clf.fit(X_train, y_train)
predes = clf.predict(X_test)

print(roc_auc_score(y_test, predes))

0.7552550543495277
Wall time: 6.34 s


## When to Use Random Forests

As we can see in the code cell from the previous screen, overfitting decreased with a random forest, and accuracy went up overall.

While the random forest algorithm is incredibly powerful, it isn't applicable to all tasks. The main strengths of a random forest are:

* Very accurate predictions - Random forests achieve near state-of-the-art performance on many machine learning tasks. Along with neural networks and gradient-boosted trees, they're typically one of the top-performing algorithms.
* Resistance to overfitting - Due to their construction, random forests are fairly resistant to overfitting. We still need to set and tweak parameters like max_depth though.

The main weaknesses of using a random forest are:

* They're difficult to interpret - Because we're averaging the results of many trees, it can be hard to figure out why a random forest is making predictions the way it is.
* They take longer to create - Making two trees takes twice as long as making one, making three takes three times as long, and so on. Fortunately, we can exploit multicore processors to parallelize tree construction. Scikit allows us to do this through the n_jobs parameter on RandomForestClassifier. We'll discuss parallelization in greater detail later on.

Given these trade-offs, it makes sense to use random forests in situations where accuracy is of the utmost importance; being able to interpret or explain the decisions the model is making isn't key. In cases where time is of the essence or interpretability is important, a single decision tree may be a better choice.

Next up in this course is a guided project where you'll explore using random forests to predict bike rentals.