# Chapter 7.

## Ensemble Learning and Random Forests

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
X, y = make_moons(n_samples=100000, noise=0.15)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# declare 3 classifiers independently
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

# then create an ensemble classifier with 
# those 3 classifiers and train it
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

In [4]:
# test the accuracy of each classifier and the ensemble classifier
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred)) # the voting_clf slightly outperfoms the other 3

LogisticRegression 0.8758
RandomForestClassifier 0.9893
SVC 0.9902
VotingClassifier 0.9901


In [5]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), # the type of classifier to fit
    n_estimators=500, # in this case, 500 decision trees
    max_samples=100, # each predictor is trained on 100 training instances
    bootstrap=True, # WITH replacement
    n_jobs=-1, # -1 means: use all available cores
    oob_score=True # score with out-of-bag instances
)

bag_clf.fit(X_train, y_train)
# the oob_score gives us the approximation as if we predict using the test set
bag_clf.oob_score_ 

0.97465

### Note:
With Random Forests you can measure the relative importance of each feature. Scikit-learn measures the feature's importance by looking at how much the tree nodes that use that feature reduce **impurity** on average

In [7]:
from sklearn.datasets import load_iris

iris = load_iris() # load iris dataset

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)

rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    # print the importance (in percentage) of each feature
    print(name, score * 100)

sepal length (cm) 10.342835424946417
sepal width (cm) 2.623617409818954
petal length (cm) 40.60955925820296
petal width (cm) 46.423987907031666


## Exercises

### Question 1
If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?

**My Answer**: It is possible that the ensemble precision can increase because each model will make different errors on the same data. 

**Book's Answer:** "If you have trained five different models and they all achieve 95% precision, you can try combining them into a **voting ensemble**, which will often give you even better results. It works better if the models are very different. It is even better if they are trained on different training instances (bagging and pasting ensembles), but if not, this will still be effective as long as **the models are different**."

### Question 2
What is the difference between hard and soft voting classifiers?

**My Answer**: A **hard voting classifier** aggregates the predictions of each classifier and predict the class that gets the most votes, while the **soft voting classifier** predicts the class with the **highest** probability (all the classifiers need the *dict_proba()* method for this).

**Book's Answer**: "A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes. A soft voting classifier computes the average estimated class probability for each class and picks the class with the highest probability. This gives high-confidence votes more weight and often performs better, but works only if every classifier is able to estimate class probabilities."

### Question 3
It is possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?

**My Answer**: Both bagging and pasting ensembles can be trained across multiple servers because predictors are completely independent from one another. Actually, they can also perform in parallel. In the other hand, boosting **CAN'T** be trained in parallel because predictors are trained sequentially, each trying to correct its predecessor. Finally, random forests can also be trained in parallel and a stacking ensemble CAN'T be trained in parallel because the *blender* needs the output of the first layer of predictors. 

**Book's Answer**: "It is quite possible to speed up training of a bagging ensemble by distributing it across multiple servers, since each predictor in the ensemble is independent of the others. The same goes for pasting ensembles and random forests, for the same reason. However, each predictor in a boosting ensemble is built based on the previos predictor, so training is necessarily sequential, and you will not gain anything by distributing training across multiple servers. Regarding stacking ensembles, all the predictors in a given layer are independent of each other, so **they can be trained in parallel on multiple servers**. HOWEVER, the predictors in one layer can only be trained after the predictors in the previous layer have all been trained."

### Question 4
What is the benefit of out-of-bag evaluation?

**My Answer**: With out-of-bag evaluation, there's no need of a validation set, which in some cases when working with a small dataset, can be beneficial.

**Book's Answer**: "With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using instances that it was NOT trained on (they were held out). This makes it possible to have fairly unbiased evaluation of the ensemble without the need for an additional validation set. Thus, you have more instances available for training (**or testing?**), and your ensemble can perform slightly better."

### Question 5
What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?

**My Answer**: Extra-Trees are faster because they use random thresholds for each feature rather than searching the best possible one. This characteristic comes with more BIAS for the model.

**Book's Answer**: "When you are growing a tree in a Random Forest, only a random subset of the features is considered for splitting at each node. This is true as well for Extra-Trees, but they go one step further: rather than searching for the best possible thresholds, like regular Decision Trees do, they use random thresholds for each feature. This extra randomness acts like a form of regularization: **if a Random Forest overfits the training data, Extra-Trees might perform better**."

### Question 6
If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?

**My Answer**: If an AdaBoost ensemble is underfitting, INCREASING the number of estimators (*n_estimators*) may be a good idea. 

**Book's Answer**: "If your AdaBoost ensemble underfits the training data, you can try increasing the number of estimators or reducing the regularization hyperparameters of the base estimator. You may also try slightly increasing the learning rate."

### Question 7
If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?

**My Answer**: If the ensemble is overfitting the training set, the learning_rate needs to be DECREASED. The learning_rate hyperparameter scales the contribution of each tree, so each tree needs to "weight" less. 

**Book's Answer**: "If your Gradient Boosting ensemble overfits the training set, you should try decreasing the learning rate. You could also use early stopping to find the right number of predictors (you probably have too many)."

### Question 8

In [1]:
# load mnist
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target']

In [9]:
# split train, test and validation sets
from sklearn.model_selection import train_test_split

X_batch, X_test, y_batch, y_test = train_test_split(X, y, test_size=0.14, random_state=42)
X_train, X_validate, y_train, y_validate = train_test_split(X_batch, y_batch, test_size=0.16, random_state=42)

In [16]:
# train 3 different classifiers
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC

rnd_clf = RandomForestClassifier(n_estimators=200, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

ext_clf = ExtraTreesClassifier(n_estimators=200, max_leaf_nodes=8, n_jobs=-1)
ext_clf.fit(X_train, y_train)

svm_clf = SVC(kernel='poly', degree=1, coef0=1, C=1, probability=True)
svm_clf.fit(X_train, y_train)

In [17]:
# test the accuracy of each model
from sklearn.metrics import accuracy_score

y_pred_rnd_clf = rnd_clf.predict(X_test)
y_pred_ext_clf = ext_clf.predict(X_test)
y_pred_svm_clf = svm_clf.predict(X_test)

print(f"RandomForestClassifier Accuracy: {accuracy_score(y_test, y_pred_rnd_clf)}")
print(f"ExtraTreesClassifier Accuracy: {accuracy_score(y_test, y_pred_ext_clf)}")
print(f"SVC Accuracy: {accuracy_score(y_test, y_pred_svm_clf)}")

RandomForestClassifier Accuracy: 0.819304152637486
ExtraTreesClassifier Accuracy: 0.7326803387409448
SVC Accuracy: 0.9393939393939394


In [18]:
# create an ensemble with the 3 models
from sklearn.ensemble import VotingClassifier

voting_clf_hard = VotingClassifier(
    estimators=[('rf', rnd_clf), ('et', ext_clf), ('svm', svm_clf)],
    voting='hard'
)

voting_clf_soft = VotingClassifier(
    estimators=[('rf', rnd_clf), ('et', ext_clf), ('svm', svm_clf)],
    voting='soft'
)

voting_clf_hard.fit(X_train, y_train)
voting_clf_soft.fit(X_train, y_train)

y_pred_voting_clf_hard = voting_clf_hard.predict(X_test)
y_pred_voting_clf_soft = voting_clf_soft.predict(X_test)

print(f"Hard VotingClassifier Accuracy: {accuracy_score(y_test, y_pred_voting_clf_hard)}")
print(f"Soft VotingClassifier Accuracy: {accuracy_score(y_test, y_pred_voting_clf_soft)}")

Hard VotingClassifier Accuracy: 0.8359351086623814
Soft VotingClassifier Accuracy: 0.9379655137230895


### Question 9

In [30]:
# training an stacking ensemble
import numpy as np

y_pred_val_rnd_clf = rnd_clf.predict(X_validate) # vector
y_pred_val_ext_clf = ext_clf.predict(X_validate) # vector
y_pred_val_svm_clf = svm_clf.predict(X_validate) # vector

In [39]:
# create the new training set for the blender
# from the predictions of the classifiers
X_train_blender = np.array([
    [y_pred_val_rnd_clf[0], y_pred_val_ext_clf[0], y_pred_val_svm_clf[0]]
])

for i, pred in enumerate(y_pred_val_rnd_clf):
    if i > 0:
        X_train_blender = np.vstack((X_train_blender, np.array([y_pred_val_rnd_clf[i], y_pred_val_ext_clf[i], y_pred_val_svm_clf[i]])))

In [40]:
# training with new set
st_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
st_clf.fit(X_train_blender, y_validate)

In [41]:
# get predictions from test set with individual classifiers
y_pred_test_rnd_clf = rnd_clf.predict(X_test) # vector
y_pred_test_ext_clf = ext_clf.predict(X_test) # vector
y_pred_test_svm_clf = svm_clf.predict(X_test) # vector

# create the new test set for the stacking ensemble
X_test_blender = np.array([
    [y_pred_test_rnd_clf[0], y_pred_test_ext_clf[0], y_pred_test_svm_clf[0]]
])

for i, pred in enumerate(y_pred_test_rnd_clf):
    if i > 0:
        X_test_blender = np.vstack((X_test_blender, np.array([y_pred_test_rnd_clf[i], y_pred_test_ext_clf[i], y_pred_test_svm_clf[i]])))

In [42]:
# making predictions with the stacking ensemble
y_predict_st_clf = st_clf.predict(X_test_blender)
accuracy_score(y_test, y_predict_st_clf)

0.9128660340781553