# HOML Chapter 7 - Ensemble Learning and Random Forests

## Exercise 8

Load the MNIST data (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). 

Then train various classifiers, such as a Random
Forest classifier, an Extra-Trees classifier, and an SVM. 

Next, try to combine
them into an ensemble that outperforms them all on the validation set, using a
soft or hard voting classifier. 

Once you have found one, try it on the test set. How
much better does it perform compared to the individual classifiers?


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let's start by loading the MNIST dataset.

In [2]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
# Split the original dataset into training/validation and testing. Use 10000 for the test set as sugggested by the author. 
X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=999)

# Then split the training set into training/validation set into separate training and validation sets. Use 10000 for the validation set as sugggested by the author.
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=999)

Now that we've split the data, let's train it on several classifiers. The author suggests random forest, extra trees, and SVM. We'll also add AdaBoost since it's covered in this chapter. However, we'll just stick with its default estimator, DecisionTreeClassifier, to add more variety to our classifiers.

In [5]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVC

In [6]:
rf = RandomForestClassifier(n_estimators=100, random_state=999)
et = ExtraTreesClassifier(n_estimators=100, random_state=999)
ada = AdaBoostClassifier(n_estimators=100, random_state=999)
svm = SVC(random_state=999, probability=True)

In [7]:
# Print the selected and default hyperparameters for each classifier and fit them to the training data. Note that this process will take several minutes to complete. 
classifiers = [rf, et, ada, svm]
for clf in classifiers:
    print(clf)
    clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=999,
                       verbose=0, warm_start=False)
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_scor

In [8]:
# List of scores on the validation set for each classifier
[clf.score(X_val, y_val) for clf in classifiers]

[0.9675, 0.9717, 0.7154, 0.9781]

The AdaBoost classifier performed far more poorly than the other classifiers. We didn't change its base estimator from DecsisonTreeClassifier, which may be why it performed as it did. We'll eliminate it and use VotingClassifier to get an ensemble score with the remaining three classifiers. We'll start with hard voting.

In [9]:
from sklearn.ensemble import VotingClassifier

In [10]:
# List of classifers to be used for VotingClassifier
estimators = [('random forest', rf), ('extra trees', et), ('svm', svm)]

In [11]:
voting = VotingClassifier(estimators)

In [12]:
# Fit the training set with the voting classifier. As with the fitting step above, this will take several minutes to complete. 
voting.fit(X_train, y_train)

VotingClassifier(estimators=[('random forest',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.

In [13]:
# Validation score for hard voting 
voting.score(X_val, y_val)

0.9733

The hard voting ensemble did fairly well, but it didn't outperform the SVM classifer. Let's try soft voting and see if it can outperform all of the individual classifiers.

In [14]:
# Change to soft voting
voting.voting = "soft"

In [15]:
# Validation score for soft voting
voting.score(X_val, y_val)

0.9787

The soft voting ensemble just beat out the SVM classifier, so let's now run it on the test set and see if it can outperform each of the individual classifiers there, too. 

In [16]:
# Test score for soft voting
voting.score(X_test, y_test)

0.9783

In [17]:
# List of test scores for each classifier in soft voting ensemble
[clf.score(X_test, y_test) for clf in voting.estimators_]

[0.967, 0.9716, 0.9776]

Again, the soft voting ensemble outperforms all of its individual classifiers.