In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Excercise 8

Load the MNIST dataset (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a random forest classifier, an extra-trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [2]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)

In [3]:
# Split the data into a training set, a validation set and a test set

X, y = mnist.data, mnist.target
X_train, X_valid, X_test = X[:50000], X[50000:60000], X[60000:]
y_train, y_valid, y_test = y[:50000], y[50000:60000], y[60000:]

In [32]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(random_state=42)),
        ('xt', ExtraTreesClassifier(random_state=42)),
    ],
    n_jobs=-1
)
voting_clf.fit(X_train, y_train)

In [18]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_valid, y_valid))

rf = 0.0
xt = 0.0


What the hell? From the solutions notebook, I learned that the `VotingClassifier` made a clone of each classifier, and it trained the clones using _class indices_ as the labels, not the original class names. Therefore, to evaluate these clones we need to provide class indices as well. To convert the classes to class indices, we can use a `LabelEncoder` or, since they are just digits, convert them to integer with numpy:

In [19]:
voting_clf.classes_

array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)

In [20]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_valid_encoded = encoder.fit_transform(y_valid)

In [21]:
# Out of curiosity, wouldn't it be possible to use OrdinalEncoder? turns out it would:

from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder()
y_valid_encoded = ord_encoder.fit_transform(y_valid.reshape(-1, 1))
y_valid_encoded = y_valid_encoded.ravel()

In [22]:
y_valid_encoded = y_valid.astype(np.int64)

In [33]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_valid, y_valid_encoded))

rf = 0.9718
xt = 0.9757


In [34]:
voting_clf.score(X_valid, y_valid)

0.9739

What about soft voting?

In [35]:
voting_clf.voting = "soft"
voting_clf.score(X_valid, y_valid)

0.9749

What about the test set?

In [36]:
y_test_encoded = y_test.astype(np.int64)
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test_encoded))
print("voting_clf score:", voting_clf.score(X_test, y_test))

rf = 0.9687
xt = 0.9713
voting_clf score: 0.9713


### Excercise 9
Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations—you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier? Now try again using a StackingClassifier instead. Do you get better performance? If so, why?

In [37]:
rf = voting_clf.named_estimators_['rf']
xt = voting_clf.named_estimators_['xt']
y_pred_rf = rf.predict(X_valid)
y_pred_xt = xt.predict(X_valid)

In [39]:
X_train_new = np.column_stack([y_pred_rf, y_pred_xt])

In [40]:
new_clf = RandomForestClassifier(random_state=43)
new_clf.fit(X_train_new, y_valid)


In [41]:
y_pred_rf_test = rf.predict(X_test)
y_pred_xt_test = xt.predict(X_test)
X_test_new = np.column_stack([y_pred_rf_test, y_pred_xt_test])
new_clf.score(X_test_new, y_test)

0.97

The stacking ensemble performs a little worse than the VotingClassifier on the test set.

In [42]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(random_state=42)),
        ('xt', ExtraTreesClassifier(random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5
)
stacking_clf.fit(X_train, y_train)

In [43]:
stacking_clf.score(X_test, y_test)

0.9749

The StackingClassifier performed a little better than our custom ensemble. Perhaps because of the 5 cross-validation folds?