## Various classifiers

Exercise: Load the MNIST data and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing).

#### Load dependencies

In [148]:
# This will help us to measure the time it took for the whole
# notebook to execute
import time
start_time = time.time()

import os
import importlib
import sys
sys.path.append('../../utils')
import datasets
importlib.reload(datasets)
import helpers
importlib.reload(helpers)

import numpy as np
from sklearn.datasets import fetch_openml               # Fetches data using openML
from sklearn.model_selection import train_test_split    # Used to split datasets into training and test subsets.
from sklearn.preprocessing import LabelEncoder          # Used to convert categorical labels (such as strings) into numeric labels.
from sklearn.metrics import accuracy_score

# RandomForestClassifier:
# Description: An ensemble method that combines multiple decision trees, where each tree is trained on a random subset of features and samples.
# Advantages: High accuracy, handles overfitting well, and works well on a wide range of tasks.
# Best for: Problems where interpretability isn't critical but performance is, such as classification on tabular data.
from sklearn.ensemble import RandomForestClassifier

# ExtraTreesClassifier:
# Description: Similar to RandomForestClassifier but uses even more randomness (random splits) in building each tree.
# Advantages: Often faster than RandomForestClassifier with lower variance in certain cases.
# Best for: Cases where speed is a priority and some additional randomness might help avoid overfitting.
from sklearn.ensemble import ExtraTreesClassifier

# LinearSVC (Support Vector Classifier):
# Description: A linear model that finds the optimal hyperplane to classify data by maximizing the margin between classes.
# Advantages: Fast, works well with high-dimensional data, and good for binary classification tasks.
# Best for: Text classification or other tasks where data is high-dimensional and a linear boundary is effective.
from sklearn.svm import LinearSVC

# MLPClassifier (Multi-layer Perceptron):
# Description: A feedforward neural network with one or more hidden layers that can learn complex, non-linear decision boundaries.
# Advantages: Can model complex patterns, works well with both structured and unstructured data.
# Best for: Problems that require capturing non-linear relationships, such as image or text classification where non-linear patterns are present.
from sklearn.neural_network import MLPClassifier

# VotingClassifier
# An ensemble learning technique used to combine the predictions of multiple individual classifiers to make a final prediction. 
# It aggregates the predictions from several models (also called "base classifiers") and uses a voting mechanism to decide the output.
# There are two main types of voting:
#    Hard Voting:
#        The final prediction is made based on the majority vote of all the classifiers.
#        For each sample, the classifier predicts a class, and the class that gets the most votes from all the classifiers is chosen as the final prediction.
#    Soft Voting:
#        Instead of voting on the predicted class labels, the classifiers predict class probabilities, and the probabilities are averaged.
# The class with the highest average probability is chosen as the final prediction.
from sklearn.ensemble import VotingClassifier

# StackingClassifier
# A model ensemble technique that combines predictions from multiple base estimators through a “stacked” approach.
# In stacking, the predictions of individual models (level-0 or base estimators) are used as features to train a final estimator (level-1 or meta-learner), 
# which makes the final predictions.
# This approach leverages the strengths of multiple models to improve accuracy and robustness.
from sklearn.ensemble import StackingClassifier

#### Get dataset

Download the **MNIST** dataset.

In [149]:
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist["data"], mnist["target"]
print(f"{mnist.keys()}")

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])


### Voting Classifier

**First split**

This step separates 10,000 samples for testing and keeps the remaining 60,000 samples for further partitioning into training and validation sets.

Parameters:

- `mnist.data`: The features of the MNIST dataset (70,000 images).
- `mnist.target`: The labels for the MNIST dataset.
- `test_size=10000`: Specifies that 10,000 images will be set aside for the test set.
- `random_state=42`: A seed to make the split reproducible. The same seed will produce the same split every time.

Output:

- `X_train_val, y_train_val`: Contains 60,000 samples intended for the training and validation split.
- `X_test, y_test`: Contains 10,000 samples that will serve as the test set.

In [150]:
X_train_val, X_test, y_train_val, y_test = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)

**Second split**

This step splits the 60,000 samples from X_train_val into two sets:

- `Training Set`: 50,000 images.
- `Validation Set`: 10,000 images.

Parameters:

- `X_train_val, y_train_val`: The 60,000 images and labels meant for training and validation.
- `test_size=10000`: Reserves 10,000 images for validation, leaving 50,000 for training.
- `random_state=42`: Ensures that this split is reproducible as well.

Output:

- `X_train, y_train`: Contains 50,000 samples for training.
- `X_val, y_val`: Contains 10,000 samples for validation.

In [151]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=10000, random_state=42)

#### Train classifiers

**RandomForestClassifier**

- `n_estimators=100`: The number of decision trees in the forest (100 in this case), which helps the model learn from a variety of patterns in the data.
- `random_state=42`: Sets a random seed for reproducibility, so the same model can be retrained with the same results.

In [152]:
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

**ExtraTreesClassifier**

- `n_estimators=100`: Like the RandomForest, this is the number of trees in the Extra Trees ensemble.
- `random_state=42`: Ensures consistent results across runs by fixing the randomness.

In [153]:
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)

**LinearSVC (Support Vector Classifier)**

- `max_iter=100`: Sets the maximum number of training iterations to 100, which limits how long the algorithm can run for each training session.
- `tol=20`: The tolerance for stopping criteria (larger than default). The training stops when the change in the model is smaller than this threshold.
- `dual=True`: The LinearSVC has a dual hyperparameter whose default value will change from True to "auto" in Scikit-Learn 1.5.
- `random_state=42`: Fixes the randomness for consistent results.

In [154]:
svm_clf = LinearSVC(max_iter=100, tol=20, dual=True, random_state=42)

**MLPClassifier (Multi-layer Perceptron)**

- `random_state=42`: Again, this is used for reproducibility in the MLP classifier's random initialization of weights.

In [155]:
mlp_clf = MLPClassifier(random_state=42)

**Model training**

Train the models one by one using the same dataset.



In [156]:
# Each classifier has a fit() method
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(dual=True, max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


In [157]:
# Score for each classifier agains the validation dataset
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9709, 0.859, 0.965]

#### Voting classifier

Exercise: Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier.

In [158]:
# Previous estimators
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

# Setting up the voting classifier
voting_clf = VotingClassifier(named_estimators)

In [159]:
# Train the voting classifier
voting_clf.fit(X_train, y_train)

In [160]:
# Score
voting_clf.score(X_val, y_val)

0.9718

In [161]:
encoder = LabelEncoder()
y_valid_encoded = encoder.fit_transform(y_val)
y_valid_encoded = y_val.astype(np.int64)

[estimator.score(X_val, y_valid_encoded)for estimator in voting_clf.estimators_]

[0.9692, 0.9709, 0.859, 0.965]

In [162]:
# Let's remove the SVM to see if performance improves. 
# It is possible to remove an estimator by setting it to "drop" using set_params().
voting_clf.set_params(svm_clf="drop")
voting_clf.estimators

# Updating the list of trained estimators
# So we can either fit the VotingClassifier again, or just remove the SVM from the list of trained estimators, both in estimators_ and named_estimators_:
svm_clf_trained = voting_clf.named_estimators_.pop("svm_clf")
voting_clf.estimators_.remove(svm_clf_trained)

In [163]:
# Evaluate score again
voting_clf.score(X_val, y_val)

0.974

The SVM classifier was hurting performance.

#### Soft voting classifier

No need to retrain the classifier, only need to set voting to "soft".

In [164]:
voting_clf.voting = "soft"
voting_clf.score(X_val, y_val)

0.9719

In [165]:
voting_clf.voting = "hard"
voting_clf.score(X_test, y_test)

0.9716

In [166]:
# The voting classifier reduced the error rate of the best model from about 3% to 2.7%, which means 10% less errors.
[estimator.score(X_test, y_test.astype(np.int64)) for estimator in voting_clf.estimators_]

[0.965, 0.97, 0.9627]

In this case we see that hard voting wins.

### Stacking Ensemble

Exercise: Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set.

In [167]:
# Creates an array, X_valid_predictions, to store the predictions from multiple estimators on the validation dataset X_val
X_valid_predictions = np.empty((len(X_val), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_valid_predictions[:, index] = estimator.predict(X_val)

X_valid_predictions

array([['5', '5', '5', '5'],
       ['8', '8', '8', '8'],
       ['2', '2', '3', '2'],
       ...,
       ['7', '7', '7', '7'],
       ['6', '6', '6', '6'],
       ['7', '7', '7', '7']], dtype=object)

A `Random Forest classifier` is created and trained to act as a "blender" model.

Instead of being trained on the original data, it is trained on the validation predictions (X_valid_predictions) from multiple other estimators, with y_val as the true target values.

The goal is to learn patterns in the combined predictions and make final predictions based on them.

In [168]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_valid_predictions, y_val)

**What is OOB Score?**

The Out-of-Bag (OOB) score is an internal validation method for random forests that leverages bootstrapping:

- Each tree in a random forest is trained on a random subset (bootstrap sample) of the training data.
- On average, about 37% of the data points are "out-of-bag" (not selected for training) for each tree.
- The OOB score calculates the model's accuracy by predicting the out-of-bag samples across all trees, providing a test score that helps approximate the model’s generalization ability without needing a separate test set.

In [169]:
rnd_forest_blender.oob_score_

0.9684

For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions.

In [170]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

y_pred = rnd_forest_blender.predict(X_test_predictions)

accuracy_score(y_test, y_pred)

0.9683

This stacking ensemble does not perform as well as the voting classifier we trained earlier.

Since ``StackingClassifier`` uses K-Fold cross-validation, we don't need a separate validation set, so let's join the training set and the validation set into a bigger training set:

In [171]:
X_mnist, y_mnist = mnist["data"], mnist["target"]
X_train_full, y_train_full = X_mnist[:60_000], y_mnist[:60_000]

Now let's create and train the stacking classifier on the full training set:

> **Warning**: the following cell will take quite a while to run (15-30 minutes depending on your hardware), as it uses K-Fold validation with 5 folds by default. 
> It will train the 4 classifiers 5 times each on 80% of the full training set to make the predictions, plus one last time each on the full training set, and lastly it will train the final model on the predictions.
? That's a total of 25 models to train!

In [172]:
stack_clf = StackingClassifier(named_estimators, final_estimator=rnd_forest_blender)
stack_clf.fit(X_train_full, y_train_full)

The ``StackingClassifier`` significantly outperforms the custom stacking implementation we tried earlier!

This is for mainly two reasons:

- Since we could reclaim the validation set, the StackingClassifier was trained on a larger dataset.
- It used `predict_proba()` if available, or else `decision_function()` if available, or else `predict()`.
- This gave the blender much more nuanced inputs to work with.

In [173]:
stack_clf.score(X_test, y_test)

0.9967

---

## Total Time

This show the total time of execution

In [174]:
# Sets the total time of execution
end_time = time.time()
helpers.calculate_execution_time(start_time, end_time)

Total execution time: 7.0 minutes and 48.61 seconds
