# EE 467 Lab 6: Ensemble Learning and Random Forest

Welcome to lab 6 of EE 467! Today we are going to learn and try out **ensemble learning** algorithms, which make use of multiple learning algorithms to achieve better performance than any of them. We will apply four kinds of common ensembles to the Kaggle credit card fraud detection problem: **voting, bagging, boosting and stacking**. We will also try **random forest** learning, which is a special kind of bagging ensemble that consists of decision trees. Like the previous lab, all algorithms are evaluated by **accuracy, precision, recall and F1-score**.

## Pre-processing / Feature Extraction

Let's start from the end of lab 5. First of all, we will load the credit card transaction dataset and re-do all the feature scaling and dataset splitting steps in the last lab:

In [1]:
!tar -xf credit-card.tar.xz #<--- To Unzip data

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler

## [ Loading Dataset ]
print("Loading Kaggle credit card transactions dataset...")
df = pd.read_csv("./creditcard.csv")

## [ Feature Scaling ]
print("Scaling transaction time and amount features...")
# Convert transaction time from seconds to hours in a day
df["Time"] = (df["Time"]/(60*60))%24
# Scale time features with StandardScaler (suitable for normally distributed data)
df["Time"] = StandardScaler().fit_transform(df["Time"].values[:, None])
# Scale amount with RobustScaler (robust to outliers, useful for skewed distributions)
df["Amount"] = RobustScaler().fit_transform(df["Amount"].values[:, None])

## [ Feature-label / train-test splits ]
print("Performing feature-label / train-test splits...")

# Get feature and label values from original dataset
feat_all = df.drop(["Class"], axis=1).values
y_all = df["Class"].values

# Split samples into training and test sets
feat_train, feat_test, y_train, y_test = train_test_split(
    feat_all, y_all, test_size=0.4, random_state=0
)

print("Completed.")

Loading Kaggle credit card transactions dataset...
Scaling transaction time and amount features...
Performing feature-label / train-test splits...
Completed.


During this lab we will use two utility functions from `lab_6_util`. The `timeit` function, which you should be fairly familiar with, times Python operations happening within the corresponding `with` block. The `evaluate_model` function evaluates a trained classification model on the test set and then prints the above-mentioned metrics we are interested about.

In [3]:
from sklearn.linear_model import LogisticRegression

from lab_6_util import timeit, evaluate_model

# Time the training of logistic regression classifier
with timeit("Training logistic regression classifier"):
    logistic_model = LogisticRegression(max_iter=200).fit(feat_train, y_train)

# Evaluate trained model and print metrics
evaluate_model(logistic_model, "logistic regression classifier", feat_test, y_test)

Training logistic regression classifier started...
Training logistic regression classifier completed. Elapsed time: 1.93s

[ Evaluation result for logistic regression classifier ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.89      0.59      0.71       199

    accuracy                           1.00    113923
   macro avg       0.95      0.80      0.86    113923
weighted avg       1.00      1.00      1.00    113923

Confusion matrix:
[[113710     14]
 [    81    118]] 



## Voting

The simplest kind of ensemble is a **voting ensemble**. Like a group of people making decisions through a majority vote, a voting ensemble contains multiple classification models, usually implemented from different algorithms. During training, each model learns independently from others. To make a prediction using the ensemble, each model "votes" by providing its own prediction computed from the sample features. The final predicted label of the ensemble is then the class with most "votes" from different models.

For classification models that output a probability distribution over all classes, there is an alternative voting mechanism called **soft voting**. In soft voting, we average the probability of a particular class for all classifiers, and refer to it as the ensemble probability of a class. The ensemble prediction is then the class with the highest ensemble probability. Soft voting largely avoids the **tie-breaking problem** of hard voting, in which two or more majority classes exist with the same number of votes.


In [11]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

## [ TODO ]
# 1) Create and evaluate a voting ensemble (`VotingClassifier`) with the following sub-classifiers:
#    - A logistic regression classifier
#      (Hint: increase maximum iteration to 200 to avoid non-convergence warning)
#    - A Gaussian naive Bayes (`GaussianNB`) classifier
#    - A decision tree classifier
# 2) Evaluate the performance of each sub-classifier on the test set
#    (Hint: obtain the sub-classifiers through `voting_ensemble.named_estimators_`)
# 3) Change the voting mechanism to soft voting. Does the performance of the ensemble improved?

logistic_model_voting = LogisticRegression(max_iter=200, random_state=0)
gaussian_nb_model = GaussianNB()
decision_tree_model = DecisionTreeClassifier(random_state=0)


estimators = [
    ('logistic_regression', logistic_model_voting),
    ('gaussian_nb', gaussian_nb_model),
    ('decision_tree', decision_tree_model)
]


with timeit("Training hard voting ensemble"):
    voting_ensemble_hard = VotingClassifier(estimators=estimators, voting='hard')
    voting_ensemble_hard.fit(feat_train, y_train)


evaluate_model(voting_ensemble_hard, "hard voting ensemble", feat_test, y_test)

print("\n--- Evaluating individual sub-classifiers ---")
for name, model in voting_ensemble_hard.named_estimators_.items():
    evaluate_model(model, f"individual {name} classifier", feat_test, y_test)
print("-------------------------------------------")


with timeit("Training soft voting ensemble"):
    voting_ensemble_soft = VotingClassifier(estimators=estimators, voting='soft', flatten_transform=True)
    voting_ensemble_soft.fit(feat_train, y_train)


# Evaluate the voting ensemble classifier
evaluate_model(voting_ensemble_hard, "hard voting ensemble", feat_test, y_test)
evaluate_model(voting_ensemble_soft, "soft voting ensemble", feat_test, y_test)

Training hard voting ensemble started...
Training hard voting ensemble completed. Elapsed time: 27.59s

[ Evaluation result for hard voting ensemble ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.83      0.76      0.79       199

    accuracy                           1.00    113923
   macro avg       0.91      0.88      0.90    113923
weighted avg       1.00      1.00      1.00    113923

Confusion matrix:
[[113693     31]
 [    48    151]] 


--- Evaluating individual sub-classifiers ---
[ Evaluation result for individual logistic_regression classifier ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.89      0.59      0.71       199

    accuracy                           1.00    113923
   macro avg       0.95      0.80      0.86    113923
weighted avg       1.00      1.00 

## Bagging / Random Forest

**Bagging ensemble** is a kind of ensemble built upon voting. Compared to regular voting ensembles, a bagging ensemble only contains several classifiers of the **same type** (implementing the same algorithm and using the same hyper-parameter settings), each of which is trained on **a random subset of samples (and / or features)**. Bagging reduces over-fitting of the original classification model by introducing randomization into its construction and then making an ensemble out of it. It works best with strong and complex machine learning models such as neural networks and deep decision trees.

In the following code cell, we will train a few **logistic regression bagging ensemble** with different settings. We will alter the number of (sub-)classifiers, proportion of samples and features and study their influence on the performance of the bagging ensemble:

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier

## [ TODO ]
# 1) Create and evaluate a logistic regression bagging ensemble (`BaggingClassifier`)
#    with 10 sub-classifiers, each using 20% of samples.
# 2) Based on the ensemble of question 1, increase the number of sub-classifiers to 30
#    and evaluate again.
# 3) Based on the ensemble of question 1, increase the proportion of training samples for
#    each classifier to 60% and evaluate again.
# 4) Based on the ensemble of question 1, enable boostrapping of features, set the
#    proportion of features to 50% and evaluate again.
base_estimator = LogisticRegression(max_iter=1000)

bagging_models = {
    "10 sub-classifiers, 20% samples": BaggingClassifier(
        estimator=base_estimator,
        n_estimators=10,
        max_samples=0.2,
        random_state=0,
        n_jobs=N_ENSEMBLE_CPUS
    ),

    "30 sub-classifiers, 20% samples": BaggingClassifier(
        estimator=base_estimator,
        n_estimators=30,
        max_samples=0.2,
        random_state=0,
        n_jobs=N_ENSEMBLE_CPUS
    ),

    "10 sub-classifiers, 60% samples": BaggingClassifier(
        estimator=base_estimator,
        n_estimators=10,
        max_samples=0.6,
        random_state=0,
        n_jobs=N_ENSEMBLE_CPUS
    ),

    "10 sub-classifiers, 20% samples, 50% feature bootstrapping": BaggingClassifier(
        estimator=base_estimator,
        n_estimators=10,
        max_samples=0.2,
        max_features=0.5,
        bootstrap_features=True,
        random_state=0,
        n_jobs=N_ENSEMBLE_CPUS
    )
}
for setting, model in bagging_models.items():
    # Train each bagging classifier
    with timeit(f"Training bagging ensemble ({setting})"):
        model.fit(feat_train, y_train)
    # Evaluate each bagging classifier
    evaluate_model(model, f"bagging ensemble ({setting})", feat_test, y_test)

Training bagging ensemble (10 sub-classifiers, 20% samples) started...
Training bagging ensemble (10 sub-classifiers, 20% samples) completed. Elapsed time: 17.27s

[ Evaluation result for bagging ensemble (10 sub-classifiers, 20% samples) ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.89      0.56      0.69       199

    accuracy                           1.00    113923
   macro avg       0.94      0.78      0.84    113923
weighted avg       1.00      1.00      1.00    113923

Confusion matrix:
[[113710     14]
 [    88    111]] 

Training bagging ensemble (30 sub-classifiers, 20% samples) started...
Training bagging ensemble (30 sub-classifiers, 20% samples) completed. Elapsed time: 51.69s

[ Evaluation result for bagging ensemble (30 sub-classifiers, 20% samples) ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00  

In practice, bagging ensemble is often built upon decision tree classifiers. When each decision tree within the ensemble learns from **both a subset of samples and a subset of features**, the resulting bagging ensemble is called a **random forest**. Below code trains and compares the performance of a single decision tree and two random forests with 40 and 100 decision tree classifiers:

In [26]:
import os
from sklearn.ensemble import RandomForestClassifier

# Number of CPUs for ensemble learning methods
N_ENSEMBLE_CPUS = max(os.cpu_count()//2, 1)

# A regular decision tree classifier
with timeit("Training DT classifier"):
    dt_model = DecisionTreeClassifier()
    dt_model.fit(feat_train, y_train)

## [ TODO ]
# 1) Train a random forest classifier with 40 decision trees
# 2) Train a random forest classifier with 100 decision trees
#    (Hint: set `n_jobs` to `N_ENSEMBLE_CPUS` to train the random forest in parallel and reduce training time)

with timeit("Training Random Forest (40 trees)"):
    rf_40_model = RandomForestClassifier(
        n_estimators=40,
        random_state=0,
        n_jobs=N_ENSEMBLE_CPUS
    )
    rf_40_model.fit(feat_train, y_train)

with timeit("Training Random Forest (100 trees)"):
    rf_100_model = RandomForestClassifier(
        n_estimators=100,
        random_state=0,
        n_jobs=N_ENSEMBLE_CPUS
    )
    rf_100_model.fit(feat_train, y_train)

# Evaluate previous models
evaluate_model(dt_model, "DT classifier", feat_test, y_test)
evaluate_model(rf_40_model, "Random forest classifier (40 DTs)", feat_test, y_test)
evaluate_model(rf_100_model, "Random forest classifier (100 DTs)", feat_test, y_test)

Training DT classifier started...
Training DT classifier completed. Elapsed time: 23.43s

Training Random Forest (40 trees) started...
Training Random Forest (40 trees) completed. Elapsed time: 114.26s

Training Random Forest (100 trees) started...
Training Random Forest (100 trees) completed. Elapsed time: 280.35s

[ Evaluation result for DT classifier ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.78      0.76      0.77       199

    accuracy                           1.00    113923
   macro avg       0.89      0.88      0.88    113923
weighted avg       1.00      1.00      1.00    113923

Confusion matrix:
[[113680     44]
 [    47    152]] 

[ Evaluation result for Random forest classifier (40 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.94      0.76      0.84  

## Boosting

Boosting is an ensemble learning technique that aims to combine **a set of weak learners** (a classifier that is only **slightly better** than a random classifier) into a strong learner. It is able to reduce both the bias and variance of the original classification models. A boosting algorithm usually consists of **iteratively learning weak classifiers** with respect to a distribution and **adding them to a final strong classifier**. Weak classifiers are typically weighted in some way that is related to its performance. After a weak learner is added, sample weights are re-adjusted so that **misclassified samples are stressed** and correctly classified samples are paid less attention to. This causes future weak learners to focus more on samples that previous weak learners fail, thus making the ensemble more robusting against variations in sample features.

In the following code cell, we will try two kinds of boosting ensembles: **[AdaBoost](https://en.wikipedia.org/wiki/AdaBoost) and [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting)**. Both use one-level (as opposed to full, deep) decision trees as base weak classifiers. AdaBoost adjusts the weights of training samples and weak classifiers based on the accuracy, while gradient boosting adjusts the weights by differentiating through the target loss and computing the correponsing gradient for gradient descent.

In [27]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

# AdaBoost: adjusts weights based on misclassification rates (50 default weak learners)
with timeit("Training AdaBoost classifier (50 DTs)"):
    adaboost_model = AdaBoostClassifier()
    adaboost_model.fit(feat_train, y_train)

# Gradient boosting: sequentially corrects errors using gradient descent (40 estimators)
with timeit("Training gradient boosting classifier"):
    gb_model = GradientBoostingClassifier(n_estimators=40)
    gb_model.fit(feat_train, y_train)

# Evaluate boosting models
evaluate_model(adaboost_model, "AdaBoost classifier", feat_test, y_test)
evaluate_model(gb_model, "gradient boosting classifier", feat_test, y_test)

Training AdaBoost classifier (50 DTs) started...
Training AdaBoost classifier (50 DTs) completed. Elapsed time: 89.18s

Training gradient boosting classifier started...
Training gradient boosting classifier completed. Elapsed time: 191.77s

[ Evaluation result for AdaBoost classifier ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.80      0.70      0.75       199

    accuracy                           1.00    113923
   macro avg       0.90      0.85      0.87    113923
weighted avg       1.00      1.00      1.00    113923

Confusion matrix:
[[113688     36]
 [    59    140]] 

[ Evaluation result for gradient boosting classifier ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.89      0.61      0.72       199

    accuracy                           1.00    113923
   macro avg

## Stacking

The last ensemble learning scheme is **stacking**, which is also the most sophisticated one among the four we have introduced today. Like all previous types of ensemble, a stacking ensemble contains a number of classifiers, each possibly using a distinct machine learning algorithm. However, we would perform $K$-fold **cross validation** during training on each classifier, so each classifier actually has $K$ clones that are trained on different parts of the training samples. The cross validation also provides us with the **validation predictions for all samples**, gathered from the $K$ clones. After training all base classifiers, we collect and concatenate the validation predictions from clones of different classifiers as features, and then **train a meta-classifier** that predicts the sample label for the whole ensemble.


To predict labels for the test set (and any other unseen dataset) samples, we apply the features to all $K$ clones of all classifiers. For the $K$ clones of the same classifier, we **average their outputs** which are usually the probability distribution over all classes. Like the training data, we then **concatenate the averaged outputs** from different classifiers as features, and finally pass them to the meta-classifier to obtain predictions.

In [28]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier

with timeit("Training stacking ensemble"):
    # Create a stacking ensemble with a logistic regression meta-classifier and three sub-classifiers
    stacking_ensemble = StackingClassifier([
        ("Random forest", RandomForestClassifier(n_estimators=40)),
        ("Logistic", LogisticRegression(max_iter=200)),
        ("SVM", LinearSVC(max_iter=1500))
    ], LogisticRegression(), n_jobs=N_ENSEMBLE_CPUS)
    # Train the stacking ensemble
    stacking_ensemble.fit(feat_train, y_train)

# Evaluate the stacking ensemble
evaluate_model(stacking_ensemble, "stacking ensemble", feat_test, y_test)

Training stacking ensemble started...
Training stacking ensemble completed. Elapsed time: 518.25s

[ Evaluation result for stacking ensemble ]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    113724
           1       0.95      0.74      0.83       199

    accuracy                           1.00    113923
   macro avg       0.97      0.87      0.92    113923
weighted avg       1.00      1.00      1.00    113923

Confusion matrix:
[[113716      8]
 [    52    147]] 



## References

1. Emsemble Learning: https://en.wikipedia.org/wiki/Ensemble_learning
2. Ensemble Learning in Machine Learning: https://towardsdatascience.com/ensemble-learning-in-machine-learning-getting-started-4ed85eb38e00
3. Random Forest: https://en.wikipedia.org/wiki/Random_forest
4. Boosting: https://en.wikipedia.org/wiki/Boosting_(machine_learning)
5. AdaBoost: https://en.wikipedia.org/wiki/AdaBoost
6. Gradient Boosting: https://en.wikipedia.org/wiki/Gradient_boosting