# Bagging and Random Forest

In this notebook, we begin by implementing a Bootstrap-Aggregating classifier using many stump classifiers. Recognizing that this is technically a random forest, then utilize the built-in RandomForestClassifier, which is slightly more robust due to the randomness it introduces in sampling. 


### Random Forest Overview

Random forests are a type of ensemble learning method used in machine learning for both classification and regression 
tasks. Random forests are built on the principle of combining multiple decision trees to improve the accuracy and 
robustness of predictions. Here are the key steps involved:

1. Bootstrap Sampling: Random forests start by creating multiple bootstrap 
samples from the original dataset. Each sample is obtained by randomly selecting data points with 
replacement, a process known as bagging.

2. Decision Tree Construction: For each bootstrap sample, a decision tree is constructed. However, unlike 
traditional decision trees, random forests introduce randomness during the tree-building process. 
At each node, instead of considering all features, a random subset of features is selected for splitting. 
This is known as feature randomness or the random subspace method.

3. Combining Predictions: For classification tasks, each decision tree in the forest gives a classification or a "vote," 
and the final prediction is the class with the majority of the votes. For regression tasks, the final prediction is
the average of the outputs of all trees.

Underlying Math
The underlying mathematics of random forests involves several key concepts:

1. Variance Reduction: Random forests reduce the variance of the predictions by averaging the outputs of multiple 
uncorrelated decision trees. This is based on the principle that the variance of the average of independent random 
variables is less than the variance of any individual variable.

2. Correlation Reduction: By randomly selecting features at each split, random forests reduce the correlation between 
the decision trees. This is crucial because the benefits of averaging (bagging) are limited by the correlation 
between the trees. Reducing correlation enhances the variance reduction achieved through bagging.

3. Feature Importance: Random forests measure the importance of each feature by evaluating the impact of each feature 
on the model’s performance. This is often done through permutation tests, where the importance of a feature is 
calculated by measuring the decrease in model performance when the values of that feature are randomly permuted.

Benefits of Random Forests:

1. High Accuracy: By combining multiple decision trees, random forests achieve higher accuracy than individual decision trees. 
This ensemble approach captures a broader range of data patterns, leading to more precise predictions.

2. Robustness to Overfitting: Random forests are highly resistant to overfitting due to the randomness in bootstrapping 
and feature selection. This ensures that the model generalizes well to new, unseen data.


3. Handling Missing Data: Random forests can handle missing values naturally by using the split with the majority 
of the data and averaging the outputs from trees trained on different parts of the data.

4. Feature Importance: Random forests provide insights into which features are most influential 
in making predictions, which is valuable for understanding the data and selecting relevant features.

5. Versatility: Random forests are effective for both classification and regression tasks and can 
handle high-dimensional datasets with ease.

Limitations of Random Forests

1. Computational Cost: Training multiple decision trees can be computationally expensive, especially with large datasets 
and a high number of trees. This increases memory usage and training times.

2. Interpretability: While individual decision trees are easy to interpret, random forests, being an ensemble 
of many trees, are more complex and harder to interpret. This lack of transparency can be a disadvantage in 
situations where model interpretability is crucial.

3. Prediction Time: Random forest models can take longer to make predictions compared to other algorithms, 
which can be a drawback in real-time applications.

4. Parameter Tuning: The performance of random forests heavily depends on various hyperparameters such as 
the number of trees, the maximum depth of each tree, and the number of features considered at each split. 
Proper tuning of these parameters is essential but can be time-consuming and requires expertise.

5. Noise Sensitivity: Random forests can struggle with datasets that have high levels of noise. 
The algorithm may construct trees that overfit the noise, reducing overall model accuracy.

#### Datasets

This notebook uses the wine dataset from sklearn.

#### Reproducibility

Ensure all random states are set to 42

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

from em_el.datasets import load_wine

In [2]:
wine = load_wine()
X = wine.drop('target', axis=1).to_numpy()
y = wine['target'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(max_depth=1, random_state=42), random_state=42)
bag_clf.fit(X_train, y_train)
bag_y_pred = bag_clf.predict(X_test)

print("Classification Report - Bagging: \n", classification_report(y_test, bag_y_pred))

Classification Report - Bagging: 
               precision    recall  f1-score   support

           0       0.76      0.93      0.84        14
           1       0.92      0.79      0.85        14
           2       1.00      0.88      0.93         8

    accuracy                           0.86        36
   macro avg       0.89      0.86      0.87        36
weighted avg       0.88      0.86      0.86        36



In [20]:
# How does this compare to a normal decision tree?

tree_clf = DecisionTreeClassifier(max_depth = 10, random_state=42)
tree_clf.fit(X_train, y_train)
tree_y_pred = tree_clf.predict(X_test)
tree_clf_rep = classification_report(y_test, bag_y_pred)
print("Classification Report - Decision Tree: \n", tree_clf_rep)

Classification Report - Decision Tree: 
               precision    recall  f1-score   support

           0       0.76      0.93      0.84        14
           1       0.92      0.79      0.85        14
           2       1.00      0.88      0.93         8

    accuracy                           0.86        36
   macro avg       0.89      0.86      0.87        36
weighted avg       0.88      0.86      0.86        36



In [22]:
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)
rf_y_pred = rf_clf.predict(X_test)

rf_clf_rep = classification_report(y_test, rf_y_pred)
print(f"Random Forest Classification Report: \n", rf_clf_rep)

Random Forest Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00         8

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



In this case, the random forest outperforms the decision tree without any hyperparameter tuning. It's also worth noting that using the Built-in Random Forest algorithm in sklearn is generally superior to using Bagging with stumps, because the built-in algorithm incorporates additional randomness that leads to greater variance reduction.

In [14]:
print("Feature Importances")
list(zip(list(wine.drop('target', axis=1).columns), rf_clf.feature_importances_))

Feature Importances


[('alcohol', 0.11239773542143086),
 ('malic_acid', 0.03570276433546083),
 ('ash', 0.021282064154184602),
 ('alcalinity_of_ash', 0.03242487609714125),
 ('magnesium', 0.03684069949458186),
 ('total_phenols', 0.029278585609125395),
 ('flavanoids', 0.20229341635663622),
 ('nonflavanoid_phenols', 0.013515250584037197),
 ('proanthocyanins', 0.023560915987205423),
 ('color_intensity', 0.1712021830864957),
 ('hue', 0.07089132259413944),
 ('od280/od315_of_diluted_wines', 0.1115643167260497),
 ('proline', 0.13904586955351153)]

While the feature importances are relatively uniform, the algorithm deemed flavanoids, color_intensity, and procline to be the most important, these three summing to just over 50% of total importance.