# Bagging and Random Forest

In this notebook, we begin by implementing a Bootstrap-Aggregating classifier using many stump classifiers. Recognizing that this is technically a random forest, then utilize the built-in RandomForestClassifier, which is slightly more robust due to the randomness it introduces in sampling. 

### Datasets

This notebook uses the wine dataset from sklearn.

### Reproducibility

Ensure all random states are set to 42

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

from em_el.datasets import load_wine

In [2]:
wine = load_wine()
X = wine.drop('target', axis=1).to_numpy()
y = wine['target'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(max_depth=1, random_state=42), random_state=42)
bag_clf.fit(X_train, y_train)
bag_y_pred = bag_clf.predict(X_test)

print("Classification Report - Bagging: \n", classification_report(y_test, bag_y_pred))

Classification Report - Bagging: 
               precision    recall  f1-score   support

           0       0.76      0.93      0.84        14
           1       0.92      0.79      0.85        14
           2       1.00      0.88      0.93         8

    accuracy                           0.86        36
   macro avg       0.89      0.86      0.87        36
weighted avg       0.88      0.86      0.86        36



In [20]:
# How does this compare to a normal decision tree?

tree_clf = DecisionTreeClassifier(max_depth = 10, random_state=42)
tree_clf.fit(X_train, y_train)
tree_y_pred = tree_clf.predict(X_test)
tree_clf_rep = classification_report(y_test, bag_y_pred)
print("Classification Report - Decision Tree: \n", tree_clf_rep)

Classification Report - Decision Tree: 
               precision    recall  f1-score   support

           0       0.76      0.93      0.84        14
           1       0.92      0.79      0.85        14
           2       1.00      0.88      0.93         8

    accuracy                           0.86        36
   macro avg       0.89      0.86      0.87        36
weighted avg       0.88      0.86      0.86        36



In [22]:
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)
rf_y_pred = rf_clf.predict(X_test)

rf_clf_rep = classification_report(y_test, rf_y_pred)
print(f"Random Forest Classification Report: \n", rf_clf_rep)

Random Forest Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00         8

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



The random forest outperforms the decision tree without any hyperparameter tuning

In [14]:
print("Feature Importances")
list(zip(list(wine.drop('target', axis=1).columns), rf_clf.feature_importances_))

Feature Importances


[('alcohol', 0.11239773542143086),
 ('malic_acid', 0.03570276433546083),
 ('ash', 0.021282064154184602),
 ('alcalinity_of_ash', 0.03242487609714125),
 ('magnesium', 0.03684069949458186),
 ('total_phenols', 0.029278585609125395),
 ('flavanoids', 0.20229341635663622),
 ('nonflavanoid_phenols', 0.013515250584037197),
 ('proanthocyanins', 0.023560915987205423),
 ('color_intensity', 0.1712021830864957),
 ('hue', 0.07089132259413944),
 ('od280/od315_of_diluted_wines', 0.1115643167260497),
 ('proline', 0.13904586955351153)]