# 🌟 Chapter 7: Ensemble Learning & Random Forests — Practical Guide

Ensemble learning combines multiple models to make more accurate and robust predictions.

We'll explore different ensemble techniques with Python examples using scikit-learn.
Let's get started!

## 1. 🔹 Voting Classifiers

Voting classifiers combine predictions from different models and make a final decision based on majority or weighted voting.

We'll use the Iris dataset and combine Logistic Regression, Decision Tree, and SVM.

In [1]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

# Define individual classifiers
log_clf = LogisticRegression(max_iter=1000)
tree_clf = DecisionTreeClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)

# Create a VotingClassifier with soft voting
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('dt', tree_clf), ('svm', svm_clf)],
    voting='soft'
)

# Fit and evaluate
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
print("Voting Classifier Accuracy:", accuracy_score(y_test, y_pred))

Voting Classifier Accuracy: 1.0


## 2. 🔁 Bagging and Pasting

Bagging (Bootstrap Aggregating) trains multiple models on different random subsets of data.
Pasting is similar but without replacement.

We'll use Decision Trees as base estimators.

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Bagging with bootstrap samples
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    random_state=42
)
bag_clf.fit(X_train, y_train)
print("Bagging Accuracy:", accuracy_score(y_test, bag_clf.predict(X_test)))

Bagging Accuracy: 1.0


### Out-of-Bag (OOB) Samples for Validation

OOB samples are the data points not used in training each bootstrap sample. They can be used for validation.

Let's see how to access the OOB score.

In [8]:
# Bagging with OOB score enabled
bag_clf_oob = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=100,
    bootstrap=True,
    oob_score=True,
    random_state=42
)
bag_clf_oob.fit(X_train, y_train)
print("OOB Score:", bag_clf_oob.oob_score_)

OOB Score: 0.9375


## 3. 🌲 Random Patches and Random Subspaces — Random Forests & Extra Trees

These methods introduce randomness across features and samples to create diverse trees.
Random Forests are the most popular implementation.


In [3]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# Initialize classifiers
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)

# Train classifiers
rf_clf.fit(X_train, y_train)
et_clf.fit(X_train, y_train)

# Evaluate
print("Random Forest Accuracy:", accuracy_score(y_test, rf_clf.predict(X_test)))
print("Extra Trees Accuracy:", accuracy_score(y_test, et_clf.predict(X_test)))

Random Forest Accuracy: 1.0
Extra Trees Accuracy: 1.0


### 📊 Feature Importance

These models can also tell us which features are most important.


In [4]:
for name, clf in [("Random Forest", rf_clf), ("Extra Trees", et_clf)]:
    print(name, "feature importances:", clf.feature_importances_)

Random Forest feature importances: [0.10968334 0.02954459 0.43763486 0.42313721]
Extra Trees feature importances: [0.08806129 0.07102978 0.44737944 0.39352949]


## 4. 🚀 Boosting

Boosting trains models sequentially, where each new model tries to correct errors made by previous ones.

We'll explore AdaBoost and Gradient Boosting.

In [5]:
from sklearn.ensemble import AdaBoostClassifier

# AdaBoost with decision stumps
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    learning_rate=0.5,
    random_state=42
)
ada_clf.fit(X_train, y_train)
print("AdaBoost Accuracy:", accuracy_score(y_test, ada_clf.predict(X_test)))

AdaBoost Accuracy: 0.9473684210526315


### 📌 Gradient Boosting

Gradient Boosting builds models sequentially by optimizing a loss function.


In [6]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=1.0,
    max_depth=3,
    random_state=42
)
gb_clf.fit(X_train, y_train)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_clf.predict(X_test)))

Gradient Boosting Accuracy: 1.0


## 5. 🎯 Stacking

Stacking combines diverse models and trains a meta-classifier on their outputs.
It's a way to leverage the strengths of different algorithms.

In [7]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# Initialize stacking classifier with base models and meta-model
stack_clf = StackingClassifier(
    estimators=[('rf', rf_clf), ('et', et_clf), ('gb', gb_clf)],
    final_estimator=LogisticRegression(),
    cv=5
)

# Train and evaluate
stack_clf.fit(X_train, y_train)
print("Stacking Accuracy:", accuracy_score(y_test, stack_clf.predict(X_test)))

Stacking Accuracy: 1.0


## Summary Table

| Technique                        | Description                                                            |
| -------------------------------- | ---------------------------------------------------------------------- |
| **Voting**                       | Combines predictions from diverse models                               |
| **Bagging / Pasting**            | Trains multiple bootstrapped/pasted models                             |
| **Random Forest / Extra Trees**  | Adds feature randomness for diverse trees; computes feature importance |
| **AdaBoost / Gradient Boosting** | Sequentially corrects previous errors                                  |
| **Stacking**                     | A meta-model learns from base model outputs                            |


## 🔧 Exercises to Practice

1. Tune `n_estimators`, `max_depth`, and `max_features` for Random Forest using `GridSearchCV`.
2. Plot learning curves for AdaBoost and Gradient Boosting to check for overfitting or underfitting.
3. Compare stacking versus a voting classifier on a larger or more complex dataset.
4. Analyze misclassifications from the stacked model to understand weaknesses.

Happy experimenting!