# **Ensemble Learning | Assignment**

# **Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

Ensemble Learning is a machine learning technique that combines multiple individual models (often called 'weak learners' or 'base learners') to achieve better predictive performance than could be obtained from any single model. The key idea behind it is to leverage the 'wisdom of the crowd' principle: by combining the predictions of several diverse models, the ensemble can reduce errors, improve robustness, and increase accuracy. Each individual model might have its own strengths and weaknesses, but when their predictions are aggregated, the biases and variances tend to cancel each other out, leading to a more reliable and accurate overall prediction.

# **Question 2: What is the difference between Bagging and Boosting?**

Bagging and Boosting are both ensemble learning techniques that combine multiple weak learners to create a strong learner, but they differ significantly in their approach:

**Bagging (Bootstrap Aggregating):**
*   **Goal:** To reduce variance and prevent overfitting.
*   **How it works:**
    *   **Parallel:** Base learners are built independently (in parallel).
    *   **Bootstrapping:** Multiple subsets of the original training data are created by sampling with replacement (bootstrapping).
    *   **Model Training:** A base learner (e.g., decision tree) is trained on each of these subsets.
    *   **Aggregation:** For regression, predictions are averaged; for classification, predictions are combined by majority voting.
*   **Key Characteristics:**
    *   **Reduces Variance:** By averaging or voting, it smooths out the individual model biases and reduces the overall variance of the ensemble.
    *   **Less prone to overfitting:** Due to the averaging/voting mechanism and diverse training sets.
    *   **Examples:** Random Forest (which is essentially Bagging applied to decision trees).

**Boosting:**
*   **Goal:** To reduce bias and convert weak learners into strong learners.
*   **How it works:**
    *   **Sequential:** Base learners are built sequentially, with each new model attempting to correct the errors of the previous ones.
    *   **Weighted Data:** Each iteration, the algorithm focuses more on the misclassified or high-error instances from the previous models by assigning them higher weights.
    *   **Model Training:** A base learner is trained on the weighted data.
    *   **Aggregation:** Predictions are combined using a weighted average, where models that performed better on challenging instances might have higher influence.
*   **Key Characteristics:**
    *   **Reduces Bias:** By iteratively focusing on difficult examples, it effectively learns from mistakes and improves overall accuracy.
    *   **More prone to overfitting:** If not properly regularized, as it can over-specialize on the training data due to its sequential error correction.
    *   **Examples:** AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM, CatBoost.

**Summary of Differences:**

| Feature         | Bagging (e.g., Random Forest)                                  | Boosting (e.g., AdaBoost, XGBoost)                               |
| :-------------- | :------------------------------------------------------------- | :----------------------------------------------------------------- |
| **Approach**    | Parallel, independent training                                 | Sequential, dependent training                                     |
| **Goal**        | Reduce variance, prevent overfitting                           | Reduce bias, convert weak learners to strong learners              |
| **Data Usage**  | Each model trained on a bootstrap sample of the original data  | Each model trained on re-weighted data, focusing on previous errors |
| **Error Focus** | Each model treats errors equally                             | Each model focuses on errors made by previous models              |
| **Weighting**   | Equal weighting for models (or simple majority vote)           | Models weighted based on their performance, data points weighted   |
| **Complexity**  | Simpler to parallelize                                       | More complex, inherently sequential                               |

# **Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Bootstrap Sampling:**

Bootstrap sampling (also known as bootstrapping) is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. In simpler terms, it involves creating multiple new datasets (bootstrap samples) from an original dataset by randomly selecting observations with replacement. This means that an observation can be selected multiple times in a single bootstrap sample, and some observations might not be selected at all.

Key characteristics of bootstrap sampling:
*   **Sampling with Replacement:** Each time an observation is selected from the original dataset to form a bootstrap sample, it is returned to the original dataset, making it available for re-selection.
*   **Same Size:** Each bootstrap sample typically has the same size as the original dataset.
*   **Diverse Samples:** Because of sampling with replacement, each bootstrap sample will be slightly different from the original dataset and from each other, containing some duplicates and missing some original observations.

**Role in Bagging Methods like Random Forest:**

Bootstrap sampling is a fundamental component of Bagging (Bootstrap Aggregating) methods, and it plays a crucial role in the success of algorithms like Random Forest. Here's how:

1.  **Creating Diverse Training Sets:** In Bagging, instead of training a single model on the entire dataset, multiple base models (e.g., decision trees in Random Forest) are trained on different bootstrap samples of the original training data. This ensures that each base model sees a slightly different version of the data.

2.  **Reducing Variance:** Because each base model is trained on a different subset of the data, they will likely make different errors and have different biases. When their predictions are combined (e.g., by averaging for regression or majority voting for classification), these individual errors and biases tend to cancel each other out. This process effectively reduces the variance of the overall ensemble model, making it more robust and less prone to overfitting than any single base model.

3.  **Increasing Model Stability:** By training on diverse samples, the ensemble becomes more stable. The overall prediction is less sensitive to small changes in the training data, as the individual models' sensitivities are averaged out.

4.  **Enabling Parallelization:** Since each base model is trained independently on its own bootstrap sample, the training process can be easily parallelized, making Bagging methods computationally efficient.

In essence, bootstrap sampling is the mechanism that introduces diversity among the base learners in Bagging. This diversity is key to the ensemble's ability to reduce variance and improve overall predictive performance, making it a powerful technique for creating robust machine learning models like the Random Forest.

# **Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

**Out-of-Bag (OOB) Samples:**

Out-of-Bag (OOB) samples refer to the data points from the original training dataset that were *not* included in a particular bootstrap sample used to train a base learner in a Bagging ensemble (like a Random Forest). In bootstrap sampling, approximately one-third of the original data is left out of any given bootstrap sample. These left-out data points for each base learner are its OOB samples.

To illustrate:
1.  When creating a bootstrap sample for a base model, roughly 63.2% of the original dataset is selected (with replacement).
2.  The remaining ~36.8% of the data points that were *not* selected for that specific bootstrap sample constitute the OOB samples for that particular base model.

**How OOB Score is Used to Evaluate Ensemble Models:**

The OOB score provides a convenient and efficient way to evaluate the performance of Bagging ensemble models (especially Random Forests) without the need for a separate validation set or cross-validation. Here's how it works:

1.  **Prediction by Unseen Data:** For each data point in the original training set, only the base models that did *not* use that data point in their training (i.e., those for which that data point was an OOB sample) are used to make a prediction for that data point.

2.  **Aggregating OOB Predictions:** For a given data point, predictions are collected from all base models for which it was an OOB sample. These predictions are then aggregated (e.g., averaged for regression, or majority voted for classification) to produce a final OOB prediction for that data point.

3.  **Calculating the OOB Score:** Once OOB predictions have been made for all data points in the original training set (or at least for a sufficient portion), the OOB score is calculated by comparing these OOB predictions with the true labels (or values). Common metrics for OOB score include:
    *   **Accuracy:** For classification, the proportion of correctly classified OOB samples.
    *   **R-squared or Mean Squared Error (MSE):** For regression.

**Advantages of OOB Score:**

*   **Internal Validation:** It acts as an internal, unbiased estimate of the model's generalization error, similar to using a validation set. This is because each data point is evaluated by models that have not seen it during their training.
*   **Efficiency:** It eliminates the need for splitting the data into separate training and validation sets, allowing the model to utilize all available data for training while still providing a robust performance estimate.
*   **Computational Savings:** Since the OOB evaluation happens during the training process, it avoids the additional computational cost of performing k-fold cross-validation or training on a reduced dataset for a validation split.

In summary, OOB samples are the data points left out of each bootstrap sample, and the OOB score leverages these samples to provide an accurate and efficient estimate of the ensemble model's performance without requiring external validation data.

# **Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**


Feature importance analysis is a crucial aspect of understanding machine learning models, helping to identify which features contribute most to the model's predictions. While both Decision Trees and Random Forests can provide feature importances, there are significant differences in how they are calculated and interpreted due to their underlying structures.

### **Feature Importance in a Single Decision Tree**

**How it's Calculated:**
In a single Decision Tree, feature importance is typically determined by the reduction in impurity (e.g., Gini impurity for classification, Mean Squared Error for regression) that a feature brings when it's used to split a node. The more a feature reduces impurity across all splits it's involved in, the higher its importance score.

**Characteristics:**
*   **Greedy and Local:** The importance scores are very specific to the particular tree structure. A feature might appear very important if it's chosen for a high-level split early in the tree, even if other features might be equally or more important if the tree had been constructed differently.
*   **Instability:** Small changes in the training data can lead to drastically different tree structures and, consequently, different feature importance rankings.
*   **Bias towards High Cardinality Features:** Features with many unique values or continuous features can sometimes be artificially inflated in importance because they offer more potential split points, which might coincidentally lead to large impurity reductions. This doesn't necessarily mean they are genuinely more predictive.
*   **Tree-Specific:** The importance values only reflect the contribution of features within that *specific* decision tree.

### **Feature Importance in a Random Forest**

**How it's Calculated:**
Random Forests, being an ensemble of many Decision Trees, aggregate the feature importances from all individual trees. The most common method is "Mean Decrease in Impurity" (MDI) or "Gini Importance":
1.  For each feature in each individual decision tree, the impurity reduction from that feature is calculated, just like in a single tree.
2.  These impurity reductions are then averaged across all the trees in the forest. The final importance score for a feature is the average of its importance across all trees.

**Characteristics:**
*   **More Robust and Stable:** By averaging over many trees, the feature importances in a Random Forest are much more stable and less prone to the instability caused by small data variations or specific tree structures. The "randomness" introduced by bootstrapping and feature subsampling helps to decorrelate the trees and make the averaged importances more reliable.
*   **Global View:** Random Forests provide a more global assessment of feature importance. If a feature is consistently important across many different trees (even if its position varies), it will receive a high aggregate score.
*   **Reduced Bias (but still present):** While the averaging helps, Random Forests can still exhibit some bias towards high cardinality or continuous features, though generally less pronounced than in a single decision tree due to the ensemble's diversity.
*   **Better Generalization:** The aggregated importance scores tend to generalize better to unseen data, as they reflect a more comprehensive understanding of feature contributions across various data subsets and tree configurations.

### **Comparison Summary:**

| Feature             | Single Decision Tree                                      | Random Forest                                                   |
| :------------------ | :-------------------------------------------------------- | :-------------------------------------------------------------- |
| **Calculation Basis** | Impurity reduction in a single tree                       | Average impurity reduction across many trees                      |
| **Stability**       | Low; sensitive to data variations and tree structure      | High; robust due to averaging over an ensemble                  |
| **Bias**            | Prone to bias towards high cardinality/continuous features | Less prone to bias, but can still exist                         |
| **Interpretation**  | Local to the specific tree                                | Global, more reliable, and generalizable                        |
| **Reliability**     | Lower, can be misleading                                  | Higher, better indicator of true predictive power               |

In essence, while a single Decision Tree gives you a snapshot of how features were utilized in one specific model, a Random Forest provides a more generalized, robust, and reliable measure of a feature's overall importance across various model configurations, making it generally preferred for feature importance analysis.

# **Question 6: Write a Python program to: ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores.**


In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# 1. Load the Breast Cancer dataset
bcs = load_breast_cancer()
X = pd.DataFrame(bcs.data, columns=bcs.feature_names)
y = bcs.target

print("Dataset loaded successfully. Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)

# 2. Train a Random Forest Classifier
# Using default parameters for simplicity, but in a real scenario, hyperparameter tuning would be beneficial.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X, y)

print("\nRandom Forest Classifier trained successfully.")

# 3. Print the top 5 most important features based on feature importance scores
feature_importances = pd.Series(rf_classifier.feature_importances_, index=X.columns)

# Sort features by importance in descending order and get the top 5
top_5_features = feature_importances.nlargest(5)

print("\nTop 5 most important features:")
print(top_5_features)

Dataset loaded successfully. Shape of features (X): (569, 30)
Shape of target (y): (569,)

Random Forest Classifier trained successfully.

Top 5 most important features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


# **Question 7: Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree**


In [4]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

print("Iris dataset loaded successfully. Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

# 2. Train a single Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy for the single Decision Tree
y_pred_dt = dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"\nAccuracy of a single Decision Tree: {accuracy_dt:.4f}")

# 3. Train a Bagging Classifier using Decision Trees
# base_estimator is the DecisionTreeClassifier
bag_classifier = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42), # Base estimator
    n_estimators=100, # Number of base estimators
    max_samples=1.0, # Use all samples for each base estimator (bootstrapping handles sampling)
    max_features=1.0, # Use all features for each base estimator
    bootstrap=True, # Sample with replacement
    random_state=42
)
bag_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy for the Bagging Classifier
y_pred_bag = bag_classifier.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)
print(f"Accuracy of Bagging Classifier (100 Decision Trees): {accuracy_bag:.4f}")

# 4. Compare accuracies
print("\n--- Comparison ---")
if accuracy_bag > accuracy_dt:
    print("The Bagging Classifier performed better than the single Decision Tree.")
elif accuracy_bag < accuracy_dt:
    print("The single Decision Tree performed better than the Bagging Classifier.")
else:
    print("Both models achieved the same accuracy.")


Iris dataset loaded successfully. Shape of features (X): (150, 4)
Shape of target (y): (150,)

Training set size: 105 samples
Testing set size: 45 samples

Accuracy of a single Decision Tree: 0.9333
Accuracy of Bagging Classifier (100 Decision Trees): 0.9333

--- Comparison ---
Both models achieved the same accuracy.


# **Question 8: Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy.**

In [7]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset (reusing from previous question, but good to include for standalone execution)
iris = load_iris()
X = iris.data
y = iris.target

print("Iris dataset loaded successfully. Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

# 2. Define the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 150],  # Number of trees in the forest
    'max_depth': [None, 10, 20],     # Maximum depth of the tree
    'min_samples_split': [2, 5],     # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2]       # Minimum number of samples required to be at a leaf node
}

# 3. Tune hyperparameters using GridSearchCV
print("\nStarting GridSearchCV for hyperparameter tuning...")
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("GridSearchCV completed.")

# 4. Print the best parameters and best score
print("\nBest parameters found:")
print(grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Get the best model from GridSearchCV
best_rf_model = grid_search.best_estimator_

# Make predictions on the test set with the best model
y_pred = best_rf_model.predict(X_test)

# Calculate the final accuracy on the test set
final_accuracy = accuracy_score(y_test, y_pred)
print(f"Final accuracy on the test set with best parameters: {final_accuracy:.4f}")


Iris dataset loaded successfully. Shape of features (X): (150, 4)
Shape of target (y): (150,)

Training set size: 105 samples
Testing set size: 45 samples

Starting GridSearchCV for hyperparameter tuning...
Fitting 5 folds for each of 36 candidates, totalling 180 fits
GridSearchCV completed.

Best parameters found:
{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 150}
Best cross-validation accuracy: 0.9619
Final accuracy on the test set with best parameters: 0.9111


# **Question 9: Write a Python program to: ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset ● Compare their Mean Squared Errors (MSE).**

In [9]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor # BaggingRegressor uses this as default base_estimator
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
california_housing = fetch_california_housing()
X = california_housing.data
y = california_housing.target

print("California Housing dataset loaded successfully. Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

# 2. Train a Bagging Regressor
# Using DecisionTreeRegressor as the base estimator, which is the default
bag_regressor = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=100, # Number of base estimators
    random_state=42,
    n_jobs=-1 # Use all available cores
)
bag_regressor.fit(X_train, y_train)

# Make predictions and calculate MSE for Bagging Regressor
y_pred_bag = bag_regressor.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)
print(f"\nBagging Regressor trained successfully. MSE: {mse_bag:.4f}")

# 3. Train a Random Forest Regressor
rf_regressor = RandomForestRegressor(
    n_estimators=100, # Number of trees in the forest
    random_state=42,
    n_jobs=-1 # Use all available cores
)
rf_regressor.fit(X_train, y_train)

# Make predictions and calculate MSE for Random Forest Regressor
y_pred_rf = rf_regressor.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest Regressor trained successfully. MSE: {mse_rf:.4f}")

# 4. Compare their Mean Squared Errors (MSE)
print("\n--- Comparison of MSE ---")
print(f"Bagging Regressor MSE:    {mse_bag:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")

if mse_rf < mse_bag:
    print("The Random Forest Regressor achieved a lower (better) MSE.")
elif mse_bag < mse_rf:
    print("The Bagging Regressor achieved a lower (better) MSE.")
else:
    print("Both regressors achieved the same MSE.")


California Housing dataset loaded successfully. Shape of features (X): (20640, 8)
Shape of target (y): (20640,)

Training set size: 16512 samples
Testing set size: 4128 samples

Bagging Regressor trained successfully. MSE: 0.2559
Random Forest Regressor trained successfully. MSE: 0.2554

--- Comparison of MSE ---
Bagging Regressor MSE:    0.2559
Random Forest Regressor MSE: 0.2554
The Random Forest Regressor achieved a lower (better) MSE.


# **Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to: ● Choose between Bagging or Boosting ● Handle overfitting ● Select base models ● Evaluate performance using cross-validation ● Justify how ensemble learning improves decision-making in this real-world context.**



## **Step-by-Step Approach to Using Ensemble Techniques for Loan Default Prediction**

### **1. Choosing Between Bagging and Boosting**

* **Bagging (e.g., Random Forest)** is preferred when the base model has **high variance** and you want to improve stability and reduce overfitting by averaging many independent models.
* **Boosting (e.g., XGBoost, LightGBM)** is preferred when the model suffers from **high bias**, and you want to sequentially reduce errors by giving more weight to misclassified samples.
* **In loan-default prediction**, boosting generally performs better because it captures complex non-linear relationships in customer and transaction data.

---

### **2. Handling Overfitting**

* Use **cross-validation** to avoid overly optimistic estimates.
* Apply **regularization techniques** (L1/L2 penalties, learning-rate reduction, shrinkage).
* Limit model complexity (tree depth, number of estimators, minimum samples per leaf).
* Use **early stopping** based on validation performance.
* Use **subsampling** (row/column sampling in boosting) to increase generalization.

---

### **3. Selecting Base Models**

* Choose **diverse and complementary models**:

  * **Decision trees** (high-variance, good for bagging).
  * **Shallow trees / weak learners** (for boosting).
  * **Logistic Regression** for interpretability and calibration.
* In practice, **Random Forest** for bagging and **Gradient Boosted Trees** for boosting are most effective on structured financial data.

---

### **4. Evaluating Performance Using Cross-Validation**

* Use **Stratified k-Fold Cross-Validation** to maintain the default/non-default class ratio in each fold.
* Tune hyperparameters inside the CV loop to avoid leakage.
* Evaluate using metrics relevant to imbalanced data:

  * **AUC-ROC**, **Precision-Recall AUC**, **F1**, or **Top-N% Precision**.
* If the data is time-dependent, use **time-series CV** to avoid using future data for training.

---

### **5. Justification: Why Ensemble Learning Improves Decision-Making**

* **Higher predictive accuracy** leads to better identification of high-risk borrowers.
* **Reduced variance** (bagging) gives more stable and reliable predictions across different samples.
* **Reduced bias** (boosting) captures subtle patterns in transaction behaviour.
* **Better calibration** provides more accurate risk scores for credit decisions and pricing.
* **Lower financial risk**: improved classification reduces default losses and supports smarter loan approval thresholds.

