#Ensemble Learning | Assignment

# Question 1:  What is Ensemble Learning in machine learning Explain the key idea behind it.

Ensemble Learning in **machine learning** is a technique where multiple models (often called *weak learners*) are trained and combined to solve the same problem. Instead of depending on a single model, ensemble methods bring together the predictions of several models to achieve better performance.

### **Key Idea Behind Ensemble Learning**

The central idea is that:

 *A group of models, when combined properly, can perform better than any individual model alone.*

This works because:

* Different models may make different errors.
* By aggregating their predictions (through voting, averaging, or weighting), the errors can cancel out.
* The ensemble becomes more accurate, more stable, and less prone to overfitting.

### **Main Types of Ensemble Methods**

1. **Bagging (Bootstrap Aggregating):**

   * Trains many models in parallel on random subsets of the data.
   * Final prediction is made by majority voting (classification) or averaging (regression).
   * Example: **Random Forest**.

2. **Boosting:**

   * Trains models sequentially. Each new model focuses on the mistakes of the previous ones.
   * Example: **AdaBoost, XGBoost, LightGBM**.

3. **Stacking:**

   * Combines predictions of multiple models using another model (called a *meta-learner*) to learn the best way to blend them.

---

# Question 2: What is the difference between Bagging and Boosting?

## **Bagging vs Boosting**:

### **1. Bagging (Bootstrap Aggregating)**

* **Training style:** Models are trained **in parallel** on different random subsets of the training data (using sampling with replacement).
* **Goal:** Reduce **variance** (helps prevent overfitting).
* **How it works:** Each model votes/averages its prediction → the ensemble output is more stable.
* **Example:** **Random Forest**.

### **2. Boosting**

* **Training style:** Models are trained **sequentially**. Each new model focuses on the mistakes made by the previous models.
* **Goal:** Reduce **bias** (helps improve accuracy on hard-to-predict cases).
* **How it works:** Misclassified examples get more weight, so the next model learns them better. Final prediction is a weighted combination of all models.
* **Example:** **AdaBoost, Gradient Boosting, XGBoost**.

---

### **Key Differences (Summary Table)**

| Aspect            | Bagging 🧺                      | Boosting 🚀                         |
| ----------------- | ------------------------------- | ----------------------------------- |
| Training style    | Parallel                        | Sequential                          |
| Focus             | Reduce **variance**             | Reduce **bias**                     |
| Data sampling     | Random subsets with replacement | Weighted sampling (focus on errors) |
| Model weight      | All models contribute equally   | Later models get higher weight      |
| Example algorithm | Random Forest                   | AdaBoost, XGBoost                   |

---

# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

## **Bootstrap Sampling**

* **Definition:** Bootstrap sampling is a statistical technique where we create new datasets (called *bootstrap samples*) by **randomly selecting data points from the original dataset *with replacement***.
* **With replacement** means: the same data point can appear multiple times in a sample, while some points may not appear at all.
* Each bootstrap sample is usually the same size as the original dataset.

## **Role in Bagging (e.g., Random Forest)**

Bagging = **Bootstrap Aggregating** → the name itself comes from bootstrap sampling!

Here’s why it’s important:

1. **Diversity of Models:**

   * Each model (e.g., decision tree) is trained on a different bootstrap sample.
   * This makes each model slightly different, even though they’re trained on the same overall dataset.

2. **Reduced Variance:**

   * Individual models might overfit their training data.
   * But by training many models on varied bootstrap samples and averaging/voting their results, bagging reduces overfitting and variance.

3. **Random Forest Specific:**

   * In Random Forest, bootstrap sampling is used to build each decision tree.
   * Additionally, Random Forest adds **feature randomness**: at each tree split, only a random subset of features is considered.
   * This double randomness (data + features) makes the trees less correlated, improving ensemble performance.

---

# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

## **Out-of-Bag (OOB) Samples**

When we use **bootstrap sampling** in Bagging (like in Random Forest):

* Each bootstrap sample is drawn **with replacement** from the original dataset.
* On average, about **63% of the training data** ends up in a given bootstrap sample.
* The remaining **\~37% of data points** that were **not chosen** are called **Out-of-Bag (OOB) samples**.

👉 So, OOB samples are simply the data points **left out** when creating a bootstrap sample for training a particular model.


## **OOB Score**

The **OOB score** is a way to estimate the model’s performance **without needing a separate validation set**.

**How it works:**

1. Train each base model (e.g., each decision tree in Random Forest) on its bootstrap sample.
2. For each data point in the dataset, look at the models where this point was **OOB** (i.e., not used for training).
3. Use those models to predict the label for that data point.
4. Compare predictions with the true labels → compute accuracy (or other metric).

This gives the **OOB score**, which is an **internal cross-validation estimate** of the ensemble’s generalization performance.


## **Why OOB Score is Useful**

* **No need for a separate validation set** → we can use all data for training.
* Provides an **unbiased estimate** of test error (similar to cross-validation).
* Especially handy in Random Forests, where OOB scoring is commonly used as a quick measure of model accuracy.

---

# Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

## **1. Feature Importance in a Single Decision Tree**

* **How it’s measured:**
  A tree decides splits based on some *impurity measure* (like Gini impurity, entropy for classification, or variance for regression).

  * Each time a feature is used to split the data, the impurity reduction is recorded.
  * The importance of a feature = sum of all impurity reductions from splits using that feature.
  * Finally, the values are normalized (so they add up to 1).

* **Pros:**

  * Easy to compute and interpret.
  * Shows how influential a feature was in building that tree.

* **Cons:**

  * **Unstable:** A single decision tree can vary a lot with small data changes.
  * May give **biased importance** if features have many categories or scales.

## **2. Feature Importance in a Random Forest**

* **How it’s measured:**

  * Each tree in the forest gives its own feature importance (same impurity reduction method).
  * Random Forest averages these importances across all trees → producing a more robust score.

* **Alternative method (Permutation Importance):**

  * Random Forests can also measure feature importance by randomly shuffling a feature and checking how much model accuracy drops.
  * If accuracy falls a lot, that feature was important.

* **Pros:**

  * Much **more stable** and reliable because it averages across many trees.
  * Captures importance even if features interact in complex ways.
  * Permutation method avoids bias from categorical features.

* **Cons:**

  * Harder to interpret than a single tree.
  * Importance may be “diluted” across correlated features (they share credit).


## **Quick Comparison Table**

| Aspect           | Single Decision Tree 🌳                        | Random Forest 🌲🌲🌲                           |
| ---------------- | ---------------------------------------------- | ---------------------------------------------- |
| Basis            | Impurity reduction in one tree                 | Average importance across many trees           |
| Stability        | Unstable, sensitive to data                    | Stable, robust                                 |
| Bias             | Can be biased toward features with many splits | Reduced bias, esp. with permutation importance |
| Interpretability | Simple, easy to explain                        | More complex, aggregated importance            |
| Use case         | Small/simple problems                          | Large, complex datasets                        |

---

# Question 6: Write a Python program to:

● Load the Breast Cancer dataset using
`sklearn.datasets.load_breast_cancer()`

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# 1. Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# 2. Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# 3. Get feature importances
importances = rf.feature_importances_

# 4. Create a DataFrame for better visualization
feat_importances = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
})

# 5. Sort and select top 5
top5 = feat_importances.sort_values(by="Importance", ascending=False).head(5)

# Print results
print("Top 5 Important Features:\n")
print(top5.to_string(index=False))

Top 5 Important Features:

             Feature  Importance
          worst area    0.139357
worst concave points    0.132225
 mean concave points    0.107046
        worst radius    0.082848
     worst perimeter    0.080850


# Question 7: Write a Python program to:

● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import sklearn
import sys

# Print versions for debugging
print(f"Python version: {sys.version}")
print(f"Scikit-learn version: {sklearn.__version__}")

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# 4. Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed from base_estimator to estimator
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

# 5. Print results
print("Accuracy of Single Decision Tree: {:.4f}".format(acc_dt))
print("Accuracy of Bagging Classifier  : {:.4f}".format(acc_bag))

Python version: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
Scikit-learn version: 1.6.1
Accuracy of Single Decision Tree: 0.9333
Accuracy of Bagging Classifier  : 0.9333


#Question 8: Write a Python program to:

● Train a Random Forest Classifier

● Tune hyperparameters `max_depth` and `n_estimators` using GridSearchCV

● Print the best parameters and final accuracy

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Define Random Forest and hyperparameter grid
rf = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],  # number of trees
    'max_depth': [None, 5, 10, 20]   # depth of trees
}

# 4. GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,              # 5-fold cross validation
    n_jobs=-1,         # use all CPU cores
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# 5. Best parameters and accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
final_acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Final Accuracy on Test Set: {:.4f}".format(final_acc))

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 0.9357


#Question 9: Write a Python program to:

● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset

● Compare their Mean Squared Errors (MSE)

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 1. Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train Bagging Regressor with Decision Trees
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(), # Changed from base_estimator to estimator
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# 4. Train Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# 5. Print results
print("Mean Squared Error (Bagging Regressor): {:.4f}".format(mse_bag))
print("Mean Squared Error (Random Forest Regressor): {:.4f}".format(mse_rf))

Mean Squared Error (Bagging Regressor): 0.2568
Mean Squared Error (Random Forest Regressor): 0.2565


#Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

  ● Choose between Bagging or Boosting

  ● Handle overfitting

  ● Select base models

  ● Evaluate performance using cross-validation

  ● Justify how ensemble learning improves decision-making in this real-world context.

## **Step 1: Choose Between Bagging or Boosting**

**Factors to consider:**

* **Data size:** Large datasets can benefit from **Bagging**, as it trains many models in parallel efficiently.
* **Bias vs Variance:**

  * **Bagging** reduces variance → useful if individual models (like Decision Trees) tend to overfit.
  * **Boosting** reduces bias → useful if single models underfit and can learn sequentially from mistakes.
* **Noise sensitivity:** Boosting can be sensitive to noisy data, which can lead to overfitting in financial datasets.

**Decision:**

* Start with **Bagging** (e.g., Random Forest) to get a robust baseline.
* If underfitting is detected, try **Boosting** (e.g., XGBoost, LightGBM) to improve predictive power.

## **Step 2: Handle Overfitting**

**Techniques:**

1. **Limit tree depth** (`max_depth`) and **minimum samples per leaf** to prevent overly complex trees.
2. **Use ensemble averaging**: Bagging naturally reduces overfitting by averaging predictions.
3. **Regularization in Boosting:** Parameters like `learning_rate` and `n_estimators` in XGBoost control overfitting.
4. **Cross-validation**: Monitor performance on validation sets to detect overfitting early.

## **Step 3: Select Base Models**

* Common base models for ensemble techniques:

  * **Decision Trees** → widely used for Bagging and Boosting.
  * **Logistic Regression** → can be used in stacking ensembles for interpretability.
  * **Other weak learners** like small neural networks or SVMs if boosting/staking is used.

**Financial context tip:**

* Decision Trees are preferred due to interpretability, which is important for **regulatory compliance** in financial institutions.

## **Step 4: Evaluate Performance Using Cross-Validation**

1. Split data into **k folds** (e.g., 5 or 10).
2. Train ensemble models on **k-1 folds** and validate on the remaining fold.
3. Metrics to monitor:

   * **ROC-AUC** → captures model ability to distinguish defaulters vs non-defaulters.
   * **Precision/Recall** → especially if default cases are rare (imbalanced data).
   * **F1-Score** → balances precision and recall.
4. Average metrics across folds for a reliable performance estimate.

**Optional:** Use **Stratified K-Fold** to ensure class proportions are maintained in each fold.

## **Step 5: Justify How Ensemble Learning Improves Decision-Making**

1. **Improved Accuracy:**

   * Combining multiple models reduces variance and bias, leading to more reliable predictions of defaults.

2. **Better Risk Assessment:**

   * Ensemble models are more stable, reducing the likelihood of misclassifying high-risk customers.

3. **Robustness to Noisy Data:**

   * Bagging mitigates the effect of outliers in financial transactions.

4. **Interpretability (with feature importance):**

   * Even ensemble models like Random Forest provide feature importance scores to identify key predictors of default.

5. **Regulatory and Business Confidence:**

   * Financial institutions can explain decisions to stakeholders using ensemble-derived insights while maintaining high predictive performance.

### **Step 6 (Optional Advanced Step): Stacking for Maximum Performance**

* Combine multiple ensembles (e.g., Random Forest + XGBoost + Logistic Regression) with a **meta-model** to capture complementary strengths.
* Helps improve predictions when different models capture different patterns in transaction history or demographics.
