# **Ensemble Learning | Assignment**

**Question 1:** What is Ensemble Learning in machine learning? Explain the key idea
behind it.

**Answer:**

Ensemble Learning in machine learning is a technique where multiple models (often called "learners" or "base models") are combined to solve the same problem with the goal of achieving better performance than any single model could on its own.

**Key Idea Behind Ensemble Learning:**

"A group of weak learners can come together to form a strong learner."

This principle is based on the idea that diverse models, when combined in a smart way, can correct each other’s errors, leading to improved accuracy, robustness, and generalization.

**Why Use Ensemble Learning?**
* Reduces overfitting (especially if models make uncorrelated errors)
* Improves accuracy
* Increases robustness to noisy data or outliers

**Common Types of Ensemble Methods:**
1. Bagging (Bootstrap Aggregating)
  * Trains multiple models independently on random subsets of the data (with replacement).
  * Example: Random Forest
  * Goal: Reduce variance

2. Boosting
  * Trains models sequentially, where each new model focuses on fixing the errors of the previous ones.
  * Example: AdaBoost, Gradient Boosting, XGBoost
  * Goal: Reduce bias and variance

3. Stacking (Stacked Generalization)
  * Combines the predictions of multiple models using a meta-model, which learns to best combine their outputs.
  * Example: Blend logistic regression, decision trees, and SVM with a meta-learner on top.



**Question 2:** What is the difference between Bagging and Boosting?

**Answer:**

| Feature                  | **Bagging**                                                                           | **Boosting**                                                              |
| ------------------------ | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| **Full Name**            | Bootstrap Aggregating                                                                 | Just Boosting (e.g., AdaBoost, XGBoost)                                   |
| **Main Goal**            | Reduce **variance**                                                                   | Reduce **bias** (and sometimes variance too)                              |
| **How Models Are Built** | Models are trained **independently** in parallel                                      | Models are trained **sequentially**, each correcting the previous         |
| **Data Sampling**        | Uses **bootstrapped samples** (random sampling **with replacement**)                  | Uses the **entire dataset**, but adjusts weights for misclassified points |
| **Model Focus**          | All models have **equal weight**                                                      | More focus is placed on **hard-to-classify** instances                    |
| **Combination Method**   | Typically by **averaging** (for regression) or **majority vote** (for classification) | Typically by **weighted voting** or **weighted sum**                      |
| **Overfitting Tendency** | Less prone to overfitting (especially with many trees)                                | More prone to overfitting if not tuned properly                           |
| **Example Algorithms**   | Random Forest                                                                         | AdaBoost, Gradient Boosting, XGBoost, LightGBM                            |


**Bagging:** "Let’s train multiple learners on different random subsets of data and combine their outputs to smooth out their individual errors."

**Boosting:** "Let’s train one model at a time, each one learning from the mistakes of the previous, so that the overall system becomes smarter with each step."

**Question 3:** What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

**Answer:**


Bootstrap sampling is a statistical technique that involves:
* Randomly sampling data points with replacement from the original dataset.
* This means that:
  * Some data points may appear more than once in the sample.
  * Some may not appear at all.
* The sample size is usually equal to the size of the original dataset.

This creates a new dataset that is similar to the original, but slightly varied.

**Role of Bootstrap Sampling in Bagging (e.g., Random Forest)**

In Bagging methods like Random Forest, bootstrap sampling is used to introduce diversity among the base models (like decision trees).

**Here's how it works:**
1. From the original training set, create multiple bootstrap samples.
2. Train one base model (e.g., decision tree) on each sample.

3. Combine all model predictions by:
  * Majority vote (for classification)
  * Averaging (for regression)

This process helps reduce variance, because:
* Each model sees a slightly different view of the data.
* Their errors are less likely to be correlated.
* Combining them leads to a more stable and generalizable final model.

**Example:**

Say your dataset has 100 rows. In bootstrap sampling:
* You draw 100 samples with replacement.
* Some rows may appear 2–3 times, others not at all.
* Do this for, say, 100 trees → 100 different training sets → 100 diverse trees.



**Question 4:** What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

**Answer:**

When using bootstrap sampling in ensemble methods like Random Forest, each base learner (e.g., decision tree) is trained on a random sample drawn with replacement from the original dataset.

Because of this:
* Some data points are not included in the bootstrap sample for a given tree.
* These excluded data points are called Out-of-Bag (OOB) samples for that tree.

On average, about 1/3 of the data is left out (not selected) in each bootstrap sample. This is due to the probability math behind sampling with replacement.

**What Is the OOB Score?**

The OOB score is a way to evaluate the model's performance without needing a separate validation set.

**How it works:**
1. Train each base learner on its bootstrap sample.
2. For each data point in the original dataset:
  * Find all the trees where this point was OOB (i.e., it wasn't used to train that tree).
  * Predict its label using only those trees.
3. Compare the predicted label (from OOB trees) to the true label.
4. Compute accuracy (for classification) or error (for regression) over all data points.

This is called the OOB score, and it's similar to cross-validation, but:
  * It's built-in to the training process.
  * It's computationally efficient since it reuses the same training data.

**Benefits of Using OOB Score:**
* No need for a separate validation set, saving data.
* Unbiased estimate of model performance (if dataset is large and well-represented).
* Very useful for hyperparameter tuning and early model evaluation.

**Question 5:** Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

**Answer:**

**Feature Importance in a Single Decision Tree**

**How it's calculated:**
* A decision tree splits the data based on features that maximize some criterion (e.g., Gini impurity or information gain).
* Feature importance is computed by:

    Summing up the total decrease in impurity caused by each feature across all the nodes where it's used.
* The more a feature reduces impurity, the more important it is considered.

**Pros:**
* Easy to compute and interpret.
* Shows which features the tree relied on most.

**Cons:**
* High variance – small changes in the data can lead to very different trees and importances.
* May overemphasize features used near the root.
* Can be biased toward features with more categories or higher cardinality.

**Feature Importance in a Random Forest**

**How it's calculated:**
* Random Forest is an ensemble of decision trees trained on different bootstrap samples.
* Feature importance is computed by:

    Averaging the feature importances from all the individual trees.
* Sometimes called "Mean Decrease in Impurity (MDI)".

An alternative (often better) method:

**→ Permutation Importance**
* Measures how much the model’s performance drops when you randomly shuffle the values of a feature.
* This reflects the true predictive power of that feature.

**Pros:**
* More stable and robust than a single tree.
* Less overfitting – benefits from averaging across many trees.
* Reduces bias toward high-cardinality features (especially when using permutation importance).

**Cons:**
* Still biased with correlated features (especially MDI).
* Less interpretable than a single tree due to model complexity.



| Aspect                   | **Decision Tree**                                                | **Random Forest**                                    |
| ------------------------ | ---------------------------------------------------------------- | ---------------------------------------------------- |
| **Method**               | Impurity reduction per feature                                   | Average impurity reduction across trees              |
| **Stability**            | Low – sensitive to data changes                                  | High – stable across different datasets              |
| **Bias Toward Features** | More biased toward categorical/numeric features with more levels | Less biased (especially with permutation importance) |
| **Interpretability**     | Easy to understand                                               | Harder (many trees involved)                         |
| **Overfitting Risk**     | Higher                                                           | Lower                                                |


**Question 6:** Write a Python program to:
* Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
* Train a Random Forest Classifier
* Print the top 5 most important features based on feature importance scores.

(Include your Python code and output in the code box below.)

**Answer:**


In [2]:
# Load the Breast Cancer dataset
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

In [3]:
X = data.data
y = data.target

In [4]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [5]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [6]:
feature_names = data.feature_names

In [7]:
feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [8]:
# Train a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model

In [9]:
model.fit(X, y)

In [10]:
# Get feature importances
importances = model.feature_importances_
importances

array([0.03484323, 0.01522515, 0.06799034, 0.06046164, 0.00795845,
       0.01159704, 0.06691736, 0.10704566, 0.00342279, 0.00261508,
       0.0142637 , 0.00374427, 0.01008506, 0.02955283, 0.00472157,
       0.00561183, 0.00581969, 0.00375975, 0.00354597, 0.00594233,
       0.08284828, 0.01748526, 0.0808497 , 0.13935694, 0.01223202,
       0.01986386, 0.03733871, 0.13222509, 0.00817908, 0.00449731])

In [11]:
# Create a DataFrame for better readability
import pandas as pd
import numpy as np

feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

In [12]:
# Sort by importance (descending)
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

In [13]:
feature_importance_df.head(5)

Unnamed: 0,Feature,Importance
23,worst area,0.139357
27,worst concave points,0.132225
7,mean concave points,0.107046
20,worst radius,0.082848
22,worst perimeter,0.08085


**Question 7:** Write a Python program to:
* Train a Bagging Classifier using Decision Trees on the Iris dataset
* Evaluate its accuracy and compare with a single Decision Tree

(Include your Python code and output in the code box below.)

**Answer:**

In [14]:
# Load the Iris dataset
from sklearn.datasets import load_iris

iris = load_iris()

In [15]:
X, y = iris.data, iris.target

In [16]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [17]:
# Split into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [18]:
# Train a single Decision Tree
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt

In [19]:
dt.fit(X_train, y_train)

In [20]:
dt_preds = dt.predict(X_test)
dt_preds

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
       0])

In [21]:
from sklearn.metrics import accuracy_score
dt_accuracy = accuracy_score(y_test, dt_preds)
dt_accuracy

1.0

In [22]:
# Train a Bagging Classifier with Decision Trees
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging = BaggingClassifier(
    estimator = DecisionTreeClassifier(),
    n_estimators = 100,
    random_state = 42
)

In [23]:
bagging.fit(X_train, y_train)

In [24]:
bagging_preds = bagging.predict(X_test)
bagging_preds

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
       0])

In [25]:
from sklearn.metrics import accuracy_score

bagging_accuracy = accuracy_score(y_test, bagging_preds)
bagging_accuracy

1.0

**Question 8:** Write a Python program to:
* Train a Random Forest Classifier
* Tune hyperparameters max_depth and n_estimators using GridSearchCV
* Print the best parameters and final accuracy

(Include your Python code and output in the code box below.)

**Answer:**

In [26]:
# Load dataset
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

In [27]:
X, y = data.data, data.target

In [28]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [29]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [30]:
# Split into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [31]:
# Define Random Forest and hyperparameter grid
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)

In [32]:
rf

In [33]:
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None]
}
param_grid

{'n_estimators': [50, 100, 150], 'max_depth': [3, 5, 7, None]}

In [34]:
# GridSearchCV for hyperparameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search

In [35]:
grid_search.fit(X_train, y_train)

In [36]:
# Get best model and evaluate on test set
best_rf = grid_search.best_estimator_
best_rf

In [37]:
y_pred = best_rf.predict(X_test)
y_pred

array([1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1])

In [38]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9707602339181286

**Question 9:** Write a Python program to:
* Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
* Compare their Mean Squared Errors (MSE)

(Include your Python code and output in the code box below.)

**Answer:**

In [39]:
# Load the California Housing dataset
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()

In [40]:
X, y = data.data, data.target

In [41]:
X

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])

In [42]:
y

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [43]:
# Split into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [44]:
# Train Bagging Regressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

bagging = BaggingRegressor(
    estimator = DecisionTreeRegressor(),
    n_estimators = 100,
    random_state = 42
)
bagging

In [45]:
bagging.fit(X_train, y_train)

In [46]:
bagging_preds = bagging.predict(X_test)
bagging_preds

array([0.47841  , 0.73213  , 4.8208461, ..., 2.0260901, 1.37665  ,
       2.17219  ])

In [47]:
from sklearn.metrics import mean_squared_error

bagging_mse = mean_squared_error(y_test, bagging_preds)
bagging_mse

0.2568358813508342

In [48]:
# Train Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf

In [49]:
rf.fit(X_train, y_train)

In [50]:
rf_preds = rf.predict(X_test)
rf_preds

array([0.47809  , 0.74566  , 4.8298161, ..., 2.0718201, 1.38519  ,
       2.14294  ])

In [51]:
from sklearn.metrics import mean_squared_error

rf_mse = mean_squared_error(y_test, rf_preds)
rf_mse

0.25650512920799395

**Question 10:** You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:
* Choose between Bagging or Boosting
* Handle overfitting
* Select base models
* Evaluate performance using cross-validation
* Justify how ensemble learning improves decision-making in this real-worldcontext.

(Include your Python code and output in the code box below.)

**Answer:**

In [52]:
# Simulate imbalanced financial dataset
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd


X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                           n_redundant=5, n_clusters_per_class=2, weights=[0.85, 0.15],
                           flip_y=0.01, random_state=42)

In [53]:
X

array([[ 2.944165  , -0.99859155,  3.39135329, ...,  1.10659839,
        -1.87544235,  0.50117701],
       [ 2.04669267, -0.55879034, -3.7928526 , ...,  0.5108216 ,
         0.13533451,  0.16753355],
       [ 2.7521623 , -0.55716509,  1.51575667, ..., -1.26949511,
        -1.83643433,  0.99577397],
       ...,
       [ 0.97510158,  1.30383499,  0.91193533, ...,  1.82331894,
        -1.03407531, -0.95743216],
       [ 0.16863448,  0.54430098, -2.86743148, ...,  1.7517107 ,
        -1.47239729,  2.27147103],
       [ 3.58430967,  1.75873178,  1.25991255, ..., -0.8824804 ,
        -2.00464194, -0.93524311]])

In [54]:
y

array([0, 0, 0, ..., 0, 0, 0])

In [55]:
# Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [56]:
# Initialize Boosting model (XGBoost)
from xgboost import XGBClassifier

model = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=5,  # account for imbalance
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)
model

In [57]:
# Fit and evaluate on test set
model.fit(X_train, y_train)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [58]:
y_pred = model.predict(X_test)
y_pred

array([0, 1, 0, ..., 0, 1, 0])

In [59]:
y_prob = model.predict_proba(X_test)[:, 1]
y_prob

array([0.15101217, 0.9876425 , 0.26627436, ..., 0.04504946, 0.9805022 ,
       0.41090313], dtype=float32)

In [60]:
# Cross-validation (AUC as scoring)
from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv

StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

In [61]:
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc')

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [62]:
print(cv_scores)

[0.9522151  0.97445273 0.95943326 0.94896459 0.9564721 ]


In [63]:
print(f"Mean AUC: {np.mean(cv_scores):.4f}")

Mean AUC: 0.9583


In [64]:
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.954


In [65]:
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

print("ROC AUC:", roc_auc_score(y_test, y_prob))

ROC AUC: 0.9654262238959261


In [66]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1270
           1       0.85      0.84      0.85       230

    accuracy                           0.95      1500
   macro avg       0.91      0.91      0.91      1500
weighted avg       0.95      0.95      0.95      1500

