#Ensemble Learning : Assignment

1.What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Ans.  Ensemble learning in machine learning is a technique where we combine multiple models (often called weak learners or base learners) to produce a stronger and more accurate final model.

Key Idea:

The central idea is:

A group of models, when combined, can often perform better than any single model alone.

Just like asking several experts instead of relying on one person often gives a more reliable decision, ensemble learning aggregates predictions from different models to reduce errors, bias, and variance.

Why it works

Reduction of variance – Different models may make different mistakes; averaging them smooths out random errors.

Reduction of bias – Combining models can capture more complex patterns.

Better generalization – More robust performance on unseen data.


2.What is the difference between Bagging and Boosting?

Ans. Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they work in very different ways.
In Bagging, multiple models are trained in parallel, each on a different random subset of the training data created through bootstrap sampling (sampling with replacement). These models are independent of each other, and their predictions are later combined, usually by majority voting for classification or averaging for regression. The main goal of Bagging is to reduce variance and make the final prediction more stable; Random Forest is a well-known example.

Boosting, on the other hand, trains models sequentially, where each new model focuses on correcting the mistakes made by the previous ones. In this process, data points that were misclassified earlier are given more weight so that subsequent models pay more attention to them. Boosting primarily aims to reduce bias and improve accuracy, but it can be prone to overfitting if overtrained. Popular Boosting algorithms include AdaBoost, Gradient Boosting, XGBoost, and LightGBM.

In short, Bagging builds many independent models to average out their errors, while Boosting builds dependent models that learn from each other’s mistakes to progressively improve performance.


3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Ans. Bootstrap sampling is a statistical technique where we create new datasets by randomly sampling from the original dataset with replacement.
This means a single data point from the original dataset can appear multiple times in a bootstrap sample, while some points might not appear at all.

How it works in Bagging

In Bagging methods like Random Forest, bootstrap sampling is used to give each model (or decision tree) a slightly different view of the data:

From the original dataset of size N, we randomly pick N samples with replacement to form a new dataset.

This process is repeated for each model in the ensemble, so every model is trained on a different bootstrap sample.

Because each model sees a different subset, they learn different patterns and make different errors.

Finally, their predictions are combined (by averaging or majority voting), which reduces variance and improves stability.


Key Role in Bagging:

Introduces diversity among models.

Prevents all models from making the same mistakes.

Helps reduce overfitting compared to using the entire dataset for each model.


4.What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Ans.  Out-of-Bag (OOB) samples are the data points not included in a bootstrap sample when building a model in Bagging methods like Random Forest.

Since bootstrap sampling is done with replacement, on average about 63% of the original dataset is selected for training each model, and the remaining ~37% is left out — these are the OOB samples.

How OOB Samples Are Used

For each model (e.g., each decision tree in a Random Forest), the OOB samples serve as a built-in validation set.

After training a model on its bootstrap sample, we test it on its OOB samples to measure performance without needing a separate validation dataset.

OOB Score

The OOB score is the average prediction accuracy (or another metric) computed across all OOB samples for the entire ensemble:

For each data point in the original dataset, only consider predictions from the models where that point was OOB.

Compare these predictions to the actual labels.

Average the results to get the OOB score.

Benefits of OOB Evaluation:

No need for a separate test/validation set (saves data).

Provides an unbiased estimate of model performance during training.

Especially useful when data is limited.


5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Ans.  In a single Decision Tree, feature importance is measured based on how much each feature contributes to reducing impurity (like Gini impurity or entropy for classification, and variance for regression) when it’s used to split the data. For each split in the tree, the algorithm calculates the decrease in impurity, and these decreases are summed over all nodes where the feature appears. The higher the total decrease, the more “important” the feature is considered. However, because a single decision tree is sensitive to the specific training data, its feature importance can be unstable — if you train the tree on slightly different data, the importances might change a lot.

In a Random Forest, feature importance is calculated in a similar way, but it’s averaged over many trees. Each tree is built on a bootstrap sample and uses random feature subsets for splitting, so the importance score for a feature comes from summing impurity decreases across all trees and then averaging. This aggregation makes the importance scores more robust and reliable than in a single tree, since the effect of noise or peculiarities in one dataset split is reduced. Random Forests can also use permutation importance, which measures how much the model’s accuracy drops when a feature’s values are randomly shuffled — this method captures both linear and non-linear dependencies.

In short:

Decision Tree → Importance from impurity reduction in one tree, often unstable.

Random Forest → Importance from impurity reduction averaged over many trees, more stable and generalizable.

6.Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# 1. Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# 3. Get feature importance scores
feature_importances = pd.Series(model.feature_importances_, index=data.feature_names)

# 4. Sort and display top 5 features
top_features = feature_importances.sort_values(ascending=False).head(5)
print("Top 5 Most Important Features:\n")
print(top_features)

Top 5 Most Important Features:

worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


7. Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree.

In [5]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1) Data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 2) Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# 3) Bagging with Decision Trees (handle sklearn version differences)
common = dict(n_estimators=50, random_state=42)
try:
    # scikit-learn >= 1.2
    bag = BaggingClassifier(estimator=DecisionTreeClassifier(), **common)
except TypeError:
    # scikit-learn < 1.2
    bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), **common)

bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

# 4) Compare
print(f"Single Decision Tree Accuracy: {dt_acc:.4f}")
print(f"Bagging Classifier Accuracy  : {bag_acc:.4f}")


Single Decision Tree Accuracy: 0.9333
Bagging Classifier Accuracy  : 0.9333


8. Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Define model
rf = RandomForestClassifier(random_state=42)

# 4. Define parameter grid
param_grid = {
    "n_estimators": [50, 100, 150],
    "max_depth": [None, 5, 10, 15]
}

# 5. GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

# 6. Best parameters
print("Best Parameters:", grid_search.best_params_)

# 7. Final accuracy with best params
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Final Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9357


9. Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train Bagging Regressor (using Decision Trees as base estimators)
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_reg.fit(X_train, y_train)

# 4. Train Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42
)
rf_reg.fit(X_train, y_train)

# 5. Predictions
bagging_pred = bagging_reg.predict(X_test)
rf_pred = rf_reg.predict(X_test)

# 6. Calculate MSE
bagging_mse = mean_squared_error(y_test, bagging_pred)
rf_mse = mean_squared_error(y_test, rf_pred)

# 7. Compare results
print(f"Bagging Regressor MSE      : {bagging_mse:.4f}")
print(f"Random Forest Regressor MSE: {rf_mse:.4f}")


Bagging Regressor MSE      : 0.2579
Random Forest Regressor MSE: 0.2577


10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.


Ans. 1. Choosing Between Bagging or Boosting

Start with the problem nature:

Loan default prediction is a classification problem, often with class imbalance (fewer defaults than non-defaults).

Boosting (e.g., XGBoost, LightGBM) is generally better when you want higher accuracy and can tolerate more complexity, because it reduces bias and focuses on hard-to-classify cases.

Bagging (e.g., Random Forest) is better when the dataset is noisy and you want stability by reducing variance.

Decision:
I’d try Boosting first because financial risk models benefit from focusing on edge cases (e.g., borderline defaulters). But I’d also benchmark against Bagging to see if it performs comparably without overfitting.

2. Handling Overfitting

For Bagging (Random Forest):

Limit max_depth of trees.

Increase min_samples_split or min_samples_leaf.

Use fewer features per split (max_features).

For Boosting:

Use smaller learning rates (learning_rate).

Limit number of boosting rounds (n_estimators).

Use early stopping with validation data.

Apply regularization (max_depth, min_child_weight, lambda, alpha).

Data-related methods:

Remove irrelevant features.

Use feature selection to reduce noise.

Apply SMOTE or class weights to balance target classes.

3. Selecting Base Models

Bagging: Decision Trees are a natural fit (Random Forest).

Boosting: Decision Trees are also common (e.g., Gradient Boosted Trees), but you can also use shallow trees for better generalization.

I’d start with tree-based models because:

They handle mixed data types well.

They capture non-linear patterns and interactions between variables without heavy preprocessing.

If using Stacking later, I might include Logistic Regression or LightGBM in the base layer for diversity.

4. Evaluating Performance with Cross-Validation

Use Stratified k-Fold Cross-Validation to preserve the proportion of defaulters vs. non-defaulters in each fold.

Evaluate using metrics beyond accuracy:

AUC-ROC → measures ranking ability (important for loan risk).

Precision-Recall AUC → especially useful for imbalanced datasets.

F1-score → balances precision and recall.

Keep a validation curve to see how performance changes with n_estimators, depth, etc., to detect overfitting early.

5. Justifying Ensemble Learning in This Real-World Context

Why it works here:

Loan default prediction has complex patterns — a single model might miss subtle risk signals.

Ensembles combine multiple models’ strengths, reducing the chance of relying on one biased viewpoint.

Business value:

Lower default rates → more accurate classification of high-risk customers means better loan approval decisions.

Optimized profit → approving safe borrowers while rejecting risky ones improves financial returns.

Regulatory compliance → robust models with better generalization lower the chance of biased or unstable decisions.

Risk management → more consistent predictions reduce uncertainty in portfolio risk assessment.