#Assignment Code: DA-AG-014
#**Ensemble Learning | Assignment**

**Question. No. 1 What is Ensemble Learning in machine learning? Explain the key idea  behind it.**


**Answer.** Ensemble learning is a machine learning approach that combines the predictions of multiple models to create a more accurate and robust final model. Instead of depending on a single predictive algorithm, ensemble methods integrate several weak learners (models with limited accuracy) to form a strong learner with better generalization.

The key idea behind ensemble learning is that different models capture different aspects of the dataset and make different errors. When their predictions are combined, these errors can cancel out, leading to improved performance. This approach is based on the principle that “a group decision is often better than an individual decision.”

Ensemble techniques can be broadly classified into:

1. Bagging (Bootstrap Aggregating)

2. Boosting

3. Stacking

**Advantages:**

Reduces variance (bagging) and bias (boosting).

Increases prediction stability.

Handles overfitting better than a single model in many cases.


Example: In spam detection, one model might misclassify certain emails, but combining multiple models like decision trees, Naïve Bayes, and logistic regression can significantly improve accuracy.


**Question. No. 2. What is the difference between Bagging and Boosting?**


**Answer.**Difference between Bagging and Boosting

Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they differ in how they train and combine models.

**Bagging:** Reduce variance and prevent overfitting, Models are trained independently on random subsets of the data (with replacement), Uses bootstrap sampling (random samples with replacement), All models have equal weight in final prediction. Averages out predictions to reduce variance   
Example :   Random Forest, Best For    High variance models (e.g., decision trees). Bagging builds multiple independent models and combines them to stabilize predictions.

**Boosting** Reduce bias and improve weak learners, Models are trained sequentially, each focusing on correcting errors of the previous model, Uses all data but assigns weights to emphasize misclassified instances,  Models are given different weights based on performance, Adjusts model focus to reduce bias and errors
Examples:AdaBoost, Gradient Boosting, XGBoost
Best For High bias models or improving weak learners

Boosting builds models sequentially, with each new model improving upon the weaknesses of the previous one.


 **Question. No. 3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Answer.** **Bootstrap sampling** is a resampling technique where multiple datasets are created by **randomly selecting data points from the original dataset with replacement.** This means some records may appear more than once, while others might be excluded in each sample. Each bootstrap sample is typically the same size as the original dataset.

**Role in Bagging and Random Forest:**


**Diversity Creation:** In Bagging, multiple models (e.g., decision trees) are trained on different bootstrap samples. This variation ensures each model learns different patterns, making the ensemble more robust.


**Variance Reduction:** Independent models trained on varied data make different mistakes. Combining their predictions (averaging in regression, majority voting in classification) reduces prediction variance and improves stability.


**Prevention of Overfitting:** Since each model sees only part of the data (with duplicates), they don’t all fit to the same noise, lowering the risk of overfitting.


**Out-of-Bag (OOB) Error Estimation:** About 36.8% of the original data is left out of each bootstrap sample. These “out-of-bag” samples act as a built-in validation set to estimate model performance without needing a separate test set.


**Scalability:** Bootstrap sampling allows parallel training of multiple models, improving computational efficiency in large datasets.

Example:

Suppose we have a dataset of 1,000 samples. Each decision tree in a Random Forest is trained on a bootstrap sample of 1,000 records randomly drawn with replacement. Some records will be repeated, and some will be missing. This process ensures diversity among trees, improving the ensemble’s predictive power.


**Question. No. 4. What are Out-of-Bag (OOB) samples and how is OOB score used to  evaluate ensemble models?**


**Answer.** Out-of-Bag (OOB) Samples and OOB Score in Ensemble Models

OOB Samples:

In Bagging-based models like Random Forest, each tree is trained on a bootstrap sample (random sampling with replacement) from the original dataset. Because of replacement, about 36.8% of the data is not selected for that particular tree — these unused records are called Out-of-Bag (OOB) samples.

**Role of OOB Samples:**


They act like a built-in validation set.
Since each tree’s OOB samples are unseen during training, they can be used to test that tree’s performance without needing a separate test dataset.

**OOB Score:**

The OOB score is the average prediction accuracy (or error rate) calculated by testing each training instance only on trees where it was an OOB sample.

**Steps:**

For each record, collect predictions from all trees where it was OOB.
Aggregate the predictions (majority voting for classification, mean for regression).
Compare with the actual value to compute accuracy (or another metric).
The final OOB score represents the model’s cross-validated performance.

**Advantages:**

Eliminates the need for a separate validation set, saving data for training.
Provides an unbiased estimate of model performance.
Speeds up evaluation for large datasets.

**Question. No. 5.Compare feature importance analysis in a single Decision Tree vs. a  Random Forest.**

**Answer.** Feature importance analysis differs between a single Decision Tree and a Random Forest mainly in terms of reliability, stability, and bias reduction:

1. Single Decision Tree:


Importance is computed based on the total decrease in impurity (e.g., Gini or entropy) brought by each feature across all its splits.
It can be unstable because small changes in the dataset may alter the splits drastically.
It may be biased toward features with many categories or continuous variables.
Interpretability is high, but generalization may be poor.

2. Random Forest:


Importance is averaged across many trees, reducing variance and overfitting risk.
Provides more stable and reliable importance rankings.
Less biased toward high-cardinality features due to averaging over bootstrap samples and random feature selection.
Captures more general patterns by considering diverse decision boundaries.


**Question. No. 6. Write a Python program to:**

● Load the Breast Cancer dataset using

sklearn.datasets.load_breast_cancer()

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.

**Answer.**(6.a)

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

(6.b)

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

(6.c)

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns)

top5 = importances.sort_values(ascending=False).head(5)

print("Top 5 most important features:")
print(top5)

Top 5 most important features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


**Question. No. 7. Write a Python program to:**

● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree

**Answer.**(7.a)

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

(7.b)

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print(f"Single Decision Tree Accuracy: {dt_acc:.4f}")
print(f"Bagging Classifier Accuracy : {bag_acc:.4f}")

Single Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy : 1.0000


**Question. No. 8.Write a Python program to:**

● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy

**Answer.**(8.a)

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


rf = RandomForestClassifier(random_state=42)

(8.b)

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

rf = RandomForestClassifier(random_state=42)

#Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}

#GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

(8.c)

In [None]:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


rf = RandomForestClassifier(random_state=42)

#Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}

#GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

#Best parameters
print("Best Parameters:", grid_search.best_params_)

#Final accuracy
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Final Accuracy: {accuracy:.4f}")

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0000


**Question. No. 9. Write a Python program to:**

● Train a Bagging Regressor and a Random Forest Regressor on the California

Housing dataset

● Compare their Mean Squared Errors (MSE)

**Answer.**(9.a)

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 1. Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Bagging Regressor (with Decision Trees)
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train, y_train)
bag_pred = bagging_reg.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

# 4. Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

(9.a)

In [None]:

# 5. Print results
print(f"Bagging Regressor MSE       : {bag_mse:.4f}")
print(f"Random Forest Regressor MSE : {rf_mse:.4f}")

Bagging Regressor MSE       : 0.2579
Random Forest Regressor MSE : 0.2577


**Question. No. 10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world

context.

**Answer.**

**1. Choosing between Bagging and Boosting**

Bagging (e.g., Random Forest) reduces variance by training multiple models in parallel on bootstrap samples. It’s robust to noise and works well if base learners overfit individually.
Boosting (e.g., XGBoost, LightGBM) reduces bias by sequentially improving weak learners, focusing on hard-to-classify cases.
Approach:

Start with Bagging if the dataset is large, features are noisy, and interpretability is important.
Try Boosting if model accuracy is priority and you can handle longer training times, especially if the initial models underfit.
Run a quick baseline with both to see which performs better on validation sets.


**2. Handling Overfitting**

Bagging:

Limit tree depth (max_depth)
Use fewer features per split (max_features)
Increase number of estimators for stability.


Boosting:

Apply learning rate shrinkage
Set max depth of trees
Use subsampling of rows (subsample) and columns (colsample_bytree).


General:

Feature selection or regularization (L1/L2)
Early stopping with validation set.


**3. Selecting Base Models**

Start with Decision Trees as base learners because they handle mixed data types and missing values well.
For bagging, trees can be deep (to reduce bias).
For boosting, trees should be shallow (to reduce variance).
Experiment with:

Logistic Regression (for interpretability) as a base model in bagging.
Decision Trees for non-linear patterns.


**4. Evaluating Performance with Cross-Validation**

Use Stratified k-Fold Cross-Validation (e.g., k=5 or 10) to preserve class distribution, since loan default is likely imbalanced.
Metrics:

AUC-ROC for discrimination ability.
F1-score if false positives and false negatives have different costs.
Precision/Recall curve if avoiding false negatives is crucial (e.g., approving risky loans).


Cross-validation ensures stable performance estimates across different splits.

**5. Justification of Ensemble Learning in Decision-Making**

Loan default prediction impacts financial risk directly.
Ensemble learning:

Combines multiple models to reduce variance (Bagging) or bias (Boosting), improving predictive stability.
Captures complex patterns in customer demographics and transaction histories.
Reduces the risk of relying on a single weak model, leading to more reliable credit decisions.
Improves generalization to new, unseen customers, minimizing costly misclassifications.


In practice, this means fewer risky loans approved and fewer good customers rejected, improving profitability and customer trust.