#Ensemble Learning | Assignment

**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

**Answer:**

Ensemble Learning is a technique where we combine multiple ML models to make a better and more stable prediction than using a single model.

Key idea:

Many models learn patterns differently

Combine their outputs ‚Üí reduces errors

Improves performance by reducing:

Variance (Bagging)

Bias (Boosting)

**Question 2: What is the difference between Bagging and Boosting?**

**Answer:**

**Feature**                   Bagging Boosting

**Training**

            Models train in parallel	Models train sequentially

**Focus**    

            Reduce variance	          Reduce bias

**Data**	    

           Uses bootstrap samples	  Focuses more on errors/residuals


**Example	  Random Forest	AdaBoost    Gradient Boosting, XGBoost**

**Bagging = stable + less overfit**

**Boosting = higher accuracy but may overfit if not controlled**

**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

**Answer:**

Bootstrap sampling means we create multiple training datasets by random sampling with replacement from the original dataset.

Role in Bagging / Random Forest:

Each tree gets a slightly different dataset

Trees become less correlated

Final prediction becomes stronger by averaging/voting

Result: lower variance + better generalization

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**

**Answer:**

In bootstrap sampling, some data points are not selected in a bootstrap sample.
These unused points are called Out-of-Bag (OOB) samples.

OOB score usage:

Each tree is tested using its OOB data

We calculate accuracy using those predictions

Works like internal validation, no need separate validation set

Used in Random Forest: oob_score=True

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

**Answer:**

Single Decision Tree:

Feature importance is based on splits in one tree

Can be unstable

Changes a lot if data changes slightly

**Random Forest:**

Importance is averaged across many trees

More stable + reliable

Less noise effect

Random Forest feature importance is usually better for real datasets.

**Question 6: Write a Python program to:**

**‚óè Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()**

**‚óè Train a Random Forest Classifier**

**‚óè Print the top 5 most important features based on feature importance scores.**

(Include your Python code and output in the code box below.)

**Answer:**üëá


In [10]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Feature importance
importances = rf.feature_importances_
feat_imp = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

print("Top 5 Important Features:")
print(feat_imp.head(5))


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.128549
27  worst concave points    0.128343
22       worst perimeter    0.127079
7    mean concave points    0.119801
20          worst radius    0.069273


**Question 7: Write a Python program to:**

**‚óè Train a Bagging Classifier using Decision Trees on the Iris dataset**

**‚óè Evaluate its accuracy and compare with a single Decision Tree**

(Include your Python code and output in the code box below.)

**Answer:** üëá

In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

# Bagging Classifier
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


**Question 8: Write a Python program to:**

**‚óè Train a Random Forest Classifier**

**‚óè Tune hyperparameters max_depth and n_estimators using GridSearchCV**

**‚óè Print the best parameters and final accuracy**

(Include your Python code and output in the code box below.)

**Answer:**

In [12]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Model
rf = RandomForestClassifier(random_state=42)

# Params
param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [3, 5, 7, None]
}

grid = GridSearchCV(rf, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

# Best model accuracy
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Final Accuracy:", acc)


Best Parameters: {'max_depth': 7, 'n_estimators': 200}
Final Accuracy: 0.9649122807017544


**Question 9: Write a Python program to:**

**‚óè Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset**

**‚óè Compare their Mean Squared Errors (MSE)**

(Include your Python code and output in the code box below.)

**Answer:**


In [13]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bagging Regressor
bag_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bag_reg.fit(X_train, y_train)
bag_pred = bag_reg.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25592438609899626
Random Forest Regressor MSE: 0.2539759249192041


**Question 10: You are working as a data scientist at a financial institution to predict loan**

**default. You have access to customer demographic and transaction history data.**

**You decide to use ensemble techniques to increase model performance.**

**Explain your step-by-step approach to:**

**‚óè Choose between Bagging or Boosting**

**‚óè Handle overfitting**

**‚óè Select base models**

**‚óè Evaluate performance using cross-validation**

**‚óè Justify how ensemble learning improves decision-making in this real-world context.**

(Include your Python code and output in the code box below.)

**Answer:**

**Step 1: Choose Bagging or Boosting**

If model is overfitting / variance high ‚Üí use Bagging (Random Forest)

If accuracy is low / bias high ‚Üí use Boosting (XGBoost/LightGBM/CatBoost)

For loan default: usually Boosting works better on tabular datasets.

**Step 2: Handle overfitting**

Use cross-validation

Control depth: max_depth

Use regularization:

Random Forest ‚Üí limit depth + min_samples_leaf

XGBoost ‚Üí lambda, alpha, early stopping

**Step 3: Select base models**

Bagging base model: DecisionTreeClassifier

Boosting base model: weak trees (stumps / shallow trees)

**Step 4: Evaluate using Cross Validation**

Use:

StratifiedKFold

Metrics:

ROC-AUC

F1-score

Recall (important: catch defaulters)

**Step 5: Why ensemble improves decision-making**

More stable predictions

Reduced noise impact

Better accuracy ‚Üí fewer wrong loans approved

Helps risk team reduce losses

In [14]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Using breast cancer as example binary dataset (like default / no default)
data = load_breast_cancer()
X, y = data.data, data.target

model = RandomForestClassifier(n_estimators=200, random_state=42)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")

print("Cross-Validation ROC-AUC Scores:", scores)
print("Mean ROC-AUC:", scores.mean())


Cross-Validation ROC-AUC Scores: [0.99868981 0.97658041 0.98578042 0.99404762 0.99279007]
Mean ROC-AUC: 0.9895776684222476
