#Ensemble Learning | Assignment


#1: What is Ensemble Learning in machine learning? Explain the key idea behind it

:- Ensemble Learning is a machine learning technique where multiple models are combined to make a final prediction.

Key idea:
Instead of relying on one model, ensemble learning combines several weak or base models so that their collective decision is more accurate, stable, and robust than any single model.

#2: What is the difference between Bagging and Boosting?

:- Bagging (Bootstrap Aggregating):
Trains multiple models independently on different random subsets of data and averages their predictions. It mainly reduces variance.
Example: Random Forest

Boosting:
Trains models sequentially, where each new model focuses on correcting the errors of the previous ones. It mainly reduces bias.

#3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

:- Role in Bagging / Random Forest:

Each model (tree) is trained on a different bootstrap sample.

This creates diversity among models, reducing overfitting.

The final prediction is made by averaging (regression) or voting (classification), which improves accuracy and stability.

# 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

:- OOB score usage:

Each model is tested on its own OOB samples.

Predictions from all models where a data point is OOB are aggregated.

The OOB score estimates model performance without a separate validation set, giving an unbiased accuracy estimate.

#5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.


:- Decision Tree:
Feature importance is based on how much each feature reduces impurity in that single tree.
It is unstable—small data changes can alter importance a lot.

Random Forest:
Feature importance is averaged across many trees.
It is more reliable and stable, capturing overall feature influence and reducing bias from a single tree.

#6: Write a Python program to:                                                 ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores. (Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
})

# Sort and select top 5 features
top_5 = feature_importance_df.sort_values(by="Importance", ascending=False).head(5)

print(top_5)


                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


#7: Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree (Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_preds = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_preds)

# Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_preds = bagging.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_preds)

print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


#8: Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy (Include your Python code and output in the code box below.)

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Random Forest model
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    "n_estimators": [50, 100],
    "max_depth": [None, 5, 10]
}

# GridSearchCV
grid = GridSearchCV(rf, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Final evaluation
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", final_accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9707602339181286


# 9: Write a Python program to: ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset ● Compare their Mean Squared Errors (MSE) (Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Bagging Regressor
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_preds = bagging.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_preds)

# Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_preds)

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25650512920799395


# 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to: ● Choose between Bagging or Boosting ● Handle overfitting ● Select base models ● Evaluate performance using cross-validation ● Justify how ensemble learning improves decision-making in this real-world context.


:-  1. Choosing between Bagging and Boosting

If the dataset is large and noisy with high variance models → choose Bagging (e.g., Random Forest).

If the dataset has complex patterns and bias is high → choose Boosting (e.g., XGBoost, AdaBoost).

For loan default prediction, Boosting is often preferred because it focuses on hard-to-classify defaulters, improving recall.

2. Handling Overfitting

Use cross-validation to monitor generalization.

Apply regularization (limit tree depth, learning rate in boosting).

In Bagging, overfitting is reduced naturally by averaging multiple models.

In Boosting, control overfitting using early stopping and shallow trees.

3. Selecting Base Models

Use Decision Trees as base learners:

They capture non-linear relationships.

Work well with mixed data (demographic + transaction).

Use shallow trees (weak learners) for Boosting.

Use full or moderately deep trees for Bagging.

4. Evaluating Performance (Cross-Validation)

Use k-fold cross-validation to ensure stability.

Evaluate using:

Accuracy

Precision & Recall (important for defaults)

ROC-AUC (best for imbalanced data)

Compare ensemble results with a single model baseline.

5. Why Ensemble Learning Improves Decision-Making

Combines multiple models → more robust predictions.

Reduces variance (Bagging) and bias (Boosting).

Improves detection of high-risk customers, reducing financial loss.

Leads to more reliable, fair, and data-driven credit decisions.