In [None]:
'''Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.
Answer: Ensemble Learning is a machine learning technique in which multiple models (called base learners or weak learners) are trained and combined to make a single,
stronger predictive model.

Key Idea Behind Ensemble Learning

The core idea is that a group of models can perform better than any single model alone. Different models may make different errors; by combining them,
these errors can cancel out, leading to better accuracy, robustness, and generalization.

Question 2: What is the difference between Bagging and Boosting?
Answer: Bagging
- models are trained independantly
-uses bootstrap sampling (sampling with replacement)
-no dependency among models
-Reduce variance by averaging
-all data points treat equally

       Boosting
-models are trained sequentially
-uses weighted sampling
-each models are depends on the previous one
-reduce bias and improve weak learner
-misclasified data points are given higher weight

Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
Answer:Bootstrap sampling is a resampling technique in which multiple new training datasets are created by randomly sampling from the original dataset with replacement.
Each bootstrap sample has the same size as the original dataset, but some data points may appear multiple times while others may be left out.

Role of Bootstrap Sampling in Bagging

In Bagging (Bootstrap Aggregating) methods such as Random Forest, bootstrap sampling plays a crucial role:

Creates diverse training sets

Each model (e.g., each decision tree in a Random Forest) is trained on a different bootstrap sample.

This introduces variation among models.

Reduces variance

Individual models may overfit, but averaging their predictions reduces variance and improves generalization.

Improves model robustness

Since models see slightly different data, the ensemble becomes less sensitive to noise.

Out-of-Bag (OOB) estimation

About 63% of unique data points appear in each bootstrap sample.

The remaining 37% (out-of-bag samples) are used to estimate model performance without a separate validation set.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
Answer:Out-of-Bag (OOB) samples are the data points from the original dataset that are not selected in a particular bootstrap sample during training.

When bootstrap sampling is used (as in Bagging and Random Forest):

Each model is trained on a bootstrap sample of the dataset.

On average, about 63% of the data points are included in a bootstrap sample.

The remaining 37% are called Out-of-Bag (OOB) samples for that model.

teps to compute OOB score:

For each data point, collect predictions only from the models where this point was an OOB sample.

Aggregate these predictions:

Majority voting for classification

Averaging for regression

Compare the aggregated prediction with the true label.

Compute an evaluation metric:

Accuracy for classification

Mean Squared Error (MSE) or R² for regression

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
Answer:          Single Decision Tree

-Computed from impurity reduction (Gini or Entropy) at splits in one tree
-Unstable – small data changes can greatly alter importance
-Can be biased toward dominant features
-High risk of overfitting affects importance values

               Random Forest

-Averaged impurity reduction across many trees
-More stable and reliable due to averaging
-Reduces bias by combining multiple trees
-Lower overfitting → more trustworthy importance'''

In [1]:
'''Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.'''
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance (descending)
feature_importance_df = feature_importance_df.sort_values(
    by='Importance', ascending=False
)

# Print top 5 most important features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))

Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [2]:
'''Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree'''
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_predictions = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_predictions = bagging.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_predictions)

# Print results
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [3]:
'''Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy'''
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define the Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train the model using GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate final accuracy
final_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", grid_search.best_params_)
print("Final Test Accuracy:", final_accuracy)


Best Hyperparameters: {'max_depth': None, 'n_estimators': 200}
Final Test Accuracy: 0.9707602339181286


In [4]:
'''Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)'''
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_reg.fit(X_train, y_train)

# Predict and calculate MSE for Bagging Regressor
bagging_predictions = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_predictions)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)

# Predict and calculate MSE for Random Forest Regressor
rf_predictions = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_predictions)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.2568358813508342
Random Forest Regressor MSE: 0.25650512920799395


In [5]:
'''Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.'''
# Import required libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
import numpy as np

# -------------------------------------------------
# 1. Create a synthetic loan default dataset
# -------------------------------------------------
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],   # class imbalance (non-default vs default)
    random_state=42
)

# -------------------------------------------------
# 2. Define base and ensemble models
# -------------------------------------------------
# Base model
dt = DecisionTreeClassifier(
    max_depth=5,
    random_state=42
)

# Bagging-based ensemble (Random Forest)
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42
)

# Boosting-based ensemble
gb = GradientBoostingClassifier(
    n_estimators=150,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

# -------------------------------------------------
# 3. Cross-validation setup
# -------------------------------------------------
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# -------------------------------------------------
# 4. Evaluate models using ROC-AUC
# -------------------------------------------------
dt_scores = cross_val_score(dt, X, y, cv=cv, scoring='roc_auc')
rf_scores = cross_val_score(rf, X, y, cv=cv, scoring='roc_auc')
gb_scores = cross_val_score(gb, X, y, cv=cv, scoring='roc_auc')

# -------------------------------------------------
# 5. Print results
# -------------------------------------------------
print("Decision Tree ROC-AUC:", np.mean(dt_scores))
print("Random Forest (Bagging) ROC-AUC:", np.mean(rf_scores))
print("Gradient Boosting ROC-AUC:", np.mean(gb_scores))




Decision Tree ROC-AUC: 0.8824812505013269
Random Forest (Bagging) ROC-AUC: 0.9730804289868275
Gradient Boosting ROC-AUC: 0.96505493330973
