Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

Ans: Ensemble Learning is a machine learning technique where multiple models (called base or weak learners) are combined to produce a single, stronger and more accurate model.

A group of diverse models, when combined, can perform better than any individual model.

Each model may make mistakes, but when their predictions are aggregated, errors cancel out, leading to better accuracy, stability, and generalization.

Question 2: What is the difference between Bagging and Boosting?

Ans: Both Bagging and Boosting are ensemble learning techniques used to improve model performance, but they differ in how models are trained and combined.

1. Bagging (Bootstrap Aggregating)

Trains multiple models independently

Uses bootstrap samples (random sampling with replacement)

Combines predictions by averaging or majority voting


2. Boosting

Trains models sequentially

Each new model focuses more on previous errors

Misclassified points receive higher weight

Question 3: What is Bootstrap Sampling and its Role in Bagging (Random Forest)?

Ans: Bootstrap sampling is a resampling technique where:

Multiple datasets are created by randomly sampling from the original dataset with replacement

Each bootstrap sample has the same size as the original dataset

Some observations may appear multiple times, while others may be left out

In Random Forest:

Each tree is trained on a different bootstrap sample

This introduces diversity among trees

Diversity reduces variance and prevents overfitting

Question 4: What are Out-of-Bag (OOB) Samples and OOB Score?

Ans: Data points not selected in a particular bootstrap sample are called Out-of-Bag samples

Approximately 36.8% of data is OOB for each tree

Out-of-Bag samples are data points not included in a bootstrap sample. The OOB score evaluates model performance by testing each tree on its OOB data, providing an unbiased estimate of ensemble accuracy.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

Feature importance explains which input features contribute most to a model’s predictions. While both Decision Trees and Random Forests can compute feature importance, the method, stability, and reliability differ.

Feature importance in a decision tree is based on impurity reduction within a single model and is unstable, while in a random forest it is averaged across multiple trees, making it more reliable and robust.

Question 6: Write a Python program to: ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores. (Include your Python code and output in the code box below.)

Answer:

In [None]:
# Question 6: Random Forest Feature Importance

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_

# Create DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort and select top 5
top_5_features = feature_importance_df.sort_values(
    by='Importance', ascending=False
).head(5)

print("Top 5 Important Features:")
print(top_5_features)# Question 6: Random Forest Feature Importance

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_

# Create DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort and select top 5
top_5_features = feature_importance_df.sort_values(
    by='Importance', ascending=False
).head(5)

print("Top 5 Important Features:")
print(top_5_features)

Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [None]:
# Question 7: Bagging vs Single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_pred)

# Print results
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)

Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [None]:
# Question 8: Random Forest Hyperparameter Tuning using GridSearchCV

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Random Forest model
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10, 20]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train GridSearch
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predictions
y_pred = best_rf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy)

Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [None]:
# Question 9: Bagging Regressor vs Random Forest Regressor

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)

Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

In [None]:
1. Choosing Between Bagging and Boosting
Decision Criteria
Factor	     Bagging	         Boosting
Main goal	   Reduce variance	Reduce bias
Data noise	 Works well	    Sensitive
Overfitting  risk	          Lower	Higher
Complexity	 Lower	         Higher


2. Handling Overfitting
Techniques Used

Data-level controls

Remove leakage features (e.g., post-loan information)

Handle missing values carefully

Feature scaling (if needed)

Model-level controls

Limit tree depth (max_depth)

Use minimum samples per leaf

Early stopping (for boosting)

Feature subsampling (Random Forest)

Validation-based controls

Cross-validation

Monitor train vs validation performance gap


3. Selecting Base Models
Preferred Base Learners

Decision Trees

Handle nonlinear relationships

Interpretability (important in finance)

Robust to outliers



4. Performance Evaluation Using Cross-Validation
Approach

Use Stratified K-Fold Cross-Validation

Ensures default vs non-default ratio is preserved

Evaluation Metrics

Accuracy alone is insufficient. I use:

ROC-AUC → overall discrimination power

Precision-Recall → important for default class

F1-Score → balance between precision & recall

Confusion Matrix → business impact visibility

Business Benefits

Higher Predictive Accuracy

Combines multiple weak models into a strong one

Reduced Risk

Lower variance → consistent loan decisions

Fewer extreme or unstable predictions

Better Customer Segmentation

Captures complex spending and repayment behavior

Explainability

Feature importance helps justify decisions

Required for compliance and audits

Scalability

Models adapt well to growing transaction data