# Assignment :- Ensemble Learning

## 1 What is Ensemble Learning in machine learning? Explain the key idea behind it.
Answer:

Ensemble Learning is a machine learning technique where multiple models (often called “weak learners” or “base learners”) are trained and then combined to make predictions.
The idea is that while individual models may make errors, combining their predictions can reduce overall error and improve accuracy.
Key idea:
“A group of weak models, when combined properly, can perform better than any single strong model.”
Why it works:
Different models may capture different patterns in the data.
Errors made by one model can be corrected by others.
Combining predictions reduces the impact of noise and overfitting.
Types of Ensemble methods:
Bagging (Bootstrap Aggregating) - reduces variance.
Boosting - reduces bias and variance.
Stacking - combines multiple model types using a meta-model.


## 2 What is the difference between Bagging and Boosting?
Answer:

1)
Feature:- Goal
Bagging:- Reduce variance
Boosting:- Reduce bias (and variance)                                                                                             
2)
Feature:- Training data
Bagging:- Each model gets a random bootstrap sample (sampling with replacement)
Boosting:- Models are trained sequentially, each focusing on the errors of the previous model

3)
Feature:- Model training
Bagging:- Models are trained independently
Boosting:- Each new model depends on the previous one

4)
Feature:- Weighting of samples
Bagging:- All samples have equal weight
Boosting:- Misclassified samples get higher weight in the next round

5)
Feature:- Combination of predictions
Bagging:- Simple majority vote (classification) or average (regression)
Boosting:- Weighted vote (classification) or weighted sum (regression)

6)
Feature:- Example algorithms
Bagging:- Random Forest, Bagged Decision Trees
Boosting:- AdaBoost, Gradient Boosting, XGBoost

Simple analogy:
Bagging = “Many people vote independently — the majority wins.”
Boosting = “One person speaks, makes mistakes, the next person learns from those mistakes and improves the answer.”


## 3 What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
Answer:

Bootstrap sampling is a resampling technique where we create new training datasets by randomly selecting samples from the original dataset with replacement.
“With replacement” means that after picking a sample, we put it back into the dataset before picking the next one — so the same sample can appear multiple times in one bootstrap dataset.
Role in Bagging (e.g., Random Forest):
Each base learner (like a decision tree) is trained on a different bootstrap sample.
This introduces diversity among models, because each model sees a slightly different subset of the data.
When predictions from these diverse models are averaged (or voted), variance is reduced and the overall model becomes more stable and accurate.
Example:
If you have 1,000 samples, a bootstrap sample also contains 1,000 points — but some points are repeated and some are missing. On average, about 63.2% of the original points are included in each bootstrap sample, and the rest are excluded (these excluded ones are the Out-of-Bag samples, explained next).


## 4 What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
Answer:

Out-of-Bag (OOB) samples are the data points not included in a particular bootstrap sample.
Since bootstrap sampling is done with replacement, about 36.8% of the original data is not selected in each sample — these are the OOB samples for that model.
OOB Score in evaluation:
In Bagging methods like Random Forest, we can use OOB samples as a free validation set.
For each data point:
Look at predictions from only the models where this point was OOB.
Compare the aggregated prediction to the true label.
The OOB score is simply the accuracy (or R² in regression) computed from these OOB predictions.
Benefits of OOB evaluation:
No need to create a separate validation set — the OOB method internally estimates the generalization performance.
It's efficient because it uses data already generated during training.


## 5 Compare feature importance analysis in a single Decision Tree vs.
Answer:

Feature importance in a single Decision Tree
In a decision tree, feature importance is calculated based on how much each feature reduces impurity (such as Gini impurity or entropy) when it is used to split the data. Every time the tree splits on a feature, the reduction in impurity from that split is added to that feature's importance score. After the tree is built, the scores are normalized so they sum to 1.
However, this approach can be unstable — small changes in the training data can lead to a different tree structure, and therefore very different importance scores. It can also be biased toward features with many unique values (for example, continuous numerical variables), which tend to produce bigger impurity reductions.
Feature importance in a Random Forest
A random forest is an ensemble of many decision trees, each trained on a different bootstrap sample of the data and often with a random subset of features considered at each split. For feature importance, the forest computes the same impurity reduction score for each tree individually, then averages these scores across all trees.
This averaging process makes the importance values much more stable and less sensitive to noise or small changes in the dataset. Since each tree sees different subsets of data and features, the bias toward high-cardinality features is also reduced. While the importance scores in a random forest are more reliable, they are less directly interpretable than a single tree's top splits, because the scores come from an aggregate of many models rather than one clear structure.

In [1]:
# 6

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# 2. Train a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# 3. Get feature importance scores
importances = model.feature_importances_

# 4. Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# 5. Sort features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# 6. Print top 5 features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5).to_string(index=False))


Top 5 Most Important Features:
             Feature  Importance
          worst area    0.139357
worst concave points    0.132225
 mean concave points    0.107046
        worst radius    0.082848
     worst perimeter    0.080850


In [4]:
# 7

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train a single Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# 4. Train a Bagging Classifier using Decision Trees (new parameter name: estimator)
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# 5. Print results
print("Single Decision Tree Accuracy: {:.2f}%".format(dt_accuracy * 100))
print("Bagging Classifier Accuracy: {:.2f}%".format(bagging_accuracy * 100))


Single Decision Tree Accuracy: 100.00%
Bagging Classifier Accuracy: 100.00%


In [5]:
# 8

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Define the model
rf = RandomForestClassifier(random_state=42)

# 4. Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],   # Number of trees
    'max_depth': [None, 5, 10]        # Maximum depth of trees
}

# 5. Use GridSearchCV for tuning
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1           # Use all CPU cores
)

# 6. Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# 7. Best parameters
print("Best Parameters:", grid_search.best_params_)

# 8. Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Final Accuracy on Test Set: {:.2f}%".format(accuracy * 100))


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 100.00%


In [6]:
# 9

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train Bagging Regressor with Decision Tree as base estimator
bagging_regressor = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_regressor.fit(X_train, y_train)
bagging_preds = bagging_regressor.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_preds)

# 4. Train Random Forest Regressor
rf_regressor = RandomForestRegressor(
    n_estimators=50,
    random_state=42
)
rf_regressor.fit(X_train, y_train)
rf_preds = rf_regressor.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_preds)

# 5. Print comparison
print(f"Bagging Regressor MSE: {bagging_mse:.4f}")
print(f"Random Forest Regressor MSE: {rf_mse:.4f}")


Bagging Regressor MSE: 0.2579
Random Forest Regressor MSE: 0.2577


# 10

1. Choosing between Bagging and Boosting
I would start by analyzing the problem type. Loan default prediction is a binary classification task that usually suffers from class imbalance (more “non-default” than “default” cases).
If my base models are high variance (e.g., decision trees) and I want to reduce overfitting, I would lean toward Bagging methods like Random Forest.
If my base models are underfitting (low bias but high error on both training and test sets), I would choose Boosting methods like XGBoost, LightGBM, or AdaBoost, since they sequentially focus on hard-to-classify customers and reduce bias.
In practice, I would experiment with both, but boosting often performs better in structured/tabular financial data.

2. Handling Overfittin
For Bagging:
Limit tree depth (max_depth) to prevent overly complex trees.
Increase n_estimators until performance stabilizes.
For Boosting:
Use a lower learning rate (learning_rate) with more estimators.
Apply regularization (max_depth, min_child_weight, subsample, colsample_bytree).
Early stopping based on validation performance.
Apply cross-validation to detect overfitting before deployment.

3. Selecting Base Models
For bagging: Decision trees are a good choice because they have high variance and benefit from averaging.
For boosting: Decision trees (usually shallow, max_depth=3-6) are also common because they are weak learners that boosting can improve.
Could also test logistic regression, SVM, or neural networks depending on computational budget, but decision trees are most interpretable in finance.

4. Evaluating Performance using Cross-Validation
Use Stratified K-Fold Cross-Validation to preserve the class distribution in each fold.
Metrics to evaluate:
AUC-ROC → Measures ability to separate default vs non-default cases.
Precision-Recall AUC → More informative for imbalanced datasets.
F1-score → Balances precision and recall for the “default” class.
Ensure results are averaged across folds to avoid bias from one split.

5. Justifying Ensemble Learning in this Real-World Context

Loan default prediction requires high accuracy because false negatives (predicting “no default” but actually defaulting) can cause large financial losses.
Ensemble learning improves decision-making by:
Combining multiple weak learners to reduce variance (bagging) or bias (boosting).
Handling noisy features and correlated variables better than a single model.
Providing more stable predictions, which increases trust from stakeholders.
Allowing feature importance analysis to help risk managers understand key drivers of default.

✅ Final takeaway:
In loan default prediction, I would begin with Boosting (e.g., XGBoost) for its ability to capture complex relationships and handle imbalanced data effectively, validate performance with Stratified K-Fold CV, apply regularization to prevent overfitting, and justify the approach to management based on improved stability, accuracy, and interpretability compared to single models.