# Ensemble Learning Assignment

**Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.**

->
- Ensemble learning is a machine learning approach where we combine multiple models to solve the same problem and then aggregate their results to get a better overall prediction.
- The key idea is that a group of models (often called “weak learners”) can collectively perform better than any single model alone.
- This works because different models may make different errors, and when combined, these errors can cancel each other out, improving accuracy and robustness.
- Ensemble methods
1. Bagging (e.g., Random Forest) : builds multiple models in parallel on different data subsets and averages their predictions.

2. Boosting (e.g., AdaBoost, XGBoost) : builds models sequentially, with each new model focusing on correcting errors made by previous ones.

3. Stacking : combines predictions from multiple different models using another model (meta-learner) to make the final decision.

**Question 2: What is the difference between Bagging and Boosting?**

->

**1. Bagging (Bootstrap Aggregating)**
- Trains multiple models in parallel on different random subsets of data (sampling with replacement).
- Each model works independently of the others.
- Final prediction is made by majority voting (classification) or averaging (regression).
- All models have equal weight in the final decision.
- Reduces variance, making it good for unstable models like decision trees.
- Example: Random Forest.

**2. Boosting**
- Trains models sequentially, with each new model focusing on the mistakes of the previous ones.
- Adjusts the weights of training samples to give more importance to misclassified cases.
- Models that perform better get higher weight in the final prediction.
- Reduces both bias and variance.
- More prone to overfitting if not tuned properly.
- Examples: AdaBoost, Gradient Boosting, XGBoost.

**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

->
- Bootstrap sampling is a statistical technique where we create new datasets by randomly selecting samples with replacement from the original dataset.
- Because of replacement, some samples may appear multiple times in the new dataset, while others may be left out. - Each bootstrap sample is usually the same size as the original dataset.
---
**Role in Bagging (e.g., Random Forest)**

- In Bagging, bootstrap sampling is used to generate different training subsets for each model in the ensemble. - This ensures that each model sees a slightly different version of the data, which makes their predictions less correlated.
- When these diverse models are combined, the overall variance is reduced, leading to more stable and accurate predictions.
- In Random Forest, this bootstrapping is combined with random feature selection to further increase diversity among trees.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

->

- In bagging methods like Random Forest, each decision tree is trained on a bootstrap sample created by sampling with replacement from the training data.
- Some data points are left out of this bootstrap sample.
- These unused points are called OOB samples for that specific tree.
- They are still part of the training dataset but are excluded from training that tree.
- These OOB samples act as validation data for that specific model.

**How is OOB score used to evaluate ensemble models?**

- After a tree is trained, it is tested on its own OOB samples to measure performance, producing an individual OOB score.
- This is done for every tree in the ensemble.
- The average of all individual OOB scores is taken as the final OOB score for the ensemble.
- This score provides an unbiased estimate of model performance without needing a separate validation set.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

->
**Feature Importance in a Single Decision Tree**

- Calculated by measuring the reduction in impurity (e.g., Gini impurity or entropy) brought by each feature across all the splits where it is used.

- The contribution of each split is weighted by the number of samples it affects, and then summed for the feature.

- Since the tree is built on one dataset, the importance values may be unstable—small changes in data can lead to different feature rankings.

**Feature Importance in a Random Forest**

- Computed in a similar way (reduction in impurity), but averaged across all trees in the forest.

- Because Random Forest uses bootstrap sampling and random feature selection at each split, the importance scores are more robust and stable compared to a single tree.

- Helps avoid overfitting to noise and gives a more reliable estimate of which features are truly important.

**Question 6: Write a Python program to:**
- Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

In [3]:
data = load_breast_cancer()
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

In [4]:
data.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [5]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [6]:
data.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [7]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [9]:
df.isnull().sum()

Unnamed: 0,0
mean radius,0
mean texture,0
mean perimeter,0
mean area,0
mean smoothness,0
mean compactness,0
mean concavity,0
mean concave points,0
mean symmetry,0
mean fractal dimension,0


In [10]:
X = df.drop('target', axis=1)
y = df['target']

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [12]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

In [13]:
y_pred = model.predict(X_test)

In [14]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.956140350877193

In [15]:
pd.Series(model.feature_importances_, index = X.columns)

Unnamed: 0,0
mean radius,0.02946
mean texture,0.014548
mean perimeter,0.035064
mean area,0.046239
mean smoothness,0.004773
mean compactness,0.002721
mean concavity,0.052334
mean concave points,0.069316
mean symmetry,0.00282
mean fractal dimension,0.003567


In [16]:
importances = pd.Series(model.feature_importances_, index = X.columns)
importances.sort_values(ascending=False).head(5)

Unnamed: 0,0
worst concave points,0.167324
worst area,0.142274
worst radius,0.133386
worst perimeter,0.126325
mean concave points,0.069316


**Question 7: Write a Python program to:**
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree

In [17]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [18]:
data = load_iris()
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [19]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [20]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [22]:
df.isnull().sum()

Unnamed: 0,0
sepal length (cm),0
sepal width (cm),0
petal length (cm),0
petal width (cm),0
target,0


In [23]:
X = df.drop('target', axis=1)
y = df['target']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [25]:
# Single Decision Tree

In [26]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [27]:
dt_pred = dt.predict(X_test)

In [28]:
dt_accuracy = accuracy_score(y_test, dt_pred)

In [29]:
# Bagging Classifier with Decision Trees

In [30]:
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=20, random_state=1)
bagging.fit(X_train, y_train)

In [31]:
bag_preds = bagging.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_preds)

In [32]:
print("Single Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)

Single Decision Tree Accuracy: 0.9555555555555556
Bagging Classifier Accuracy: 0.9555555555555556


**Question 8: Write a Python program to:**
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
- Print the best parameters and final accuracy

In [33]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=1)

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [36]:
rf = RandomForestClassifier()

In [37]:
params = {'max_depth': [50, 70, 100, 80], 'n_estimators': [50, 100, 200]}

In [38]:
grid_search = GridSearchCV(estimator = rf, param_grid = params, cv = 5, verbose=2, scoring='accuracy')

In [39]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ......................max_depth=50, n_estimators=50; total time=   0.3s
[CV] END ......................max_depth=50, n_estimators=50; total time=   0.3s
[CV] END ......................max_depth=50, n_estimators=50; total time=   0.4s
[CV] END ......................max_depth=50, n_estimators=50; total time=   0.4s
[CV] END ......................max_depth=50, n_estimators=50; total time=   0.5s
[CV] END .....................max_depth=50, n_estimators=100; total time=   0.6s
[CV] END .....................max_depth=50, n_estimators=100; total time=   0.6s
[CV] END .....................max_depth=50, n_estimators=100; total time=   0.6s
[CV] END .....................max_depth=50, n_estimators=100; total time=   0.9s
[CV] END .....................max_depth=50, n_estimators=100; total time=   1.5s
[CV] END .....................max_depth=50, n_estimators=200; total time=   1.7s
[CV] END .....................max_depth=50, n_es

In [40]:
grid_search.best_params_

{'max_depth': 50, 'n_estimators': 100}

In [41]:
grid_search.best_estimator_

In [42]:
grid_search.best_score_

np.float64(0.8625)

In [43]:
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

In [44]:
accuracy_score(y_test, y_pred)

0.87

**Question 9: Write a Python program to:**
- Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
- Compare their Mean Squared Errors (MSE)


In [45]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

In [46]:
X = data.data
y = data.target

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [48]:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [49]:
# Bagging Regressor
bagging = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=20, random_state=1)
bagging.fit(X_train, y_train)

In [50]:
bag_pred = bagging.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

In [51]:
# Random Forest Regressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

In [52]:
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_preds)

In [53]:
print('Bagging Regressor:', bag_mse)
print('Random Forest Regressor:', rf_mse)

Bagging Regressor: 0.2678648778131496
Random Forest Regressor: 0.2529800775888451


**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**

You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world
context.

->

---
**1. Choose Between Bagging or Boosting**

Loan default prediction is a classification task with imbalanced data and non-linear patterns. Bagging (e.g., Random Forest) reduces variance and works well if the base model overfits. Boosting (e.g., XGBoost, LightGBM) learns from mistakes, reducing bias and variance, which is useful for complex, high-stakes problems. Start with Boosting for better handling of subtle financial patterns, but compare with Bagging as a baseline.

---
**2. Handle Overfitting**

Limit tree depth and number of trees. Use learning rate in Boosting to control influence of each tree. Apply early stopping based on validation performance. Remove irrelevant or highly correlated features. Use cross-validation to confirm generalization.

---
**3. Select Base Models**

For Bagging, use moderately deep decision trees. For Boosting, use shallow trees (depth 3–5). Try other learners if needed, but tree-based models usually handle mixed features best.

---
**4. Evaluate with Cross-Validation**

Use Stratified k-fold CV to maintain class proportions. Evaluate using Precision, Recall, F1-score, AUC-ROC, and Precision–Recall curve. Compare Bagging and Boosting results to select the better model.

---

**5. Why Ensemble Improves Decisions**

Incorrect predictions, especially false negatives, cause high losses in finance. Ensembles combine multiple models for more stable and accurate predictions. Bagging reduces variance, Boosting reduces bias and variance, leading to more reliable loan approvals, fewer defaults, and higher profitability.