In [None]:
                                                        ENSEMBLE LEARNING

In [None]:
Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind it.

In [None]:
Ensemble Learning in machine learning is a method where multiple models (called base learners) are combined to solve a problem and improve performance.

**Key Idea:** The central idea is that a group of models, when combined, performs better than any single model alone because their strengths complement each other and their errors get reduced. This helps improve accuracy, stability, and generalization.

---


In [None]:
Question 2: What is the difference between Bagging and Boosting? 

In [None]:
**Bagging (Bootstrap Aggregating):**

* Trains multiple models in **parallel** on different random subsets of the data (sampled with replacement).
* Final prediction is made by **majority voting (classification)** or **averaging (regression)**.
* Aim: mainly reduces **variance** and prevents overfitting (e.g., Random Forest).

**Boosting:**

* Trains multiple models **sequentially**, where each new model focuses on correcting the errors of the previous one.
* Final prediction is a **weighted combination** of all models.
* Aim: reduces both **bias and variance**, improves accuracy but can risk overfitting (e.g., AdaBoost, XGBoost).

---

* **Bagging → Parallel, reduces variance**
* **Boosting → Sequential, reduces bias & variance**

---

In [None]:
Question 3: What is bootstrap sampling and what role does it play in Bagging methods like random forest?

In [None]:
**Bootstrap Sampling** is a statistical technique where multiple new datasets are created by randomly sampling **with replacement** from the original dataset, so some samples may appear multiple times while others may be left out. Each new dataset is of the same size as the original.

**Role in Bagging (e.g., Random Forest):**
In Bagging, bootstrap sampling is used to create diverse training subsets for each base learner (e.g., decision tree). Since each model is trained on a slightly different dataset, they make different errors. When their outputs are combined (by voting or averaging), the overall variance is reduced and the model becomes more stable and accurate.


In [None]:
Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

In [None]:
**Out-of-Bag (OOB) samples:**
In bootstrap sampling, each model is trained on a random subset of data (with replacement). On average, about **one-third of the data is left out** in each sample — these are called **Out-of-Bag (OOB) samples**.

**OOB Score:**
OOB samples act like a built-in validation set. Each model is tested on the data points it didn’t see during training, and their predictions are aggregated. The accuracy (or error) computed using these OOB predictions is called the **OOB score**.

This provides an **unbiased estimate of model performance** without needing a separate validation/test set.


In [None]:
Question 5: Compare feature importance analysis in a single Decision Tree vs random forest?

In [None]:
**Feature Importance in a Single Decision Tree:**

* Calculated based on how much each feature **reduces impurity** (e.g., Gini impurity, entropy) across the splits where it is used.
* Importance is usually biased if a feature has **many categories** or if the dataset is small.
* Since it depends on one tree, results can be **unstable** and vary with small changes in data.

**Feature Importance in a Random Forest:**

* Computed by **averaging feature importance scores** across all trees in the


In [None]:
Question 6: Write a Python program to: 
● Load the Breast Cancer dataset using 
sklearn.datasets.load_breast_cancer() 
● Train a Random Forest Classifier 
● Print the top 5 most important features based on feature importance scores. 
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# Get feature importances
importances = clf.feature_importances_
feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

# Print top 5 features
print("Top 5 Most Important Features:\n")
print(feature_importance_df.head(5))


Top 5 Most Important Features:

                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [None]:
Question 7: Write a Python program to: 
● Train a Bagging Classifier using Decision Trees on the Iris dataset 
● Evaluate its accuracy and compare with a single Decision Tree 
(Include your Python code and output in the code box below.)

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt_acc = accuracy_score(y_test, dt.fit(X_train, y_train).predict(X_test))

# Bagging with Decision Trees
bag = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag_acc = accuracy_score(y_test, bag.fit(X_train, y_train).predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


In [None]:
Question 8: Write a Python program to: 
● Train a Random Forest Classifier 
● Tune hyperparameters max_depth and n_estimators using GridSearchCV 
● Print the best parameters and final accuracy 
(Include your Python code and output in the code box below.)

In [7]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Define model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}

# GridSearchCV
grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters and final accuracy
print("Best Parameters:", grid.best_params_)
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.935672514619883


In [None]:
Question 9: Write a Python program to: 
● Train a Bagging Regressor and a Random Forest Regressor on the California 
Housing dataset 
● Compare their Mean Squared Errors (MSE) 
(Include your Python code and output in the code box below.)

In [None]:
Question 10: You are working as a data scientist at a financial institution to predict loan 
default. You have access to customer demographic and transaction history data. 
You decide to use ensemble techniques to increase model performance. 
Explain your step-by-step approach to: 
● Choose between Bagging or Boosting 
● Handle overfitting 
● Select base models 
● Evaluate performance using cross-validation 
● Justify how ensemble learning improves decision-making in this real-world 
context. 


In [None]:
As a data scientist predicting **loan default**, I would use the following step-by-step approach with **ensemble learning**:

---

#### 1. **Choosing Between Bagging or Boosting**

* **Bagging (e.g., Random Forest):** Best if the dataset has **high variance**, useful to stabilize noisy predictions and reduce overfitting.
* **Boosting (e.g., XGBoost, LightGBM):** Best if the dataset has **high bias**, since it sequentially corrects errors and usually performs better in structured data like finance.
  👉 For loan default prediction, I would **start with Boosting** because it captures complex patterns in customer transactions.

---

#### 2. **Handling Overfitting**

* Use **regularization techniques** (e.g., `max_depth`, `learning_rate`, `min_samples_split`).
* Apply **early stopping** in boosting.
* Use **cross-validation** to tune hyperparameters.
* Reduce noise by feature selection or dimensionality reduction.

---

#### 3. **Selecting Base Models**

* For Bagging → **Decision Trees** (since they are high variance models, bagging stabilizes them).
* For Boosting → **Shallow Decision Trees (stumps)** as weak learners, since boosting combines many weak models into a strong one.

---

#### 4. **Evaluating Performance**

* Use **k-fold cross-validation** to assess model stability across different data splits.
* Compare metrics: **AUC-ROC, Precision, Recall, F1-score**, since in loan default prediction, **false negatives (predicting non-default but actually default)** are very costly.

---

#### 5. **Justification of Ensemble Learning in Real-world Context**

* Loan default prediction involves **high risk and imbalanced data**.
* Ensemble methods improve **accuracy, robustness, and generalization** compared to single models.
* Boosting methods (like XGBoost) are widely used in finance due to their ability to capture **non-linear relationships** and reduce bias.
* More reliable predictions help banks in **better credit risk management, reduced losses, and informed decision-making**.


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Step 1: Create synthetic dataset (loan default = binary classification)
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10,
                           n_redundant=5, weights=[0.7, 0.3], random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Step 2: Bagging (Random Forest)
rf = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

# Step 3: Boosting (Gradient Boosting)
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)

# Step 4: Evaluation
print("Random Forest Results:")
print(classification_report(y_test, rf_pred))
print("AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))

print("\nGradient Boosting Results:")
print(classification_report(y_test, gb_pred))
print("AUC:", roc_auc_score(y_test, gb.predict_proba(X_test)[:,1]))

# Step 5: Cross-validation comparison
rf_cv = cross_val_score(rf, X, y, cv=5, scoring='roc_auc').mean()
gb_cv = cross_val_score(gb, X, y, cv=5, scoring='roc_auc').mean()

print("\nCross-validated AUC -> RF:", rf_cv, " | GB:", gb_cv)


Random Forest Results:
              precision    recall  f1-score   support

           0       0.90      0.99      0.94      1047
           1       0.96      0.75      0.84       453

    accuracy                           0.91      1500
   macro avg       0.93      0.87      0.89      1500
weighted avg       0.92      0.91      0.91      1500

AUC: 0.9589914208787431

Gradient Boosting Results:
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      1047
           1       0.91      0.77      0.84       453

    accuracy                           0.91      1500
   macro avg       0.91      0.87      0.89      1500
weighted avg       0.91      0.91      0.91      1500

AUC: 0.9525228182697965
