# Ensemble Learning | Assignment

**Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind it.**


**Answer:**  
**Ensemble Learning** is a machine learning technique that combines the predictions of multiple individual models (called *base learners* or *weak learners*) to create a more accurate and robust final model. The idea is that by aggregating the strengths of several models, the ensemble can outperform any single model alone.

## Key Idea Behind Ensemble Learning
The key idea is that **a group of weak models, when combined properly, can produce a strong model**. Each individual model may make some errors, but when their predictions are aggregated (for example, by averaging or voting), these errors can cancel out, leading to improved overall accuracy and generalization.

## Types of Ensemble Methods
1. **Bagging (Bootstrap Aggregating)**  
   - Trains multiple models on different random subsets of the training data.  
   - Example: **Random Forest**  
   - Reduces variance and prevents overfitting.

2. **Boosting**  
   - Trains models sequentially, where each new model focuses on correcting the errors of the previous ones.  
   - Example: **AdaBoost, Gradient Boosting, XGBoost**  
   - Reduces bias and improves model accuracy.

3. **Stacking**  
   - Combines multiple base models’ predictions using a **meta-model** (another model that learns how to best combine them).  
   - Example: Using Logistic Regression as a meta-model to combine Decision Tree and SVM outputs.


**Question 2: What is the difference between Bagging and Boosting?**

**Answer:**  
Bagging and Boosting are both ensemble learning techniques, but they differ in how they build and combine the base models.

| Feature | Bagging | Boosting |
|---------|---------|----------|
| **Full Form** | Bootstrap Aggregating | N/A |
| **Objective** | Reduces **variance** and prevents overfitting | Reduces **bias** and improves accuracy |
| **How Models Are Trained** | Models are trained **independently** on different random subsets of data | Models are trained **sequentially**, where each new model focuses on the errors of previous models |
| **Data Sampling** | Random sampling **with replacement** (bootstrap) | All data points are used, but weights are adjusted to focus on misclassified examples |
| **Prediction Combination** | **Voting** (for classification) or **averaging** (for regression) | **Weighted combination** based on model performance |
| **Example Algorithms** | Random Forest | AdaBoost, Gradient Boosting, XGBoost |
| **Error Handling** | Reduces **variance** by averaging errors | Reduces **bias** by learning from mistakes |



**Question 3: What is Bootstrap Sampling and What Role Does It Play in Bagging Methods Like Random Forest?**

**Answer:**  
**Bootstrap Sampling** is a statistical technique where multiple random samples are drawn **with replacement** from the original dataset. Each sample is of the same size as the original dataset, but because of sampling with replacement, some data points may appear multiple times while others may be left out.

## Role of Bootstrap Sampling in Bagging (e.g., Random Forest)
1. **Creating Diverse Training Sets:**  
   - Each base model (like a decision tree in Random Forest) is trained on a different bootstrap sample.  
   - This introduces **diversity** among the models, which is crucial for reducing variance.

2. **Reducing Overfitting:**  
   - Because each tree sees a slightly different dataset, Random Forest avoids overfitting to the original dataset.  

3. **Enabling Out-of-Bag (OOB) Error Estimation:**  
   - Data points not included in a bootstrap sample (called *out-of-bag samples*) can be used to estimate the model’s performance without needing a separate test set.


**Question 4: What are Out-of-Bag (OOB) Samples and How is OOB Score Used to Evaluate Ensemble Models?**

**Answer:**  
**Out-of-Bag (OOB) samples** are the data points **not included** in a bootstrap sample when training a base model in Bagging methods (like Random Forest). Since each bootstrap sample leaves out about **one-third of the original data**, these left-out samples can be used for evaluation.

## Role of OOB Samples
1. **Evaluation Without a Separate Test Set:**  
   - OOB samples act as a **built-in validation set**, allowing estimation of model performance without needing an external test dataset.

2. **OOB Score Calculation:**  
   - Each data point that is OOB for some trees is predicted using only those trees.  
   - The predictions are aggregated (majority vote for classification, averaging for regression) to compute the **OOB score**, which reflects model accuracy or error.

3. **Advantages:**  
   - Saves data since no separate test set is needed.  
   - Provides an unbiased estimate of model generalization performance.  
   - Helps in **hyperparameter tuning** and **feature importance estimation**.

**Question 5: Compare Feature Importance Analysis in a Single Decision Tree vs. a Random Forest**

**Answer:**  
Feature importance measures how much each feature contributes to the predictive performance of a model. Both Decision Trees and Random Forests provide feature importance, but the way it is computed and interpreted differs.

| Feature | Decision Tree | Random Forest |
|---------|---------------|---------------|
| **Calculation** | Importance is based on how much a feature **reduces impurity** (like Gini or entropy) in the tree’s splits | Importance is averaged over **all trees** in the forest, reducing bias from a single tree |
| **Stability** | Can be **unstable**, as small changes in data can lead to different splits and importance scores | More **stable and reliable** because it aggregates importance across multiple trees |
| **Bias Towards Certain Features** | Can be **biased toward features with more levels** (categorical) or continuous features with many split points | Less biased than a single tree due to averaging over multiple diverse trees |
| **Interpretation** | Shows importance for the **specific tree** | Shows overall importance for the **entire ensemble**, providing better generalization |
| **Use Case** | Good for quick analysis or simple datasets | Preferred for robust feature selection in complex datasets |


In [1]:
# Question 6: Random Forest Feature Importance on Breast Cancer Dataset

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top 5 most important features
top5_features = feature_importance_df.head(5)
print("Top 5 Most Important Features:\n")
print(top5_features)


Top 5 Most Important Features:

                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [4]:
# Question 7: Bagging Classifier vs Single Decision Tree on Iris Dataset

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50,
                            random_state=42, n_jobs=-1)  # <-- changed base_estimator to estimator
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

# Print the accuracies
print(f"Accuracy of Single Decision Tree: {accuracy_dt:.4f}")
print(f"Accuracy of Bagging Classifier: {accuracy_bagging:.4f}")


Accuracy of Single Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000


In [1]:
# Question 8: Random Forest Classifier with Hyperparameter Tuning using GridSearchCV

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5, 7]
}

# Perform GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

# Evaluate the final model on the test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Final Accuracy on Test Set: {accuracy:.4f}")


Best Hyperparameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 1.0000


In [2]:
# Question 9: Bagging Regressor vs Random Forest Regressor on California Housing Dataset

# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Bagging Regressor using Decision Trees
bagging = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50,
                           random_state=42, n_jobs=-1)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# Train a Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print Mean Squared Errors
print(f"Mean Squared Error of Bagging Regressor: {mse_bagging:.4f}")
print(f"Mean Squared Error of Random Forest Regressor: {mse_rf:.4f}")


Mean Squared Error of Bagging Regressor: 0.2579
Mean Squared Error of Random Forest Regressor: 0.2565


**Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**

**You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:**

 **● Choose between Bagging or Boosting**

 **● Handle overfitting**

 **● Select base models**

 **● Evaluate performance using cross-validation**

 **● Justify how ensemble learning improves decision-making in this real-world context. bold text bold text**

**Answer:**  

When predicting loan default at a financial institution, ensemble techniques can improve predictive accuracy, stability, and generalization. Here’s a step-by-step approach:

---

## 1. Choosing Between Bagging or Boosting
- **Bagging** (e.g., Random Forest) is suitable when the base models are **high-variance**, like Decision Trees, and we want to **reduce overfitting**.  
- **Boosting** (e.g., AdaBoost, Gradient Boosting, XGBoost) is suitable when base models are **weak learners** and the goal is to **reduce bias** by sequentially correcting errors.  
- **Decision:** Start with **Random Forest (Bagging)** to get robust predictions. If higher accuracy is needed, use **Boosting** for fine-tuned error correction.

---

## 2. Handling Overfitting
- **Bagging:** Reduces variance by averaging multiple trees trained on different bootstrap samples.  
- **Boosting:** Control overfitting by tuning parameters like `learning_rate`, `n_estimators`, and `max_depth`.  
- **Additional techniques:**  
  - Feature selection to remove irrelevant variables.  
  - Regularization (e.g., L1/L2 penalties in boosting models).  
  - Cross-validation to monitor generalization.

---

## 3. Selecting Base Models
- **Decision Trees** are commonly used because they:  
  - Capture non-linear patterns in customer data.  
  - Work well with both bagging and boosting.  
- For tabular financial data, **shallow trees** (max_depth 3-5) are often good base learners for boosting.  

---

## 4. Evaluating Performance Using Cross-Validation
- Use **k-fold cross-validation** (e.g., k=5 or 10) to assess model performance.  
- Metrics to track:  
  - **Accuracy** (overall correctness)  
  - **Precision and Recall** (especially for detecting defaulters)  
  - **ROC-AUC** (for ranking risk probabilities)  
- Cross-validation ensures the model generalizes well to unseen customers and avoids overfitting.

---

## 5. How Ensemble Learning Improves Decision-Making
- **Higher predictive accuracy:** Combining multiple models reduces individual model errors.  
- **Stability:** Aggregating predictions reduces the impact of outliers or noisy data.  
- **Better risk assessment:** Helps the institution identify high-risk customers more reliably.  
- **Informed decisions:** Accurate predictions enable better loan approval policies, interest rate assignments, and proactive interventions for at-risk customers.  

**Conclusion:**  
Using ensemble learning, such as Bagging or Boosting, allows financial institutions to build **robust, accurate, and reliable models** for predicting loan default, which ultimately improves risk management and decision-making.
