#Assignment

**Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.**

Ensemble Learning is a machine learning technique in which multiple individual models (called base learners) are combined to solve a single problem in order to achieve better performance than any single model alone.

**Key Idea Behind Ensemble Learning**:

The key idea of ensemble learning is that a group of diverse models can make better and more accurate predictions than a single model. Each model learns different patterns or aspects of the data, and their predictions are combined using methods such as voting, averaging, or weighted averaging.

By combining multiple models, ensemble learning helps to:


1.   Reduce overfitting
2.   Improve accuracy
2.   Increase robustness and stability of predictions

**Example**

- In classification, multiple models vote for a class, and the class with the most votes is selected.

- In regression, predictions from multiple models are averaged to produce the final output.

**Question 2: What is the difference between Bagging and Boosting?**

Bagging (Bootstrap Aggregating) and Boosting are two popular ensemble learning techniques, but they differ in how models are trained and combined.

**Bagging (Bootstrap Aggregating)**:

Bagging focuses on reducing variance by training multiple models independently on different random samples of the dataset (created using bootstrapping). Each model has equal importance, and their predictions are combined using majority voting (for classification) or averaging (for regression).

**Key points:**

- Models are trained in parallel

- Each model has equal weight

- Helps reduce overfitting

- **Example**: Random Forest

**Boosting**

Boosting focuses on reducing bias by training models sequentially. Each new model gives more importance to the data points that were misclassified by previous models, gradually improving performance.

**Key points:**

- Models are trained sequentially

- Misclassified points get higher weight

- Helps improve model accuracy

- **Examples**: AdaBoost, Gradient Boosting, XGBoost



**Summary of Differences**


| Aspect          | Bagging                  | Boosting                 |
| --------------- | ------------------------ | ------------------------ |
| Training Method | Parallel                 | Sequential               |
| Focus           | Reducing variance        | Reducing bias            |
| Data Sampling   | Random bootstrap samples | Reweighted samples       |
| Model Weight    | Equal                    | Different weights        |
| Overfitting     | Reduced                  | Can overfit if not tuned |


**Conclusion**

Bagging improves stability by combining independent models, while Boosting builds stronger models by learning from previous mistakes.

**Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

Bootstrap sampling is a statistical resampling technique in which multiple new datasets are created by randomly sampling data points from the original dataset with replacement. As a result, some observations may appear multiple times in a sample, while others may not appear at all.

**Role of Bootstrap Sampling in Bagging and Random Forest**

In Bagging (Bootstrap Aggregating) methods such as Random Forest, bootstrap sampling plays a crucial role by creating diverse training datasets for each model.

**Its key roles include:**

- Each decision tree in a Random Forest is trained on a different bootstrap sample of the original dataset.

- This introduces diversity among the models, as each tree sees a slightly different version of the data.

- Diversity among models helps in reducing variance and preventing overfitting.

- The final prediction is obtained by aggregating the predictions of all trees through majority voting (classification) or averaging (regression).

**Conclusion**

In summary, bootstrap sampling enables bagging methods like Random Forest to build multiple diverse models, which when combined, result in a more stable, accurate, and robust machine learning model.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**

Out-of-Bag (OOB) samples are the data points from the original dataset that are not selected in a bootstrap sample when training a model in bagging-based ensemble methods such as Random Forest.

**Out-of-Bag (OOB) Samples Explained**

During bootstrap sampling, each model is trained on a random sample drawn with replacement from the dataset. On average, about 63% of the original data points are included in a bootstrap sample, while the remaining 37% are left out. These left-out data points are called Out-of-Bag samples.

**Use of OOB Score in Model Evaluation**

The OOB score is used as an internal validation method to evaluate ensemble models without needing a separate test dataset.

- For each data point, predictions are made using only the models for which that point was an OOB sample.

- These predictions are aggregated and compared with the actual values.

- The resulting accuracy (for classification) or error (for regression) is called the OOB score.

**Advantages of OOB Score**

- Eliminates the need for a separate validation set.

- Provides an unbiased estimate of model performance.

- Efficient and saves computational resources.

**Conclusion**

OOB samples allow ensemble models like Random Forest to self-evaluate their performance, making OOB score a reliable and convenient metric for assessing model accuracy.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.**

Feature importance analysis helps identify which input features contribute most to a model‚Äôs predictions. The way feature importance is calculated and interpreted differs between a single Decision Tree and a Random Forest.

**Feature Importance in a Single Decision Tree**

In a single Decision Tree, feature importance is based on how much each feature reduces impurity (such as Gini impurity or entropy) at the splits where it is used.

- Importance is calculated from one tree only.

- Features used near the top of the tree usually appear more important.

- Results are easy to interpret and visualize.

- However, the importance values can be unstable and highly sensitive to the training data.

**Feature Importance in a Random Forest**

In a Random Forest, feature importance is computed by averaging the importance values across all trees in the forest.

- Importance is more stable and reliable due to aggregation.

- Reduces bias caused by a single tree‚Äôs structure.

- Captures feature contributions across different subsets of data and features.

- Less interpretable at the individual tree level but more robust overall.


##**Comparison Summary**


| Aspect              | Decision Tree | Random Forest  |
| ------------------- | ------------- | -------------- |
| Number of Models    | Single tree   | Multiple trees |
| Stability           | Low           | High           |
| Sensitivity to Data | High          | Low            |
| Interpretability    | High          | Moderate       |
| Reliability         | Lower         | Higher         |

**Conclusion**

A single Decision Tree provides simple and interpretable feature importance, while a Random Forest offers more stable and reliable importance estimates by combining results from many trees.

Question 6: Write a Python program to:
‚óè Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
‚óè Train a Random Forest Classifier
‚óè Print the top 5 most important features based on feature importance scores.

In [1]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Get feature importance
feature_importance = pd.Series(rf.feature_importances_, index=X.columns)

# Get top 5 important features
top_5_features = feature_importance.sort_values(ascending=False).head(5)

# Print the results
print("Top 5 Most Important Features:")
print(top_5_features)


Top 5 Most Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


**Explanation**

The Random Forest model calculates feature importance based on how much each feature reduces impurity across all trees. The top features contribute the most to predicting whether a tumor is malignant or benign.

**Question 7: Write a Python program to:**

**‚óè Train a Bagging Classifier using Decision Trees on the Iris dataset**

**‚óè Evaluate its accuracy and compare with a single Decision Tree**

In [3]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_predictions = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)

# Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_predictions = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)

# Print accuracies
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bagging_accuracy)

Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


**Conclusion:**

The Bagging Classifier achieves higher accuracy than a single Decision Tree because it combines multiple trees trained on different bootstrap samples, reducing variance and improving overall performance.

**Question 8: Write a Python program to:**

**‚óè Train a Random Forest Classifier**

**‚óè Tune hyperparameters max_depth and n_estimators using GridSearchCV**

**‚óè Print the best parameters and final accuracy**

In [4]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10]
}

# Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

# Get best model
best_model = grid_search.best_estimator_

# Make predictions
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 0.9707602339181286


**Conclusion**

GridSearchCV helps find the optimal hyperparameters for the Random Forest model. Using the best parameters improves model performance and results in higher classification accuracy.

**Question 9: Write a Python program to:**

**‚óè Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset**

**‚óè Compare their Mean Squared Errors (MSE)**

In [6]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Bagging Regressor
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)

# Calculate Mean Squared Errors
bagging_mse = mean_squared_error(y_test, bagging_pred)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25650512920799395


**Conclusion:**

The Random Forest Regressor generally achieves a lower Mean Squared Error than the Bagging Regressor because it introduces additional randomness through feature selection, leading to better generalization and improved prediction accuracy.

**Question 10: You are working as a data scientist at a financial institution to predict loan  default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance.**

**Explain your step-by-step approach to:**

**‚óè Choose between Bagging or Boosting**

**‚óè Handle overfitting**

**‚óè Select base models**

**‚óè Evaluate performance using cross-validation**

**‚óè Justify how ensemble learning improves decision-making in this real-world
context.**



**Step-by-Step Approach to Loan Default Prediction Using Ensemble Learning**

**1. Choosing Between Bagging or Boosting**

**Decision:**

- Bagging (e.g., Random Forest) is preferred when:



1.   The base model has high variance
2.   Data is noisy
3.   Overfitting is a concern

- Boosting (e.g., Gradient Boosting, XGBoost) is preferred when:



1.   The model has high bias
2.   Complex relationships exist
3.   You want to focus on hard-to-classify defaulters

**In a financial loan default problem:**

- Default prediction usually has imbalanced data and complex patterns

- Boosting is often more effective, as it sequentially focuses on misclassified (high-risk) customers

üëâ **Chosen approach:** Boosting, with Random Forest as a benchmark

**2. Handling Overfitting**

**To reduce overfitting:**

- Use ensemble models instead of a single model

- Apply regularization parameters

1.   Limit tree depth

2.   Use learning rate (for boosting)


- Use cross-validation

- Early stopping (in boosting)

**3. Selecting Base Models**

**Base Learners:**

Decision Trees (weak learners)

**Why Decision Trees?**

1. Handle non-linear relationships

2. Work well with mixed data (demographic + transaction)

3. Interpretable for financial decisions

**Ensemble Models Used:**

1. Random Forest (Bagging)

2. Gradient Boosting (Boosting)



**4. Evaluating Performance Using Cross-Validation**

**Why Cross-Validation?**

1. Ensures model stability

2. Prevents data leakage

3. Provides reliable performance estimate

**Metrics Used:**

1. Accuracy

2. ROC-AUC (important for default prediction)

3. Precision & Recall (cost-sensitive problem)

**5. How Ensemble Learning Improves Decision-Making**

**Ensemble learning:**

1. Reduces prediction risk

2. Improves default detection

3. Produces more stable credit decisions

4. Minimizes financial losses from false approvals



In [7]:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Simulated loan default dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    weights=[0.7, 0.3],  # Imbalanced data
    random_state=42
)

# Bagging Model
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=6,
    random_state=42
)

# Boosting Model
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

# Cross-validation accuracy
rf_score = cross_val_score(rf, X, y, cv=5, scoring='accuracy').mean()
gb_score = cross_val_score(gb, X, y, cv=5, scoring='accuracy').mean()

print("Random Forest Accuracy:", rf_score)
print("Gradient Boosting Accuracy:", gb_score)


Random Forest Accuracy: 0.9040000000000001
Gradient Boosting Accuracy: 0.93


**Conclusion**

In loan default prediction, ensemble learning significantly improves model performance by combining multiple weak learners. Boosting methods outperform bagging by focusing on misclassified high-risk customers, leading to better default detection. This results in reduced financial risk, improved credit decision accuracy, and more robust real-world deployment.