# Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.
- Ensemble Learning in machine learning is a technique where multiple models (often called weak learners) are trained and then combined to make better predictions than any single model could achieve on its own.

-Key Idea Behind Ensemble Learning
The central concept is that:

- "A group of diverse models, when combined, can perform better than any individual model alone."



# Question 2: What is the difference between Bagging and Boosting?

| Feature              | Bagging            | Boosting          |
| -------------------- | ------------------ | ----------------- |
| Learning type        | Parallel           | Sequential        |
| Focus                | Reduce variance    | Reduce bias       |
| Data sampling        | Bootstrap samples  | Weighted data     |
| Model dependency     | Independent models | Dependent models  |
| Sensitivity to noise | Less sensitive     | More sensitive    |
| Example algorithms   | Random Forest      | AdaBoost, XGBoost |


# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Bootstrap sampling is a random sampling technique with replacement used to create multiple new datasets (called bootstrap samples) from an original dataset.




- Role in Bagging (e.g., Random Forest)

In Bagging methods like Random Forest:

Bootstrap sampling is used to create different training datasets for each individual model (e.g., each decision tree).

Each model sees a slightly different version of the data → this increases diversity among the models.

When predictions are combined (by averaging or voting), the variance of the final model is reduced *italicised text*. *italicised text*

Out-of-bag samples (the ~36.8% of data not included in each bootstrap sample) can be used for internal model validation without needing a separate validation set *italicised text*

# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used toevaluate ensemble models?

When we do bootstrap sampling in Bagging methods like Random Forest:

We select N samples with replacement from a dataset of size N.
Because sampling is with replacement, some data points are selected multiple times, while others are not selected at all for that model’s training set.
The data points not selected are called Out-of-Bag (OOB) samples for that model.
On average, about 36.8% of the original dataset is OOB for each bootstrap sample.

**OOB Score**

The OOB score is an internal validation metric for Bagging models that uses OOB samples instead of a separate validation set.

**How it works:**

- For each tree in the Random Forest:

Train it on its bootstrap sample.

Use its OOB samples to make predictions.

- For each data point:

Collect predictions from all trees where that point was OOB.

Compare the aggregated OOB predictions to the actual labels.

The proportion of correctly predicted OOB samples = OOB accuracy (for classification) or OOB
𝑅
2
R
2
  (for regression).


# Question 5: Compare feature importance analysis in a single Decision Tree vs. aRandom Forest.

| Aspect               | Single Decision Tree                           | Random Forest                       |
| -------------------- | ---------------------------------------------- | ----------------------------------- |
| **Basis**            | Importance from one tree’s splits              | Averaged importance over many trees |
| **Stability**        | Unstable – small data change can alter ranking | Stable – robust to small changes    |
| **Bias**             | Can be biased toward high-cardinality features | Bias reduced through averaging      |
| **Overfitting risk** | High if tree is deep                           | Lower due to ensemble averaging     |
| **Reliability**      | Less reliable                                  | More reliable and generalizable     |


# Question 6: Write a Python program to:

# ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()

# ● Train a Random Forest Classifier

# ● Print the top 5 most important features based on feature importance scores.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Create and train the Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_
feature_names = data.feature_names

# Create a DataFrame for easy sorting
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort by importance in descending order
feature_importance_df = feature_importance_df.sort_values(
    by='Importance',
    ascending=False
)

# Print the top 5 most important features
print("Top 5 Important Features:\n")
print(feature_importance_df.head(5))


Top 5 Important Features:

                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


# Question 7: Write a Python program to:

# ● Train a Bagging Classifier using Decision Trees on the Iris dataset

# ● Evaluate its accuracy and compare with a single Decision Tree

In [None]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=50,         # number of trees
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

# Print results
print(f"Accuracy of Single Decision Tree: {dt_accuracy:.4f}")
print(f"Accuracy of Bagging Classifier  : {bagging_accuracy:.4f}")

# Quick comparison message
if bagging_accuracy > dt_accuracy:
    print("\nBagging improved the accuracy compared to a single Decision Tree.")
else:
    print("\nBagging did not improve accuracy in this run.")

Accuracy of Single Decision Tree: 1.0000
Accuracy of Bagging Classifier  : 1.0000

Bagging did not improve accuracy in this run.


# Question 8: Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV

# ● Print the best parameters and final accuracy

In [None]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset (Iris for example)
data = load_iris()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [None, 3, 5, 7],
    'n_estimators': [50, 100, 150]
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    n_jobs=-1,          # use all CPU cores
    scoring='accuracy'
)

# Fit the model
grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Predict with the best model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Final Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0000


# Question 9: Write a Python program to:

# ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset

# ● Compare their Mean Squared Errors (MSE)

In [None]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Bagging Regressor with Decision Tree as base estimator
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Train Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print(f"Mean Squared Error (Bagging Regressor)    : {mse_bag:.4f}")
print(f"Mean Squared Error (Random Forest)        : {mse_rf:.4f}")

# Quick comparison message
if mse_rf < mse_bag:
    print("\nRandom Forest performed better (lower MSE).")
elif mse_rf > mse_bag:
    print("\nBagging Regressor performed better (lower MSE).")
else:
    print("\nBoth models performed equally well.")

Mean Squared Error (Bagging Regressor)    : 0.2579
Mean Squared Error (Random Forest)        : 0.2577

Random Forest performed better (lower MSE).


# Question 10: You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.You decide to use ensemble techniques to increase model performance.

# Explain your step-by-step approach to:

# ● Choose between Bagging or Boosting

# ● Handle overfitting

# ● Select base models

# ● Evaluate performance using cross-validation

# ● Justify how ensemble learning improves decision-making in this real-world context.

- Step 1: Choosing Between Bagging and Boosting
Bagging is preferred when:

Base models are high-variance and prone to overfitting (e.g., deep decision trees).Data contains a lot of noise → Bagging is more robust to noise.
Boosting is preferred when:

You want to reduce bias and extract more complex patterns.

You have moderately clean data and can afford longer training time.

For loan default prediction:

Since accuracy and recall are critical (especially for catching defaults), and patterns may be subtle, Boosting (e.g., XGBoost, LightGBM) is often better.

However, if the dataset is noisy or very large, Bagging (Random Forest) may be a safer start.

- Step 2: Handling Overfitting
For Bagging:

Limit the depth of decision trees.

Increase the number of estimators to stabilize predictions.

For Boosting:

Use regularization parameters (learning rate, max_depth, subsample).

Early stopping with validation data.

Additional Steps:

Perform feature selection or regularization (L1/L2 penalties).

Ensure enough cross-validation folds to validate generalization.

- Step 3: Selecting Base Models
Common base models:

Decision Trees (most common for Bagging/Boosting).

Logistic Regression (in stacking frameworks for interpretability).

Gradient Boosted Trees for tabular data.

For this case:

Start with Decision Trees (handle mixed feature types well).

Consider Logistic Regression as a meta-learner if using Stacking.

- Step 4: Evaluating Performance with Cross-Validation
Split data → Stratified K-Fold Cross-Validation to preserve class balance.

Metrics:

Primary: Recall (catch as many defaults as possible) or Precision-Recall AUC.

Secondary: ROC-AUC, Accuracy, F1-score.

Procedure:

For each fold: train → predict → compute metrics.

Average results across folds.

Why CV?:

Ensures evaluation is not biased by a single train-test split.

Detects overfitting early.

- Step 5: Justifying Ensemble Learning in This Context
Why ensembles help in loan default prediction:

Better Generalization → Different models capture different aspects of borrower behavior (transaction patterns, demographics).

Reduced Variance → Bagging smooths predictions by averaging.

Reduced Bias → Boosting focuses on correcting mistakes, improving detection of rare defaults.

Robustness → Handles outliers and complex decision boundaries better than a single model.

Impact on Decision-Making:

Higher recall → fewer missed defaults → reduced financial losses.

Balanced precision → fewer false positives → avoids rejecting good borrowers unnecessarily.

Data-driven, consistent decision-making → supports compliance and risk assessment.

In [None]:
########END################