# ***Ensemble Learning***

1.  What is Ensemble Learning in machine learning? Explain the key idea
behind it.

    ANS :- Ensemble learning is a theoretical framework in machine learning where multiple models often called "weak learners", are combined to form a stronger predictive system. The idea rests on the principle that diverse models capture different aspects of data, and their collective decision reduces variance, bias, and error. Techniques include bagging, boosting, and stacking, each differing in how models are trained and aggregated. Bagging emphasizes variance reduction, boosting focuses on bias correction, and stacking integrates heterogeneous learners.


    # Key Idea Behind Ensemble Learning

`Wisdom of Crowds`:-
    
  - Just like consulting multiple people yields better advice, combining models balances errors.

`Error Reduction`:-

- Bagging reduces variance.
- Boosting reduces bias.
- Stacking enhances generalization.

`Diversity Principle`:-
- Different models capture different aspects of data; their collective decision is more reliable.

2.  What is the difference between Bagging and Boosting?

    ANS :- Bagging trains models independently on random subsets, then aggregates results to reduce variance. Bagging emphasizes stability through parallel diversity
    
    Boosting trains models sequentially, each correcting prior errors, reducing bias. Boosting emphasizes accuracy through iterative refinement.
    
    Both are ensemble methods, but their theoretical distinction lies in variance reduction versus bias correction, reflecting complementary strategies for improving generalization.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

    ANS  :- Bootstrap sampling is a statistical resampling technique where datasets are created by drawing observations with replacement from the original data. Each sample is the same size as the original but may repeat some points while omitting others.
    
    In Bagging methods like Random Forest, bootstrap sampling ensures each model (eX - decision tree) is trained on a slightly different dataset. This diversity reduces correlation among models, stabilizes predictions, and lowers variance.
    
    It injects randomness, enabling ensemble averaging to smooth fluctuations, making the combined predictor more robust and generalizable than any single tree trained on the full dataset.


4.  What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

    ANS :- Out-of-Bag (OOB) samples are the data points not selected in a bootstrap sample during bagging. Since each bootstrap sample is drawn with replacement, about one-third of the original dataset is left out. These excluded points form the OOB set.
    
    
    The OOB score is computed by testing each model on its corresponding OOB samples, then aggregating results across the ensemble. Theoretically, this provides an unbiased estimate of generalization performance without needing a separate validation set. In Random Forests, OOB scoring acts as an internal cross-validation mechanism, ensuring efficiency and reliable evaluation of predictive accuracy.


5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

    ANS :- In a single Decision Tree, feature importance is measured by how much each split reduces impurity (eX - Gini or entropy). Theoretical limitation: importance may be biased toward features with many levels and is unstable, since small data changes can alter the tree structure.
  
  
    In a Random Forest, feature importance is averaged across many trees built on bootstrap samples and random feature subsets. This aggregation stabilizes importance values, reduces bias, and highlights consistently useful predictors. Theoretical strength: ensemble averaging provides more reliable, generalizable importance estimates compared to the variability of a single tree's analysis.


6.  Write a Python program to:
- Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the Breast Cancer dataset

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores

importances = rf.feature_importances_

# Create a DataFrame for better visualization

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort by importance and select top 5

top_features = importance_df.sort_values(by='Importance', ascending=False).head(5)

# Print the top 5 features

print("Top 5 most important features:")
print(top_features)

Top 5 most important features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


7.  Write a Python program to:
- Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 42
)

# Train a single Decision Tree

dt = DecisionTreeClassifier(random_state =42 )
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier using Decision Trees

bagging = BaggingClassifier(
    estimator = DecisionTreeClassifier(),
    n_estimators = 50,
    random_state = 42
)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bagging)

# Print results

print("Accuracy of Single Decision Tree:", dt_accuracy)
print("Accuracy of Bagging Classifier:", bagging_accuracy)

Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


8.  Write a Python program to:
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
- Print the best parameters and final accuracy

In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 42
)

# Define the Random Forest model

rf = RandomForestClassifier(random_state = 42)

# Define hyperparameter grid

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 3, 5, 10]
}

# Perform GridSearchCV

grid_search = GridSearchCV(
    estimator = rf,
    param_grid = param_grid,
    cv = 5,
    n_jobs = -1,
    scoring = 'accuracy'
)

grid_search.fit(X_train, y_train)

# Get best parameters

print("Best Parameters:", grid_search.best_params_)


best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Final Accuracy on Test Set:", accuracy)

Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy on Test Set: 1.0


9.  Write a Python program to:
- Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
- Compare their Mean Squared Errors (MSE)

In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset

housing = fetch_california_housing()
X, y = housing.data, housing.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 42
)

# Train Bagging Regressor using Decision Trees

bagging = BaggingRegressor(
    estimator = DecisionTreeRegressor(),
    n_estimators = 50,
    random_state = 42,
    n_jobs = -1
)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# Train Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators = 100,
    random_state = 42,
    n_jobs = -1
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print results
print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)

Mean Squared Error (Bagging Regressor): 0.25787382250585034
Mean Squared Error (Random Forest Regressor): 0.25650512920799395


10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world
context.

- `Understand the Problem and Data`:- Clearly define the objective (predicting loan default) and understand the characteristics of the customer demographic and transaction history data. This includes identifying target variable, feature types, and potential data quality issues.

- `Data Preprocessing and Feature Engineering`:- Perform necessary data cleaning, handle missing values, encode categorical features, scale numerical features, and engineer new features that could be relevant for loan default prediction (e.g., debt-to-income ratio, transaction frequency).

- `Choose Between Bagging or Boosting`:- Analyze the bias-variance trade-off for the loan default dataset. If the base models are likely to have high variance (e.g., deep decision trees), Bagging (like Random Forest) is preferred. If base models are high-bias, Boosting (like Gradient Boosting or XGBoost) might be more effective to reduce bias iteratively. Consider initial model performance and error analysis to guide this choice.

- `Select Base Models`:- For Bagging, common base models include Decision Trees (e.g., in Random Forest). For Boosting, typically weak learners like shallow Decision Trees (stumps) are used. The choice depends on the specific ensemble method selected in the previous step and the nature of the data.

- `Handle Overfitting`:- Implement strategies to combat overfitting: for Bagging, increase n_estimators, limit tree depth in base models, and use OOB samples for validation. For Boosting, use regularization techniques (learning_rate, n_estimators, max_depth), introduce subsampling (subsample), and apply early stopping based on cross-validation performance.

- `Evaluate Performance Using Cross-Validation`:- Employ k-fold cross-validation to get a robust estimate of the model's generalization performance. Use appropriate metrics for imbalanced datasets common in loan default prediction, such as precision, recall, F1-score, ROC AUC, and AUPRC, rather than just accuracy.

- `Justify Ensemble Learning for Loan Default`:- Explain how combining multiple models reduces the risk of relying on a single, potentially biased or high-variance model. Discuss how ensemble methods provide more stable, accurate, and robust predictions, leading to better risk assessment, improved lending decisions, and reduced financial losses for the institution.

- `Final Task`:- Summarize the comprehensive approach to predicting loan default using ensemble learning, highlighting the benefits and strategic choices made at each step.