# **Ensemble Techniques**

1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.
   - Ensemble learning combines multiple machine learning models (called base learners or weak learners) to produce a single, more accurate and robust prediction than any individual model could achieve alone.
Key idea
The core principle is that a group of diverse models, each with its own biases and errors, can compensate for each other's weaknesses when their predictions are aggregated—reducing overall variance and bias.


2. What is the difference between Bagging and Boosting?
   - Bagging (Bootstrap AGGregatING):
Creates multiple bootstrap samples from training data
Trains identical models independently on each sample
Combines predictions via majority vote (classification) or averaging (regression)
Example: Random Forest (adds feature randomness too)

Boosting:
Starts with full dataset, trains first weak learner
Increases weights of misclassified samples for next learner
Each subsequent model focuses on previous errors
Final prediction is weighted combination
Examples: AdaBoost, Gradient Boosting, XGBoost


3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
   - Bootstrap sampling is a resampling technique that creates multiple training datasets by randomly drawing samples from the original dataset with replacement, where each bootstrap sample has the same size as the original data. About 63.2% of the samples in each bootstrap set are unique, with the rest being duplicates.

Role in Bagging (e.g., Random Forest)
Creates diversity: Each tree in Random Forest is trained on a different bootstrap sample, ensuring models see slightly different data and make uncorrelated errors.
Reduces variance: Since each tree votes independently on varied data views, final predictions (majority vote) average out individual tree noise/overfitting.
Out-of-bag (OOB) validation: ~37% of data is left out of each bootstrap sample naturally, used to estimate ensemble performance without a separate validation set.



4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
   - Out-of-Bag (OOB) samples are the data points that are not selected in the bootstrap sampling process for a particular tree in bagging-based ensembles like Random Forest—roughly 37% of the original dataset per tree. The OOB score uses these naturally held-out samples to provide an internal validation metric without needing a separate test set.
Advantages of OOB score
No data splitting needed: Uses full training set for both training and validation
Unbiased estimate: Comparable to cross-validation accuracy, often slightly optimistic
Fast: Computed during training as byproduct of bootstrap process
Variable importance: Can compute feature importance using only OOB samples.



5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
   - Single Decision Tree
Feature importance is calculated as the total reduction in impurity (Gini/entropy) from all splits using that feature, weighted by the proportion of samples reaching each split node, then normalized to sum to 1.

Bias: Favors features with many unique values (appear higher in tree)
Instability: Single tree easily overfits, making importances unreliable.
  - Random Forest
  Stability: Bootstrap + feature randomness → consistent rankings
Reduced bias: Aggregation smooths out single-tree preferences
OOB validation: Can compute permutation importance using OOB samples




In [1]:
#6. Write a Python program to:
#● Load the Breast Cancer dataset using
#sklearn.datasets.load_breast_cancer()
#● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# 1. Load Breast Cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
feature_names = cancer.feature_names

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    oob_score=True  # Use OOB for internal validation
)
rf.fit(X_train, y_train)

# 4. Get feature importances with indices
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]  # descending order

# 5. Print top 5 most important features
print("Top 5 Most Important Features (Random Forest):")
print("Accuracy (OOB):", rf.oob_score_)
print()
for i in range(5):
    idx = indices[i]
    print(f"{i+1}. {feature_names[idx]}: {importances[idx]:.4f}")


Top 5 Most Important Features (Random Forest):
Accuracy (OOB): 0.9538461538461539

1. worst area: 0.1400
2. worst concave points: 0.1295
3. worst radius: 0.0977
4. mean concave points: 0.0909
5. worst perimeter: 0.0722


In [4]:
#7.Write a Python program to:
#● Train a Bagging Classifier using Decision Trees on the Iris dataset
#● Evaluate its accuracy and compare with a single Decision Tree
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -------------------------------
# Train Single Decision Tree
# -------------------------------
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Predict and evaluate
dt_pred = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# -------------------------------
# Train Bagging Classifier
# -------------------------------
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)

bagging.fit(X_train, y_train)

# Predict and evaluate
bagging_pred = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# -------------------------------
# Print Results
# -------------------------------
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bagging_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [5]:
#8.Write a Python program to:
#● Train a Random Forest Classifier
#● Tune hyperparameters max_depth and n_estimators using GridSearchCV
#● Print the best parameters and final accuracy
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10, 15]
}

# Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0


In [7]:
#9.Write a Python program to:
#● Train a Bagging Regressor and a Random Forest Regressor on the California
#Housing dataset
#● Compare their Mean Squared Errors (MSE)
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# 1. Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Train-test split (larger test set for regression)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Bagging Regressor (DecisionTreeRegressor default)
bag_reg = BaggingRegressor(
    n_estimators=100,
    random_state=42,
    oob_score=True
)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)
oob_mse_bag = mean_squared_error(y_train, bag_reg.oob_prediction_)

# 4. Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    oob_score=True
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
oob_mse_rf = mean_squared_error(y_train, rf_reg.oob_prediction_)

# 5. Comparison
print("California Housing Price Prediction (target = median house value)")
print("=" * 60)
print(f"Bagging Regressor:")
print(f"  Test MSE:   {mse_bag:.4f}")
print(f"  OOB MSE:    {oob_mse_bag:.4f}")
print()
print(f"Random Forest Regressor:")
print(f"  Test MSE:   {mse_rf:.4f}")
print(f"  OOB MSE:    {oob_mse_rf:.4f}")
print()
print(f"RF better by: {mse_bag - mse_rf:.4f} MSE ({((mse_bag-mse_rf)/mse_bag)*100:.1f}%)")


HTTPError: HTTP Error 403: Forbidden

10. You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.
   -  Step 1: Choose Between Bagging vs Boosting
Start with Boosting (Gradient Boosting/XGBoost) for loan default prediction:
Decision tree: XGBoost/LightGBM → state-of-the-art for financial tabular data.

Step 2: Handle Overfitting
1. Early stopping: Use validation set, stop when val_loss stops improving
2. Regularization: L1/L2 penalties, max_depth=6, min_child_weight=5
3. Subsampling: row_sample=0.8, col_sample=0.8
4. Learning rate: Start 0.1, reduce on plateau
5. Cross-validation: 5-fold stratified KFold
Monitor: Train vs validation gap >0.05 → overfitting signal.

Step 3: Base Model Selection
Primary: XGBoost (Gradient Boosting)
- base_score=proportion_non_default
- scale_pos_weight=non_default/default_ratio  # Handle imbalance

Fallback: Random Forest (Bagging)
- n_estimators=500, max_features='sqrt'
- class_weight='balanced'
Hybrid: Stack XGBoost + RF predictions → meta-learner.

Step 4: Cross-Validation Strategy
Stratified 5-fold CV (preserve default/non-default ratio):
Step 5: Ensemble Implementation
1. Train XGBoost (main)
2. Train Random Forest (backup)  
3. Weighted voting: 0.7*XGB + 0.3*RF
4. Threshold tuning on profit curve (not 0.5!)
