# Ensemble Learning

q.no.1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.


-ensemble Learning is a general machine learning paradigm where the goal is to improve predictive performance—either classification accuracy or regression stability—by combining the predictions of multiple individual models instead of relying on a single model.The term "ensemble" is borrowed from music and theater, where it refers to a group of performers working together to produce a better result than any individual performer could achieve alone.
key idea ensemble learning-
1. Diversity: Combining Different Perspectives
2. Error Reduction (Bias vs. Variance)


q.no.2: What is the difference between Bagging and Boosting?

-bagging -
1)Parallel and Independent. Each model is trained in parallel, completely independent of the others.
2)All individual models are treated equally (same voting power in the final result).
3)Each model is trained on a random sample with replacement (a bootstrap sample) of the original training data.
4)Reduce Variance. Aims to stabilize a high-variance model (like a deep decision tree) by averaging out the noise introduced by data variations.

boosting-
1) Reduce Variance. Aims to stabilize a high-variance model (like a deep decision tree) by averaging out the noise introduced by data variations.
2)Models are weighted; better-performing models (those with lower error) have greater influence.
3)Each model is trained on the full dataset, but the data points (or targets/residuals) are adjusted to highlight previous errors.
4)Reduce Bias. Aims to reduce the systemic errors left by simpler models by forcing successive models to correct them.



q.no.3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

- bootstrap sampling (or simply bootstrapping) is a statistical resampling technique used to estimate the characteristics of a population by sampling from a single data sample.

role-
1. Introducing Diversity (Reducing Variance)

2.De-correlating the Trees (The Random Forest Advantage)


q.no.4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?


- Out-of-Bag (OOB) samples are the data points from the original training set that were not included in the bootstrap sample used to train a specific base estimator (tree).

how OOB score used to evaluate models-
To calculate the OOB prediction for a single data point, x:Identify all the trees in the forest for which $x_i$ was an OOB sample (i.e., trees that never saw $x_i$ during training).Have those trees make a prediction for $x_i$.Aggregate these predictions (e.g., take the majority vote for classification or the average for regression).

OOB score is then calculated by comparing these OOB predictions against the true target values for every single data point in the entire original training set.


q.no.5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.


- decision tree -
1)Reduction in Impurity (Gini or Entropy) at each split.
2)Low. Highly unstable and prone to bias.
3)Biased. Heavily favors a single feature in a group of highly correlated features, ignoring the others.
4)High. Simple to trace why the single most important feature was chosen (it's the root node).
5)Useful for initial quick insights, especially for visualization and understanding the raw logic.

random forest -
1)Average Reduction in Impurity across all trees in the forest.
2)High. Very stable and robust due to averaging.
3)Fairer, but still a challenge. It is less biased because trees are trained on different data subsets, allowing correlated features to take turns being selected.
4)Low to Moderate. The final score is an average across hundreds of trees, making it impossible to trace an individual decision path.
5)Standard and Reliable. Used in production to select the final feature set, guide data collection, and provide model explainability.



q.no.n 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

print("Loading Breast Cancer dataset...")
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("\nTraining Random Forest Classifier...")


rf_model = RandomForestClassifier(
    n_estimators=500,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)


rf_model.fit(X_train, y_train)

print("Training complete.")


feature_scores = rf_model.feature_importances_

feature_names = X.columns
importance_df = pd.Series(feature_scores, index=feature_names).sort_values(ascending=False)


top_n = 5
top_features = importance_df.head(top_n)

print(f"Top {top_n} Features by Gini Importance (Random Forest)")

for rank, (feature, score) in enumerate(top_features.items(), 1):

    print(f"[{rank}] {feature:<25} : {score:.4f}")

print(f"Total features considered: {len(feature_names)}")

Loading Breast Cancer dataset...

Training Random Forest Classifier...
Training complete.
Top 5 Features by Gini Importance (Random Forest)
[1] worst area                : 0.1300
[2] worst perimeter           : 0.1280
[3] worst concave points      : 0.1272
[4] worst radius              : 0.0928
[5] mean concave points       : 0.0890
Total features considered: 30


q.no.7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [5]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score


print("Loading Iris dataset...")

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print("\n[A] Training Single Decision Tree...")
dt_model = DecisionTreeClassifier(
    max_depth=None,
    random_state=4
)

dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)


print("[B] Training Bagging Classifier (Ensemble)...")


bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=None, random_state=42),
    n_estimators=100,
    max_samples=1.0,
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)



print("     MODEL ACCURACY COMPARISON (IRIS DATASET)")
print(f"1. Single Decision Tree Accuracy : {dt_accuracy:.4f}")
print(f"2. Bagging Classifier Accuracy : {bagging_accuracy:.4f}")


if bagging_accuracy > dt_accuracy:
    print("Conclusion: Bagging successfully improved the model's accuracy, reducing variance.")
elif bagging_accuracy < dt_accuracy:
    print("Conclusion: The single Decision Tree was slightly more accurate on this specific test set.")
else:
    print("Conclusion: Both models achieved the same accuracy.")

Loading Iris dataset...
Training set size: 105 samples
Testing set size: 45 samples

[A] Training Single Decision Tree...
[B] Training Bagging Classifier (Ensemble)...
     MODEL ACCURACY COMPARISON (IRIS DATASET)
1. Single Decision Tree Accuracy : 0.9778
2. Bagging Classifier Accuracy : 0.9333
Conclusion: The single Decision Tree was slightly more accurate on this specific test set.


q.no.8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy


In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# --- 1. Load the Dataset ---
print("Loading Breast Cancer dataset...")
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")


print("\nSetting up GridSearchCV...")


param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
}


rf_base = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    verbose=1,
    n_jobs=-1
)


print("Starting Grid Search...")
grid_search.fit(X_train, y_train)
print("Grid Search complete.")


best_params = grid_search.best_params_

best_rf_model = grid_search.best_estimator_


y_pred = best_rf_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)


print("  Random Forest Tuning Results (GridSearchCV)")
print(f"Best Parameters Found: {best_params}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
print(f"Final Test Set Accuracy: {final_accuracy:.4f}")


q.no.9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)


In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error


print("Loading California Housing dataset...")
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")

print("\n[A] Training Bagging Regressor...")


base_estimator = DecisionTreeRegressor(random_state=42)

bagging_model = BaggingRegressor(
    estimator=base_estimator,
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)
print("[B] Training Random Forest Regressor...")

rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("\n" + "="*70)
print("       REGRESSOR PERFORMANCE COMPARISON (CALIFORNIA HOUSING)")
print("="*70)
print(f"1. Bagging Regressor MSE      : {bagging_mse:.4f}")
print(f"2. Random Forest Regressor MSE: {rf_mse:.4f}")
print("-" * 70)

if rf_mse < bagging_mse:
    print("Conclusion: Random Forest achieved a lower MSE, demonstrating the benefit of feature randomness.")
elif rf_mse > bagging_mse:
    print("Conclusion: Bagging Regressor performed better on this test set.")
else:
    print("Conclusion: Both models performed equally well.")

q.no.10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.


explaination -
. Choice Between Bagging or Boosting
Decision: Boosting (specifically, a highly optimized algorithm like XGBoost or CatBoost) is the preferred choice.

prove:

Goal: Reducing Bias for High Accuracy: Loan default prediction is a high-bias problem. Simple models struggle to identify the complex, non-linear relationships that cause someone to default. Boosting's sequential, error-correcting nature is specifically designed to reduce bias and achieve superior, state-of-the-art accuracy, which is paramount when millions of dollars are at stake.

Handling Imbalance: While both can handle imbalance, boosting (when combined with techniques like SMOTE or focused cost-sensitive learning) is better at adjusting its focus to correctly predict the rare "default" events.


. Base Model Selection
Choice: Decision Trees are the ideal base model for both Bagging and Boosting.
prove:

 Boosting (Preferred): Use shallow decision trees (often called "stumps" or trees with a max_depth between 3 and 7). Shallow trees are weak, high-bias learners. Boosting combines many of these weak learners, allowing the ensemble to build a complex, low-bias model without over-relying on the rules of any single tree.


 Handling Overfitting (Regularization)Overfitting is a major risk in boosting, as it continuously fits the residuals, which can eventually lead it to model noise.Learning Rate (Shrinkage): The most effective technique. When adding a new tree's prediction to the ensemble, we scale it by a small learning rate . This forces the model to learn slowly, making the training process more robust and preventing the model from becoming too specialized to the training data.Subsampling: Train each new tree on a random fraction of the training data (e.g., 70% of rows). This introduces randomness (similar to Bagging) and prevents trees from becoming overly correlated, stabilizing the model.Column Subsampling (Feature Randomness): Train each tree using only a random subset of the features. This is especially useful for high-dimensional financial data to increase diversity and prevent reliance on any single feature.Max Depth Constraint: Limiting the max_depth of the base trees, as mentioned above, acts as a primary form of regularization, preventing any single tree from becoming too complex.


