# Ensemble Learning

1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

- Ensemble Learning is a machine learning technique in which multiple models (called base learners) are trained and combined to solve the same problem. Instead of relying on a single model, ensemble learning combines the predictions of several models to produce a more accurate and reliable result.

The key idea behind ensemble learning is that a group of weak or moderately accurate models, when combined, can perform better than any individual model. This works because different models make different errors, and combining them helps reduce overall error, variance, and sometimes bias.

2. What is the difference between Bagging and Boosting?

- Bagging and Boosting are two popular ensemble learning techniques used to improve model performance by combining multiple models.

Bagging (Bootstrap Aggregating) works by training several models on different random samples of the training data. Each model is trained independently using bootstrap sampling, and the final prediction is made by averaging or voting. Bagging mainly helps in reducing variance and prevents overfitting. Random Forest is a common example of Bagging.

Boosting, on the other hand, trains models sequentially. Each new model focuses more on the data points that were misclassified by previous models. Boosting assigns higher weights to incorrectly predicted samples so that the next model learns from those mistakes. This method helps reduce bias and improve overall accuracy. Examples of Boosting include AdaBoost and Gradient Boosting.

In summary, Bagging builds independent models to reduce variance, while Boosting builds dependent models to improve accuracy by learning from errors.


3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

- Bootstrap sampling is a technique where multiple datasets are created from the original dataset by randomly sampling with replacement. This means some data points may appear multiple times, while others may not appear at all in a sample.

In Bagging methods like Random Forest:

Each decision tree is trained on a different bootstrap sample.

This creates diversity among trees.

Diversity helps reduce overfitting and improves generalization.


4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

- Out-of-Bag (OOB) samples are the data points that are not selected in a bootstrap sample for training a particular model.

In Random Forest:

- About 36% of data is left out for each tree.

- These OOB samples are used to test the model.

- The OOB score is calculated by averaging predictions on OOB samples.

OOB score provides an unbiased estimate of model performance without needing a separate validation dataset.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

- Single Decision Tree:

Feature importance is based on how much a feature reduces impurity.

Results can be unstable and sensitive to noise.

May overfit the data.

- Random Forest:

Feature importance is averaged across many trees.

More stable and reliable.

Less sensitive to noise and overfitting.

In [1]:
# 6.Write a Python program to:
#● Load the Breast Cancer dataset using
#sklearn.datasets.load_breast_cancer()
#● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train model
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Feature importance
importance = pd.Series(rf.feature_importances_, index=data.feature_names)
top5 = importance.sort_values(ascending=False).head(5)

print(top5)


worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [2]:
# 7.  Write a Python program to:
#● Train a Bagging Classifier using Decision Trees on the Iris dataset
#● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


In [3]:
# 8.Write a Python program to:
#● Train a Random Forest Classifier
#● Tune hyperparameters max_depth and n_estimators using GridSearchCV
#● Print the best parameters and final accuracy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20]
}

rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Best Accuracy: 0.9560937742586555


In [4]:
# 9.  Write a Python program to:
#● Train a Bagging Regressor and a Random Forest Regressor on the CaliforniaHousing dataset
#● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

bag = BaggingRegressor(random_state=42)
rf = RandomForestRegressor(random_state=42)

bag.fit(X_train, y_train)
rf.fit(X_train, y_train)

bag_mse = mean_squared_error(y_test, bag.predict(X_test))
rf_mse = mean_squared_error(y_test, rf.predict(X_test))

print("Bagging MSE:", bag_mse)
print("Random Forest MSE:", rf_mse)


Bagging MSE: 0.27872374841230696
Random Forest MSE: 0.2542358390056568


10. Ensemble Learning for Loan Default Prediction

- Choosing Bagging or Boosting:

Use Boosting (like Gradient Boosting) to improve prediction accuracy.

If data is noisy, prefer Bagging.

- Handling Overfitting:

Use cross-validation.

Limit tree depth.

- Use regularization parameters.

Selecting Base Models:

Decision Trees are preferred as base learners.

They capture non-linear relationships.

- Evaluating Performance:

Use k-fold cross-validation.

Evaluate using accuracy, ROC-AUC, precision, and recall.

- Justification:
Ensemble learning improves decision-making by:

Reducing risk of wrong predictions

Improving accuracy

Making the model more stable and reliable