# Ensemble Techniques

# Questions And Answer

Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind it.
- Ensemble Learning is a machine learning technique where multiple models are trained and combined to make a final prediction. Instead of relying on a single model, ensemble methods aggregate the predictions of several models to improve accuracy, stability, and robustness.

The key idea behind ensemble learning is that a group of diverse and relatively weak models, when combined, can produce better results than any single model alone. By reducing bias, variance, or both, ensemble learning leads to improved generalization on unseen data.

 Question 2: What is the difference between Bagging and Boosting?
 - Bagging (Bootstrap Aggregating) trains multiple models independently using different random samples of the dataset and then combines their predictions, usually by averaging or voting. Its main goal is to reduce variance and prevent overfitting.

- Boosting, on the other hand, trains models sequentially, where each new model focuses more on correcting the errors made by previous models. Boosting aims to reduce both bias and variance and gives more importance to difficult or misclassified data points.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

- Bootstrap sampling is a technique where multiple training datasets are created by randomly sampling the original dataset with replacement. This means some data points may appear multiple times in a sample, while others may not appear at all.

- In Bagging methods like Random Forest, bootstrap sampling ensures diversity among individual trees. Each tree is trained on a different bootstrap sample, which helps reduce variance and improves the overall performance of the ensemble.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

- Out-of-Bag (OOB) samples are the data points that are not selected in a particular bootstrap sample. On average, about 37% of the data remains unused for each tree.

- The OOB score is calculated by predicting these unused samples using the trees that did not see them during training. It provides an unbiased estimate of model performance without needing a separate validation dataset.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
- In a single Decision Tree, feature importance is based on how much each feature reduces impurity at each split. However, this can be unstable because small changes in data may lead to very different trees.

- In a Random Forest, feature importance is averaged across many trees, making it more reliable and stable. Random Forest reduces bias toward specific features and provides a more robust estimate of feature importance.

In [1]:
# Question 6: Write a Python program to: ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': rf.feature_importances_
})

# Top 5 features
top_5 = feature_importance.sort_values(by='Importance', ascending=False).head(5)
print(top_5)


                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [3]:
# Question 7: Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

# Bagging Classifier
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed base_estimator to estimator
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [4]:
# Question 8: Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Model
rf = RandomForestClassifier(random_state=42)

# Grid Search
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10]
}

grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_
pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", accuracy_score(y_test, pred))


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0


In [6]:
# Question 9: Write a Python program to: ● Load the California Housing dataset
# ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
# ● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Bagging Regressor
bag = BaggingRegressor(
    estimator=DecisionTreeRegressor(), # Changed base_estimator to estimator
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

print("Bagging MSE:", mean_squared_error(y_test, bag_pred))
print("Random Forest MSE:", mean_squared_error(y_test, rf_pred))


Bagging MSE: 0.25787382250585034
Random Forest MSE: 0.25772464361712627


Question 10: Ensemble Learning for Loan Default Prediction

- Choosing Bagging or Boosting:
Boosting would be preferred because loan default prediction involves complex patterns and misclassification costs are high. Boosting focuses on difficult cases.

- Handling Overfitting:
Overfitting can be handled using cross-validation, limiting tree depth, regularization, and early stopping in boosting algorithms.

- Selecting Base Models:
Decision Trees are chosen as base learners because they handle non-linear relationships and mixed data types effectively.

- Evaluating Performance:
K-fold cross-validation is used along with metrics such as ROC-AUC, precision, recall, and F1-score to ensure robust evaluation.

- Justification of Ensemble Learning:
Ensemble learning improves decision-making by combining multiple perspectives, reducing individual model errors, and increasing predictive reliability. This leads to better risk assessment and more informed loan approval decisions.