1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.
   - Ensemble learning in machine learning is a technique where we combine several different models to make better predictions than a single model alone. The key idea behind it is that a group of weak or average models, when put together, can perform stronger and more accurately. Each model in the group, called a “base learner,” may make different kinds of mistakes, but when their outputs are combined for example, by voting or averaging, these errors can cancel out. This helps improve the overall performance, accuracy, and stability of the system. Common ensemble methods include Bagging, Boosting, and Stacking.

2. What is the difference between Bagging and Boosting?
   - Bagging and Boosting are both ensemble learning methods, but they work in different ways. Bagging builds multiple models independently using different random samples of the training data, and then combines their results by voting or averaging. This method helps reduce variance and prevents overfitting, making the model more stable. An example of bagging is the Random Forest algorithm. Boosting, on the other hand, builds models one after another, where each new model focuses more on the errors made by the previous ones. It combines the models by giving more weight to the stronger ones, which helps reduce bias and improve accuracy. However, Boosting can sometimes overfit if the data is noisy. Examples of boosting algorithms include AdaBoost and Gradient Boosting.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
   - Bootstrap sampling is a technique in which new training datasets are created by randomly selecting data points with replacement from the original dataset. This means some samples may appear more than once, while others may not appear at all. In Bagging methods like Random Forest, bootstrap sampling is very important because it ensures that each decision tree is trained on a slightly different version of the data. As a result, the trees become diverse and make different errors. When their predictions are combined, usually by voting or averaging, the final result becomes more accurate and stable. This helps reduce overfitting and improves the overall performance of the model.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
   - Out-of-Bag (OOB) samples are the data points that are not included in a particular bootstrap sample when training models in ensemble methods like Random Forest. Since each model is trained on a random subset of the data, around one-third of the original data is usually left out in each round, and these unused samples are called OOB samples. The OOB score is then used to evaluate the model’s performance without needing a separate test set. After training, each OOB sample is predicted only by the models that did not use it during training, and the accuracy of these predictions is calculated. This gives a reliable estimate of how well the model performs on unseen data.


5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
    - In a single Decision Tree, feature importance is calculated based on how much each feature helps reduce impurity, such as Gini impurity or entropy, when splitting the data. The more a feature contributes to making accurate splits, the higher its importance. However, the results from a single tree can be unstable because the tree is built on one dataset and may overfit to it. In a Random Forest, feature importance is determined by averaging the importance scores of each feature across many decision trees, each trained on different bootstrap samples and random subsets of features. This makes the results more reliable, balanced, and less affected by noise. Therefore, feature importance in a Random Forest is generally more accurate and stable than in a single Decision Tree.



In [1]:
# 6.  Write a Python program to:
# ● Load the Breast Cancer dataset using
# sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores.

# Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance and get top 5
top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

# Print the top 5 features
print("Top 5 most important features:")
print(top_features)



Top 5 most important features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [12]:
#  7. Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Accuracy of single Decision Tree: {round(accuracy_dt,4)}")

# Train a Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,  # Number of trees
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)
print(f"Accuracy of Bagging Classifier: {round(accuracy_bag,4)}")


Accuracy of single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


In [5]:
# 8. Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 2, 4, 6, 8]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Evaluate the final model on the test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Final accuracy on test set: {accuracy:.4f}")


Best hyperparameters: {'max_depth': None, 'n_estimators': 100}
Final accuracy on test set: 1.0000


In [10]:
#  9. Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California
# Housing dataset
# ● Compare their Mean Squared Errors (MSE)

# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Bagging Regressor using Decision Trees
bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
print(f"Mean Squared Error of Bagging Regressor: {round(mse_bagging,4)}")

# Train a Random Forest Regressor
rf = RandomForestRegressor(
    n_estimators=50,
    random_state=42
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Mean Squared Error of Random Forest Regressor: {round(mse_rf,4)}")


Mean Squared Error of Bagging Regressor: 0.2579
Mean Squared Error of Random Forest Regressor: 0.2577


10. You are working as a data scientist at a financial institution to predict loandefault. You have access to customer demographic and transaction history data.You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.
    - To predict loan defaults using ensemble techniques, I would start by choosing between Bagging and Boosting based on the dataset and model behavior. Bagging, like Random Forest, is useful if individual models, such as Decision Trees, tend to overfit, because it reduces variance by averaging predictions over multiple models. Boosting, like XGBoost or AdaBoost, is better if models underfit, as it sequentially focuses on correcting errors to reduce bias. To handle overfitting, I would control model complexity through parameters like tree depth, use regularization and apply cross validation to ensure generalization. Base models would be selected to be diverse and complementary Decision Trees are common for Bagging due to their high variance, while shallow trees or weak learners are ideal for Boosting. Performance would be evaluated using k fold cross validation and metrics like accuracy, precision, recall, F1-score, and AUC-ROC to ensure reliability. Ensemble learning improves decision making in this real world context by combining multiple models to produce more stable and accurate predictions, which helps the financial institution identify high risk customers, manage risk effectively, and make fair lending decisions while minimizing potential losses.


In [8]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Simulate a loan default dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ----------------- Bagging Classifier -----------------
bagging = BaggingClassifier(
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)
auc_bag = roc_auc_score(y_test, bagging.predict_proba(X_test)[:,1])

# ----------------- Boosting Classifier -----------------
boosting = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
boosting.fit(X_train, y_train)
y_pred_boost = boosting.predict(X_test)
accuracy_boost = accuracy_score(y_test, y_pred_boost)
auc_boost = roc_auc_score(y_test, boosting.predict_proba(X_test)[:,1])

# ----------------- Cross-Validation -----------------
cv_scores_bag = cross_val_score(bagging, X_train, y_train, cv=5, scoring='accuracy')
cv_scores_boost = cross_val_score(boosting, X_train, y_train, cv=5, scoring='accuracy')

# ----------------- Print Results -----------------
print("Bagging Classifier Accuracy:", accuracy_bag)
print("Bagging Classifier AUC-ROC:", auc_bag)
print("Bagging CV Accuracy:", np.mean(cv_scores_bag))

print("\nBoosting Classifier Accuracy:", accuracy_boost)
print("Boosting Classifier AUC-ROC:", auc_boost)
print("Boosting CV Accuracy:", np.mean(cv_scores_boost))


Bagging Classifier Accuracy: 0.9093333333333333
Bagging Classifier AUC-ROC: 0.9671308523409363
Bagging CV Accuracy: 0.916

Boosting Classifier Accuracy: 0.908
Boosting Classifier AUC-ROC: 0.9670997287804011
Boosting CV Accuracy: 0.9151428571428571
