# Ensemble Learning | Assignment

1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.
- Ensembling combines multiple models predictions which is more stable and accurate as copared to individual model.

 The key idea behind it is that single decision tree leads to overfitting and to control that we apply some hyperparameters but every time that doesn't works good. so we apply ensembling in which number of models make predictions and combining all the predictions, ensembling makes one prediction and improves accuracy.

 2: What is the difference between Bagging and Boosting?

 - Bagging makes all the models paralel and independent of each other.
 - Boosting makes sequetial models in which every model learns from the previous model

 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

 - Bootsrap sampling means making multiple models but every time the data will be randomly selected.

   Each model (tree) is trained on a different bootstrap sample, so the trees become less correlated. This helps reduce overfitting and improves overall accuracy.

4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
 - out-of-Bag (OOB) samples are the data points not selected during bootstrap sampling.

   How OOB score is used:
   Each model (tree) predicts only its unused (OOB) samples. These predictions are combined to calculate the OOB score, which acts like a validation accuracy.

5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
 - Decision Tree:
Importance is based on splits in one tree, so it can be unstable and sensitive to data changes.

- Random Forest:
Importance is averaged over many trees, so it is more stable, reliable, and less biased.

In [None]:
# 6: Write a Python program to:
# ● Load the Breast Cancer dataset using
# sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

data = load_breast_cancer()
x,y = data.data, data.target

# train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=32)

# model
model = RandomForestClassifier()
model.fit(x_train,y_train)

# Feature Importance

important_df = pd.DataFrame({
    "Features": data.feature_names,
    "Importance": model.feature_importances_
})

# Top 5 important features
top_5 = importance_df.sort_values(by="Importance", ascending=False).head(5)

print("Top 5 most important features:")
print(top_5)


Top 5 most important features:
                Feature  Importance
23           worst area    0.159700
20         worst radius    0.109397
22      worst perimeter    0.087272
2        mean perimeter    0.083987
7   mean concave points    0.080582


In [None]:
# 7: Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# load_dataset

data = load_iris()
x,y = data.data , data.target

# train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=32)

# model with baging and decision tree
base_tree = DecisionTreeClassifier()
bagging_model = BaggingClassifier(estimator = base_tree, n_estimators = 15)
bagging_model.fit(x_train,y_train)
print("Accuracy:",accuracy_score(y_test,bagging_model.predict(x_test)))

# model with sngle decision tree

model = DecisionTreeClassifier()
model.fit(x_train,y_train)
print("Accuracy:",accuracy_score(y_test,model.predict(x_test)))


Accuracy: 0.9777777777777777
Accuracy: 0.9777777777777777


In [None]:
# 8: Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# load_dataset

data = load_iris()
x,y = data.data , data.target

# train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=32)

# model
hyperparam = ({
    "max_depth":[2,3,4,5,6,7,10],
    "n_estimators":[5,10,20,30,50,70,100]
})

grid_search = GridSearchCV(estimator =RandomForestClassifier(random_state=32), param_grid = hyperparam, cv = 5, scoring="accuracy",n_jobs = -1)
grid_search.fit(x_train,y_train)

print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_score(y_test, grid_search.best_estimator_.predict(x_test)))


Best Parameters: {'max_depth': 6, 'n_estimators': 5}
Accuracy: 1.0


In [None]:
# 9: Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California
# Housing dataset
# ● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# load data
data = fetch_california_housing()
x,y = data.data , data.target

#train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=32)


#model
bag_model = BaggingRegressor(n_estimators = 100, random_state = 32)
bag_model.fit(x_train,y_train)
print("MSE:",mean_squared_error(y_test,bag_model.predict(x_test)))

# model 2

rf_model = RandomForestRegressor(n_estimators = 100, random_state = 34)
rf_model.fit(x_train,y_train)
print("MSE:",mean_squared_error(y_test,rf_model.predict(x_test)))

MSE: 0.25357605123436444
MSE: 0.2575844635105253


10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.


- I choose bagging over boosting because all the transaction history data processed paralelly their is no need for sequential learning from previvous data.

- To control overfitting tune some hyperparameters like max_depth, min_samples_split and min_samples_leaf.

- Decision tree is the best base model because their is noise and non-linear data. And decision tree gracefully handles the data.

- Split the data into K parts

  Train on K-1 parts, test on 1 part (repeat for all parts)

   Check ROC-AUC, Precision, and Recall

  Compare single Decision Tree vs Bagged model

- By training(making) many models and take predictions of all the models as combine in one. This helps alot because we are not depend only on one model, the model is more stable and accurate.
In finance, small accuracy leads to large financial impact, making ensemble highly value.