Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.
- Ensemble Learning is a technique where multiple models are trained and combined to solve the same problem. Instead of relying on a single model, ensemble methods aggregate the predictions of several models to produce a more accurate and generalized result.

Question 2: What is the difference between Bagging and Boosting?
- Bagging involves training multiple models independently and in parallel on random subsets of the data to reduce variance and prevent overfitting, while Boosting involves training models sequentially, with each new model correcting the errors of the previous ones to reduce bias and increase overall accuracy.

Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
- Bootstrap sampling is a resampling technique that involves repeatedly drawing samples with replacement from an original dataset to create multiple new datasets. Each new dataset, called a bootstrap sample, has the same size as the original dataset, but because sampling is done with replacement, some observations from the original dataset may appear multiple times in a bootstrap sample, while others may not appear at all.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
- The OOB score is an unbiased estimate of an ensemble model's performance, calculated by evaluating the model on its OOB samples, which serve as a built-in validation set, thus eliminating the need for a separate validation dataset.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
- In a decision tree, feature importance is calculated based on how much each feature reduces impurity.
- A random forest builds many decision trees on bootstrapped samples of the data and averages their results.Feature importance is calculated by averaging impurity reduction across all trees, giving a more stable and reliable measure.

Question 6: Write a Python program to:
-  Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
- Train a Random Forest Classifier
- Print the top 5 most important features based on feature importance scores.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [3]:
from sklearn.datasets import load_breast_cancer
df_cancer = load_breast_cancer()

In [4]:
X = df_cancer.data
y=df_cancer.target

In [4]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, train_size=0.25, random_state=1)

from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100, criterion='gini', random_state=1)

In [8]:
rf_clf.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [10]:
y_pred = rf_clf.predict(X_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test,y_pred)
acc


0.936768149882904

In [17]:
rf_clf.feature_importances_

array([0.05389652, 0.01595761, 0.07389801, 0.04794027, 0.00295455,
       0.0027431 , 0.0662151 , 0.07248317, 0.00263747, 0.00285544,
       0.01817515, 0.01127383, 0.0106052 , 0.04203811, 0.00210209,
       0.00308731, 0.00328544, 0.00791706, 0.00341975, 0.0005601 ,
       0.11327344, 0.01281708, 0.14110245, 0.13690016, 0.00350552,
       0.01731845, 0.01640819, 0.10246519, 0.01023659, 0.00192766])

In [23]:
print(sorted(rf_clf.feature_importances_)[-5:])


[np.float64(0.07389801030097219), np.float64(0.10246518985384874), np.float64(0.11327343692935356), np.float64(0.136900164219124), np.float64(0.14110245078563585)]


Question 7: Write a Python program to:
-  Train a Bagging Classifier using Decision Trees on the Iris dataset
- Evaluate its accuracy and compare with a single Decision Tree


In [2]:
from sklearn.datasets import load_iris
df_irs = load_iris()

In [5]:
X = df_irs.data
y = df_irs.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)


In [28]:
from sklearn.tree import DecisionTreeClassifier
single_tree = DecisionTreeClassifier()

single_tree.fit(X_train,y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [29]:
y_pred = single_tree.predict(X_test)
acc_single = accuracy_score(y_test,y_pred)

In [30]:
multi_tree = RandomForestClassifier(n_estimators=100, criterion='gini', random_state=1)

multi_tree.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [45]:
y_pred_multi = multi_tree.predict(X_test)
acc_multi = accuracy_score(y_test,y_pred_multi)

print(f"The accuracy of single tree will be {acc_single:.4}, and the accuracy of multiple tree will be {acc_multi:.4}")

The accuracy of single tree will be 0.9737, and the accuracy of multiple tree will be 0.9737


Question 8: Write a Python program to:
- Train a Random Forest Classifier
- Tune hyperparameters max_depth and n_estimators using GridSearchCV
-  Print the best parameters and final accuracy


In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

params = {
    'n_estimators' : [10,20,50,100,200,300,500],
    'max_depth' : [1,2,3,5,7,10]
}

model = GridSearchCV(rf, param_grid=params, cv=5, verbose=3)

In [10]:
model.fit(X_train,y_train)

Fitting 5 folds for each of 42 candidates, totalling 210 fits
[CV 1/5] END ......max_depth=1, n_estimators=10;, score=0.913 total time=   0.0s
[CV 2/5] END ......max_depth=1, n_estimators=10;, score=0.913 total time=   0.0s
[CV 3/5] END ......max_depth=1, n_estimators=10;, score=0.727 total time=   0.0s
[CV 4/5] END ......max_depth=1, n_estimators=10;, score=0.818 total time=   0.0s
[CV 5/5] END ......max_depth=1, n_estimators=10;, score=0.773 total time=   0.0s
[CV 1/5] END ......max_depth=1, n_estimators=20;, score=1.000 total time=   0.0s
[CV 2/5] END ......max_depth=1, n_estimators=20;, score=0.957 total time=   0.0s
[CV 3/5] END ......max_depth=1, n_estimators=20;, score=0.682 total time=   0.0s
[CV 4/5] END ......max_depth=1, n_estimators=20;, score=0.682 total time=   0.0s
[CV 5/5] END ......max_depth=1, n_estimators=20;, score=0.727 total time=   0.0s
[CV 1/5] END ......max_depth=1, n_estimators=50;, score=0.696 total time=   0.0s
[CV 2/5] END ......max_depth=1, n_estimators=50

0,1,2
,estimator,RandomForestClassifier()
,param_grid,"{'max_depth': [1, 2, ...], 'n_estimators': [10, 20, ...]}"
,scoring,
,n_jobs,
,refit,True
,cv,5
,verbose,3
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,5
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [12]:
model.best_score_

np.float64(0.9553359683794467)

In [13]:
model.best_params_

{'max_depth': 5, 'n_estimators': 50}

Question 9: Write a Python program to:
- Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
- Compare their Mean Squared Errors (MSE)

In [26]:
from sklearn.datasets import load_diabetes

dbt = load_diabetes()
X = dbt.data
y = dbt.target

In [28]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [29]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

lr = LinearRegression()
dtr = DecisionTreeRegressor()
svr = SVR()

from sklearn.ensemble import VotingRegressor

vr = VotingRegressor(estimators=[('lr',lr),('dtr',dtr),('svr',svr)])

In [30]:
vr.fit(X_train,y_train)
y_vr_pred = vr.predict(X_test)

from sklearn.metrics import mean_squared_error
mse_vr = mean_squared_error(y_test,y_vr_pred)

In [31]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()

rfr.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [32]:
y_rfr_pred = rfr.predict(X_test)

mse_rfr = mean_squared_error(y_test,y_rfr_pred)

print(f"the mean squred error for votting regressor is {mse_vr}, while for random forest regressor is {mse_rfr}")

the mean squred error for votting regressor is 3543.722644079226, while for random forest regressor is 3701.0169325842694


Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
- Choose between Bagging or Boosting
- Handle overfitting
- Select base models
- Evaluate performance using cross-validation
- Justify how ensemble learning improves decision-making in this real-world
context.


 For loan default prediction, I would choose Boosting (XGBoost/LightGBM) with decision trees as base models, tune hyperparameters to prevent overfitting (early stopping, depth control, learning rate), evaluate using stratified k-fold cross-validation with metrics like AUC-ROC & PR-AUC, and justify ensemble learning as a way to reduce financial risk by improving predictive accuracy and stability.