**1. What is Ensemble Learning in machine learning? Explain the key idea behind it.**  

Ensemble Learning combines multiple models to improve accuracy and robustness. Key idea: multiple weak learners together form a stronger model, reducing overfitting and improving generalization. Common methods: Bagging, Boosting, Stacking.

---

**2. What is the difference between Bagging and Boosting?**  

Bagging trains multiple models in parallel using random bootstrap samples, while Boosting trains models sequentially by giving more weight to misclassified samples.


Bagging mainly reduces variance and overfitting, whereas Boosting mainly reduces bias and improves accuracy.

---

**3.What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

Bootstrap sampling randomly selects data with replacement to create subsets. Each tree in Bagging trains on a different subset, introducing diversity and reducing variance.

---

**4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**  

OOB samples are the data not included in a bootstrap sample. The OOB score uses these samples to estimate model accuracy without a separate test set.

---

**5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**  

Decision Tree: Feature importance based on split reduction; unstable.

Random Forest: Average importance across trees; more robust and reliable.

---

**10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach.**  

Step-by-step approach:


Choose Bagging or Boosting:

If data is noisy → Bagging (reduces variance).
If data is clean with complex patterns → Boosting (reduces bias, improves accuracy).


Handle Overfitting:

Use regularization (e.g., max_depth, min_samples_leaf).
Limit number of estimators in Boosting.
Use cross-validation to monitor performance.

Select Base Models:
Decision Trees are common as base learners.
Can also use Logistic Regression or SVM for Bagging.


Evaluate Performance:
Use k-fold cross-validation.
Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC.


Justification:
Ensemble learning combines multiple models → reduces errors and variance.
Improves robustness and reliability in predicting loan defaults → better decision-making

---

# Practial Questions

In [1]:
# 6: Write a Python program to:
# ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importances
importances = pd.Series(rf.feature_importances_, index=feature_names)
top5 = importances.sort_values(ascending=False).head(5)
print("Top 5 important features:\n", top5)


Top 5 important features:
 worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [5]:
# 7: Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# Bagging Classifier (FIXED)
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

print(f"Decision Tree Accuracy: {acc_dt:.3f}")
print(f"Bagging Accuracy: {acc_bag:.3f}")

Decision Tree Accuracy: 1.000
Bagging Accuracy: 1.000


In [7]:
# 8. Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Hyperparameter tuning
param_grid = {'n_estimators': [50,100,150], 'max_depth': [2,3,4,None]}
rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X_train, y_train)

# Evaluate
best_rf = grid.best_estimator_
y_pred = best_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid.best_params_)
print(f"Accuracy: {acc:.3f}")

Best Parameters: {'max_depth': 2, 'n_estimators': 150}
Accuracy: 1.000


In [9]:
# 9: Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
# ● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=42
)

# Bagging Regressor (FIXED)
bag_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print(f"Bagging Regressor MSE: {mse_bag:.3f}")
print(f"Random Forest MSE: {mse_rf:.3f}")

Bagging Regressor MSE: 0.258
Random Forest MSE: 0.257
