# *Ensemble Learning*

## 1. What is Ensemble Learning in machine learning? Explain the key idea behind it.

Ensemble Learning combines multiple models (called base learners) to get better accuracy and stability than a single model.
The key idea: “Many weak models together can form a strong model."

## 2. What is the difference between Bagging and Boosting?

| Feature  | Bagging                         | Boosting                                  |
| -------- | ------------------------------- | ----------------------------------------- |
| Goal     | Reduce variance                 | Reduce bias                               |
| Training | Models trained independently    | Models trained sequentially               |
| Sample   | Random samples with replacement | Each new model focuses on previous errors |
| Example  | Random Forest                   | AdaBoost, XGBoost                         |


## 3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest? 

- **Bootstrap sampling** means randomly selecting samples with replacement from the dataset to create multiple subsets.
- In **Bagging** (like Random Forest), each tree is trained on a different bootstrap sample. This increases diversity and reduces overfitting.

## 4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models? 

OOB samples are the data points not selected in a bootstrap sample.
The model uses these unseen samples to test performance — giving an OOB score, which acts like built-in cross-validation.

## 5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

- A single Decision Tree gives feature importance based on how much each feature reduces impurity.

- A Random Forest averages these importance scores across many trees, making the ranking more stable and reliable.

## 6. Write a Python program to: Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() 
● Train a Random Forest Classifier 
● Print the top 5 most important features based on feature importance scores. 

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train model
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

importances = rf.feature_importances_
indices = importances.argsort()[::-1][:5]

for i in indices:
    print(f"{data.feature_names[i]}: {importances[i]:.4f}")


worst area: 0.1394
worst concave points: 0.1322
mean concave points: 0.1070
worst radius: 0.0828
worst perimeter: 0.0808


## 7. Write a Python program to: 
- **Train a Bagging Classifier using Decision Trees on the Iris dataset** 
- **Evaluate its accuracy and compare with a single Decision Tree**

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Data split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
acc_dt = accuracy_score(y_test, dt.predict(X_test))

# Bagging with Decision Trees
bag = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
acc_bag = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", acc_dt)
print("Bagging Accuracy:", acc_bag)


Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


## 8. Write a Python program to: 
- Train a Random Forest Classifier 
- Tune hyperparameters max_depth and n_estimators using GridSearchCV 
- Print the best parameters and final accuracy 

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {'max_depth': [3, 5, 7, None], 'n_estimators': [50, 100, 150]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
acc = accuracy_score(y_test, best_model.predict(X_test))

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", acc)


Best Parameters: {'max_depth': 7, 'n_estimators': 50}
Final Accuracy: 0.9649122807017544


## 9. Write a Python program to: 
- Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset 
- Compare their Mean Squared Errors (MSE)

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

bag = BaggingRegressor(DecisionTreeRegressor(), n_estimators=50, random_state=42)
rf = RandomForestRegressor(n_estimators=50, random_state=42)

bag.fit(X_train, y_train)
rf.fit(X_train, y_train)

mse_bag = mean_squared_error(y_test, bag.predict(X_test))
mse_rf = mean_squared_error(y_test, rf.predict(X_test))

print("Bagging MSE:", mse_bag)
print("Random Forest MSE:", mse_rf)


Bagging MSE: 0.2582477439355284
Random Forest MSE: 0.2576688724828818


## 10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. 

You decide to use ensemble techniques to increase model performance. 

Explain your step-by-step approach to: 
- Choose between Bagging or Boosting 
- Handle overfitting 
- Select base models 
- Evaluate performance using cross-validation 
- Justify how ensemble learning improves decision-making in this real-world 
context.

1. Choose between Bagging or Boosting:
    - If data is noisy → use Bagging (Random Forest) to reduce variance.
    - If model underfits → use Boosting (XGBoost, AdaBoost) to reduce bias.

2. Handle Overfitting:
    - Use fewer tree depths, limit estimators, apply regularization, and cross-validation.

3. Select Base Models:
    - Decision Trees (for interpretability) or Logistic Regression (for stability).

4. Evaluate Performance:
    - Use k-Fold Cross-Validation, check metrics like AUC, F1-score, Precision-Recall.

5. Justification:
    - Ensemble models combine multiple weak learners, improving accuracy and reducing risk of misclassifying potential defaulters.
    - This helps financial institutions make more reliable lending decisions.