# Ensemble Learning

## 1. What is Ensemble Learning in machine learning? Explain the key idea behind it.
Ensemble Learning is a technique in machine learning where multiple models are combined to solve a problem and improve performance. Instead of relying on a single model, ensemble methods build multiple models and combine their outputs to make final prediction. The key idea is to reduce errors like bias and variance by taking the strength of multiple models.

## 2. What is the difference between Bagging and Boosting?
Bagging (Bootstrap Aggregating) is a technique where multiple models are trained independently using random samples (with replacement) from the training data. The final output is given by majority vote or average. Boosting is a sequential method where each model tries to correct the errors made by the previous one. Bagging reduces variance and Boosting reduces bias. Bagging builds models in parallel, Boosting builds in sequence.

## 3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
Bootstrap sampling is a method where random samples are taken from the dataset with replacement. This means some records can appear more than once in the sample. In Bagging methods like Random Forest, bootstrap sampling is used to train each decision tree on a different subset of the data. This helps in creating diverse models and reduces variance.

## 4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
OOB samples are the data points not selected in a bootstrap sample. Since bootstrap samples are created with replacement, about 1/3rd of data is left out. These left out samples are called Out-of-Bag samples. OOB score is the accuracy calculated using these samples and it acts like a cross-validation score to evaluate performance without needing a separate validation set.

## 5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
In a single decision tree, feature importance is calculated based on how much each feature reduces impurity like Gini index or entropy. But since itâ€™s only one tree, the importance might be biased or unstable. In Random Forest, the importance is averaged over many trees, which gives more reliable and stable importance values because it reduces overfitting and takes diverse paths into account.


## 6. Write a Python program to:
  - Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
  - Train a Random Forest Classifier
  - Print the top 5 most important features based on feature importance scores.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Get feature importances
importances = pd.Series(model.feature_importances_, index=X.columns)
top_5 = importances.sort_values(ascending=False).head(5)
print("Top 5 Important Features:")
print(top_5)


Top 5 Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


## 7.Write a Python program to:
  - Train a Bagging Classifier using Decision Trees on the Iris dataset
  - Evaluate its accuracy and compare with a single Decision Tree

In [2]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Single Decision Tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
tree_acc = accuracy_score(y_test, tree.predict(X_test))

# Bagging Classifier
bagging = BaggingClassifier(DecisionTreeClassifier(), random_state=42)
bagging.fit(X_train, y_train)
bagging_acc = accuracy_score(y_test, bagging.predict(X_test))

print("Decision Tree Accuracy:", tree_acc)
print("Bagging Classifier Accuracy:", bagging_acc)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


## 8. Write a Python program to:
  - Train a Random Forest Classifier
  - Tune hyperparameters max_depth and n_estimators using GridSearchCV
  - Print the best parameters and final accuracy

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Define model and grid
model = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, None]
}

# Grid search
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 5, 'n_estimators': 100}
Best Accuracy: 0.9596180717279925


## 9.Write a Python program to:
  - Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
  - Compare their Mean Squared Errors (MSE)

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Bagging Regressor
bagging = BaggingRegressor(random_state=42)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# Random Forest Regressor
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.27872374841230696
Random Forest Regressor MSE: 0.2542358390056568


## 10.You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:
  - Choose between Bagging or Boosting
  - Handle overfitting
  - Select base models
  - Evaluate performance using cross-validation
  - Justify how ensemble learning improves decision-making in this real-world context.

In this scenario we are predicting loan default using customer demographic and transaction history.

### Choose between Bagging or Boosting:
I will choose Boosting (like Gradient Boosting or XGBoost) as boosting handles imbalanced data and reduces bias. It gives better performance in classification problems like loan default.

### Handle Overfitting:
I will use techniques like:
Limit tree depth (max_depth)
Use regularization parameters (like learning_rate)
Early stopping
Cross-validation to monitor performance

### Select Base Models:
I will use Decision Trees as base learners since they perform well as weak learners in ensemble models. I might also try logistic regression with Boosting.

### Evaluate performance using cross-validation:
I will use stratified k-fold cross-validation to maintain class ratio in each fold. Metrics like ROC-AUC, precision, and recall will be used for imbalanced data.

### Justify how ensemble learning improves decision-making:
Ensemble models provide more stable and accurate predictions. In loan default prediction, they reduce false positives and false negatives. This improves risk assessment and helps the bank make better loan approval decisions.