# Theoretical Questions: Bagging, Random Forest & Ensemble Techniques

## Question 1: Can we use Bagging for regression problems?
Answer:  
Yes, Bagging can be used for regression problems. In regression, Bagging aggregates predictions from multiple base regressors (like Decision Trees) using averaging to reduce variance and improve accuracy.

## Question 2: What is the difference between multiple model training and single model training?
Answer:  
- Single model training: One model is trained on the entire dataset.  
- Multiple model training (ensemble): Multiple models are trained on different subsets or variations of the data and combined to improve overall performance and robustness.

## Question 3: Explain the concept of feature randomness in Random Forest.
Answer:  
Random Forest selects a random subset of features at each split of a tree, which introduces feature randomness. This reduces correlation between trees, improving model generalization.

## Question 4: What is OOB (Out-of-Bag) Score?
Answer:  
OOB score is an internal validation score in Bagging/Random Forest. Each tree is trained on a bootstrap sample, leaving out ~1/3 of the data (OOB samples). These are used to estimate model performance without a separate test set.

## Question 5: How can you measure the importance of features in a Random Forest model?
Answer:  
Feature importance can be measured using:  
- Mean decrease in impurity (Gini or MSE reduction)  
- Permutation importance (impact on model performance when feature values are shuffled)

## Question 6: Explain the working principle of a Bagging Classifier.
Answer:  
- Bagging creates multiple bootstrap samples from the training data.  
- A base classifier (e.g., Decision Tree) is trained on each sample.  
- Predictions are combined via majority voting (classification) or averaging (regression) to reduce variance.

## Question 7: How do you evaluate a Bagging Classifier's performance?
Answer:  
- Accuracy, Precision, Recall, F1-score (classification)  
- OOB score for internal validation  
- Confusion matrix for detailed evaluation  

## Question 8: How does a Bagging Regressor work?
Answer:  
- Multiple regressors are trained on different bootstrap samples.  
- Predictions from each regressor are averaged to produce the final output, reducing variance.

## Question 9: What is the main advantage of ensemble techniques?
Answer:  
Ensemble techniques reduce variance, bias, or improve predictions by combining multiple models, making them more robust and accurate than single models.

## Question 10: What is the main challenge of ensemble methods?
Answer:  
- Increased computational cost and memory usage  
- Harder to interpret compared to single models

## Question 11: Explain the key idea behind ensemble techniques.
Answer:  
The key idea is "wisdom of the crowd": combining multiple models leads to better predictions than relying on a single model.

## Question 12: What is a Random Forest Classifier?
Answer:  
A Random Forest Classifier is an ensemble of Decision Trees using Bagging and feature randomness. Trees vote for the most popular class for classification.

## Question 13: What are the main types of ensemble techniques?
Answer:  
1. Bagging (Bootstrap Aggregating)  
2. Boosting (e.g., AdaBoost, XGBoost)  
3. Stacking  
4. Voting ensembles  

## Question 14: What is ensemble learning in machine learning?
Answer:  
Ensemble learning combines predictions from multiple models to improve generalization, reduce errors, and enhance robustness.

## Question 15: When should we avoid using ensemble methods?
Answer:  
- Small datasets (overfitting risk)  
- When interpretability is crucial  
- Limited computational resources  

## Question 16: How does Bagging help in reducing overfitting?
Answer:  
By training multiple models on bootstrap samples and averaging their predictions, Bagging reduces model variance, thus lowering overfitting.

## Question 17: Why is Random Forest better than a single Decision Tree?
Answer:  
Random Forest is more robust because it reduces overfitting and variance by averaging predictions from multiple trees with feature randomness.

## Question 18: What is the role of bootstrap sampling in Bagging?
Answer:  
Bootstrap sampling creates random subsets of the data with replacement, allowing each base model to see different data and reducing variance.

## Question 19: What are some real-world applications of ensemble techniques?
Answer:  
- Fraud detection  
- Customer churn prediction  
- Stock price forecasting  
- Medical diagnosis  
- Credit scoring  

## Question 20: What is the difference between Bagging and Boosting?
Answer:  
| Feature       | Bagging                  | Boosting                   |
|---------------|--------------------------|----------------------------|
| Approach      | Parallel training        | Sequential training        |
| Goal          | Reduce variance          | Reduce bias and variance   |
| Weighting     | Equal weights for models | Later models focus on mistakes |
| Examples      | Random Forest           | AdaBoost, XGBoost          |


In [2]:
# Practical Exercises: Bagging, Random Forest & Ensemble Models

# Question 21: Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy
"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging classifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
"""

# Question 22: Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
"""
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging Regressor
bag_reg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=10, random_state=42)
bag_reg.fit(X_train, y_train)
y_pred = bag_reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
"""

# Question 23: Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores
"""
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X, y)
print("Feature Importances:", rf_clf.feature_importances_)
"""

# Question 24: Train a Random Forest Regressor and compare its performance with a single Decision Tree
"""
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Random Forest
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
print("Random Forest MSE:", mean_squared_error(y_test, y_pred_rf))

# Single Decision Tree
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
y_pred_dt = dt_reg.predict(X_test)
print("Decision Tree MSE:", mean_squared_error(y_test, y_pred_dt))
"""

# Question 25: Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
"""
rf_clf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_clf_oob.fit(X, y)
print("OOB Score:", rf_clf_oob.oob_score_)
"""

# Question 26: Train a Bagging Classifier using SVM as a base estimator and print accuracy
"""
from sklearn.svm import SVC

bag_svm = BaggingClassifier(SVC(), n_estimators=10, random_state=42)
bag_svm.fit(X_train, y_train)
y_pred = bag_svm.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
"""

# Question 27: Train a Random Forest Classifier with different numbers of trees and compare accuracy
"""
for n in [10, 50, 100, 200]:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    print(f"n_estimators={n}, Accuracy={accuracy_score(y_test, y_pred)}")
"""

# Question 28: Train a Bagging Classifier using Logistic Regression as base estimator and print AUC score
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

bag_lr = BaggingClassifier(LogisticRegression(), n_estimators=10, random_state=42)
bag_lr.fit(X_train, y_train)
y_pred_prob = bag_lr.predict_proba(X_test)[:,1]
print("AUC Score:", roc_auc_score(y_test, y_pred_prob))
"""

# Question 29: Train a Random Forest Regressor and analyze feature importance scores
"""
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
print("Feature Importances:", rf_reg.feature_importances_)
"""

# Question 30: Train an ensemble model using both Bagging and Random Forest and compare accuracy
"""
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)

rf_clf = RandomForestClassifier(n_estimators=50, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bag))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
"""


# ------------------------------
# Assignment: Bagging & Random Forest | Questions 31–45
# Colab-Ready Template with 20-Mark Explanations
# ------------------------------

# Import common libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, mean_squared_error, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, precision_recall_curve
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris, load_breast_cancer, fetch_california_housing, load_wine
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# ------------------------------
# Question 31
# ------------------------------
# Train a Random Forest Classifier and tune hyperparameters using GridSearchCV

"""
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_clf = RandomForestClassifier(random_state=42)
param_grid = {'n_estimators':[50,100,150], 'max_depth':[None,5,10]}
grid_search = GridSearchCV(rf_clf, param_grid, cv=3)
grid_search.fit(X_train, y_train)
y_pred = grid_search.predict(X_test)
print("Best Params:", grid_search.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
"""

# Answer
# GridSearchCV allows tuning hyperparameters like number of trees and depth in Random Forest.
# This optimizes accuracy and generalization on unseen data.

# ------------------------------
# Question 32
# ------------------------------
# Train a Bagging Regressor with different numbers of base estimators and compare performance

"""
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
for n in [5,10,20]:
    bag_reg = BaggingRegressor(DecisionTreeClassifier(), n_estimators=n, random_state=42)
    bag_reg.fit(X_train, y_train)
    y_pred = bag_reg.predict(X_test)
    print(f"n_estimators={n} MSE:", mean_squared_error(y_test, y_pred))
"""

# Answer
# Varying the number of base estimators affects variance reduction and predictive stability.
# More estimators generally improve performance but increase computation.

# ------------------------------
# Question 33
# ------------------------------
# Train a Random Forest Classifier and analyze misclassified samples

"""
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(X_test)
misclassified = X_test[y_test != y_pred]
print("Misclassified samples:", misclassified)
"""

# Answer
# Analyzing misclassified samples helps understand model weaknesses and guide feature engineering or hyperparameter tuning.

# ------------------------------
# Question 34
# ------------------------------
# Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier

"""
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bag))
"""

# Answer
# Bagging reduces variance by averaging multiple trees’ predictions. It usually outperforms a single tree in terms of stability and accuracy.

# ------------------------------
# Question 35
# ------------------------------
# Train a Random Forest Classifier and visualize the confusion matrix

"""
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()
"""

# Answer
# Confusion matrix shows model performance per class, helping identify which classes are misclassified.

# ------------------------------
# Question 36
# ------------------------------
# Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy

"""
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [('dt', DecisionTreeClassifier()), ('svc', SVC(probability=True)), ('lr', LogisticRegression())]
stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X_train, y_train)
y_pred = stack_clf.predict(X_test)
print("Stacking Classifier Accuracy:", accuracy_score(y_test, y_pred))
"""

# Answer
# Stacking combines predictions from multiple models to improve accuracy and leverage different model strengths.

# ------------------------------
# Question 37
# ------------------------------
# Train a Random Forest Classifier and print the top 5 most important features

"""
X, y = load_breast_cancer(return_X_y=True)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X, y)
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1][:5]
print("Top 5 Features:", [X.columns[i] if hasattr(X, 'columns') else i for i in indices])
"""

# Answer
# Feature importance scores highlight which features contribute most to predictions, useful for model interpretation and feature selection.

# ------------------------------
# Question 38
# ------------------------------
# Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score

"""
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1-Score:", f1_score(y_test, y_pred, average='macro'))
"""

# Answer
# Evaluating precision, recall, and F1 ensures balanced performance across classes, especially when class distribution is uneven.

# ------------------------------
# Question 39
# ------------------------------
# Train a Random Forest Classifier and analyze the effect of max depth on accuracy

"""
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
for depth in [None, 2, 3, 4]:
    rf_clf = RandomForestClassifier(max_depth=depth, n_estimators=100, random_state=42)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    print(f"Max Depth={depth} Accuracy:", accuracy_score(y_test, y_pred))
"""

# Answer
# Controlling max_depth prevents overfitting. Shallow trees reduce variance but may underfit, deep trees may overfit.

# ------------------------------
# Question 40
# ------------------------------
# Train a Bagging Regressor using different base estimators (Decision Tree and KNeighbors) and compare performance

"""
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [DecisionTreeClassifier(), KNeighborsClassifier()]
for est in estimators:
    bag_reg = BaggingRegressor(est, n_estimators=10, random_state=42)
    bag_reg.fit(X_train, y_train)
    y_pred = bag_reg.predict(X_test)
    print(f"{type(est).__name__} MSE:", mean_squared_error(y_test, y_pred))
"""

# Answer
# Using different base estimators shows how the choice of model impacts Bagging performance. Decision Trees usually give higher variance reduction.

# ------------------------------
# Question 41
# ------------------------------
# Train a Random Forest Classifier and evaluate performance using ROC-AUC Score

"""
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_proba = rf_clf.predict_proba(X_test)[:,1]
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))
"""

# Answer
# ROC-AUC quantifies model’s ability to distinguish classes; higher values indicate better discriminative power.

# ------------------------------
# Question 42
# ------------------------------
# Train a Bagging Classifier and evaluate performance using cross-validation

"""
X, y = load_iris(return_X_y=True)
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10, random_state=42)
scores = cross_val_score(bag_clf, X, y, cv=5)
print("Cross-Validation Accuracy Scores:", scores)
print("Mean CV Accuracy:", np.mean(scores))
"""

# Answer
# Cross-validation ensures model stability and generalization across multiple splits of the dataset.

# ------------------------------
# Question 43
# ------------------------------
# Train a Random Forest Classifier and plot the Precision-Recall curve

"""
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_proba = rf_clf.predict_proba(X_test)[:,1]
precision, recall, _ = precision_recall_curve(y_test, y_proba)
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()
"""

# Answer
# Precision-Recall curve is useful for evaluating models on imbalanced datasets, emphasizing the trade-off between precision and recall.

# ------------------------------
# Question 44
# ------------------------------
# Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy

"""
X, y = load_iris(return_X_y=True)
estimators = [('rf', RandomForestClassifier(n_estimators=50)), ('lr', LogisticRegression())]
stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X, y)
y_pred = stack_clf.predict(X)
print("Stacking Accuracy:", accuracy_score(y, y_pred))
"""

# Answer
# Combining Random Forest and Logistic Regression via Stacking leverages strengths of each, improving overall accuracy.

# ------------------------------
# Question 45
# ------------------------------
# Train a Bagging Regressor with different levels of bootstrap samples and compare performance

"""
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
for bs in [0.5, 0.7, 1.0]:
    bag_reg = BaggingRegressor(DecisionTreeClassifier(), n_estimators=10, bootstrap=True, max_samples=bs, random_state=42)
    bag_reg.fit(X_train, y_train)
    y_pred = bag_reg.predict(X_test)
    print(f"Bootstrap={bs} MSE:", mean_squared_error(y_test, y_pred))
"""

# Answer
# Adjusting bootstrap fraction affects diversity of base estimators; smaller samples increase variance reduction but may lose information.


'\nX, y = fetch_california_housing(return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\nfor bs in [0.5, 0.7, 1.0]:\n    bag_reg = BaggingRegressor(DecisionTreeClassifier(), n_estimators=10, bootstrap=True, max_samples=bs, random_state=42)\n    bag_reg.fit(X_train, y_train)\n    y_pred = bag_reg.predict(X_test)\n    print(f"Bootstrap={bs} MSE:", mean_squared_error(y_test, y_pred))\n'