Can we use Bagging for regression problems?

 - Yes. Bagging (Bootstrap Aggregating) is a general ensemble technique applicable to both classification and regression problems.
 - For regression, instead of taking a majority vote (like in classification), the predictions from the individual base regressors (e.g., Decision Trees) are typically averaged to produce the final ensemble prediction.
 2.What is the difference between multiple model training and single model training?
- Single Model Training: Involves training one individual model (e.g., a single Decision Tree, a single Logistic Regression model, a single SVM) on the entire training dataset.
- The goal is to find the best parameters for that one model to capture the patterns in the data. Its performance relies solely on the capability and suitability of that specific model type for the given data.
- Multiple Model Training (Ensemble Methods): Involves training multiple models, often of the same type (homogeneous ensemble) but sometimes different types (heterogeneous ensemble).
- These models are typically trained on different subsets of the data or with different configurations. Their individual predictions are then combined (e.g., by voting or averaging) to produce a final prediction. The idea is that the combined wisdom of multiple models is often better (more accurate, more robust) than any single model.

3.Explain the concept of feature randomness in Random Forest

- Random Forest builds upon Bagging (which uses bootstrap sampling of data).
-  In addition to sampling data points for each tree, Random Forest introduces randomness in feature selection during the tree building process.
- How it works: When splitting a node in a decision tree within the Random Forest, the algorithm doesn't consider all available features to find the best split. Instead, it randomly selects a subset of features (e.g., sqrt(n_features) for classification, n_features / 3 for regression are common defaults).
- Only the features in this random subset are evaluated to find the best split point for that node.
- Purpose: This feature randomness further decorrelates the trees in the forest. If one or a few features are very strong predictors, Bagging alone might still produce similar trees because those features would likely be selected near the root of most trees. By restricting the features available at each split, Random Forest forces trees to explore other, potentially less obvious, predictive features, leading to greater diversity among the trees and often better generalization (reduced variance).

4.What is OOB (Out-of-Bag) Score?
- The Out-of-Bag (OOB) score is a method for evaluating the performance of Bagging-based models (like Random Forest and BaggingClassifier/Regressor) without needing a separate validation set.
- How it works: Because Bagging uses bootstrap sampling, each base model (e.g., each tree in a Random Forest) is trained on only a subset (roughly 63.2%) of the original training data. The remaining data points (roughly 36.8%), which were not included in the bootstrap sample for a specific tree, are called the "Out-of-Bag" samples for that tree.

- To calculate the OOB score for the entire ensemble:
For each data point in the original training set, identify all the trees that did not use this data point during their training (i.e., the trees for which this point was OOB).
- Make predictions for this data point using only those specific OOB trees (e.g., by majority vote or averaging).
Compare this prediction to the actual target value for the data point.
- Calculate an overall metric (like accuracy for classification or R-squared/MSE for regression) based on these OOB predictions across all training data points.
- Benefit: The OOB score provides an unbiased estimate of the model's performance on unseen data, similar to cross-validation, but obtained "for free" during the training process.

5.How can you measure the importance of features in a Random Forest model?
- Random Forests offer built-in mechanisms to estimate feature importance, indicating how much each feature contributes to the model's predictive power. Common methods include:
- Mean Decrease in Impurity (MDI) / Gini Importance:
- Calculated during training.
- For each feature, it measures the total reduction in the node impurity criterion (like Gini impurity for classification or variance reduction/MSE for regression) averaged over all trees in the forest whenever that feature is used to split a node.
- Features that result in larger impurity decreases are considered more important.
- Pros: Fast to compute (available directly after training).
- Cons: Can be biased towards high-cardinality features (features with many unique values) and might overestimate the importance of correlated features.
- Mean Decrease in Accuracy (MDA) / Permutation Importance:
Calculated after training, often on a validation set or the OOB samples.
For a specific feature:
- Measure the model's baseline accuracy (or other metric) on the dataset.
Randomly shuffle (permute) the values of only that feature in the dataset, breaking its relationship with the target variable.

6.Explain the working principle of a Bagging Classifier
- A Bagging Classifier aims to improve the stability and accuracy of a base classification algorithm (like a Decision Tree, SVM, etc.) by reducing variance.
Steps:
- Bootstrap Sampling: Create multiple (say, B) different training datasets from the original training data.
Each dataset is created by sampling with replacement from the original data. Each bootstrap sample typically has the same size as the original dataset but contains
duplicate instances and omits some original instances (the OOB samples).
Base Model Training: Train one instance of the base classifier (e.g., a Decision Tree) independently on each of the B bootstrap samples. This results in B different classifiers.
- Aggregation (Voting): To make a prediction for a new, unseen data point, pass it through all B trained classifiers. Collect the individual predictions from each classifier.
- The final prediction of the Bagging Classifier is determined by majority voting among the B individual predictions. (e.g., if 7 out of 10 trees predict 'Class A' and 3 predict 'Class B', the final prediction is 'Class A'). For probabilistic predictions, the probabilities from each base model can be averaged.


- Measure the model's accuracy again on this permuted dataset.
The decrease in accuracy caused by shuffling the feature indicates its importance. A larger drop means the feature was more important.

- Pros: Less biased by feature cardinality, captures interactions better, can be calculated on any fitted model.
Cons: Computationally more expensive, especially with many features or large datasets.

7.How do you evaluate a Bagging Classifier’s performance?
- You evaluate a Bagging Classifier using standard classification metrics, typically calculated on a hold-out test set (data not used during training at all). Common metrics include:
- Accuracy: Overall percentage of correct predictions.
Precision: Of the instances predicted as positive, how many actually are positive (TP / (TP + FP)). Important when False Positives are costly.
- Recall (Sensitivity): Of the actual positive instances, how many were correctly identified (TP / (TP + FN)). Important when False Negatives are costly.
- F1-Score: The harmonic mean of Precision and Recall (2 * (Precision * Recall) / (Precision + Recall)). Useful for imbalanced datasets.
- AUC (Area Under the ROC Curve): Measures the model's ability to distinguish between positive and negative classes across different probability thresholds.
- Confusion Matrix: A table showing True Positives, True Negatives, False Positives, and False Negatives, providing a detailed breakdown of performance.
-Additionally, as mentioned before, the OOB Score can be used as an internal estimate of performance without needing a separate test set during development or hyperparameter tuning.
8.How does a Bagging Regressor work?
- Similar to the Bagging Classifier, but adapted for regression tasks.
Steps:
-Bootstrap Sampling: Create B bootstrap samples from the original training data (with replacement).
-Base Model Training: Train one instance of the base regressor (e.g., a -Decision Tree Regressor) independently on each of the B bootstrap samples.
-Aggregation (Averaging): To make a prediction for a new data point, pass it through all B trained regressors.
- Collect the individual numerical predictions. The final prediction of the Bagging Regressor is the average (mean) of the B individual predictions.

9.What is the main advantage of ensemble techniques?
-The primary advantage is generally improved predictive performance (accuracy for classification, lower error for regression) and increased robustness compared to a single model. They achieve this by:
-Reducing Variance (Bagging, Random Forests): Averaging predictions from diverse models trained on different data subsets smooths out noise and reduces sensitivity to the specific training data.
-Reducing Bias (Boosting): Sequentially focusing on misclassified examples helps the model learn complex patterns it might otherwise miss.
-Combining the strengths of different models (Stacking)

10.What is the main advantage of ensemble techniques?
- The primary advantage is generally improved predictive performance (accuracy for classification, lower error for regression) and increased robustness compared to a single model. They achieve this by:
- Reducing Variance (Bagging, Random Forests): Averaging predictions from diverse models trained on different data subsets smooths out noise and reduces sensitivity to the specific training data.
- Reducing Bias (Boosting): Sequentially focusing on misclassified examples helps the model learn complex patterns it might otherwise miss.
Combining the strengths of different models (Stacking)

11.What is the main advantage of ensemble techniques?
- The primary advantage is generally improved predictive performance (accuracy for classification, lower error for regression) and increased robustness compared to a single model. They achieve this by:
-Reducing Variance (Bagging, Random Forests): Averaging predictions from diverse models trained on different data subsets smooths out noise and reduces sensitivity to the specific training data.
-Reducing Bias (Boosting): Sequentially focusing on misclassified examples helps the model learn complex patterns it might otherwise miss.
Combining the strengths of different models (Stacking)

12.What is a Random Forest Classifier?
A Random Forest Classifier is a specific type of ensemble learning method primarily used for classification tasks. It operates by constructing a multitude (a "forest") of Decision Trees during training time.
Key Characteristics:
It uses Bagging (bootstrap sampling of data) to train each tree on a different random subset of the training data.
It introduces feature randomness: at each node split during tree construction, only a random subset of features is considered for finding the best split.
The final prediction for a new instance is obtained by taking the majority vote of the predictions from all individual trees in the forest.
It's known for its good performance "out-of-the-box", robustness to overfitting, and ability to handle high-dimensional data.

13.What are the main types of ensemble techniques?
The three most common types are:
- Bagging (Bootstrap Aggregating): Trains multiple base models independently in parallel on different bootstrap samples of the data. Combines predictions by voting (classification) or averaging (regression).
- Aims primarily to reduce variance. (e.g., Random Forest).
Boosting: Trains base models sequentially. Each new model focuses on correcting the errors made by the previous models (e.g., by up-weighting misclassified samples).
-Combines predictions through a weighted vote or sum. Aims primarily to reduce bias. (e.g., AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM, CatBoost).
-Stacking (Stacked Generalization): Trains multiple different base models (often diverse types).
-Then, trains a final "meta-model" whose input features are the predictions made by the base models on the training data (often using cross-validation predictions to prevent leakage). The meta-model learns how to best combine the base model predictions.
What is ensemble learning in machine learning?
Ensemble learning is a machine learning paradigm where multiple learning algorithms (base models) are strategically combined to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Instead of relying on a single model, ensemble methods leverage the collective intelligence of several models.
14.When should we avoid using ensemble methods?
-Need for High Interpretability: If explaining precisely how a prediction is made is crucial (e.g., in some legal or medical contexts), the complexity of ensembles can be a drawback compared to simpler models like linear regression or single decision trees.
Strict Computational Constraints: If training time, prediction latency, or memory usage is severely limited, the overhead of training and running multiple models might be prohibitive.
- Very Small Datasets: With extremely limited data, creating diverse bootstrap samples for Bagging might be difficult, and the complexity of ensembles might lead to overfitting despite their variance-reducing nature. A simpler model might generalize better.
When a Single Model Performs Exceptionally Well: If a well-tuned single model already achieves the desired performance and generalization, the added complexity and computational cost of an ensemble might not be justified.
-Real-time Prediction Requirements: While many ensembles are fast, some (especially large ones or complex boosting models) might introduce latency unacceptable for hard real-time systems

16.How does Bagging help in reducing overfitting?
- Overfitting occurs when a model learns the training data too well, including its noise and specific idiosyncrasies, leading to poor performance on unseen data. Single complex models like deep Decision Trees are prone to overfitting.
-Bagging reduces overfitting primarily by reducing variance:
Diverse Training Sets: Each base model is trained on a slightly different subset of the data due to bootstrap sampling.
-Averaging/Voting: When predictions from these diverse models are combined (averaged for regression, voted for classification), the individual errors and overfitting tendencies tend to cancel each other out. A single tree might latch onto noise in its specific bootstrap sample, but it's unlikely that all trees will overfit the same noise patterns in the same way. The averaging/voting process smooths out these individual model eccentricities, resulting in a more stable and generalizable final prediction.
17.Why is Random Forest better than a single Decision Tree?
Random Forest generally outperforms a single Decision Tree for several reasons:
-Reduced Overfitting: As explained above, the combination of Bagging and feature randomness significantly reduces the variance and makes the model less prone to overfitting compared to a single, potentially deep, Decision Tree.
-Increased Robustness: The ensemble nature makes Random Forest less sensitive to noise in the data and variations in the training set. Small changes in the data are less likely to drastically alter the final model.
Improved Accuracy: By averaging the predictions of many decorrelated trees, Random Forest often achieves higher accuracy than a single optimized Decision Tree. The ensemble can capture more complex patterns without overfitting as severely.
-No Need for Pruning (typically): Single Decision Trees often require careful pruning to prevent overfitting. Random Forests are less sensitive to the depth of individual trees because the ensemble averaging combats overfitting.

What are some real-world applications of ensemble techniques?
Ensemble methods are widely used across various domains due to their high performance:
Finance: Credit scoring, fraud detection, stock market prediction.
Healthcare: Disease diagnosis (e.g., cancer detection from images), predicting patient outcomes, drug discovery.
E-commerce: Recommendation systems, customer churn prediction, sentiment analysis of reviews.
Image Recognition & Computer Vision: Object detection, image classification.
Remote Sensing: Land cover classification, environmental monitoring.
Bioinformatics: Gene expression analysis, protein structure prediction.
Machine Learning Competitions: Often the winning solutions on platforms like Kaggle involve sophisticated ensembles (especially Gradient Boosting and Stacking).
What is the difference between Bagging and Boosting?
-Feature	Bagging (e.g., Random Forest)	Boosting (e.g., AdaBoost, GBM)
Model Training	Parallel (Independent)	Sequential (Dependent)
Goal	Decrease Variance	Decrease Bias (and often Variance too)
Data Sampling	Bootstrap sampling (random subsets with rep.)	Often uses all data, but weights samples
-Model Weighting	Typically equal weight (voting/averaging)	Weighted (models performing better get higher weight)
-Focus	Create diverse models on different data views	Focus on correcting errors of prior models
-Sensitivity	Less sensitive to outliers	Can be sensitive to outliers / noisy data







In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_regression, load_breast_cancer
from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor
import matplotlib.pyplot as plt
# Set random state for reproducibility
random_seed = 42
print("\n--- 1. Bagging Classifier with Decision Trees ---")
# Generate sample classification data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=random_seed)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Define the base estimator
base_dt = DecisionTreeClassifier(random_state=random_seed)

# Create and train the Bagging Classifier
# n_estimators: number of trees
bagging_clf = BaggingClassifier(estimator=base_dt, n_estimators=50,
                                random_state=random_seed, n_jobs=-1) # n_jobs=-1 uses all CPU cores
bagging_clf.fit(X_train, y_train)

# Make predictions
y_pred = bagging_clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier (Decision Tree base) Accuracy: {accuracy:.4f}")


--- 1. Bagging Classifier with Decision Trees ---
Bagging Classifier (Decision Tree base) Accuracy: 0.8733


In [2]:
# . Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
print("\n--- 2. Bagging Regressor with Decision Trees ---")
# Generate sample regression data
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15,
                       noise=0.1, random_state=random_seed)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Define the base estimator
base_dtr = DecisionTreeRegressor(random_state=random_seed)

# Create and train the Bagging Regressor
bagging_reg = BaggingRegressor(estimator=base_dtr, n_estimators=50,
                               random_state=random_seed, n_jobs=-1)
bagging_reg.fit(X_train, y_train)

# Make predictions
y_pred = bagging_reg.predict(X_test)

# Evaluate MSE
mse = mean_squared_error(y_test, y_pred)
print(f"Bagging Regressor (Decision Tree base) MSE: {mse:.4f}")


--- 2. Bagging Regressor with Decision Trees ---
Bagging Regressor (Decision Tree base) MSE: 17888.1752


In [None]:
#Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores
print("\n--- 3. Random Forest Classifier & Feature Importance ---")
# Load Breast Cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
feature_names = cancer.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Create and train the Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=random_seed, n_jobs=-1)
rf_clf.fit(X_train, y_train)

# Evaluate accuracy (optional, but good practice)
y_pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy (Breast Cancer): {accuracy:.4f}")

# Get feature importances (Mean Decrease in Impurity)
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1] # Sort features by importance

print("\nFeature Importance Scores (MDI):")
for i in range(X.shape[1]):
    print(f"{i + 1}. Feature '{feature_names[indices[i]]}' ({importances[indices[i]]:.4f})")

In [3]:
 #Train a Random Forest Regressor and compare its performance with a single Decision Tree
print("\n--- 4. Random Forest Regressor vs Single Decision Tree ---")
# Generate sample regression data (can reuse from example 2)
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15,
                       noise=0.1, random_state=random_seed)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# --- Single Decision Tree Regressor ---
single_dtr = DecisionTreeRegressor(random_state=random_seed)
single_dtr.fit(X_train, y_train)
y_pred_single = single_dtr.predict(X_test)
mse_single = mean_squared_error(y_test, y_pred_single)
print(f"Single Decision Tree Regressor MSE: {mse_single:.4f}")

# --- Random Forest Regressor ---
rf_reg = RandomForestRegressor(n_estimators=100, random_state=random_seed, n_jobs=-1)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")

if mse_rf < mse_single:
    print("Random Forest Regressor performed better (lower MSE).")
else:
    print("Single Decision Tree Regressor performed better or equal (lower/equal MSE).")


--- 4. Random Forest Regressor vs Single Decision Tree ---
Single Decision Tree Regressor MSE: 42008.5183
Random Forest Regressor MSE: 17137.6945
Random Forest Regressor performed better (lower MSE).


In [None]:
#Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
print("\n--- 5. Random Forest Classifier OOB Score ---")
# Use Breast Cancer data from example 3
X, y = load_breast_cancer().data, load_breast_cancer().target

# Split data (although OOB is calculated on training data, we split for consistency)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Create and train RF Classifier WITH OOB_SCORE=TRUE
# Note: Needs enough estimators for OOB score to be reliable (e.g., >= 10)
rf_clf_oob = RandomForestClassifier(n_estimators=100, random_state=random_seed,
                                    oob_score=True, # <<< Enable OOB score calculation
                                    n_jobs=-1)
rf_clf_oob.fit(X_train, y_train) # OOB score calculated during fit

# Access the OOB score
oob_accuracy = rf_clf_oob.oob_score_
print(f"Random Forest Classifier OOB Accuracy: {oob_accuracy:.4f}")

# Compare with test set accuracy (optional)
y_pred_test = rf_clf_oob.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"Random Forest Classifier Test Set Accuracy: {test_accuracy:.4f}")

In [None]:
#Train a Bagging Classifier using SVM as a base estimator and print accuracy
print("\n--- 6. Bagging Classifier with SVM Base ---")
# Generate sample classification data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=random_seed)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Define the base estimator (SVM Classifier)
# Note: SVM can be slow, especially for Bagging with many estimators/large data
# Using a smaller n_estimators here for speed
base_svm = SVC(probability=True, random_state=random_seed) # probability=True often helps ensembles

# Create and train the Bagging Classifier
bagging_clf_svm = BaggingClassifier(estimator=base_svm, n_estimators=10, # Fewer estimators due to SVM speed
                                    random_state=random_seed, n_jobs=-1)
bagging_clf_svm.fit(X_train, y_train)

# Make predictions
y_pred = bagging_clf_svm.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier (SVM base) Accuracy: {accuracy:.4f}")

In [None]:
# Train a Random Forest Classifier with different numbers of trees and compare accuracy
print("\n--- 7. Random Forest Classifier Accuracy vs Number of Trees ---")
# Use Breast Cancer data
X, y = load_breast_cancer().data, load_breast_cancer().target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

n_estimators_list = [10, 50, 100, 200, 500]
accuracies = []

print("Comparing RF accuracy for different n_estimators:")
for n_estimators in n_estimators_list:
    rf_clf = RandomForestClassifier(n_estimators=n_estimators, random_state=random_seed, n_jobs=-1)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"  n_estimators = {n_estimators}: Accuracy = {accuracy:.4f}")

# Optional: Plot the results
# plt.figure(figsize=(8, 5))
# plt.plot(n_estimators_list, accuracies, marker='o')
# plt.title("Random Forest Accuracy vs. Number of Trees (Breast Cancer)")
# plt.xlabel("Number of Estimators (Trees)")
# plt.ylabel("Test Set Accuracy")
# plt.grid(True)
# plt.show()

In [4]:
 #Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score
print("\n--- 8. Bagging Classifier with Logistic Regression Base (AUC) ---")
# Generate sample classification data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=random_seed)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Define the base estimator (Logistic Regression)
base_lr = LogisticRegression(solver='liblinear', random_state=random_seed) # liblinear good for smaller datasets

# Create and train the Bagging Classifier
bagging_clf_lr = BaggingClassifier(estimator=base_lr, n_estimators=50,
                                   random_state=random_seed, n_jobs=-1)
bagging_clf_lr.fit(X_train, y_train)

# Make probability predictions for AUC
y_pred_proba = bagging_clf_lr.predict_proba(X_test)[:, 1] # Probability of positive class

# Evaluate AUC
auc = roc_auc_score(y_test, y_pred_proba)
print(f"Bagging Classifier (Logistic Regression base) AUC: {auc:.4f}")


--- 8. Bagging Classifier with Logistic Regression Base (AUC) ---
Bagging Classifier (Logistic Regression base) AUC: 0.9075


In [5]:
 #Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score
print("\n--- 8. Bagging Classifier with Logistic Regression Base (AUC) ---")
# Generate sample classification data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=random_seed)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Define the base estimator (Logistic Regression)
base_lr = LogisticRegression(solver='liblinear', random_state=random_seed) # liblinear good for smaller datasets

# Create and train the Bagging Classifier
bagging_clf_lr = BaggingClassifier(estimator=base_lr, n_estimators=50,
                                   random_state=random_seed, n_jobs=-1)
bagging_clf_lr.fit(X_train, y_train)

# Make probability predictions for AUC
y_pred_proba = bagging_clf_lr.predict_proba(X_test)[:, 1] # Probability of positive class

# Evaluate AUC
auc = roc_auc_score(y_test, y_pred_proba)
print(f"Bagging Classifier (Logistic Regression base) AUC: {auc:.4f}")


--- 8. Bagging Classifier with Logistic Regression Base (AUC) ---
Bagging Classifier (Logistic Regression base) AUC: 0.9075


In [6]:
#Train a Random Forest Regressor and analyze feature importance scores
print("\n--- 9. Random Forest Regressor & Feature Importance Analysis ---")
# Generate sample regression data
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15,
                       noise=0.1, random_state=random_seed)
# Create dummy feature names for demonstration
feature_names = [f'feature_{i}' for i in range(X.shape[1])]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# Create and train the Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=random_seed, n_jobs=-1)
rf_reg.fit(X_train, y_train)

# Evaluate MSE (optional)
y_pred = rf_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Random Forest Regressor MSE: {mse:.4f}")

# Get feature importances (Mean Decrease in Impurity/Variance Reduction)
importances = rf_reg.feature_importances_
indices = np.argsort(importances)[::-1]

print("\nFeature Importance Scores (MDI/Variance Reduction):")
# Print top 10 features
for i in range(min(10, X.shape[1])):
    print(f"{i + 1}. Feature '{feature_names[indices[i]]}' ({importances[indices[i]]:.4f})")

# Analysis: Features with higher scores contribute more, on average, to reducing
# the variance (or MSE) at the splits in the trees across the forest.
# In this synthetic dataset, we expect the informative features (first 15)
# to generally have higher importance scores than the non-informative ones.


--- 9. Random Forest Regressor & Feature Importance Analysis ---
Random Forest Regressor MSE: 17137.6945

Feature Importance Scores (MDI/Variance Reduction):
1. Feature 'feature_3' (0.2123)
2. Feature 'feature_2' (0.1586)
3. Feature 'feature_8' (0.0885)
4. Feature 'feature_14' (0.0865)
5. Feature 'feature_13' (0.0844)
6. Feature 'feature_9' (0.0734)
7. Feature 'feature_0' (0.0716)
8. Feature 'feature_12' (0.0385)
9. Feature 'feature_6' (0.0273)
10. Feature 'feature_16' (0.0228)


In [7]:
#Train an ensemble model using both Bagging and Random Forest and compare accuracy.
print("\n--- 10. Comparing Bagging (DT) vs Random Forest Classifier Accuracy ---")
# Use Breast Cancer data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)

# --- Bagging Classifier (with Decision Trees) ---
base_dt = DecisionTreeClassifier(random_state=random_seed)
bagging_clf = BaggingClassifier(estimator=base_dt, n_estimators=100, # Use same n_estimators for fair comparison
                                random_state=random_seed, n_jobs=-1)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Bagging Classifier (DT base) Accuracy: {accuracy_bagging:.4f}")

# --- Random Forest Classifier ---
# Note: RF is essentially Bagging(DT) + Feature Randomness
rf_clf = RandomForestClassifier(n_estimators=100, random_state=random_seed, n_jobs=-1)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Classifier Accuracy: {accuracy_rf:.4f}")

if accuracy_rf > accuracy_bagging:
    print("Random Forest performed better.")
elif accuracy_bagging > accuracy_rf:
     print("Bagging Classifier (DT base) performed better.")
else:
    print("Both models performed equally well.")

# Typically, Random Forest often slightly outperforms a standard Bagging Classifier
# with Decision Trees due to the added decorrelation from feature randomness,
# especially when some features are very dominant.


--- 10. Comparing Bagging (DT) vs Random Forest Classifier Accuracy ---
Bagging Classifier (DT base) Accuracy: 0.9591
Random Forest Classifier Accuracy: 0.9708
Random Forest performed better.
