# Ensemble Learning

#  Assignment

**Question 1:  What is Ensemble Learning in machine learning? Explain the key idea
behind it.**

**Answer:** **Ensemble Learning**  in machine learning is a strategic delivery model where multiple base learners (models) are orchestrated to work in tandem, to outperform any single model operating in isolation.

**(Key Idea)**:

The central thesis is collective intelligence: by aggregating diverse models, the ensemble mitigates individual weaknesses and amplifies overall performance. In enterprise terms, it’s a risk-diversification and performance-optimization strategy.

**Question 2: What is the difference between Bagging and Boosting?**

**Answer:** **Bagging (Bootstrap Aggregating)**

**Core objective**: Risk mitigation through variance reduction.

**Execution model**: Multiple models are trained in parallel on randomly resampled datasets.

**Error strategy**: Treats all observations as equal—no prioritization of failure cases.

**Enterprise value**: Improves stability and robustness, especially for high-variance models like Decision Trees.

**Typical use case**: When overfitting is the primary bottleneck.

**Example**: Random Forest

**Boosting**

**Core objective**: Performance acceleration via bias reduction.

**Execution model**: Models are trained sequentially, each iteration focusing on prior errors.

**Error strategy**: Misclassified data points are up-weighted to drive continuous improvement.

**Enterprise value**: Delivers higher predictive accuracy on complex patterns.

**Typical use case**: When underfitting or weak learners limit business outcomes.

**Example**: AdaBoost, Gradient Boosting, XGBoost


**Question 3: What is bootstrap sampling, and what role does it play in Bagging methods
like Random Forest?**

**Answer:** **Strategic Role in Bagging & Random Forest**

*   In Bagging (Bootstrap Aggregating), each base learner (e.g., a Decision Tree) is trained on a different bootstrap sample, creating model diversity.    

* In Random Forest, bootstrap sampling de-correlates trees, reducing variance and stabilizing predictions.      

*   The observations not selected in a bootstrap sample become Out-of-Bag (OOB) data, which enables internal performance validation without a separate test set.






**Question 4: What are Out-of-Bag (OOB) samples, and how is the OOB score used to
evaluate ensemble models?**

**Answer:** Out-of-Bag (OOB) samples are the data points not selected during bootstrap sampling when training ensemble models like Bagging or Random Forests.  
**Strategic Value Proposition:**

*   Each base model is trained on ~63% of the data.
*   The remaining ~37% automatically becomes OOB data. These OOB samples act as a built-in validation set.    
**OOB Score – Performance KPI:**

*   Predictions on OOB samples are aggregated across models.
*   The resulting OOB score provides a near-unbiased estimate of model accuracy.





**Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest**

**Answer:** **Decision Trees explain decisions**   
They provide transparent, rule-based logic that enables straightforward stakeholder communication, auditability, and rapid diagnostic insight. However, they operate with limited resilience and are susceptible to variance and overfitting.

**Random Forests optimize decisions**     
They leverage ensemble intelligence to drive higher accuracy, stability, and generalization at scale. While individual decision paths are less interpretable, the aggregate outcome is materially stronger for production-grade AI systems.

**Question 6: Write a Python program to:   
● Load the Breast Cancer dataset using   
sklearn.datasets.load_breast_cancer()      
● Train a Random Forest Classifier   
● Print the top 5 most important features based on feature importance scores.**

In [1]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Step 1: Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Step 2: Train a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# Step 3: Extract feature importances
importances = rf_model.feature_importances_

# Step 4: Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({"Feature": feature_names,
    "Importance": importances})

# Step 5: Sort features by importance (descending order)
top_5_features = feature_importance_df.sort_values(by="Importance", ascending=False).head(5)

# Step 6: Print top 5 important features
print(top_5_features)


                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


**Question 7: Write a Python program to:    
● Train a Bagging Classifier using Decision Trees on the Iris dataset     
● Evaluate its accuracy and compare with a single Decision Tree**

In [2]:
# Import core libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree (Baseline)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

dt_predictions = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)


# Bagging Classifier (Ensembles)
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),n_estimators=100,random_state=42)
bagging.fit(X_train, y_train)

bagging_predictions = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)

# Print results
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bagging_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


**Question 8: Write a Python program to:      
● Train a Random Forest Classifier     
● Tune hyperparameters max_depth and n_estimators using GridSearchCV     
● Print the best parameters and final accuracy**  

In [3]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {"n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10]}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf,param_grid=param_grid,cv=5,scoring="accuracy")

# Fit model
grid_search.fit(X_train, y_train)

# Best model from GridSearch
best_model = grid_search.best_estimator_

# Predictions on test data
y_pred = best_model.predict(X_test)

# Final accuracy
final_accuracy = accuracy_score(y_test, y_pred)

# Output results
print("Best Hyperparameters:", grid_search.best_params_)
print("Final Accuracy:", final_accuracy)


Best Hyperparameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0


**Question 9: Write a Python program to:      
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset    
● Compare their Mean Squared Errors (MSE)**

In [4]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
bagging_model = BaggingRegressor(random_state=42)
random_forest_model = RandomForestRegressor(random_state=42)

# Train the models
bagging_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)

# Make predictions
bagging_predictions = bagging_model.predict(X_test)
rf_predictions = random_forest_model.predict(X_test)

# Calculate Mean Squared Error
bagging_mse = mean_squared_error(y_test, bagging_predictions)
rf_mse = mean_squared_error(y_test, rf_predictions)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.2824242776841025
Random Forest Regressor MSE: 0.2553684927247781


**Question 10: You are working as a data scientist at a financial institution to predict loan     
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.   
Explain your step-by-step approach to:    
● Choose between Bagging or Boosting     
● Handle overfitting    
● Select base models    
● Evaluate performance using cross-validation      
● Justify how ensemble learning improves decision-making in this real-world
context.**

In [5]:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Simulated loan default dataset
X, y = make_classification(n_samples=1000,n_features=20,n_informative=10,n_redundant=5,random_state=42)

# Bagging model
bagging_model = RandomForestClassifier(n_estimators=100,random_state=42)

# Boosting model
boosting_model = GradientBoostingClassifier(n_estimators=100,learning_rate=0.1,max_depth=3,random_state=42)

# Cross-validation evaluation
bagging_score = cross_val_score(bagging_model, X, y, cv=5).mean()
boosting_score = cross_val_score(boosting_model, X, y, cv=5).mean()

print("Bagging (Random Forest) Accuracy:", round(bagging_score, 4))
print("Boosting (Gradient Boosting) Accuracy:", round(boosting_score, 4))


Bagging (Random Forest) Accuracy: 0.933
Boosting (Gradient Boosting) Accuracy: 0.918
