# 🤝 Ensemble Learning

This notebook explores ensemble methods to boost performance  
by combining multiple machine learning models.

---

## 🎯 Purpose

To evaluate ensemble strategies—Voting, Bagging, Boosting, and Stacking—  
and compare them with the best single model from earlier work.

Each method will be briefly introduced and tested using the same dataset and evaluation metrics.

## 📦 Dataset

Same processed dataset used in earlier notebooks:  
[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)  
via public repository: [Data Science Dojo GitHub](https://github.com/datasciencedojo/datasets)

📦 1. Load the Feature-Engineered Dataset

I begin by loading the fully processed Titanic dataset,  
which includes all engineered features created in earlier notebooks.

I also prepare the input matrix `X` and target vector `y`  
to be used consistently across all ensemble models.

In [65]:
import pandas as pd

# Load the processed dataset with all engineered features
df = pd.read_csv("feature_engineered_titanic.csv")

# One-hot encode 'Sex'
df['Sex_male'] = (df['Sex'] == 'male').astype(int)
df['Sex_female'] = (df['Sex'] == 'female').astype(int)
df = df.drop(columns=['Sex'])

# Define features and target
expected_columns = [
    'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S',
    'Cabin_A', 'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E',
    'Cabin_F', 'Cabin_G', 'Cabin_T', 'Cabin_U',
    'Pclass_1', 'Pclass_2', 'Pclass_3',
    'FamilySize', 'IsAlone', 'LowFare', 'AgeBin',
    'ModerateFamily', 'IsCherbourg', 'FemaleFirstSecondClass',
    'IsChildOrElderly', 'LowFare_3rdClass',
    'Sex_male', 'Sex_female', 'Age', 'Parch', 'SibSp'
]
X_all = df[expected_columns]
y_all = df['Survived']


Instead of dropping unnecessary columns, I explicitly selected the features used during training<br> to avoid any mismatch in the number or order of features.<br>This ensures consistency between the training and prediction data.

🧪 2. Ensemble Experiments

I test several ensemble methods to boost predictive performance.  
The order reflects increasing model complexity and learning power:

| Step | Method            | Description                                     |
|------|-------------------|-------------------------------------------------|
| 2-1  | Voting Classifier | Simple average of multiple model predictions    |
| 2-2  | Bagging           | Random Forest to reduce variance                |
| 2-3  | Boosting          | Gradient-based models for higher accuracy       |
| 2-4  | Stacking          | Combine base model predictions through a meta-learner for improved performance            |

2-1. Voting Classifier

Voting is a simple ensemble technique that combines predictions from multiple models.  
In **soft voting**, I average the predicted class probabilities and choose the class with the highest average.

I selected three strong models that not only perform well individually,  
but also differ in learning strategies — allowing them to complement each other.

This helps balance out the weaknesses of individual models and often leads to more stable performance.

**Models Used:**
- Random Forest (Bagging-based)
- Gradient Boosting (Sequential boosting)
- XGBoost (Optimized boosting)

In [66]:
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score
from xgboost import XGBClassifier

# Define individual models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)
xgb = XGBClassifier(random_state=42, eval_metric='logloss')

# Voting ensemble
voting = VotingClassifier(
    estimators=[('rf', rf), ('gb', gb), ('xgb', xgb)],
    voting='soft'
)

# Evaluate with cross-validation
scores = cross_val_score(voting, X_all, y_all, cv=5, scoring=make_scorer(f1_score))
print(f"Voting Classifier: Mean F1 Score = {scores.mean():.4f}")

Voting Classifier: Mean F1 Score = 0.7682


2-2. Bagging (Random Forest)

Bagging, short for Bootstrap Aggregating, is an ensemble technique that trains multiple models on different random subsets of the data.  
Each model learns independently, and their predictions are aggregated (e.g., by voting or averaging).

**Random Forest** is a classic bagging model that builds multiple decision trees,  
each trained on a different bootstrap sample and a random subset of features.

This strategy helps:
- Reduce **variance** by averaging diverse predictions
- Prevent overfitting compared to a single decision tree
- Maintain strong performance with relatively low tuning effort

**Key Concept:**  
By combining many "weakly correlated" trees, Random Forest achieves robust and stable predictions.

In [67]:
from sklearn.ensemble import RandomForestClassifier

# Define Random Forest model (Bagging)
rf_model = RandomForestClassifier(random_state=42)

# Evaluate with 5-fold cross-validation using F1 score
rf_scores = cross_val_score(rf_model, X_all, y_all, cv=5, scoring=make_scorer(f1_score))

# Print results
print(f"Random Forest (Bagging): Mean F1 Score = {rf_scores.mean():.4f}")

Random Forest (Bagging): Mean F1 Score = 0.7346


2-3. Boosting: Gradient Boosting & XGBoost

Boosting is a powerful ensemble technique that builds models sequentially.  
Each new model focuses on the mistakes made by the previous ones, reducing bias and improving overall accuracy.

I test two popular boosting algorithms:

**Models Used:**
- GradientBoostingClassifier: Standard boosting method with decision trees  
- XGBClassifier: An optimized version of gradient boosting with better performance and regularization

These models often achieve top performance on structured datasets like Titanic.

In [68]:
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Define models
gb_model = GradientBoostingClassifier(random_state=42)
xgb_model = XGBClassifier(random_state=42, eval_metric='logloss')

# Evaluate with 5-fold cross-validation
gb_scores = cross_val_score(gb_model, X_all, y_all, cv=5, scoring=make_scorer(f1_score))
xgb_scores = cross_val_score(xgb_model, X_all, y_all, cv=5, scoring=make_scorer(f1_score))

# Print results
print(f"Gradient Boosting: Mean F1 Score = {gb_scores.mean():.4f}")
print(f"XGBoost: Mean F1 Score = {xgb_scores.mean():.4f}")

Gradient Boosting: Mean F1 Score = 0.7472
XGBoost: Mean F1 Score = 0.7471


2-4. Stacking Classifier

Stacking is a more advanced ensemble method that combines multiple base models using a meta-model.
Instead of voting or averaging, the meta-model learns how to best combine the predictions of the base models.

This method often achieves higher accuracy by leveraging the strengths of different algorithms.

Base Models:
- Random Forest
- Gradient Boosting
- XGBoost

Meta-model:<br>
- Logistic Regression (simple and effective for final decision making)

In [69]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# Define base models
base_models = [
    ('rf', RandomForestClassifier(random_state=42)),
    ('gb', GradientBoostingClassifier(random_state=42)),
    ('xgb', XGBClassifier(random_state=42, eval_metric='logloss'))
]

# Meta-model
meta_model = LogisticRegression()

# Stacking ensemble
stacking = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,
    passthrough=False
)

# Evaluate with 5-fold CV using F1 score
stacking_scores = cross_val_score(
    stacking, X_all, y_all, cv=5, scoring=make_scorer(f1_score)
)

print(f"Stacking Ensemble: Mean F1 Score = {stacking_scores.mean():.4f}")

Stacking Ensemble: Mean F1 Score = 0.7693


📊 3. Performance Comparison

After evaluating all ensemble methods using 5-fold cross-validation with the F1 score,  <br>
I summarize the results below:


In [70]:
# Define scorer
scorer = make_scorer(f1_score)

# Define stacking model separately
stacking = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(random_state=42, class_weight='balanced')),
        ('gb', GradientBoostingClassifier(random_state=42)),
        ('xgb', XGBClassifier(random_state=42, eval_metric='logloss'))
    ],
    final_estimator=LogisticRegression(class_weight='balanced')
)


# Store mean F1 scores directly
model_scores = {
    "Voting Ensemble": cross_val_score(
        VotingClassifier(estimators=[
            ('rf', RandomForestClassifier(random_state=42)),
            ('gb', GradientBoostingClassifier(random_state=42)),
            ('xgb', XGBClassifier(random_state=42, eval_metric='logloss'))
        ], voting='soft'),
        X_all, y_all, cv=5, scoring=scorer
    ).mean(),
    
    "Bagging (Random Forest)": cross_val_score(
        RandomForestClassifier(random_state=42),
        X_all, y_all, cv=5, scoring=scorer
    ).mean(),
    
    "Boosting (Gradient Boosting)": cross_val_score(
        GradientBoostingClassifier(random_state=42),
        X_all, y_all, cv=5, scoring=scorer
    ).mean(),
    
    "Stacking Ensemble": cross_val_score(
        stacking,
        X_all, y_all, cv=5, scoring=scorer
    ).mean()
}

# Display as DataFrame (with clean index)
score_df = pd.DataFrame(
    list(model_scores.items()), columns=["Model", "Mean F1 Score"]
).reset_index(drop=True)

display(score_df)

Unnamed: 0,Model,Mean F1 Score
0,Voting Ensemble,0.768189
1,Bagging (Random Forest),0.734582
2,Boosting (Gradient Boosting),0.74723
3,Stacking Ensemble,0.756582


Stacking and Voting achieved the best performance,<br> confirming that combining diverse models leads to stronger predictions than using a single method alone.

🚀 Exporting the Final Model for Real-World Use

In [71]:
import joblib

stacking.fit(X_all, y_all)

joblib.dump(stacking, '../titanic-project/deployment/model.pkl')

['../titanic-project/deployment/model.pkl']

## 🧠 Summary

In this notebook, I tested four ensemble methods to improve prediction performance:

- **Voting**: Combined three well-performing, diverse models  
- **Bagging**: Applied Random Forest to reduce variance  
- **Boosting**: Used Gradient Boosting for sequential error correction  
- **Stacking**: Blended base models using a meta-model

Among all methods, **Voting** gave the best F1 score (0.768), followed closely by Stacking.  
This suggests that combining models with different learning styles can lead to more robust results.

It also confirmed that ensemble learning is a useful next step after building strong individual models.