# 🔧 Hyperparameter Tuning & Cross-Validation

This notebook begins **Project 2** of the Titanic survival prediction project.  
We aim to improve our baseline model performance by tuning hyperparameters using `GridSearchCV`  
and evaluating results through cross-validation.

---

## 🎯 Purpose

To optimize the most promising model from Project 1 by:
- Defining a grid of hyperparameter values for tuning  
- Selecting the best combination using `GridSearchCV` with F1 scoring  
- Validating performance with cross-validation for generalizability

## 📦 Dataset

Same processed dataset used in earlier notebooks:  
[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)  
via public repository: [Data Science Dojo GitHub](https://github.com/datasciencedojo/datasets)

📦 1. Load the Dataset

We use the same fully processed training dataset created in earlier notebooks.  
Features unrelated to prediction (e.g., PassengerId, Name, Ticket) are excluded.

In [69]:
import pandas as pd

# Load processed full training data
df_train = pd.read_csv("feature_engineered_titanic.csv")

# Select features and target
X_all = df_train.drop(columns=['Survived', 'Name', 'Ticket', 'PassengerId', 'Cabin', 'Title'])
y_all = df_train['Survived']

🛠️ 2. Define Parameter Grid

To start the tuning process, we define a set of hyperparameters for `GradientBoostingClassifier`.  
The selected parameters control the number of trees, learning rate, tree depth, and subsampling strategy.

This grid will be used in `GridSearchCV` to evaluate all combinations and select the best-performing one.

In [70]:
from sklearn.ensemble import GradientBoostingClassifier

param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 4],
    'min_samples_split': [2, 5],
    'subsample': [0.8, 1.0],
    'max_features': ['sqrt']
}

- `max_depth`: limited to small values (3–5) to avoid overfitting on a relatively small dataset  
- `n_estimators`: tested up to 200 to balance training time and performance  
- `subsample` & `max_features`: introduced to inject randomness and reduce variance

⚙️ 3. Run GridSearchCV

We'll use `GridSearchCV` to test all parameter combinations defined above.  
The evaluation will be based on **F1 score**, as it balances precision and recall —  
especially important for datasets with class imbalance like Titanic.

We use 5-fold cross-validation to ensure the model generalizes well.

In [71]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score

# Define the model
gb_model = GradientBoostingClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=gb_model,
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit to the full training set
grid_search.fit(X_all, y_all)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


📊 4. Review Best Parameters & Score

After running `GridSearchCV`, we check which combination of hyperparameters performed best  
based on the average F1 score from 5-fold cross-validation.

This step helps us understand how tuning impacted the model and what settings worked well.

In [72]:
# Show best hyperparameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best F1 Score:", grid_search.best_score_)

Best Parameters: {'learning_rate': 0.1, 'max_depth': 4, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 100, 'subsample': 0.8}
Best F1 Score: 0.7650879809547317


This configuration suggests that a moderately deep tree (`max_depth=4`),  
a learning rate of 0.1, and subsampling (`subsample=0.8`) worked best  
for balancing precision and recall on the Titanic dataset.

📈 5. Compare Tuned Model vs. Original

Now that we have the best hyperparameters from tuning,<br>
we retrain the optimized model and compare its performance with the original baseline model.

To ensure reproducibility, we fixed `random_state=42` across model training and tuning steps.

In [73]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

# Baseline model
baseline_model = GradientBoostingClassifier(random_state=42)

# Same split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=42)

# Train & evaluate
baseline_model.fit(X_train, y_train)
tuned_model = grid_search.best_estimator_
tuned_model.fit(X_train, y_train)

baseline_preds = baseline_model.predict(X_test)
tuned_preds = tuned_model.predict(X_test)

print("Baseline F1:", f1_score(y_test, baseline_preds))
print("Tuned F1   :", f1_score(y_test, tuned_preds))

Baseline F1: 0.7777777777777778
Tuned F1   : 0.7755102040816326


While the tuned model appeared to perform slightly worse on the test set,<br>
this result raised the possibility that the test split might not fully reflect model performance.

To validate this, we compared cross-validation mean F1 scores from 5-fold CV:

In [74]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score

# Baseline model CV
baseline = GradientBoostingClassifier(random_state=42)
cv_scores = cross_val_score(baseline, X_all, y_all, cv=5, scoring=make_scorer(f1_score))
print("Baseline CV F1 Mean:", cv_scores.mean())

# Tuned model CV (already evaluated during grid search)
print("Tuned CV F1 Mean   :", grid_search.best_score_)

Baseline CV F1 Mean: 0.7621449757013689
Tuned CV F1 Mean   : 0.7650879809547317


The tuned model achieved a slightly higher CV mean F1 score (0.7651 vs. 0.7621),<br>
suggesting both improved generalization and overall performance.

This highlights that, despite a marginally lower test score,<br>
the tuned model performs more consistently across different data splits.

🔍 Extended Exploration: Grid Search CV Results

To extend the earlier model comparison,<br>
we further examine the full cross-validation results from GridSearchCV.

- During evaluation, we observed an unexpected pattern:<br>
the tuned model’s test F1 score was slightly lower than the baseline,<br>
while its CV mean F1 score was higher.

- This raised a key question:<br>
Which hyperparameter combinations contributed most to performance gains?

By exploring the full CV results,<br>
we aim to identify what made the tuned model more generalizable,<br>
and understand which parameter choices led to stronger overall performance.

In [75]:
# Extract top 5 parameter combinations from tuning results
top5 = cv_results.sort_values(by="mean_test_score", ascending=False).head(5)

# Select key columns to display
top5_summary = top5[[
    "params",
    "mean_test_score",
    "rank_test_score"
]]

# Display the summary table
top5_summary.reset_index(drop=True, inplace=True)
top5_summary

Unnamed: 0,params,mean_test_score,rank_test_score
0,"{'learning_rate': 0.1, 'max_depth': 4, 'max_fe...",0.765088,1
1,"{'learning_rate': 0.05, 'max_depth': 4, 'max_f...",0.761912,2
2,"{'learning_rate': 0.05, 'max_depth': 4, 'max_f...",0.760255,3
3,"{'learning_rate': 0.1, 'max_depth': 4, 'max_fe...",0.759532,4
4,"{'learning_rate': 0.1, 'max_depth': 3, 'max_fe...",0.759089,5


What Contributed to Better Generalization?<br>

To understand why the tuned model achieved better generalization,<br>
we examined the top 5 parameter combinations from GridSearchCV.

Common patterns observed:

- max_depth = 4: Slightly deeper trees captured more complex patterns<br>
- subsample = 0.8: Sampling helped reduce overfitting<br>
- max_features = 'sqrt': Limited features improved robustness

These elements likely acted as key factors that made the tuned model more stable across different data splits.

Interestingly, most top configurations used 100 trees,
but the 2nd best setup used 200, hinting that a higher number of estimators combined with a low learning rate (0.05) may further enhance performance.

## 🧠 Summary

In this notebook, we tuned the hyperparameters of the Gradient Boosting model  
to improve performance on the Titanic survival prediction task.

Here's what we covered:

- Defined a grid of hyperparameter combinations to explore
- Applied `GridSearchCV` with 5-fold cross-validation, using F1 score as the metric
- Identified the best combination:
  - `max_depth = 4`, `n_estimators = 100`, `learning_rate = 0.1`, `subsample = 0.8`, `max_features = 'sqrt'`
- Compared the tuned model to the baseline:
  - Test F1 was slightly lower, but CV mean F1 was higher
- Visualized and interpreted GridSearch results to understand tuning impact
  - Confirmed that the best-performing models shared common characteristics (e.g. depth = 4, 100+ estimators)

✅ These insights help validate the tuning process  
and strengthen confidence in the final model’s generalization ability.