#  **Heart Failure Clinical Records Dataset Hyperparameter Optimization**

**Objective**  
Tune and evaluate the performance of multiple machine learning models (RandomForest, GradientBoosting, LogisticRegression) using Optuna and StratifiedKFold. The best model will be saved for further evaluation.

In [1]:
import joblib
import optuna
import warnings


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


sns.set(style='whitegrid', font_scale=1.2)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)


In [2]:
X = pd.read_csv("../data/processed/X_train.csv")
y = pd.read_csv("../data/processed/y_train.csv").squeeze()


print("X shape is", X.shape)
print("Target distribution is \n", y.value_counts(normalize=True))


X shape is (239, 16)
Target distribution is 
 death_event
0    0.677824
1    0.322176
Name: proportion, dtype: float64


## **Define Evaluation Strategy**

In [3]:
f1 = make_scorer(f1_score)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


## **Optuna Helper Functions**

In [4]:
def objective_rf(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 10, 200),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2"]),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
        "random_state": 42,
        "n_jobs": -1,
    }

    model = RandomForestClassifier(**params)
    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    return scores.mean()


In [5]:
def objective_gb(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 2, 8),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "random_state": 42,
    }

    model = GradientBoostingClassifier(**params)
    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    return scores.mean()


In [6]:
def objective_lr(trial):
    penalty = trial.suggest_categorical("penalty", ["l1", "l2"])
    solver = trial.suggest_categorical("solver", ["liblinear", "saga", "lbfgs"])
    valid_combinations = {
        "liblinear": ["l1", "l2"],
        "saga": ["l1", "l2"],
        "lbfgs": ["l2"] 
    }

    if penalty not in valid_combinations[solver]:
        raise optuna.exceptions.TrialPruned()

    params = {
        "C": trial.suggest_loguniform("C", 1e-4, 10),
        "penalty": penalty,
        "solver": solver,
        "random_state": 42,
        "max_iter": 1000
    }

    model = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(**params))
    ])

    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    return scores.mean()


##  **Run Optuna Trials**

In [7]:
# RandomForest Optimization
study_rf = optuna.create_study(direction="maximize")
study_rf.optimize(objective_rf, n_trials=30)
joblib.dump(study_rf, "../models/optuna_study_rf.pkl")


[I 2025-07-16 10:38:15,941] A new study created in memory with name: no-name-d326e4b2-933b-4fe8-bf2d-65903cae0a1e
[I 2025-07-16 10:38:17,026] Trial 0 finished with value: 0.6970478435995677 and parameters: {'n_estimators': 112, 'max_depth': 2, 'max_features': 'log2', 'min_samples_split': 9}. Best is trial 0 with value: 0.6970478435995677.
[I 2025-07-16 10:38:18,400] Trial 1 finished with value: 0.6578557095078834 and parameters: {'n_estimators': 180, 'max_depth': 2, 'max_features': 'log2', 'min_samples_split': 6}. Best is trial 0 with value: 0.6970478435995677.
[I 2025-07-16 10:38:19,751] Trial 2 finished with value: 0.7605926779313876 and parameters: {'n_estimators': 181, 'max_depth': 7, 'max_features': 'log2', 'min_samples_split': 2}. Best is trial 2 with value: 0.7605926779313876.
[I 2025-07-16 10:38:20,553] Trial 3 finished with value: 0.7601894521249359 and parameters: {'n_estimators': 87, 'max_depth': 9, 'max_features': 'sqrt', 'min_samples_split': 7}. Best is trial 2 with value:

['../models/optuna_study_rf.pkl']

In [8]:
optuna.visualization.plot_optimization_history(study_rf).show()

In [9]:
optuna.visualization.plot_param_importances(study_rf).show()

In [10]:
# GradientBoosting Optimization
study_gb = optuna.create_study(direction="maximize")
study_gb.optimize(objective_gb, n_trials=30)
joblib.dump(study_gb, "../models/optuna_study_gb.pkl")


[I 2025-07-16 10:38:47,287] A new study created in memory with name: no-name-66731211-0a3a-4bc5-82b5-737f923cf424
[I 2025-07-16 10:38:49,084] Trial 0 finished with value: 0.6982010582010582 and parameters: {'n_estimators': 220, 'max_depth': 5, 'learning_rate': 0.012188282873943574, 'subsample': 0.9132052627190831}. Best is trial 0 with value: 0.6982010582010582.
[I 2025-07-16 10:38:50,939] Trial 1 finished with value: 0.7029615595132837 and parameters: {'n_estimators': 178, 'max_depth': 6, 'learning_rate': 0.012224123567353442, 'subsample': 0.8388256124051876}. Best is trial 1 with value: 0.7029615595132837.
[I 2025-07-16 10:38:53,428] Trial 2 finished with value: 0.700936100936101 and parameters: {'n_estimators': 235, 'max_depth': 6, 'learning_rate': 0.013157804787961623, 'subsample': 0.8846284884934338}. Best is trial 1 with value: 0.7029615595132837.
[I 2025-07-16 10:38:53,678] Trial 3 finished with value: 0.7591630591630592 and parameters: {'n_estimators': 57, 'max_depth': 2, 'lear

['../models/optuna_study_gb.pkl']

In [11]:
optuna.visualization.plot_optimization_history(study_gb).show()

In [12]:
optuna.visualization.plot_param_importances(study_gb).show()

In [13]:
# Logistic Regression Optimization
study_lr = optuna.create_study(direction="maximize")
study_lr.optimize(objective_lr, n_trials=30)
joblib.dump(study_lr, "../models/optuna_study_lr.pkl")


[I 2025-07-16 10:39:13,144] A new study created in memory with name: no-name-d88c7383-748d-454d-b332-9038ac8e7f67
[I 2025-07-16 10:39:13,146] Trial 0 pruned. 
[I 2025-07-16 10:39:13,147] Trial 1 pruned. 
[I 2025-07-16 10:39:13,148] Trial 2 pruned. 
[I 2025-07-16 10:39:13,191] Trial 3 finished with value: 0.6999430199430199 and parameters: {'penalty': 'l2', 'solver': 'saga', 'C': 0.10805319126005959}. Best is trial 3 with value: 0.6999430199430199.
[I 2025-07-16 10:39:13,240] Trial 4 finished with value: 0.7271212121212121 and parameters: {'penalty': 'l2', 'solver': 'liblinear', 'C': 0.01783066028800066}. Best is trial 4 with value: 0.7271212121212121.
[I 2025-07-16 10:39:13,296] Trial 5 finished with value: 0.025 and parameters: {'penalty': 'l2', 'solver': 'lbfgs', 'C': 0.0022068808299220903}. Best is trial 4 with value: 0.7271212121212121.
[I 2025-07-16 10:39:13,330] Trial 6 finished with value: 0.0 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.00012295048506210437}. Best

['../models/optuna_study_lr.pkl']

In [14]:
optuna.visualization.plot_optimization_history(study_lr).show()

In [15]:
optuna.visualization.plot_param_importances(study_lr).show()

## **Compare Best Models**

In [16]:
best_rf = RandomForestClassifier(**study_rf.best_params, random_state=42, n_jobs=-1)
best_gb = GradientBoostingClassifier(**study_gb.best_params, random_state=42)
best_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(**study_lr.best_params, random_state=42))
])

models = {
    "Random Forest": best_rf,
    "Gradient Boosting": best_gb,
    "Logistic Regression": best_lr
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    print(f"{name} had a Mean F1 of {scores.mean():.4f} ± {scores.std():.4f}")


Random Forest had a Mean F1 of 0.7687 ± 0.0816
Gradient Boosting had a Mean F1 of 0.7592 ± 0.0721
Logistic Regression had a Mean F1 of 0.7421 ± 0.0626


The **Best Model** is RandomForestClassifier with a **Best F1 Score** of ~ 0.7653

In [17]:
best_model = best_rf
best_model.fit(X, y)

joblib.dump(best_model, "../models/best_model.pkl")
print("Best model saved in best_model.pkl")


Best model saved in best_model.pkl
