#  **Heart Failure Clinical Records Dataset Hyperparameter Optimization**

**Objective**  
Tune and evaluate the performance of multiple machine learning models (RandomForest, GradientBoosting, LogisticRegression) using Optuna and StratifiedKFold. The best model will be saved for further evaluation.

In [1]:
import joblib
import optuna
import warnings


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


sns.set(style='whitegrid', font_scale=1.2)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)


In [2]:
X = pd.read_csv("../data/processed/X_train.csv")
y = pd.read_csv("../data/processed/y_train.csv").squeeze()


print("X shape is", X.shape)
print("Target distribution is \n", y.value_counts(normalize=True))


X shape is (239, 16)
Target distribution is 
 death_event
0    0.677824
1    0.322176
Name: proportion, dtype: float64


## **Define Evaluation Strategy**

In [3]:
f1 = make_scorer(f1_score)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


## **Optuna Helper Functions**

In [4]:
def objective_rf(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 10, 200),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2"]),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
        "random_state": 42,
        "n_jobs": -1,
    }

    model = RandomForestClassifier(**params)
    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    return scores.mean()


In [5]:
def objective_gb(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 2, 8),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "random_state": 42,
    }

    model = GradientBoostingClassifier(**params)
    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    return scores.mean()


In [6]:
def objective_lr(trial):
    penalty = trial.suggest_categorical("penalty", ["l1", "l2"])
    solver = trial.suggest_categorical("solver", ["liblinear", "saga", "lbfgs"])
    valid_combinations = {
        "liblinear": ["l1", "l2"],
        "saga": ["l1", "l2"],
        "lbfgs": ["l2"] 
    }

    if penalty not in valid_combinations[solver]:
        raise optuna.exceptions.TrialPruned()

    params = {
        "C": trial.suggest_loguniform("C", 1e-4, 10),
        "penalty": penalty,
        "solver": solver,
        "random_state": 42,
        "max_iter": 1000
    }

    model = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(**params))
    ])

    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    return scores.mean()


##  **Run Optuna Trials**

In [7]:
# RandomForest Optimization
study_rf = optuna.create_study(direction="maximize")
study_rf.optimize(objective_rf, n_trials=30)
joblib.dump(study_rf, "../models/optuna_study_rf.pkl")


[I 2025-07-09 23:11:42,486] A new study created in memory with name: no-name-2e53950e-bb8f-4db0-b1e0-5c253524b8ca
[I 2025-07-09 23:11:43,328] Trial 0 finished with value: 0.7527162014230979 and parameters: {'n_estimators': 82, 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_split': 2}. Best is trial 0 with value: 0.7527162014230979.
[I 2025-07-09 23:11:44,441] Trial 1 finished with value: 0.7696258629517806 and parameters: {'n_estimators': 129, 'max_depth': 9, 'max_features': 'sqrt', 'min_samples_split': 6}. Best is trial 1 with value: 0.7696258629517806.
[I 2025-07-09 23:11:44,881] Trial 2 finished with value: 0.7649350649350649 and parameters: {'n_estimators': 40, 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_split': 10}. Best is trial 1 with value: 0.7696258629517806.
[I 2025-07-09 23:11:46,325] Trial 3 finished with value: 0.7482846902201741 and parameters: {'n_estimators': 154, 'max_depth': 3, 'max_features': 'log2', 'min_samples_split': 9}. Best is trial 1 with value:

['../models/optuna_study_rf.pkl']

In [8]:
optuna.visualization.plot_optimization_history(study_rf).show()

In [9]:
optuna.visualization.plot_param_importances(study_rf).show()

In [10]:
# GradientBoosting Optimization
study_gb = optuna.create_study(direction="maximize")
study_gb.optimize(objective_gb, n_trials=30)
joblib.dump(study_gb, "../models/optuna_study_gb.pkl")


[I 2025-07-09 23:12:14,410] A new study created in memory with name: no-name-2f4e953d-ae2c-41d1-80b3-53a1005cf9d6
[I 2025-07-09 23:12:17,147] Trial 0 finished with value: 0.6950738916256158 and parameters: {'n_estimators': 288, 'max_depth': 4, 'learning_rate': 0.036338777201032416, 'subsample': 0.6869454907701353}. Best is trial 0 with value: 0.6950738916256158.
[I 2025-07-09 23:12:19,289] Trial 1 finished with value: 0.7136904761904761 and parameters: {'n_estimators': 162, 'max_depth': 7, 'learning_rate': 0.16981038969313456, 'subsample': 0.802010996635006}. Best is trial 1 with value: 0.7136904761904761.
[I 2025-07-09 23:12:20,935] Trial 2 finished with value: 0.6649755923589595 and parameters: {'n_estimators': 105, 'max_depth': 7, 'learning_rate': 0.023219071729951996, 'subsample': 0.9475270880478569}. Best is trial 1 with value: 0.7136904761904761.
[I 2025-07-09 23:12:23,672] Trial 3 finished with value: 0.7190476190476189 and parameters: {'n_estimators': 236, 'max_depth': 7, 'lear

['../models/optuna_study_gb.pkl']

In [11]:
optuna.visualization.plot_optimization_history(study_gb).show()

In [12]:
optuna.visualization.plot_param_importances(study_gb).show()

In [13]:
# Logistic Regression Optimization
study_lr = optuna.create_study(direction="maximize")
study_lr.optimize(objective_lr, n_trials=30)
joblib.dump(study_lr, "../models/optuna_study_lr.pkl")


[I 2025-07-09 23:12:52,733] A new study created in memory with name: no-name-b00281f2-17d5-4819-8da6-79e1782c07d8
[I 2025-07-09 23:12:52,783] Trial 0 finished with value: 0.7297423266388783 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.9014272327055006}. Best is trial 0 with value: 0.7297423266388783.
[I 2025-07-09 23:12:52,785] Trial 1 pruned. 
[I 2025-07-09 23:12:52,827] Trial 2 finished with value: 0.3401785794409297 and parameters: {'penalty': 'l2', 'solver': 'lbfgs', 'C': 0.011618012343917726}. Best is trial 0 with value: 0.7297423266388783.
[I 2025-07-09 23:12:52,857] Trial 3 finished with value: 0.0 and parameters: {'penalty': 'l1', 'solver': 'liblinear', 'C': 0.006132206935824387}. Best is trial 0 with value: 0.7297423266388783.
[I 2025-07-09 23:12:52,888] Trial 4 finished with value: 0.7421342098761453 and parameters: {'penalty': 'l2', 'solver': 'liblinear', 'C': 1.254578684860916}. Best is trial 4 with value: 0.7421342098761453.
[I 2025-07-09 23:12:52,923] Trial 

['../models/optuna_study_lr.pkl']

In [14]:
optuna.visualization.plot_optimization_history(study_lr).show()

In [15]:
optuna.visualization.plot_param_importances(study_lr).show()

## **Compare Best Models**

In [16]:
best_rf = RandomForestClassifier(**study_rf.best_params, random_state=42, n_jobs=-1)
best_gb = GradientBoostingClassifier(**study_gb.best_params, random_state=42)
best_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(**study_lr.best_params, random_state=42))
])

models = {
    "Random Forest": best_rf,
    "Gradient Boosting": best_gb,
    "Logistic Regression": best_lr
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring=f1)
    print(f"{name} had a Mean F1 of {scores.mean():.4f} ± {scores.std():.4f}")


Random Forest had a Mean F1 of 0.7762 ± 0.0943
Gradient Boosting had a Mean F1 of 0.7443 ± 0.0815
Logistic Regression had a Mean F1 of 0.7467 ± 0.0622


The **Best Model** is RandomForestClassifier with a **Best F1 Score** of ~ 0.7653

In [17]:
best_model = best_rf
best_model.fit(X, y)

joblib.dump(best_model, "../models/best_model.pkl")
print("Best model saved in best_model.pkl")


Best model saved in best_model.pkl
