# **Heart Failure Predictor**
Luca-Andrei Codorean, 30233-1 CTI-RO @2025

This projects consists of an implementation of a text classifer that wishes to succesfully predict cases of heart failure.
The used dataset can be found at: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data

In order to proceed with the solution, the dependencies found in ``requirements.txt`` should be installed, using the following command ```pip install -r requirements.
txt```

## Data Preprocessing

The really first step is realted to data-preprocessing and visualization. The first function will just take one of the three datasets obtained after ```scr/data_loader.py``` script has been run.
The ```data_loader``` script splitted the dataset in three datasets as follows: train, test, and val. 

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

def preprocess_data(dataset_path: str):
    dataset_df = pd.read_csv(dataset_path)

    scaler  = StandardScaler()
    le_dict = {}

    for column in dataset_df.select_dtypes(include=["object"]).columns:
        le = LabelEncoder() 
        dataset_df[column] = le.fit_transform(dataset_df[column])  
        le_dict[column] = le  

    y = dataset_df["HeartDisease"]
    X = dataset_df.drop(columns=["HeartDisease"])

    X_scaled = scaler.fit_transform(X)
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

    return X_scaled_df, y, le_dict, scaler

### Plotting the histograms and class distribution

An issue reagrading plotting the histograms has been identified. In the early stages of the development, the columns containing strings instead of numbers were unable to be plotted as histograms. For that, they were plotted as class distribution diagrams, firstly bars, then pie charts. 

It's been a problem with understanding the meaning of these columns so they were mapped accordingly. See ```preprocess_data``` function.

Basically, the histograms are used for numerical columns whereas the class distribution diagrams are used for the other columns.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_visualization(X, le_dict, scaler, output_dir: str):


    temp = scaler.inverse_transform(X)
    df = pd.DataFrame(temp, columns=X.columns)

    for key  in le_dict:
        label_encoder = le_dict[key]
        df[key] = label_encoder.inverse_transform(df[key].astype(int))
        
    for column in df.columns:
        plt.figure(figsize=(8, 6))
        if pd.api.types.is_numeric_dtype(df[column]):
            plt.title(f"Histogram of {column}")
            plt.xlabel(column)
            df[column].plot(kind='hist', bins=30, color='skyblue', edgecolor='black')
            plt.ylabel("Frequency")
        else:
            plt.title(f"Class distribution of {column}")
            category_counts = df[column].value_counts()
            category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90, colors=['lightgreen', 'skyblue', 'lightcoral', "yellow"])
            plt.ylabel("")

        plt.tight_layout()
        plt.savefig(f"{output_dir+"/"}{column}.png")
        plt.close()


In [3]:
import os

dataset_formatted_file_path = "/home/luca/SI/Project/data/preprocessed/"

train_dataset_path = "/home/luca/SI/Project/data/raw/train.csv"
dataset_formatted_file_name = "train.csv"

validation_dataset_path = "/home/luca/SI/Project/data/raw/val.csv"
dataset_formtted_file_name = "val.csv"

(X, y, le_dict, scaler) = preprocess_data(dataset_path=train_dataset_path)
(X_val, y_val, _, _)    = preprocess_data(dataset_path=validation_dataset_path)  

plot_visualization(X=X, le_dict=le_dict, scaler=scaler, output_dir="/home/luca/SI/Project/outputs/data_vizualization")
X.to_csv(os.path.join(dataset_formatted_file_path, dataset_formatted_file_name), index=False)

## Constructing the model

After the data has been pre-processed, the very first step is to combine the output of the pre-processing phase into a ```HeartFailureDataset``` object. This way, we will be able to structure a MLNN easier. For this, the ```HeartFailureDataset``` class has been created. It's implementation can be found in ```src.data_set.py```.

Once the dataset object is set-up, it's attributes can be used to inialized the ```HeartFailureMLP``` object that is responsible to implement the training model. Its implementation is available in ```src.model.py```. 

An important hyperparameter for the training process is the ```batch_size``` used by the dataloader. The model should be tested using multiple values for the ```BATCH_SIZE``` parameter in order to get the best results. Same goes for the ```LEARNINIG_RATE``` parameter.

```EPOCHS``` parameter denotes the number of times the algorithm goes through the dataset.

Thus, the first code fragment will handle initalization of diferent hyperparameters and of the model.
Another analisys will be done in order to observe model's reaction to different optimizers such as Adams, SDG, SDG with momentum.


In [15]:
from torch.utils.data   import DataLoader
from src.data_set       import HeartFailureDataset
from src.model          import HeartFailureMLP

import torch
import torch.optim as optim
import torch.nn as nn
 
BATCH_SIZE      = 16
LEARNING_RATE   = 1e-3
EPOCHS          = 1000
DEVICE          = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

train_dataset   = HeartFailureDataset(X, y)
val_dataset     = HeartFailureDataset(X_val, y_val)
train_loader    = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader      = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)

model = HeartFailureMLP().to(DEVICE)
criterion = nn.BCELoss()  
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [12]:
from sklearn.metrics            import accuracy_score, precision_score, recall_score, f1_score
from torch.utils.tensorboard    import SummaryWriter

writer = SummaryWriter(log_dir="runs/heart_failure_prediction")

def compute_metrics(all_preds, all_labels, phase, epoch):
    acc  = accuracy_score(all_labels, all_preds)
    prec = precision_score(all_labels, all_preds)
    rec  = recall_score(all_labels, all_preds)
    f1   = f1_score(all_labels, all_preds)

    writer.add_scalar(f"{phase}/Accuracy", acc, epoch)
    writer.add_scalar(f"{phase}/Precision", prec, epoch)
    writer.add_scalar(f"{phase}/Recall", rec, epoch)
    writer.add_scalar(f"{phase}/F1-Score", f1, epoch)

    return acc, prec, rec, f1

In [None]:
import torch
import os

def train_model(model, train_loader, val_loader, criterion, optimizer, epochs):
    best_loss = float('inf')
    
    for epoch in range(epochs):

        # Training phase
        model.train()
        running_loss, total = 0.0, 0
        train_preds, train_labels = [], []

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)

            optimizer.zero_grad()
            outputs = model(inputs).squeeze() 
            labels = labels.float()  

            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs.size(0)
            total += labels.size(0)

            preds = (outputs > 0.5).float()
            train_preds.extend(preds.cpu().numpy())
            train_labels.extend(labels.cpu().numpy())

        epoch_loss = running_loss / total

        compute_metrics(train_preds, train_labels, "train", epoch)
        writer.add_scalar('Loss/train', epoch_loss, epoch)  
        print(f"Epoch [{epoch+1}/{epochs}] Training Loss: {epoch_loss:.4f}")

        # Validation phase
        model.eval()
        val_loss, val_total = 0.0, 0
        val_preds, val_labels = [], []

        with torch.no_grad(): 
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
                outputs = model(inputs).squeeze()  
                labels = labels.float()             

                loss = criterion(outputs, labels)
                val_loss += loss.item() * inputs.size(0)
                val_total += labels.size(0)

                preds = (outputs > 0.5).float()
                val_preds.extend(preds.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())

        val_epoch_loss = val_loss / val_total
        
        compute_metrics(val_preds, val_labels, "val", epoch)
        writer.add_scalar('Loss/val', val_epoch_loss, epoch) 
        print(f"Validation Loss: {val_epoch_loss:.4f}")

        if val_epoch_loss < best_loss:
            best_loss = val_epoch_loss
            os.makedirs("./models", exist_ok=True)
            torch.save(model.state_dict(), f"./models/predat_model_best.pth")
            print("Model saved!")




train_model(model, train_loader, val_loader, criterion, optimizer, EPOCHS)
torch.save(model.state_dict(), "./models/codiax_model_final.pth")

Epoch [1/1000] Training Loss: 0.6359
Validation Loss: 0.5609
Model saved!
Epoch [2/1000] Training Loss: 0.4858
Validation Loss: 0.4256
Model saved!
Epoch [3/1000] Training Loss: 0.3647
Validation Loss: 0.3783
Model saved!
Epoch [4/1000] Training Loss: 0.3494
Validation Loss: 0.3738
Model saved!
Epoch [5/1000] Training Loss: 0.3277
Validation Loss: 0.3700
Model saved!
Epoch [6/1000] Training Loss: 0.3304
Validation Loss: 0.3654
Model saved!
Epoch [7/1000] Training Loss: 0.3100
Validation Loss: 0.3599
Model saved!
Epoch [8/1000] Training Loss: 0.3017
Validation Loss: 0.3603
Epoch [9/1000] Training Loss: 0.3005
Validation Loss: 0.3523
Model saved!
Epoch [10/1000] Training Loss: 0.2909
Validation Loss: 0.3552
Epoch [11/1000] Training Loss: 0.2984
Validation Loss: 0.3493
Model saved!
Epoch [12/1000] Training Loss: 0.2877
Validation Loss: 0.3466
Model saved!
Epoch [13/1000] Training Loss: 0.2976
Validation Loss: 0.3468
Epoch [14/1000] Training Loss: 0.2872
Validation Loss: 0.3407
Model saved