# Heart Attack Analysis

This work trains and compares 3 different Machine Learning models, namely:
- Neural Network (PyTorch framework)
- Logistic Regression (Scikit-learn framework)
- Random Forest (Scikit-learn framework)


## Dataset
Models were trained and evaluated on a Heart Attack Dataset 
([source](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset)).
To reproduce, download the dataset and put the CSV file in the same directory as this notebook.

Dataset attributes:

- Age : Age of the patient
- Sex : Sex of the patient
- exang: exercise induced angina (1 = yes; 0 = no)
- ca: number of major vessels (0-3)
- cp : Chest Pain type chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor
- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg : resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    - thalach : maximum heart rate achieved
- target : 0= less chance of heart attack 1= more chance of heart attack

## Attributes distribution

![features_dist](imgs/features_dist.png)

## Results

All models used perform comparably (in range of random state differences). For a dataset this small (303 records)
training time is not an issue and thus there is no improvement in using more sophisticated methods (neural networks
and random forest over logistic regression). Moreover random forest models are very effective for tabular data.

![accuracy_score](imgs/accuracy_score.png)

*Accuracy score on test partition of the Heart Attack Dataset for all models.*


![f1_score_score](imgs/f1_score.png)

*F1 score on test partition of the Heart Attack Dataset for all models.*

### Install required packages.

In [1]:
%%capture
%pip install kaleido numpy nbformat pandas plotly torch scikit-learn

## Code

### Imports and loading dataset

In [2]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import torch
from plotly.subplots import make_subplots
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(0)

TARGET_COLUMN = "output"
df = pd.read_csv("heart.csv")
df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### Features distributions

In [3]:
fig = make_subplots(rows=(len(df.columns) + 1)// 2, cols=2)
fig.update_layout(autosize=False, width=1600, height=1600)

for i, (column_name, column_contents) in enumerate(df.items()):
    fig.add_trace(
        go.Histogram(
            x=column_contents,
            name=column_name,
            nbinsx=50,
        ),
        row=1 + i // 2,
        col=1 + i % 2,
        
    )
fig.write_image("imgs/features_dist.png")
fig.show()


Extracting target column and one hot encoding categorical columns.

In [4]:
y = df[[TARGET_COLUMN]]
y = y.to_numpy().astype(np.float32)

x = df.drop(TARGET_COLUMN, axis=1)

categorical_cols = ["sex", "cp", "fbs", "restecg", "exng", "slp", "caa", "thall"]
numerical_cols = list(set(x.columns) - set(categorical_cols))

x = pd.get_dummies(x, columns=categorical_cols, drop_first=True)

categorical_cols, numerical_cols

(['sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall'],
 ['trtbps', 'age', 'oldpeak', 'thalachh', 'chol'])

### Data split and normalize

Splitting data to 3 partitions:
- train (75%)
- validation (12.5%)
- test (12.5%)

Min-Max Scaling numerical columns.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, random_state=0, shuffle=True, stratify=y)

# Normalize the data.
scaler = MinMaxScaler()
x_train[numerical_cols] = scaler.fit_transform(x_train[numerical_cols])
x_test[numerical_cols] = scaler.transform(x_test[numerical_cols])

x_train = x_train.to_numpy().astype(np.float32)
x_test = x_test.to_numpy().astype(np.float32)

Extract validation partition from test.

In [6]:
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, train_size=0.5, random_state=0, shuffle=True, stratify=y_test)

x_train.shape, y_train.shape, x_val.shape, y_val.shape, x_test.shape, y_test.shape

((227, 22), (227, 1), (38, 22), (38, 1), (38, 22), (38, 1))

## Neural network model

In [7]:
BATCH_SIZE=16
LEARNING_RATE=1e-5
EPOCHS = 100
ACC_THRESHOLD = 0.5
HIDDEN_SIZE = 128

train_dataset = TensorDataset(torch.tensor(x_train), torch.tensor(y_train))
val_dataset = TensorDataset(torch.tensor(x_val), torch.tensor(y_val))
test_dataset = TensorDataset(torch.tensor(x_test), torch.tensor(y_test))

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

In [8]:
net = nn.Sequential(
    nn.Linear(x_train.shape[1], HIDDEN_SIZE),
    nn.ReLU(),
    nn.BatchNorm1d(HIDDEN_SIZE),
    nn.Linear(HIDDEN_SIZE, HIDDEN_SIZE),
    nn.ReLU(),
    nn.BatchNorm1d(HIDDEN_SIZE),
    nn.Linear(HIDDEN_SIZE, HIDDEN_SIZE),
    nn.ReLU(),
    nn.BatchNorm1d(HIDDEN_SIZE),
    nn.Linear(HIDDEN_SIZE, 1),
)

criterion = nn.functional.binary_cross_entropy_with_logits
optimizer = torch.optim.Adam(net.parameters(), lr=LEARNING_RATE)

In [9]:
def bin_acc_with_logits(pred_logit: torch.Tensor, tgt: torch.Tensor) -> int:
    pred_bin = torch.sigmoid(pred_logit) > ACC_THRESHOLD

    return torch.sum(pred_bin == tgt).item() / len(tgt)

def train():
    for epoch in range(1, EPOCHS + 1):
        train_loss = 0
        train_acc = 0
        val_loss = 0
        val_acc = 0

        net.train()

        for x_batch, y_batch in train_loader:
            optimizer.zero_grad()

            y_pred = net(x_batch)
            loss = criterion(y_pred, y_batch)

            train_acc += bin_acc_with_logits(y_pred, y_batch)

            loss.backward()
            optimizer.step()

            train_loss += loss.item()
        
        net.eval()

        for x_batch, y_batch in val_loader:
            y_pred = net(x_batch)
            loss = criterion(y_pred, y_batch).detach()

            val_acc += bin_acc_with_logits(y_pred, y_batch)
            val_loss += loss.item()
    
        print(
            f"Epoch {epoch:>03d}: train_loss: {train_loss/len(train_loader.dataset):.4f} --- val_loss:"
            f" {val_loss/len(val_loader.dataset):.4f} --- train_acc: {100*train_acc/len(train_loader):.2f}% --- val_acc:"
            f" {100*val_acc/len(val_loader):.2f}%"
        )

train()

Epoch 001: train_loss: 0.0453 --- val_loss: 0.0543 --- train_acc: 51.67% --- val_acc: 55.56%
Epoch 002: train_loss: 0.0439 --- val_loss: 0.0524 --- train_acc: 57.92% --- val_acc: 61.11%
Epoch 003: train_loss: 0.0426 --- val_loss: 0.0500 --- train_acc: 60.42% --- val_acc: 67.36%
Epoch 004: train_loss: 0.0415 --- val_loss: 0.0489 --- train_acc: 63.33% --- val_acc: 73.61%
Epoch 005: train_loss: 0.0404 --- val_loss: 0.0484 --- train_acc: 66.25% --- val_acc: 75.69%
Epoch 006: train_loss: 0.0394 --- val_loss: 0.0485 --- train_acc: 68.75% --- val_acc: 75.69%
Epoch 007: train_loss: 0.0384 --- val_loss: 0.0484 --- train_acc: 70.42% --- val_acc: 75.69%
Epoch 008: train_loss: 0.0375 --- val_loss: 0.0480 --- train_acc: 71.67% --- val_acc: 75.69%
Epoch 009: train_loss: 0.0366 --- val_loss: 0.0477 --- train_acc: 74.17% --- val_acc: 73.61%
Epoch 010: train_loss: 0.0358 --- val_loss: 0.0473 --- train_acc: 74.58% --- val_acc: 73.61%
Epoch 011: train_loss: 0.0350 --- val_loss: 0.0470 --- train_acc: 75.0

In [10]:
# Full model prediction on test set.

net_pred = net(torch.tensor(x_test)).detach()
net_pred = (torch.sigmoid(net_pred) > ACC_THRESHOLD).numpy()

## Logistic Regression model

In [11]:
clf = RidgeClassifier(random_state=0).fit(x_train, y_train.squeeze())

pred_train = clf.predict(x_train)
pred_val = clf.predict(x_val)
pred_test = clf.predict(x_test)
lr_pred = pred_test

acc_train = accuracy_score(y_train, pred_train)
acc_val = accuracy_score(y_val, pred_val)
acc_test = accuracy_score(y_test, pred_test)

acc_train, acc_val, acc_test

(0.8678414096916299, 0.7894736842105263, 0.9210526315789473)

## Random Forest model

In [12]:
forrest_clf = RandomForestClassifier(random_state=0).fit(x_train, y_train.squeeze())

pred_train = forrest_clf.predict(x_train)
pred_val = forrest_clf.predict(x_val)
pred_test = forrest_clf.predict(x_test)
rf_pred = pred_test

acc_train = accuracy_score(y_train, pred_train)
acc_val = accuracy_score(y_val, pred_val)
acc_test = accuracy_score(y_test, pred_test)

acc_train, acc_val, acc_test

(1.0, 0.7368421052631579, 0.868421052631579)

## Results

In [13]:
preds = [net_pred, lr_pred, rf_pred]

scores = {
    "Model": ["Neural Network", "Logistic Regression", "Random Forest"],
    "Accuracy Score": [accuracy_score(y_test, y_pred) for y_pred in preds],
    "F1 Score": [f1_score(y_test, y_pred) for y_pred in preds],
}

scores

{'Model': ['Neural Network', 'Logistic Regression', 'Random Forest'],
 'Accuracy Score': [0.9210526315789473, 0.9210526315789473, 0.868421052631579],
 'F1 Score': [0.9302325581395349, 0.9268292682926829, 0.8717948717948718]}

In [14]:
fig = px.bar(scores, x="Model", y="Accuracy Score", range_y=[0.85, 1])
fig.write_image("imgs/accuracy_score.png")
fig.show()

In [15]:
px.bar(scores, x="Model", y="F1 Score", range_y=[0.85, 1])
fig.write_image("imgs/f1_score.png")
fig.show()