<a href="https://www.kaggle.com/code/cmosbattery/heart-attack-prediction-with-svm?scriptVersionId=217656143" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Heart Attack Prediction with SVM: From EDA to Hyperparameter Tuning

This notebook demonstrates the process of predicting heart attack risk using a Support Vector Machine (SVM) model. The workflow includes Exploratory Data Analysis (EDA), outlier treatment, and various visualizations such as scatter plots and heatmaps. Feature engineering with mutual information is applied to identify relevant features, followed by model training with standard scaling in a pipeline. Hyperparameter tuning improves the model's accuracy from 78% to 90% on the testing set, showcasing effective techniques in data preprocessing and model optimization. 
The dataset used was collected from [Zheen hospital in Erbil, Iraq](https://data.mendeley.com/datasets/wmhctcrt5v/1) and was accessed via [Heart Attack Dataset on Kaggle](https://www.kaggle.com/datasets/sukhmandeepsinghbrar/heart-attack-dataset). The features in the dataset to determine the risk of heart attack include Age, Gender, Heart Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Blood Sugar, CK-MB, and Troponin. 

# 1. Setup

In this section, we import the necessary libraries to get started. Additional libraries, specifically for training the SVM model, will be imported as we progress through the process.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# 2. Preprocessing and EDA

In this section, we will explore the dataset by analyzing the measures of central tendency for each feature, checking for null values and duplicates, normalizing categorical data, and identifying potential outliers.

In [2]:
dataset = pd.read_csv('/kaggle/input/heart-attack-dataset/Medicaldataset.csv')
dataset

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Result
0,63,1,66,160,83,160.0,1.80,0.012,negative
1,20,1,94,98,46,296.0,6.75,1.060,positive
2,56,1,64,160,77,270.0,1.99,0.003,negative
3,66,1,70,120,55,270.0,13.87,0.122,positive
4,54,1,64,112,65,300.0,1.08,0.003,negative
...,...,...,...,...,...,...,...,...,...
1314,44,1,94,122,67,204.0,1.63,0.006,negative
1315,66,1,84,125,55,149.0,1.33,0.172,positive
1316,45,1,85,168,104,96.0,1.24,4.250,positive
1317,54,1,58,117,68,443.0,5.80,0.359,positive


In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1319 entries, 0 to 1318
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       1319 non-null   int64  
 1   Gender                    1319 non-null   int64  
 2   Heart rate                1319 non-null   int64  
 3   Systolic blood pressure   1319 non-null   int64  
 4   Diastolic blood pressure  1319 non-null   int64  
 5   Blood sugar               1319 non-null   float64
 6   CK-MB                     1319 non-null   float64
 7   Troponin                  1319 non-null   float64
 8   Result                    1319 non-null   object 
dtypes: float64(3), int64(5), object(1)
memory usage: 92.9+ KB


In [4]:
dataset.describe()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin
count,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0
mean,56.193328,0.659591,78.336619,127.170584,72.269143,146.634344,15.274306,0.360942
std,13.638173,0.474027,51.63027,26.12272,14.033924,74.923045,46.327083,1.154568
min,14.0,0.0,20.0,42.0,38.0,35.0,0.321,0.001
25%,47.0,0.0,64.0,110.0,62.0,98.0,1.655,0.006
50%,58.0,1.0,74.0,124.0,72.0,116.0,2.85,0.014
75%,65.0,1.0,85.0,143.0,81.0,169.5,5.805,0.0855
max,103.0,1.0,1111.0,223.0,154.0,541.0,300.0,10.3


In [5]:
dataset.duplicated().unique()

array([False])

In [6]:
for column in dataset:
    print(f"Unique values of {column} column: \n{dataset[column].unique()} \n")

Unique values of Age column: 
[ 63  20  56  66  54  52  38  61  49  65  45  64  47  86  37  60  48  30
  50  72  42  35  68  34  44  55  58  40  46  57  28  29  80  90  62  53
  75  19  77  71  43  67  51  59  36  70  78  69  73  41  82  32  25  26
  76  33  39  91  21  31  74  22  79  81  27  83  24  85  88 100  23  14
  87 103  84] 

Unique values of Gender column: 
[1 0] 

Unique values of Heart rate column: 
[  66   94   64   70   61   40   60   76   81   73   72   92  135   63
   65  125   62   58   93   96   95   97   91   87   77   80   82   83
   78   90   59   57   98 1111  102  103  105   74   85   75   71   68
   67   56   89   88   86   79  100   69   84  110  120  122  119  116
  114   55   53   54  117  112  108  134  111  101  113   51   52   99
  132   50  107  104   49   46   20   36   45] 

Unique values of Systolic blood pressure column: 
[160  98 120 112 179 214 154 166 150 199 122 118 114 100 107 109 151 110
 104 106 152 134 135 131 137 121 145 136 156 155 105  91 

## Treating an outlier in 'Heart rate' column
According to the data, someone was aged 1111 years old. We are going to have that changed to the median value of the heart rate.

In [7]:
# Compute the median heart rate
median_heart_rate = dataset["Heart rate"].median()

# Use .loc[] to explicitly modify the column in the original DataFrame
dataset.loc[dataset["Heart rate"] == 1111, "Heart rate"] = median_heart_rate

dataset['Heart rate'].unique()

array([ 66,  94,  64,  70,  61,  40,  60,  76,  81,  73,  72,  92, 135,
        63,  65, 125,  62,  58,  93,  96,  95,  97,  91,  87,  77,  80,
        82,  83,  78,  90,  59,  57,  98,  74, 102, 103, 105,  85,  75,
        71,  68,  67,  56,  89,  88,  86,  79, 100,  69,  84, 110, 120,
       122, 119, 116, 114,  55,  53,  54, 117, 112, 108, 134, 111, 101,
       113,  51,  52,  99, 132,  50, 107, 104,  49,  46,  20,  36,  45])

## Normalizing the Target Variable
Normalizing the categorgorical data is necessary for our machine learning.

In [8]:
svm_dataset = dataset.copy()
svm_dataset['Result'], _ = svm_dataset['Result'].factorize()

svm_dataset

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Result
0,63,1,66,160,83,160.0,1.80,0.012,0
1,20,1,94,98,46,296.0,6.75,1.060,1
2,56,1,64,160,77,270.0,1.99,0.003,0
3,66,1,70,120,55,270.0,13.87,0.122,1
4,54,1,64,112,65,300.0,1.08,0.003,0
...,...,...,...,...,...,...,...,...,...
1314,44,1,94,122,67,204.0,1.63,0.006,0
1315,66,1,84,125,55,149.0,1.33,0.172,1
1316,45,1,85,168,104,96.0,1.24,4.250,1
1317,54,1,58,117,68,443.0,5.80,0.359,1


# 3. Data Visualization

In this section, we will perform a series of data visualizations showcasing the features and their relationships with one another. We will also conduct Mutual Information feature selection to identify which features have the most influence on determining the outcome of heart attack risk.

## A. Age Group vs Result

In [9]:
def categorize_age(age):
    if age <= 18:
        return "Minor"
    elif age <= 25:
        return "Young Adult"
    elif age <= 40:
        return "Adult"
    elif age <= 60:
        return "Middle-aged"
    else:
        return "Senior"

dataset["Age Group"] = dataset['Age'].apply(categorize_age)

age_group = dataset.groupby(["Age Group", "Result"]).size().reset_index(name="Count")

custom_order = ["Senior", "Middle-aged", "Adult", "Young Adult", "Minor"]
age_group["Age Group"] = pd.Categorical(age_group["Age Group"], categories=custom_order, ordered=True)

age_group = age_group.sort_values("Age Group")

age_group

Unnamed: 0,Age Group,Result,Count
5,Senior,negative,148
6,Senior,positive,366
2,Middle-aged,negative,256
3,Middle-aged,positive,373
0,Adult,negative,91
1,Adult,positive,63
7,Young Adult,negative,13
8,Young Adult,positive,8
4,Minor,negative,1


In [10]:
age_group_fig = px.bar(
    age_group,
    x="Age Group",
    y="Count",
    color="Result",
    title="Distribution of Positive and Negative Heart Attack Cases by Age Group",
    barmode="group"
)


age_group_fig.show()

## B. Age Group and Gender

In [11]:
def categorize_gender(gender):
    if gender == 1:
        return "Male"
    elif gender == 0:
        return "Female"

age_group_gender = dataset.copy()
age_group_gender["Gender"] = age_group_gender["Gender"].apply(categorize_gender)
age_group_gender = age_group_gender[["Age Group", "Gender", "Result"]]
age_group_gender_ct = pd.crosstab(
    [age_group_gender["Age Group"], 
    age_group_gender["Gender"]],
    age_group_gender["Result"]
)

age_group_gender_ct

Unnamed: 0_level_0,Result,negative,positive
Age Group,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1
Adult,Female,34,14
Adult,Male,57,49
Middle-aged,Female,93,109
Middle-aged,Male,163,264
Minor,Female,1,0
Senior,Female,70,122
Senior,Male,78,244
Young Adult,Female,4,2
Young Adult,Male,9,6


In [12]:
age_group_gender_fig = px.sunburst(
    age_group_gender,
    path=["Age Group", "Gender", "Result"],
    title="Sunburst Chart of Age Group, Gender, and Result"
)

age_group_gender_fig.show()

## C. Age and Blood Sugar

In [13]:
age_sugar_systolic = dataset[["Age", "Blood sugar", "Result"]]
age_sugar_systolic_fig = px.scatter(
    age_sugar_systolic, 
    x="Age", 
    y="Blood sugar", 
    color="Result",
    title="Scatter Plot of Age and Blood Sugar"
)
age_sugar_systolic_fig.show()


## D. Systolic and Diastolic Blood Pressure

In [14]:
systolic_diastolic = dataset[["Systolic blood pressure", "Diastolic blood pressure", "Result"]]
systolic_diastolic


Unnamed: 0,Systolic blood pressure,Diastolic blood pressure,Result
0,160,83,negative
1,98,46,positive
2,160,77,negative
3,120,55,positive
4,112,65,negative
...,...,...,...
1314,122,67,negative
1315,125,55,positive
1316,168,104,positive
1317,117,68,positive


In [15]:
systolic_diastolic_fig = px.density_heatmap(
    systolic_diastolic, 
    x="Systolic blood pressure", 
    y="Diastolic blood pressure",
    facet_col="Result",
    title="Density Heatmap of Systolic and Diastolic Blood Pressure"
)

systolic_diastolic_fig.show()

## E. Systolic Blood Pressure and Blood sugar and Troponin

In [16]:
systolic_sugar_troponin = dataset[["Systolic blood pressure", "Blood sugar", "Troponin", "Result"]]
systolic_sugar_troponin

Unnamed: 0,Systolic blood pressure,Blood sugar,Troponin,Result
0,160,160.0,0.012,negative
1,98,296.0,1.060,positive
2,160,270.0,0.003,negative
3,120,270.0,0.122,positive
4,112,300.0,0.003,negative
...,...,...,...,...
1314,122,204.0,0.006,negative
1315,125,149.0,0.172,positive
1316,168,96.0,4.250,positive
1317,117,443.0,0.359,positive


In [17]:
systolic_sugar_troponin_fig = px.scatter_3d(
    systolic_sugar_troponin,
    x="Systolic blood pressure",
    y="Blood sugar",
    z="Troponin",
    color="Result",
    title="3-Dimensional Scatter Plot of Troponin, Blood Sugar and Systolic Blood Pressure"
)

systolic_sugar_troponin_fig.update_traces(marker=dict(size=1.5))

## F. Troponin and CK-MB

In [18]:
troponin_ckmb = dataset[["Troponin", "CK-MB", "Result"]]
troponin_ckmb

Unnamed: 0,Troponin,CK-MB,Result
0,0.012,1.80,negative
1,1.060,6.75,positive
2,0.003,1.99,negative
3,0.122,13.87,positive
4,0.003,1.08,negative
...,...,...,...
1314,0.006,1.63,negative
1315,0.172,1.33,positive
1316,4.250,1.24,positive
1317,0.359,5.80,positive


In [19]:
troponin_ckmb_fig = px.scatter(
    troponin_ckmb, 
    x="Troponin", 
    y="CK-MB", 
    color="Result",
    title="Scatter Plot of Troponin and CK-MB"
)
troponin_ckmb_fig.show()

## G. Feature Engineering

Mutual information allow us to identify which features have significant influence in determining the heart attack cases.

In [20]:
from sklearn.feature_selection import mutual_info_classif

predict_vars = svm_dataset.copy()
target_var = predict_vars.pop("Result")

mi_scores = mutual_info_classif(predict_vars, target_var)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=predict_vars.columns)
mi_scores = mi_scores.sort_values(ascending=False)

mi_scores = mi_scores.reset_index()
mi_scores.columns = ['Feature', 'MI Score']
mi_scores

Unnamed: 0,Feature,MI Score
0,Troponin,0.371678
1,CK-MB,0.130149
2,Age,0.042184
3,Diastolic blood pressure,0.024647
4,Systolic blood pressure,0.014248
5,Gender,0.0
6,Heart rate,0.0
7,Blood sugar,0.0


In [21]:
mi_scores_fig = px.bar(
    mi_scores, 
    x="MI Score", 
    y="Feature",
    title="MI Score of the Features"
)
mi_scores_fig.show()

# 4. Training The Baseline Model

In this section, we will train the SVM for predictive modeling by building a pipeline with scaling. We will first build a baseline SVM model to evaluate its performance on the dataset and analyze for potential cases of overfitting or underfitting.

## A. Data Splitting

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(predict_vars, target_var, test_size=0.2, random_state=69)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (1055, 8)
X_test shape: (264, 8)
y_train shape: (1055,)
y_test shape: (264,)


## B. Building the Pipeline with Scaling

In [23]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

## C. Training the Model

In [24]:
svm_pipeline.fit(X_train, y_train)


## D. Evaluation of the Baseline Model

In [25]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Accuracy
accuracy_baseline_train = svm_pipeline.score(X_train, y_train)
accuracy_baseline_test = svm_pipeline.score(X_test, y_test)

def present_accuracies(accuracy_train, accuracy_test, model):
    
    present_accuracies_fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}]],
                        subplot_titles=['Training Accuracy', 'Testing Accuracy'])
    
    # Training Accuracy
    present_accuracies_fig.add_trace(go.Pie(
        labels=["Correct Predictions", "Incorrect Predictions"],
        values=[accuracy_train, 1 - accuracy_train],
        name="Training Accuracy",
    ), row=1, col=1)
    
    # Testing Accuracy
    present_accuracies_fig.add_trace(go.Pie(
        labels=["Correct Predictions", "Incorrect Predictions"],
        values=[accuracy_test, 1 - accuracy_test],
        name="Testing Accuracy",
    ), row=1, col=2)
    
    present_accuracies_fig.update_layout(
        title_text="Training vs Testing Accuracy of the " + model,
        title_x=0.5
    )
    
    present_accuracies_fig.show()

present_accuracies(accuracy_baseline_train, accuracy_baseline_test, "Baseline Model")

In [26]:
from sklearn.metrics import classification_report

# Classification report
y_pred_baseline = svm_pipeline.predict(X_test)

classif_report = classification_report(y_test, y_pred_baseline)

print("Classification Report for the Testing Dataset: \n")
print(classif_report)

Classification Report for the Testing Dataset: 

              precision    recall  f1-score   support

           0       0.73      0.60      0.66        93
           1       0.80      0.88      0.84       171

    accuracy                           0.78       264
   macro avg       0.76      0.74      0.75       264
weighted avg       0.78      0.78      0.77       264



In [27]:
from sklearn.metrics import confusion_matrix

# Confusion Matrix
conf_matrix_baseline = confusion_matrix(y_test, y_pred_baseline)

def present_conf(conf_matrix):
    conf_matrix_df = pd.DataFrame(
        conf_matrix, 
        index=["True Negative", "True Positive"], 
        columns=["Predicted Negative", "Predicted Positive"]
    )
    
    conf_matrix_fig = px.imshow(
        conf_matrix_df, 
        text_auto=True, 
        title="Confusion Matrix of the Baseline Model for the Testing Dataset"
    )
    
    conf_matrix_fig.show()

present_conf(conf_matrix_baseline)

# 5. Hyperparameter Tuning Using GridSearchCV

In this section, we will perform hyperparameter tuning with GridSearchCV to find the optimal hyperparameters that achieve the highest accuracy for the model.
## A. Setting up the Hyperparameters

In [28]:
param_grid = {
    'svm__C': [0.1, 1, 10, 100],  # Regularization parameter
    'svm__gamma': [1, 0.1, 0.01, 0.001],  # Kernel coefficient
    'svm__kernel': ['rbf', 'linear', 'poly', 'sigmoid']  # Different kernel types
}

## B. Performing the Tuning

Finding the optimized parameter using GridSearchCV with cross validation of 5 folds

In [29]:
from sklearn.model_selection import GridSearchCV

# Perform GridSearchCV
grid_search = GridSearchCV(svm_pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

## C. Evaluation of Hyperparameters

In [30]:
# Results of GridSearchCV
results = pd.DataFrame(grid_search.cv_results_)

results['params_str'] = results['params'].apply(lambda x: str(x))

top_results = results[['params_str', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False).head(5)

top_results_fig = px.bar(
    top_results, 
    x='mean_test_score', 
    y='params_str', 
    labels={'mean_test_score': 'Accuracy', 'params_str': 'Hyperparameter Combination'},
    title='Top 5 Hyperparameter Combinations and their Accuracy',
    category_orders={'params_str': top_results['params_str'].tolist()}
) 

top_results_fig.show()

In [31]:
best_score = grid_search.best_score_

best_params = str(grid_search.best_params_)

best_params_fig = px.pie(
    values=[best_score, 1 - best_score], 
    names=["Correct Predictions", "Incorrect Predictions"],
    title="Highest Accuracy Achieved by the Optimized Hyperparameters" 
        + "<br>" + best_params
)

best_params_fig.show()

We found that the model with the highest training accuracy has hyperparameters of X for regularization, Y for the kernel coefficient, and Z for the kernel type.

# 6. Evaluation of the Optimized Model

In this section, we will evaluate the performance of the optimized model by analyzing its training and testing accuracies, as well as precision, recall, F1-score, and the confusion matrix.

In [32]:
from sklearn.metrics import accuracy_score

# Accuracy
y_pred_optimized = grid_search.best_estimator_.predict(X_test)

accuracy_optimized_train = grid_search.best_score_
accuracy_optimized_test = accuracy_score(y_test, y_pred_optimized)

present_accuracies(accuracy_optimized_train, accuracy_optimized_test, "Optimized Model")

In [33]:
# Classification Report
print("Classification Report of the Optimized Model for the Testing Dataset: \n")
print(classification_report(y_test, y_pred_optimized))

Classification Report of the Optimized Model for the Testing Dataset: 

              precision    recall  f1-score   support

           0       0.83      0.91      0.87        93
           1       0.95      0.90      0.92       171

    accuracy                           0.91       264
   macro avg       0.89      0.91      0.90       264
weighted avg       0.91      0.91      0.91       264



In [34]:
conf_matrix_optimized = confusion_matrix(y_test, y_pred_optimized)

present_conf(conf_matrix_optimized)

# 7. Exporting the Model for Web Application Use

In this section, we will be exporting the optimized model for web application use. We will convert it to ONNX model wherein we can run the model in JavaScript.

In [35]:
# Installing the skl2onnx to convert the sklearn models to onnx models
!pip install skl2onnx



In [36]:
import onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Convert SVM model to ONNX format
onnx_model = convert_sklearn(
    grid_search.best_estimator_, 
    initial_types=[('input', FloatTensorType([None, len(X_train.columns)]))]
)

# Save ONNX model
onnx.save_model(onnx_model, 'svm_model.onnx')

# 8. Conclusion

In this project, we followed a comprehensive workflow to predict heart attack risk using a Support Vector Machine (SVM) model. We started with Exploratory Data Analysis (EDA), identifying outliers and performing data visualizations such as scatter plots and heatmaps to understand the dataset. After engineering relevant features using mutual information, we trained the model with standard scaling and optimized it through hyperparameter tuning with GridSearchCV, increasing accuracy from 78% to 90%. Finally, we exported the optimized model in the ONNX.js format, enabling seamless deployment and real-time predictions directly within a web application without the need for a backend server.