## `Influence of Machine Learning on the Diagnosis of Cardiovascular Diseases`

### `Problem Statement`

Cardiovascular Diseases (CVDs) are considered the leading cause of mortality globally, with at least `80%` of deaths associated with heart disease and strokes in individuals below the age of `70 years` (Rana et al., 2025). Indications that a patient has CVD are high levels of biomarkers in their blood, significant chest pain and abnormal readings on an ECG. Surprisingly, a lot of patients with CVD are difficult to diagnose using ECG when their results are largely normal. Therefore, the use of AI approaches can make a significant difference in how we diagnose CVD, ultimately improving the outcomes.

### `Objectives`

#### `General Objective`
- To develop a machine learning predictive model to aid in the diagnosis of cardiovascular diseases.

#### `Specific Objectives`

- To examine the appliations of machine learning in the diagnosis of CVDs.
- To evaluate the accuracy of machine learning tools in the diagnosis of CVDs.
- To design a machine learning model to diagnose CVDs.
- To evaluate and validate the developed CVDs diagnosing machine learning model using performance metrics.

### `Data Understanding`

The source of the data is: [kaggle](https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset)

`DISCLAIMER:` Kaggle datasets are meant for simulations and modelling, and not necessarily reflect the real word context.

The data used in this study comprises of: 
- `Age` | Objective Feature | age | int (days)
- `Height` | Objective Feature | height | int (cm) |
- `Weight` | Objective Feature | weight | float (kg) |
- `Gender` | Objective Feature | gender | categorical code |
- `Systolic blood pressure` | Examination Feature | ap_hi | int |
- `Diastolic blood pressure` | Examination Feature | ap_lo | int |
- `Cholesterol` | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
- `Glucose` | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
- `Smoking` | Subjective Feature | smoke | binary |
- `Alcohol intake` | Subjective Feature | alco | binary |
- `Physical activity` | Subjective Feature | active | binary |
- `Presence or absence of cardiovascular disease` | Target Variable | cardio | binary |

### `Performance Metrics`

- `Precision`: The desired precision score for this diagnosis support system is of range `0.80 - 0.95+`. Critical for minimizing unnecessary treatments.
- `Recall`: Recall score equal or greater than `0.85`. Important for not missing real disease cases.
- 

### `Libraries`

In [3]:
# Custom
from functions import duplicated
# Standard
import pandas as pd 
from ydata_profiling import ProfileReport
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### `Load the data`

In [4]:
data = pd.read_csv("archive\cardio_train.csv", index_col=0, sep=";")
data.head(10)

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,17474,1,156,56.0,100,60,1,1,0,0,0,0
8,21914,1,151,67.0,120,80,2,2,0,0,0,0
9,22113,1,157,93.0,130,80,3,1,0,0,1,0
12,22584,2,178,95.0,130,90,3,3,0,0,1,1
13,17668,1,158,71.0,110,70,1,1,0,0,1,0
14,19834,1,164,68.0,110,60,1,1,0,0,0,0


In [23]:
# Data report 
ProfileReport(data).to_file("DataReport.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
# Duplicated records
duplicated(data)

'Number of duplicated records: 24'

- The data has no missing records.
- The data has 24 duplicated records.

### `Data Handling`

#### `Age`
- Change the age column from days to years.
- `Assumption:` One year is equal to 365 days.

In [22]:
data["age"] = data["age"].apply(lambda x: x//365) # Floor division to round down the age to the nearest whole number
data.head(10)

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,50,2,168,62.0,110,80,1,1,0,0,1,0
1,55,1,156,85.0,140,90,3,1,0,0,1,1
2,51,1,165,64.0,130,70,3,1,0,0,0,1
3,48,2,169,82.0,150,100,1,1,0,0,1,1
4,47,1,156,56.0,100,60,1,1,0,0,0,0
8,60,1,151,67.0,120,80,2,2,0,0,0,0
9,60,1,157,93.0,130,80,3,1,0,0,1,0
12,61,2,178,95.0,130,90,3,3,0,0,1,1
13,48,1,158,71.0,110,70,1,1,0,0,1,0
14,54,1,164,68.0,110,60,1,1,0,0,0,0


### `Exploratory Data Analysis`

#### `Age`

### `References`

- Rana, N., Sharma, K., & Sharma, A. (2025). Diagnostic strategies using AI and ML in cardiovascular diseases: Challenges and future perspectives. In Deep Learning and Computer Vision: Models and Biomedical Applications: Volume 1 (pp. 135-165). Singapore: Springer Nature Singapore. https://link.springer.com/chapter/10.1007/978-981-96-1285-7_7