## Heart Disease Dataset Variables (with Severity Notes)

* **`ChestPainType`**: Type of chest pain.
    * `TA`: Typical Angina -> üö® **High Risk** (Classic heart attack warning)
    * `ATA`: Atypical Angina -> ‚ö†Ô∏è **Moderate Risk**
    * `NAP`: Non-Anginal Pain -> ‚úÖ **Low Risk** (Likely not heart-related)
    * `ASY`: Asymptomatic -> ‚ùì **Silent/Sneaky** (Often dangerous in ML models)
* **`RestingBP`**: Resting blood pressure [mm Hg].
    * *Note:* Higher values -> ‚ö†Ô∏è **Worse** (Hypertension strain)
* **`Cholesterol`**: Serum cholesterol [mm/dl].
    * *Note:* Higher values -> ‚ö†Ô∏è **Worse** (Risk of blockage)
* **`FastingBS`**: Fasting blood sugar.
    * `0`: Normal (< 120 mg/dl) -> ‚úÖ **Good**
    * `1`: High (> 120 mg/dl) -> ‚ö†Ô∏è **Diabetes Risk** (Damages vessels)
* **`RestingECG`**: Resting electrocardiogram results.
    * `Normal`: Normal -> ‚úÖ **Good**
    * `ST`: ST-T abnormality -> ‚ö†Ô∏è **Bad** (Ischemia/Oxygen lack)
    * `LVH`: Left Ventricular Hypertrophy -> üö® **Severe** (Thickened heart muscle)
* **`MaxHR`**: Maximum heart rate achieved.
    * *Note:* Lower max achievable HR often indicates -> ‚ö†Ô∏è **Weaker Heart**
* **`ExerciseAngina`**: Exercise-induced angina.
    * `N`: No -> ‚úÖ **Good**
    * `Y`: Yes -> üö® **Critical Flag** (Heart fails under stress)
* **`Oldpeak`**: ST depression induced by exercise vs rest.
    * *Note:* Higher value (>1.0) -> üö® **Tragic/Severe Ischemia**
* **`ST_Slope`**: The slope of the peak exercise ST segment.
    * `Up`: Upsloping -> ‚úÖ **Healthy Response**
    * `Flat`: Flat -> ‚ö†Ô∏è **Warning**
    * `Down`: Downsloping -> ‚ò†Ô∏è **Worst/Tragic** (Strongest sign of disease)

In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from heart_failure_prediction.config import RAW_DATA_DIR

In [None]:
path = os.path.join(RAW_DATA_DIR, 'heart.csv')
heart = pd.read_csv(path)

In [None]:
heart.head()

### Encoding binary values as 0/1

In [None]:
heart['ExerciseAngina'] = (
    heart['ExerciseAngina'].replace({'N': '0', 'Y': '1'}).astype(int)
)
heart['Sex'] = heart['Sex'].replace({'M': '0', 'F': '1'}).astype(int)

### Data types

In [None]:
heart.dtypes

### Missing and duplicated values

In [None]:
print(f'Null value count: {heart.isna().sum().sum()}')

In [None]:
print(f'Duplicated values count: {heart.duplicated().sum()}')

### Data distribution

In [None]:
cols = heart.columns

n_cols = 4
n_rows = int(np.ceil(len(cols) / n_cols))

plt.figure(figsize=(3 * n_cols, 2 * n_rows))

for i, col in enumerate(cols, start=1):
    plt.subplot(n_rows, n_cols, i)
    heart[col].hist(bins=50)
    plt.title(col)
    plt.tight_layout()

In [None]:
heart['HeartDisease'].value_counts(normalize=True) * 100

Cholesterol can't be 0! Suggests missing values.

In [None]:
heart['Cholesterol'] = heart['Cholesterol'].replace({0: np.nan})

In [None]:
print(f'Null value count: {heart.isna().sum().sum()}')

In [None]:
heart['col_missing'] = heart['Cholesterol'].isna().astype(int)
print(heart.groupby('col_missing')['HeartDisease'].mean())
heart.drop('col_missing', axis=1, inplace=True)

Missing cholesterol suggests critical status of patient! Should add indicator while imputing later on.

### Correlation

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(
    heart.select_dtypes(np.number).corr(), annot=True, fmt='.2f', cmap='coolwarm'
)
plt.show()

In [None]:
sns.countplot(x='ChestPainType', hue='HeartDisease', data=heart)

In [None]:
sns.countplot(x='RestingECG', hue='HeartDisease', data=heart)

In [None]:
sns.countplot(x='ST_Slope', hue='HeartDisease', data=heart)