## Heart Disease Dataset Variables (with Severity Notes) (Cheat sheet)

* **`ChestPainType`**: Type of chest pain.
    * `TA`: Typical Angina -> üö® **High Risk** (Classic heart attack warning)
    * `ATA`: Atypical Angina -> ‚ö†Ô∏è **Moderate Risk**
    * `NAP`: Non-Anginal Pain -> ‚úÖ **Low Risk** (Likely not heart-related)
    * `ASY`: Asymptomatic -> ‚ùì **Silent/Sneaky** (Often dangerous in ML models)
* **`RestingBP`**: Resting blood pressure [mm Hg].
    * *Note:* Higher values -> ‚ö†Ô∏è **Worse** (Hypertension strain)
* **`Cholesterol`**: Serum cholesterol [mm/dl].
    * *Note:* Higher values -> ‚ö†Ô∏è **Worse** (Risk of blockage)
* **`FastingBS`**: Fasting blood sugar.
    * `0`: Normal (< 120 mg/dl) -> ‚úÖ **Good**
    * `1`: High (> 120 mg/dl) -> ‚ö†Ô∏è **Diabetes Risk** (Damages vessels)
* **`RestingECG`**: Resting electrocardiogram results.
    * `Normal`: Normal -> ‚úÖ **Good**
    * `ST`: ST-T abnormality -> ‚ö†Ô∏è **Bad** (Ischemia/Oxygen lack)
    * `LVH`: Left Ventricular Hypertrophy -> üö® **Severe** (Thickened heart muscle)
* **`MaxHR`**: Maximum heart rate achieved.
    * *Note:* Lower max achievable HR often indicates -> ‚ö†Ô∏è **Weaker Heart**
* **`ExerciseAngina`**: Exercise-induced angina.
    * `N`: No -> ‚úÖ **Good**
    * `Y`: Yes -> üö® **Critical Flag** (Heart fails under stress)
* **`Oldpeak`**: ST depression induced by exercise vs rest.
    * *Note:* Higher value (>1.0) -> üö® **Tragic/Severe Ischemia**
* **`ST_Slope`**: The slope of the peak exercise ST segment.
    * `Up`: Upsloping -> ‚úÖ **Healthy Response**
    * `Flat`: Flat -> ‚ö†Ô∏è **Warning**
    * `Down`: Downsloping -> ‚ò†Ô∏è **Worst/Tragic** (Strongest sign of disease)

In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from heart_failure_prediction.config import RAW_DATA_DIR
import heart_failure_prediction.plots as plots

In [None]:
path = os.path.join(RAW_DATA_DIR, 'heart.csv')
heart = pd.read_csv(path)

In [None]:
heart.head()

In [None]:
heart.describe().T

### Encoding binary values as 0/1

In [None]:
heart['ExerciseAngina'] = heart['ExerciseAngina'].map({'N': 0, 'Y': 1})
heart['Sex'] = heart['Sex'].map({'M': 0, 'F': 1})

### Data types

In [None]:
heart.dtypes

After mapping binary values to 0/1 there are 3 categorical values: ChestPainType, RestingECG, ST_Slope

### Missing and duplicated values

In [None]:
print(f'Null value count: {heart.isna().sum().sum()}')

In [None]:
print(f'Duplicated values count: {heart.duplicated().sum()}')

There are no missing or duplicated values

### Data distribution

In [None]:
plots.plot_distributions(heart)

In [None]:
heart['HeartDisease'].value_counts(normalize=True) * 100

The distribution of the target value is more or less even

Cholesterol can't be 0! Suggests missing values.

In [None]:
heart['Cholesterol'] = heart['Cholesterol'].replace({0: np.nan})

In [None]:
print(f'Null value count: {heart.isna().sum().sum()}')

In [None]:
heart['col_missing'] = heart['Cholesterol'].isna().astype(int)
print(heart.groupby('col_missing')['HeartDisease'].mean())
heart.drop('col_missing', axis=1, inplace=True)

Missing cholesterol suggests critical status of patient! Should add indicator while imputing later on.

### Looking for outliers

In [None]:
num_cols = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
plt.figure(figsize=(15, 10))
for i, col in enumerate(num_cols):
    plt.subplot(2, 3, i + 1)
    sns.boxplot(x=heart[col])
    plt.title(col)
plt.show()

0 RestingBP is impossible. Dropping. Rest of the outliers are medically possible.

In [None]:
zero_restingbp = heart.loc[heart['RestingBP'] == 0].index
heart.drop(zero_restingbp, axis=0, inplace=True)

Negative oldpeek suggests ST Elevation - it's still bad, worse than 0. For regression maybe should transform it to absolute value? For decision trees it will be ok.

### Correlation

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(
    heart.select_dtypes(np.number).corr(), annot=True, fmt='.2f', cmap='coolwarm'
)
plt.show()

There are no strongly correlated values. ExerciseAngina is the value strongest correlated with the target.

In [None]:
plots.plot_categorical_countplots(heart, 'HeartDisease')

In [None]:
print(heart.groupby('ChestPainType')['HeartDisease'].mean())

In [None]:
print(heart.groupby('ST_Slope')['HeartDisease'].mean())

Patients with Asymptotic Chest Pain are much more likely to be in positive group. Suggests that patients can't feel the pain, ignore other symptomps and the disease can develop freely.

Non-flat ST_Slope patients have much greater risk of being in the positive group.