# Heart Disease Prediction Model - Exploratory Analysis #
## Author: Madison Little ##
## Date: Sep. 4, 2024 ##

This initial report explores the data's quality, completeness, and distribution.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from charts import categorical_ct
from charts import grouped_categorical_ct

The following two defined functions are used throughout the report to create similar charts for each of the categorical values.
I defined them in this notebook so I would not have to read the categorical_mapping.json file each time the function is called or pass the json data as a parameter to the function.  This way, the functions can access the mapping as a global variable.

In [2]:
df = pd.read_csv('data/heart_disease_prediction.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [3]:
print(f"Features: {len(df.columns)}")
print(f"Observations: {len(df)}")

Features: 12
Observations: 918


In [4]:
df.describe(include='all')

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
count,918.0,918,918,918.0,918.0,918.0,918,918.0,918,918.0,918,918.0
unique,,2,4,,,,3,,2,,3,
top,,M,ASY,,,,Normal,,N,,Flat,
freq,,725,496,,,,552,,547,,460,
mean,53.510893,,,132.396514,198.799564,0.233115,,136.809368,,0.887364,,0.553377
std,9.432617,,,18.514154,109.384145,0.423046,,25.460334,,1.06657,,0.497414
min,28.0,,,0.0,0.0,0.0,,60.0,,-2.6,,0.0
25%,47.0,,,120.0,173.25,0.0,,120.0,,0.0,,0.0
50%,54.0,,,130.0,223.0,0.0,,138.0,,0.6,,1.0
75%,60.0,,,140.0,267.0,0.0,,156.0,,1.5,,1.0


## Observations ##
Patient ages range from 28 to 77, with a mean of about 53 years.

It appears that RestingBP and Cholesterol have missing values, as their minimums are 0.0--a value that does not make sense for those measurements.

MaxHR is more ambiguous.  There is a minimum value of 60BPM, but the lower quartile is 120BPM. This may mean that 60 is a placeholder value for unrecorded heartrate. If these heartrates were achieved during exercise, it is unlikley that 60BPM was a recorded value.

The cholesterol column has a max value of 603, which is 2.5 times larger than what is considered "high" cholesterol.  Based on the IQR, this is an extreme outlier.  According to the American Heart Association, such a cholesterol level is caused by a rare genetic disease called familial hypercholesterolemia.


In [5]:
fig = categorical_ct(df, 'Sex')
fig.update_layout(
    title="Sex of Patients",
    xaxis_title="Sex",
    yaxis_title="Count of Patients")


There are more male patients represented in the data than female patients (~80% men)

In [6]:
fig = categorical_ct(df, 'ChestPainType')
fig.update_layout(
    title="Type of Patient Chest Pain",
    xaxis_title="Chest Pain Type",
    yaxis_title="Count of Patients")

Asymptomatic chest pain (meaning no chest pain) is the most common in the data.

In [7]:
fig = categorical_ct(df, 'HeartDisease')
fig.update_layout(
    title="Heart Disease Cases",
    xaxis_title="Test Result",
    yaxis_title="Count of Patients")

The dataset is pretty evenly split between patients with heart disease and those without heart disease. This means we have ample data for both classes, and we should be able to classify a test patient into one of the two categories. 

In [8]:
fig = categorical_ct(df, 'FastingBS')
fig.update_layout(
    title="Patient Blood Sugar Levels",
    xaxis_title="Blod Sugar Range",
    yaxis_title="Count of Patients")

The data is already split into categorical blood sugar ranges--above or below 120mg/dl.  A quick search revealed that the normal range is under 120mg/dl.  Most patients in the data are therefore non-diabetic.

In [9]:

fig = categorical_ct(df, 'RestingECG')
fig.update_layout(
    title='Patient Resting Electrocardiogram Results',
    xaxis_title="Test Result",
    yaxis_title="Count of Patients")

Most patients have a "normal" ECG test result. There are two categories that are therefore "abnormal".  If we combine the two abnormal groups, then we have a much closer split between normal and abnormal ECG readings.

In [10]:

fig = categorical_ct(df, 'ExerciseAngina')
fig.update_layout(
    title='Presense of Exercise-Induced Angina',
    xaxis_title="Angina Present?",
    yaxis_title="Count of Patients")


Most patients do not experience angina after exercise, but the split is very close.  It is unclear how this metric relates to the "ChestPainType" data, where 

In [11]:
fig = categorical_ct(df, 'ST_Slope')
fig.update_layout(
    title='Patients\' ECG ST-Segment Slope',
    xaxis_title="Test Result",
    yaxis_title="Count of Patients")

In [12]:
fig = grouped_categorical_ct(df, grouper='Sex', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by Sex',
    xaxis_title="Sex",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")


In [13]:
fig = grouped_categorical_ct(df, grouper='ChestPainType', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Presense of Heart Disease By Chest Pain Type',
    xaxis_title="Chest Pain Type",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

77% of all heart disease cases had no chest pain symptoms. Strangely, a greater percentage of healthy patients reported chest pain than patients with heart disease.  Non-anginal pain was the most common type of chest pain reported by patients with heart disease.  The most common pain reported by healthy patients was atypical angina. Typical angina was the least reported chest pain type overall, with almost equal numbers of healthy and heart disease patients.

In [14]:
fig = grouped_categorical_ct(df, grouper='FastingBS', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by Blood Sugar Level',
    xaxis_title="Blood Sugar Range",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

Non-diabetic patients are more common than diabetic patients overall.  Among non-diabetic patients, there are about the same number of patients with and without heart disease.  However, there is much larger portion of diabetic patients that also have heart disease.  While the absense of diabetes does not imply the absense of heart disease, the presense of diabetes is a strong indicator of heart disease.

In [15]:
fig = grouped_categorical_ct(df, grouper='RestingECG', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by ECG Result',
    xaxis_title="ECG Reading",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

There are fairly similar distributions of healthy and heart disease patients across the ECG reading categories.  The ST-T wave abnormality group has the most drastic difference in frequency.  It is more common for a person with heart disease to display an ST-T wave abnormality than a healthy patient, which could indicate another symptom or risk factor for heart disease.

In [16]:
fig = grouped_categorical_ct(df, grouper='ExerciseAngina', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Exercise-Induced Angina and Heart Disease',
    xaxis_title="Angina Induced by Exercise?",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

Most patients that have heart disease experience exercised-induced angina, and very few healthy patients display this symptom.  This may be a good feature to use in our predictive model. It is important to note that there is still a significant portion of patients with heart disease that do not experience angina after exercising (false negative risk).

In [17]:
fig = grouped_categorical_ct(df, grouper='ST_Slope', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by ST-Segment Slope',
    xaxis_title="ST-Segment Slope",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

After researching this measurement, I found that an upward slope of the ST-segment of an ECG reading is considered normal.  Downward or horizontal (flat) slopes are considered ST-segment depressions.

A large majority of those with heart disease display a flat ST-segment slope, while most healthy patients display a normal upward slope.  There are fewer patients overall that display a downward slope, but this is more common among those with heart disease rather than healthy individuals. Some healthy patients display a flat ST-segment slope. This could be an indicator that the patient is at risk of heart disease given that the upslope is more common among healthy individuals. More grouping analysis is needed for this metric, but it could make another strong feature for my model.

In [18]:
# Cleaning the values that don't make sense: RestingBP and Cholesterol
df[df['RestingBP']==0]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
449,55,M,NAP,0,0,0,Normal,155,N,1.5,Flat,1


Only one patient has a recorded resting BP of 0, so we will remove that record.

In [19]:
df = df[df['RestingBP'] != 0]

In [20]:
df[df['Cholesterol']==0]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
293,65,M,ASY,115,0,0,Normal,93,Y,0.0,Flat,1
294,32,M,TA,95,0,1,Normal,127,N,0.7,Up,1
295,61,M,ASY,105,0,1,Normal,110,Y,1.5,Up,1
296,50,M,ASY,145,0,1,Normal,139,Y,0.7,Flat,1
297,57,M,ASY,110,0,1,ST,131,Y,1.4,Up,1
...,...,...,...,...,...,...,...,...,...,...,...,...
514,43,M,ASY,122,0,0,Normal,120,N,0.5,Up,1
515,63,M,NAP,130,0,1,ST,160,N,3.0,Flat,0
518,48,M,NAP,102,0,1,ST,110,Y,1.0,Down,1
535,56,M,ASY,130,0,0,LVH,122,Y,1.0,Flat,1


There are significantly more patients without a cholesterol reading.  We will fill these values with the median of the patient's sex, age group, and heart disease status.

In [21]:
age_bins = list(range(0, 100, 5))
df['AgeRange'] = pd.cut(df['Age'], bins=age_bins)
df_reduced = df[df['Cholesterol']!=0]
medians = df_reduced.groupby(['Sex', 'AgeRange', 'HeartDisease'], observed=True).agg(MedCholesterol=('Cholesterol', 'median'))
df = df.merge(medians, how='left', on=['Sex', 'AgeRange', 'HeartDisease'])
df['Cholesterol'] = np.where(df['Cholesterol'] == 0, df['MedCholesterol'], df['Cholesterol'])
df = df.drop(columns=['MedCholesterol'])


For the model, I will look at Age, Sex, ChestPainType, Cholesterol, ExerciseAngina, ST_Slope, and FastingBS as features.
First, I will convert the categorical variables into binary dummy variables. Sex and ExerciseAngina are binary, but need to be converted to integer values.  ChestPainType and ST_Slope need to be converted into dummy variables.

In [22]:
features = ['Age', 'Sex', 'ChestPainType', 'Cholesterol', 'ExerciseAngina', 'ST_Slope', 'FastingBS']

In [23]:
result = pd.get_dummies(data=df[features], prefix=['ChestPainType', 'ST_Slope'], 
               columns=['ChestPainType', 'ST_Slope'], dtype=int)

# Dropping the categories considered baseline/normal to avoid dummy variable trap
result = result.drop(columns=['ChestPainType_ASY', 'ST_Slope_Up'])

In [24]:
result['Sex'] = result['Sex'].map({'M':0, 'F':1})
result['ExerciseAngina'] = result['ExerciseAngina'].map({'N':0, 'Y':1})


In [25]:
result

Unnamed: 0,Age,Sex,Cholesterol,ExerciseAngina,FastingBS,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,ST_Slope_Down,ST_Slope_Flat
0,40,0,289.0,0,0,1,0,0,0,0
1,49,1,180.0,0,0,0,1,0,0,1
2,37,0,283.0,0,0,1,0,0,0,0
3,48,1,214.0,1,0,0,0,0,0,1
4,54,0,195.0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
912,45,0,264.0,0,0,0,0,1,0,1
913,68,0,193.0,0,1,0,0,0,0,1
914,57,0,131.0,1,0,0,0,0,0,1
915,57,1,236.0,0,0,1,0,0,0,1
