# Heart Disease Prediction Model - Exploratory Analysis #
## Author: Madison Little ##
## Date: Sep. 4, 2024 ##

This initial report explores the data's quality, completeness, and distribution.

In [154]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import json

In [155]:
# The mappings file contains expanded versions of the 
# categorical values found in the data.  This helps the 
# charts to be more readable.
with open('data/categorical_mappings.json', 'r') as f:
    mappings = json.load(f)

In [156]:
def categorical_ct(df:pd.DataFrame, field:str):
    '''
    Chart the frequency of each category within a categorical field.

    Input: dataframe, field name, chart title, and axis titles
    Output: bar chart
    '''
    data = df[field].replace(mappings[field]).value_counts().to_dict()
    fig = px.bar(df, 
                x=data.keys(),
                y=data.values())
    return fig
    

In [157]:
def grouped_categorical_ct(df:pd.DataFrame, grouper:str, counter:str, colors:list=None):
    '''
    Chart two categorical variables, where one is a grouping condition 
    and the other is counted (frequency).
    Ex. Split data into those with and without heart disease. Then, 
    count how many men and women are in each group.

    Input: the dataframe, the name of the field to group by, and the 
    name of the field to count. Optional: list of bar colors in order
    Output: grouped bar chart
    '''
    df = df.groupby(grouper).agg(Count=(counter, 'value_counts')).reset_index()

    #ensures that bars appear in consistent order across charts
    df = df.sort_values([counter],ascending=[False]) 
    
    df[grouper] = df[grouper].replace(mappings[grouper])
    df[counter] = df[counter].replace(mappings[counter])

    fig = px.bar(df, 
                x=grouper,
                y="Count",
                color_discrete_sequence=colors,
                color=counter,
                barmode='group')
    return fig

In [158]:
df = pd.read_csv('data/heart_disease_prediction.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [159]:
print(f"Features: {len(df.columns)}")
print(f"Observations: {len(df)}")

Features: 12
Observations: 918


In [160]:
# Cast the following binary fields from int to string so they are treated as categorical values
df['FastingBS'] = df['FastingBS'].astype('str')
df['HeartDisease'] = df['HeartDisease'].astype('str')

In [161]:
df.describe(include='all')

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
count,918.0,918,918,918.0,918.0,918.0,918,918.0,918,918.0,918,918.0
unique,,2,4,,,2.0,3,,2,,3,2.0
top,,M,ASY,,,0.0,Normal,,N,,Flat,1.0
freq,,725,496,,,704.0,552,,547,,460,508.0
mean,53.510893,,,132.396514,198.799564,,,136.809368,,0.887364,,
std,9.432617,,,18.514154,109.384145,,,25.460334,,1.06657,,
min,28.0,,,0.0,0.0,,,60.0,,-2.6,,
25%,47.0,,,120.0,173.25,,,120.0,,0.0,,
50%,54.0,,,130.0,223.0,,,138.0,,0.6,,
75%,60.0,,,140.0,267.0,,,156.0,,1.5,,


## Observations ##
Patient ages range from 28 to 77, with a mean of about 53 years.

It appears that RestingBP and Cholesterol have missing values, as their minimums are 0.0--a value that does not make sense for those measurements.

MaxHR is more ambiguous.  There is a minimum value of 60BPM, but the lower quartile is 120BPM. This may mean that 60 is a placeholder value for unrecorded heartrate. If these heartrates were achieved during exercise, it is unlikley that 60BPM was a recorded value.

The cholesterol column has a max value of 603, which is 2.5 times larger than what is considered "high" cholesterol.  Based on the IQR, this is an extreme outlier.  According to the American Heart Association, such a cholesterol level is caused by a rare genetic disease called familial hypercholesterolemia.


In [162]:
fig = categorical_ct(df, 'Sex')
fig.update_layout(
    title="Sex of Patients",
    xaxis_title="Sex",
    yaxis_title="Count of Patients")


There are more male patients represented in the data than female patients (~80% men)

In [163]:
fig = categorical_ct(df, 'ChestPainType')
fig.update_layout(
    title="Type of Patient Chest Pain",
    xaxis_title="Chest Pain Type",
    yaxis_title="Count of Patients")

Asymptomatic chest pain (meaning no chest pain) is the most common in the data.

In [164]:
fig = categorical_ct(df, 'HeartDisease')
fig.update_layout(
    title="Heart Disease Cases",
    xaxis_title="Test Result",
    yaxis_title="Count of Patients")

The dataset is pretty evenly split between patients with heart disease and those without heart disease. This means we have ample data for both classes, and we should be able to classify a test patient into one of the two categories. 

In [165]:
fig = categorical_ct(df, 'FastingBS')
fig.update_layout(
    title="Patient Blood Sugar Levels",
    xaxis_title="Blod Sugar Range",
    yaxis_title="Count of Patients")

The data is already split into categorical blood sugar ranges--above or below 120mg/dl.  A quick search revealed that the normal range is under 120mg/dl.  Most patients in the data are therefore non-diabetic.

In [166]:

fig = categorical_ct(df, 'RestingECG')
fig.update_layout(
    title='Patient Resting Electrocardiogram Results',
    xaxis_title="Test Result",
    yaxis_title="Count of Patients")

Most patients have a "normal" ECG test result. There are two categories that are therefore "abnormal".  If we combine the two abnormal groups, then we have a much closer split between normal and abnormal ECG readings.

In [167]:

fig = categorical_ct(df, 'ExerciseAngina')
fig.update_layout(
    title='Presense of Exercise-Induced Angina',
    xaxis_title="Angina Present?",
    yaxis_title="Count of Patients")


Most patients do not experience angina after exercise, but the split is very close.  It is unclear how this metric relates to the "ChestPainType" data, where 

In [168]:
fig = categorical_ct(df, 'ST_Slope')
fig.update_layout(
    title='Patients\' ECG ST-Segment Slope',
    xaxis_title="Test Result",
    yaxis_title="Count of Patients")

In [169]:
fig = grouped_categorical_ct(df, grouper='Sex', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by Sex',
    xaxis_title="Sex",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")


In [175]:
fig = grouped_categorical_ct(df, grouper='ChestPainType', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Presense of Heart Disease By Chest Pain Type',
    xaxis_title="Chest Pain Type",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

77% of all heart disease cases had no chest pain symptoms. Strangely, more healthy patients reported chest pain than patients with heart disease.  Non-anginal pain was the most common type of chest pain reported by patients with heart disease.  The most common pain reported by healthy patients was atypical angina. Typical angina was the least reported chest pain type overall, with almost equal numbers of healthy and heart disease patients.

In [176]:
fig = grouped_categorical_ct(df, grouper='FastingBS', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by Blood Sugar Level',
    xaxis_title="Blood Sugar Range",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

Non-diabetic patients are more common than diabetic patients overall.  Among non-diabetic patients, there are about the same number of patients with and without heart disease.  However, there is much larger portion of diabetic patients that also have heart disease.  High blood sugar level may therefore be tied to developing heart disease.

In [177]:
fig = grouped_categorical_ct(df, grouper='RestingECG', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by ECG Result',
    xaxis_title="ECG Reading",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

There are fairly similar distributions of healthy and heart disease patients across the ECG reading categories.  The ST-T wave abnormality group has the most drastic difference in frequency.  It is more common for a person with heart disease to display an ST-T wave abnormality than a healthy patient, which could indicate another symptom or risk factor for heart disease.

In [178]:
fig = grouped_categorical_ct(df, grouper='ExerciseAngina', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Exercise-Induced Angina and Heart Disease',
    xaxis_title="Angina Induced by Exercise?",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

Most patients that have heart disease experience exercised-induced angina, and very few healthy patients display this symptom.  This may be a good feature to use in our predictive model. It is important to note that there is still a significant portion of patients with heart disease that do not experience angina after exercising (false negative risk).

In [179]:
fig = grouped_categorical_ct(df, grouper='ST_Slope', counter='HeartDisease', colors=['red', 'green'])
fig.update_layout(
    title='Heart Disease in Patients by ST-Segment Slope',
    xaxis_title="ST-Segment Slope",
    yaxis_title="Count of Patients",
    legend_title="Heart Disease Result")

After researching this measurement, I found that an upward slope of the ST-segment of an ECG reading is considered normal.  Downward or horizontal (flat) slopes are considered ST-segment depressions.

A large majority of those with heart disease display a flat ST-segment slope, while most healthy patients display a normal upward slope.  There are fewer patients overall that display a downward slope, but this is more common among those with heart disease rather than healthy individuals. Some healthy patients display a flat ST-segment slope. This could be an indicator that the patient is at risk of heart disease given that the upslope is more common among healthy individuals. More grouping analysis is needed for this metric, but it could make another strong feature for my model.