In [1]:
import sys
import os

sys.path.append(os.path.abspath('..'))

from src.visualization import (
    plot_continuous_distribution, 
    plot_categorical_ratio, 
    plot_outlier_analysis,
    plot_correlation_heatmap
)


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('../data/raw/heart_failure_clinical_records_dataset.csv')

sns.countplot(x='DEATH_EVENT', data=df)
plt.title('Distribution of survival (0)             VS               death (1)')


In [None]:
summary = df.describe().T 
display(summary)

based on this age distribution probability density function (PDF), most survivors are around age 60, while distribution for dead patients has a broader right-tail extending toward age 90.

for now, I will use more 'scientific' version of histograms, which is:  Kernel Density Estimate (KDE).


#1. Based on some medical researches, The most important factors are : age, ejection fraction and serum creatine

I will use a probability density function to visualize these distributions in order to compare their shapes fairly, since the number of survivors and deaths is different. In our dataset, as we saw from the previous survivor count, survivors are approximately twice as many as the Dead.

In [None]:
# 1. age, ejection fraction and serum creatine

for col in ['age', 'ejection_fraction', 'serum_creatinine']:
    plot_continuous_distribution(df, col)




age distribution:
on this KDE plot, we can definetely see the "Distribution shift", which means, based on the dataset,  older patients have a higher probability of the death event.
The "Dead" distribution is stochastically larger than the "Survivors" distribution.
"Survivors" distributions are more normally distributed, while "Dead" distribution has high mean and high variance.

ejection fraction distribution: visualization shows that patients who had low ejection fraction level (below 30%) most likely to be dead. Curve of  "Dead" distribution peaks at approximately 25%. "Survivors" distribution has bimodal nature, it has a huge peak at around 40% (which means early heart failure, but stable), and second much smaller peak at around 60% (that means healthy). Based on this visualization, we can see "critical threshold" at approximately 35%.
below that, density of red curve which is "Dead" distribution, is much higher than blue curve of "Survivors" distribution. That means 35% of ejection fraction is mathematically significant cutoff for survival prediction.

Serum creatinine: 
Based on the visualization, "Survivors" distribution has low variance, is super narrow and peaked at approximately 1. 
However, "Dead" distribution has higher variance, long tail on right, heavily right skewed, tail  even reaches 6.0 and more on scale. That means that even small increase of serum creatinine can be a reason for a death event to occur.



In [None]:
#2.  The messy variables
for col in ['platelets', 'creatinine_phosphokinase']:
    plot_outlier_analysis(df, col)

platelets:
Based on the boxplot showing how platelet concentration in blood relates to fatal heart disease, the median values for the “Survival” and “Dead” groups are almost the same. Additionally, the heights of the boxes are very close to each other, which can be considered essentially identical. This suggests that platelet concentration does not have a strong effect on mortality due to heart disease in this dataset.
We also observe many data points above 400,000, which cannot be considered noise and removed, since this is medical data. Such high platelet counts are indicators of thrombocytosis, which increases the risk of blood clots and can be dangerous for heart stability.

creatine phosphokinase:
The same pattern is observed here. The medians and box heights are almost identical, indicating that creatine phosphokinase does not play a significant role in heart disease mortality in this dataset. Both boxplots are concentrated near zero, meaning that most patients have low creatine phosphokinase levels and the distribution is highly skewed.
However, we also observe several extreme values (especially among survived patients) that can reach up to 6000. Some dead patients also had higher than 6000.

In [None]:
#3. Categorical variables
for col in ['high_blood_pressure', 'smoking', 'diabetes', 'sex', 'anaemia']:
    plot_categorical_ratio(df, col)

In [None]:
# correlation hitmap
# I will use Spearman because data is skewed and has outliers.
plot_correlation_heatmap(df, 'spearman')

by this, we successfully identified the most important 3 factors of a heart disease, which are:
1. **Serum Creatinine - with a strong positive correlation (0.37) **
 --> The kidney-heart link, low Serum Creatinine causes  kidney dysfunction, which strains the heart by causing fluid buildup and hypertension, increasing risks for heart failure.
2. **Ejection Fraction - with a strong negative correlation (-0.29) **
 --> heart's main pumping chamber (left ventricle) isn't effectively pumping enough blood out to body = High risk of a heart disease.
3. ** Age - with a solid positive correlation (0.22)**

Time has strongest negative correlation on heatmap (-0.54), that's because, when as long as time increases, there is less probability that the patient will die, because the time describes the days the patient was observed. Patient either died or when the observation and collecting the data has ended, the patient was still alive. that variable might cause "data leakage", because model might assume that if the time is short, that's because death_event occured, which is not something that can be happen in real world during medical observation.  
"time" variable will be cause of model's very high accuracy, but it will fail on a real world scenario.

In [None]:
# let's see for Pearson
# plot_correlation_heatmap(df, 'pearson')