#  **Heart Failure Clinical Records Dataset Exploratory Data Analysis (EDA)**

**Objective**  
Gain deep insight into the UCI Heart Failure Clinical Records dataset before modeling.

In [1]:
!pip install jupyter_dash ucimlrepo



In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns


from dash import html, dcc, Input, Output
from jupyter_dash import JupyterDash
from plotly.subplots import make_subplots
from ucimlrepo import fetch_ucirepo

sns.set_context("talk")
plt.rcParams["figure.figsize"] = (10,6)
RANDOM_STATE = 42

##  **Data Acquisition**

Dataset fetched directly via `ucimlrepo` (ID 519) to ensure always‑up‑to‑date retrieval from UCI ML Repository.


In [3]:
repo = fetch_ucirepo(id=519)
X = repo.data.features
y = repo.data.targets
df = pd.concat([X, y], axis=1)

print(f"Dataset shape: {df.shape}")


Dataset shape: (299, 13)


##  **Metadata and Variable Information**

Presentation of repository‑level metadata and detailed variable descriptions to inform subsequent feature engineering.


In [4]:
print("Repository Metadata")
display(repo.metadata)

Repository Metadata


{'uci_id': 519,
 'name': 'Heart Failure Clinical Records',
 'repository_url': 'https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records',
 'data_url': 'https://archive.ics.uci.edu/static/public/519/data.csv',
 'abstract': 'This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.',
 'area': 'Health and Medicine',
 'tasks': ['Classification', 'Regression', 'Clustering'],
 'characteristics': ['Multivariate'],
 'num_instances': 299,
 'num_features': 12,
 'feature_types': ['Integer', 'Real'],
 'demographics': ['Age', 'Sex'],
 'target_col': ['death_event'],
 'index_col': None,
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 2020,
 'last_updated': 'Mon Feb 26 2024',
 'dataset_doi': '10.24432/C5Z89R',
 'creators': [],
 'intro_paper': {'ID': 286,
  'type': 'NATIVE',
  'title': 'Machine learning can predict survival of patie

In [5]:
print("Variable Descriptions")
display(pd.DataFrame(repo.variables))


Variable Descriptions


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,age of the patient,years,no
1,anaemia,Feature,Binary,,decrease of red blood cells or hemoglobin,,no
2,creatinine_phosphokinase,Feature,Integer,,level of the CPK enzyme in the blood,mcg/L,no
3,diabetes,Feature,Binary,,if the patient has diabetes,,no
4,ejection_fraction,Feature,Integer,,percentage of blood leaving the heart at each ...,%,no
5,high_blood_pressure,Feature,Binary,,if the patient has hypertension,,no
6,platelets,Feature,Continuous,,platelets in the blood,kiloplatelets/mL,no
7,serum_creatinine,Feature,Continuous,,level of serum creatinine in the blood,mg/dL,no
8,serum_sodium,Feature,Integer,,level of serum sodium in the blood,mEq/L,no
9,sex,Feature,Binary,Sex,woman or man,,no


## **Data Quality Checks**

Verification of missingness, data types, and basic integrity before any transformation.


In [6]:
print("Missing values per column")
display(df.isna().sum())

Missing values per column


age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
death_event                 0
dtype: int64

In [7]:
print("Data type:")
display(df.dtypes)

Data type:


age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
death_event                   int64
dtype: object

##  **Univariate Analysis**

###  **Continuous Features**

Statistical summaries and distribution visualizations for all numeric variables.


In [8]:
continuous_cols = [
    "age","creatinine_phosphokinase","ejection_fraction",
    "platelets","serum_creatinine","serum_sodium","time"
]
display(df[continuous_cols].describe().T)



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,299.0,60.833893,11.894809,40.0,51.0,60.0,70.0,95.0
creatinine_phosphokinase,299.0,581.839465,970.287881,23.0,116.5,250.0,582.0,7861.0
ejection_fraction,299.0,38.083612,11.834841,14.0,30.0,38.0,45.0,80.0
platelets,299.0,263358.029264,97804.236869,25100.0,212500.0,262000.0,303500.0,850000.0
serum_creatinine,299.0,1.39388,1.03451,0.5,0.9,1.1,1.4,9.4
serum_sodium,299.0,136.625418,4.412477,113.0,134.0,137.0,140.0,148.0
time,299.0,130.26087,77.614208,4.0,73.0,115.0,203.0,285.0


In [9]:
df_cont = df[continuous_cols].melt(var_name="Feature", value_name="Value")
fig_cont = px.histogram(
    df_cont, x="Value",
    facet_col="Feature", facet_col_wrap=3,
    nbins=15,
    title="Distributions of Continuous Features",
    height=750
)

fig_cont.for_each_annotation(lambda ann: ann.update(text=ann.text.split("=")[-1]))
fig_cont.update_layout(showlegend=False)
fig_cont.show()


*   Age spans roughly 40–95 years, peaking around 55–75 years.

*   CPK (creatinine_phosphokinase) is extremely right‑skewed: most values lie below 500 mcg/L, but a handful exceed 3 000 mcg/L.

*   Ejection Fraction clusters between 25 % and 45 %, with fewer patients above 50 %.

*   Platelets approximately normal around 200 000–400 000 kiloplatelets/mL.

*   Serum Creatinine is right‑skewed: most values are < 2 mg/dL, but outliers reach above 8 mg/dL.

*   Serum Sodium concentrates between 130–145 mEq/L, reflecting typical homeostatic ranges.

*   Follow‑up Time (days) uniformly spans 0–300 days, indicating variable censoring times.



###  **Categorical Features**

Frequency counts for binary flags and target variable.


In [10]:
categorical_cols = ["anaemia","diabetes","high_blood_pressure","sex","smoking","death_event"]
df_cat = df[categorical_cols].melt(var_name="Feature", value_name="Count")
fig_cat = px.histogram(
    df_cat, x="Count",
    color="Count",
    facet_col="Feature", facet_col_wrap=3,
    title="Counts of Categorical Features",
    height=750
)
fig_cat.for_each_annotation(lambda ann: ann.update(text=ann.text.split("=")[-1]))
fig_cat.update_layout(showlegend=False)
fig_cat.show()

*   ~32 % of patients died during follow‑up (≈ 96/299).

*   ~65 % male, 35 % female.

*   ~68 % non‑smokers, ~32 % smokers.

*   Anaemia and Diabetes similarly present in ~40 % of cases.

*   High Blood Pressure in ~35 % of patients.

Class imbalance (survival vs. death) and roughly balanced binary covariates suggest logistic‑type models with class weighting or resampling will be appropriate.

##  **Bivariate Analysis**

###  **Continuous vs. Outcome**

Boxplots to contrast distributions of key biomarkers by survival status.


In [11]:
key_feats = ["ejection_fraction","serum_creatinine","age"]
df_key = df.melt(id_vars="death_event", value_vars=key_feats,
                 var_name="Feature", value_name="Value")
fig_box = px.box(
    df_key, x="death_event", y="Value",
    color="death_event",
    facet_col="Feature", facet_col_wrap=3,
    title="Key Feature Distributions by Outcome",
    height=400
)
fig_box.for_each_annotation(lambda ann: ann.update(text=ann.text.split("=")[-1]))
fig_box.show()



*   Ejection Fraction

  Median drops from ~38 % (survivors) to ~30 % (decedents). Survivors exhibit a wider range (17–80 %) versus decedents (14–70 %).

*   Serum Creatinine

  Decedents show a higher median (~1.3 mg/dL) versus survivors (~1.0 mg/dL), with pronounced upper‐tail outliers (up to ~9 mg/dL).

*   Age

  Median increases from ~60 years (survivors) to ~65 years (decedents). Upper outliers above 90 years concentrated among decedents.



Lower ejection fraction, elevated serum creatinine, and advanced age each correlate with higher mortality risk.

###  **Correlation Matrix**

Identification of multicollinearity and candidate features via heatmap of Pearson correlations.


In [12]:
corr = df[continuous_cols].corr()
fig_corr = px.imshow(
    corr,
    text_auto=".2f",
    aspect="auto",
    title="Correlation Matrix of Continuous Features",
    width=600, height=600,
    color_continuous_scale='RdBu_r'
)
fig_corr.update_layout(margin=dict(l=20, r=20, t=40, b=20))
fig_corr.show()

Weak overall correlations (< |0.2|) indicate low multicollinearity—favorable for most classifiers.

Negative time–age correlation suggests older patients tended to have shorter follow‑up (perhaps due to earlier events).

Positive creatinine–age link is modest, reflecting age‑related renal decline.

 Low feature interdependence simplifies interpretation (e.g., SHAP) and supports tree‑based or regularized linear methods without heavy de‑correlation.