# Heart Failure Prediction
Heart failure is a serious condition where the heart cannot pump blood efficiently, leading to complications such as organ damage and increased mortality. Our project aims to predict early signs of this condition and possibly alter the path of a person through medical intervention. The project is focused on developing a predictive model to assess heart failure risk based on patient data.

Using **supervised learning** techniques and a comprehensive dataset, we aim to provide accurate early warnings to healthcare professionals. The model is integrated into a user-friendly desktop application with a CustomTkinter GUI, allowing medical practitioners to input patient information and receive risk assessments, ultimately enabling timely interventions and improving patient outcomes.

## 1. Problem Definition
### Problem Identification  
- Heart failure is a serious medical condition where the heart is unable to pump blood efficiently, leading to complications and potentially fatal outcomes. Early detection and prediction of heart failure can significantly improve patient management and reduce mortality rates.  

### Impact Analysis  
- Delayed diagnosis of heart failure can result in severe complications, increased hospitalization rates, higher medical costs, and a reduced quality of life for patients. Early prediction allows for timely medical intervention, improving patient outcomes.  

### Root Cause Exploration  
- Diagnosing heart failure is challenging due to the complex nature of cardiovascular diseases, variations in symptoms, and the need for multiple diagnostic tests such as ECG, echocardiography, and biomarker analysis.  

### Scope Clarification  
- This issue affects healthcare providers, cardiologists, patients with cardiovascular risk factors, and medical researchers. It is particularly critical in areas with limited access to specialized cardiac care.  

### Stakeholder Involvement  
- Key stakeholders include physicians, cardiologists, healthcare providers, medical researchers, and patients at risk of heart failure. Predictive models can assist medical professionals in making data-driven decisions.  

### Current Limitations  
- Traditional methods of diagnosing heart failure rely on symptoms, medical history, and imaging tests, which can be time-consuming and may miss early signs of deterioration.  

### Desired Outcomes  
- The goal is to develop a predictive model that can assess heart failure risk based on patient data, providing early warnings to healthcare providers and allowing timely intervention.  

### Constraints and Challenges  
- Challenges include the availability and quality of medical data, potential bias in datasets, interpretability of machine learning models, and regulatory considerations for medical AI applications.  

### Potential Risks  
- Inaccurate predictions may lead to false alarms or missed diagnoses, impacting patient trust and medical decision-making. Ethical concerns regarding patient data privacy and the model’s reliability must also be addressed.  

### Technical Details  
- **Type of Learning:** Supervised learning (classification model using patient health data).  
- **Dataset Type:** Tabular datasets containing patient health records, including factors such as age, blood pressure, cholesterol levels, ejection fraction, serum creatinine, and other clinical parameters.  
- **Deployment:** GUI — A desktop application using CustomTkinter to allow healthcare professionals to input patient data and receive risk predictions.

## 2. Data Collection  
We will be using the **Heart Failure Prediction Dataset** available on Kaggle, provided through the following link: [Heart Failure Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)  

### Dataset Information  
This dataset was created by combining different datasets already available independently but not combined before. In this dataset, **5 heart datasets** are merged over **11 common features**, making it the **largest heart disease dataset** available for research purposes.  

The five datasets used for its curation are:  
- **Cleveland:** 303 observations  
- **Hungarian:** 294 observations  
- **Switzerland:** 123 observations  
- **Long Beach VA:** 200 observations  
- **Stalog (Heart) Data Set:** 270 observations  

**Total observations:** 1190  
**Duplicated observations:** 272  
**Final dataset size:** 918 observations

Firstly, we will import all the required libraries for our project.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import kagglehub

Then, we will download our 'Heart Failure Prediction Dataset' from Kaggle, read the .csv file, and print it out.

In [3]:
# Download latest version of the dataset.
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")

data = pd.read_csv(path + '/heart.csv')
data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [4]:
# Display the number of rows & columns.
print("Number of Records: " + str(data.shape[0]))
print("Number of Features: " + str(data.shape[1]))

Number of Records: 918
Number of Features: 12


## 3. Data Cleaning & Preprocessing
In this step, we will try to clear any missing data or duplicates if there are any in the dataset.

In [5]:
data.describe() # Describes the mean, standard deviation, and other information.

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [6]:
data.info() # Describes the types of data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [7]:
# 3.1 Identify and remove duplicates (if there are any).

duplicates = data.duplicated()
print("Number of Duplicates: " + str(data[duplicates].size))

Number of Duplicates: 0


In [8]:
# 3.2 Identify and deal with any missing values (if there are any).

print("Number of Missing Values: \n" + str(data.isnull().sum()))

Number of Missing Values: 
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


In [9]:
# 3.3 Removing outliers in the dataset.
numerical_features = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR']
data_cleaned = data

for feature in numerical_features:    
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)

    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    data_cleaned = data_cleaned[(data_cleaned[feature] >= lower_bound) & (data_cleaned[feature] <= upper_bound)]

data_cleaned

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [10]:
# 3.4 Data transformation (Normalization).
normalized_data = data
normalized_data['Cholesterol'] = (data['Cholesterol'] - data['Cholesterol'].min()) / (data['Cholesterol'].max() - data['Cholesterol'].min()) * 30
normalized_data.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,9.890526,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,5.441997,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,8.619403,0.0,120.0,0.0,0.0
50%,54.0,130.0,11.094527,0.0,138.0,0.6,1.0
75%,60.0,140.0,13.283582,0.0,156.0,1.5,1.0
max,77.0,200.0,30.0,1.0,202.0,6.2,1.0


## 4. Exploratory Data Analysis (EDA)

In this step, we will try to explore the data and find relationships between the features.

## 5. Feature Engineering & Selection

In this step, we extract the important features for our project.