# **Heart Disease Prediction**

![image](Heart3.png)

## 1. **BUSINESS UNDERSTANDING**

### a. Introduction

Heart disease, particularly coronary artery disease (CAD), is a leading cause of morbidity and mortality worldwide. Early detection and accurate diagnosis are crucial for effective treatment and prevention of severe outcomes. Machine learning has the potential to significantly enhance the predictive accuracy for CAD, offering a powerful tool for healthcare providers. This project aims to leverage a comprehensive heart disease dataset, created by combining five well-known datasets, to develop a predictive model for CAD.

### b. Problem Statement
### **What is the prevailing Circumstance?**

Despite advances in medical technology, the early detection and diagnosis of coronary artery disease remain challenging. Traditional diagnostic methods can be time-consuming, costly, and may not always be accurate.
### **What problem is being addressed?**

The project addresses the challenge of improving the early and accurate prediction of coronary artery disease. Current diagnostic methods can be supplemented by a predictive model to provide timely and reliable insights, which are essential for early intervention and better patient outcomes.
### **How the project aims to solve the problems?**

The project aims to develop a machine learning model using a combined dataset from five established heart disease datasets. By training the model on a large and diverse dataset with 11 common features, the predictive accuracy can be enhanced. The model will help healthcare providers in identifying high-risk patients, thereby facilitating early diagnosis and personalized treatment plans.

### c. Objectives

#### Main Objectives
- To develop a machine learning model that can predict the presence of coronary artery disease based on patient data.
#### Specific Objectives
- To clean and preprocess the combined heart disease dataset ensuring data quality and consistency.
- To select and engineer features that significantly contribute to the prediction of CAD.
- To train and validate multiple machine learning models to identify the most accurate and interpretable model.
- To implement the model in a user-friendly interface for healthcare providers.
### d. Notebook Structure
i. Business Understanding <br>

ii. Data Understanding<br>

iii. Exploratory Data Analysis<br>

iv. Data Preprocessing<br>

v. Modeling<br>

vi. Evaluation<br>

vii. Conclusion<br>

viii. Recommendation<br>

ix. Next Steps

### e. Stakeholders
- Healthcare Providers: Physicians, cardiologists, and healthcare professionals who will use the model to aid in diagnosis and treatment.
- Patients: Individuals at risk of or suffering from heart disease who will benefit from early and accurate diagnosis.
- Healthcare Institutions: Hospitals and clinics aiming to improve patient outcomes and optimize resource allocation.
- Researchers: Academics and professionals focused on advancing medical research and machine learning applications in healthcare.
- Insurance Companies: Organizations that can utilize predictive models to assess risk and manage patient care costs effectively.
### f. Metric of Success

- The performance of the model is evaluated based on achieving an accuracy of over 85%.
- Precision and Recall: High precision and recall to ensure reliability and minimize false positives/negatives.

## 2. **Data Understanding**

The data used in this project was obtained from: [Kaggle](https://www.kaggle.com/datasets/mexwell/heart-disease-dataset).

The data set has 1 csv file and  1190 instances with 11 features combines five heart disease datasets to enhance CAD-related machine learning research and improve clinical diagnosis and early treatment.








### Import Libraries

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading the datasets

In [42]:
Heart = pd.read_csv('heart_disease_dataset.csv')
Heart.head()

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,40,1,2,140,289,0,0,172,0,0.0,1,0
1,49,0,3,160,180,0,0,156,0,1.0,2,1
2,37,1,2,130,283,0,1,98,0,0.0,1,0
3,48,0,4,138,214,0,0,108,1,1.5,2,1
4,54,1,3,150,195,0,0,122,0,0.0,1,0


In [43]:
# check shape of the datasets
Heart.shape

(1190, 12)

In [44]:
#check info of the dataset
Heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190 entries, 0 to 1189
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  1190 non-null   int64  
 1   sex                  1190 non-null   int64  
 2   chest pain type      1190 non-null   int64  
 3   resting bp s         1190 non-null   int64  
 4   cholesterol          1190 non-null   int64  
 5   fasting blood sugar  1190 non-null   int64  
 6   resting ecg          1190 non-null   int64  
 7   max heart rate       1190 non-null   int64  
 8   exercise angina      1190 non-null   int64  
 9   oldpeak              1190 non-null   float64
 10  ST slope             1190 non-null   int64  
 11  target               1190 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 111.7 KB


In [45]:
# Description of the data

Heart.describe()

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
count,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0
mean,53.720168,0.763866,3.232773,132.153782,210.363866,0.213445,0.698319,139.732773,0.387395,0.922773,1.62437,0.528571
std,9.358203,0.424884,0.93548,18.368823,101.420489,0.409912,0.870359,25.517636,0.48736,1.086337,0.610459,0.499393
min,28.0,0.0,1.0,0.0,0.0,0.0,0.0,60.0,0.0,-2.6,0.0,0.0
25%,47.0,1.0,3.0,120.0,188.0,0.0,0.0,121.0,0.0,0.0,1.0,0.0
50%,54.0,1.0,4.0,130.0,229.0,0.0,0.0,140.5,0.0,0.6,2.0,1.0
75%,60.0,1.0,4.0,140.0,269.75,0.0,2.0,160.0,1.0,1.6,2.0,1.0
max,77.0,1.0,4.0,200.0,603.0,1.0,2.0,202.0,1.0,6.2,3.0,1.0


### Data Cleaning

In [46]:
# Checking the column names  in the dataset
print("Column names of  dataset:")
print(list(Heart.columns))

Column names of  dataset:
['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol', 'fasting blood sugar', 'resting ecg', 'max heart rate', 'exercise angina', 'oldpeak', 'ST slope', 'target']


In [47]:
# Check for duplicates
Heart.duplicated().sum()

272

In [48]:
# View duplicate rows
duplicates = Heart[Heart.duplicated()]
print(duplicates)


      age  sex  chest pain type  resting bp s  cholesterol  \
163    49    0                2           110          208   
604    58    1                3           150          219   
887    63    1                1           145          233   
888    67    1                4           160          286   
889    67    1                4           120          229   
...   ...  ...              ...           ...          ...   
1156   42    1                3           130          180   
1157   61    1                4           140          207   
1158   66    1                4           160          228   
1159   46    1                4           140          311   
1160   71    0                4           112          149   

      fasting blood sugar  resting ecg  max heart rate  exercise angina  \
163                     0            0             160                0   
604                     0            1             118                1   
887                     1     

In [49]:
# Remove duplicates
Heart= Heart.drop_duplicates()

# Verify removal
duplicate_count_after = Heart.duplicated().sum()
print(f"Number of duplicate rows after cleaning: {duplicate_count_after}")


Number of duplicate rows after cleaning: 0


In [50]:
# Check for missing values

Heart = Heart.isnull().any()
Heart

age                    False
sex                    False
chest pain type        False
resting bp s           False
cholesterol            False
fasting blood sugar    False
resting ecg            False
max heart rate         False
exercise angina        False
oldpeak                False
ST slope               False
target                 False
dtype: bool

- This shows the datasets has no missing values

## 3. Exploratory Data Analysis

In [None]:

# Set up the figure and axis
plt.figure(figsize=(8, 6))

# Create a box plot to compare the distribution of age between individuals with and without heart disease
sns.boxplot(x='target', y='age', data=Heart)

# Add labels and title
plt.xlabel('Presence of Heart Disease (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.title('Distribution of Age by Heart Disease Presence')

# Show the plot
plt.show()
