### Heart Disease Predictors

#### Background:
Heart disease, also known as cardiovascular disease, is a broad term that encompasses various conditions affecting the heart and circulatory system. It is a leading cause of disability worldwide. Because the heart is one of the body’s most essential organs, its disorders can impact other organs and body systems as well. There are many types and forms of heart disease, with the most common involving the narrowing or blockage of coronary arteries, valve dysfunction, enlargement of the heart, and other issues that can result in heart attacks or heart failure. [Source](https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-%28cvds%29?)

#### Dataset Description:
This dataset originates from 1988 and includes data collected from four sources: `Cleveland, Hungary, Switzerland, and Long Beach V`. It contains 76 attributes in total, including the target variable; however, most published studies typically utilize a subset of 14 of these features. The “target” variable indicates whether a patient has heart disease, represented as an `integer — 0 for no disease and 1 for presence of disease`.

#### Objective: 
The ojective of this project is to use exploratory analysis to determine heart disease prdeictors using the provided dataset.

Data Source: [Kaggle](https://www.google.com/search?q=kaggle%2Finput%2Fheart-disease%2Fheart.csv&rlz=1C1KNTJ_enNG1087NG1088&oq=kaggle%2Finput%2Fheart-disease%2Fheart.csv&gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARBFGDrSAQgxMTM0ajBqN6gCCLACAfEFy5I5ff9HO2Y&sourceid=chrome&ie=UTF-8)

In [1]:
import pandas as pd
data = pd.read_csv('heart.csv')

In [2]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
#Display column data types
data.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


### Data Glossary
- age: age in years
- sex: gender
     - 1 = male
     - 0 = female
- cp: chest pain type
    - value 0: typical angina
    - value 1: atypical angina
    - value 2: non-anginal pain
    - value 3: asymptomatic
- trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholestoral in mg/dl
- fbs: (fasting blood sugar > 120 mg/dl)
    - 1 = true
    - 0 = false
- restecg: resting electrocardiographic results
    - value 0: normal
    - value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach: maximum heart rate achieved
- exang: exercise induced angina
    - 1 = yes
    - 0 = no
- oldpeak = ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
    - value 0: upsloping
    - value 1: flat
    - value 2: downsloping
- ca: number of major vessels (0-3) colored by flourosopy
- thal:
   - 0 = error (in the original dataset 0 maps to NaN's)
   - 1 = fixed defect
   - 2 = normal
   - 3 = reversable defect
- target (the label):
    - 0 = no disease,
    - 1 = disease

### Note: 

Regarding the target label, the designated integers describe the diagnosis of heart disease which is thw angiographic disease status:
Value 0: means < 50% diameter narrowing and;
Value 1: means > 50% diameter narrowing

### Also from the discussion forum of the dataset, the following was noted:

- Data #93, 159, 164, 165 and 252 have ca = 4 which is incorrect. In the original Cleveland dataset they are NaNs.
- data #49 and 282 have thal = 0, also incorrect. They are also NaNs in the original dataset.
Therefore the afftected data will be dropped. (7 data entry will be dropped)

## Data Cleaning
As initially observed, seven data entry will be dropped.

In [5]:
data = data[data['ca'] < 4] #drop the wrong ca values
data = data[data['thal'] > 0] # drop the wong thal value

In [6]:
data.info()

print(f'The length of the data now is now {len(data)} instead of 303!')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       296 non-null    int64  
 1   sex       296 non-null    int64  
 2   cp        296 non-null    int64  
 3   trestbps  296 non-null    int64  
 4   chol      296 non-null    int64  
 5   fbs       296 non-null    int64  
 6   restecg   296 non-null    int64  
 7   thalach   296 non-null    int64  
 8   exang     296 non-null    int64  
 9   oldpeak   296 non-null    float64
 10  slope     296 non-null    int64  
 11  ca        296 non-null    int64  
 12  thal      296 non-null    int64  
 13  target    296 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 34.7 KB
The length of the data now is now 296 instead of 303!


### Renaming Columns

The dataset’s feature names are abbreviated, which makes them difficult to interpret. Even the full medical or technical terms can be complex, so to make the data more readable, I’ll rename the dataframe’s columns using the detailed descriptions provided by the UCI data repository. In addition, I’ll replace the numerical category codes (0, 1, 2, etc.) with their corresponding medical meanings (for example, ‘typical angina’, ‘atypical angina’, and so on).

In [7]:
# Rename the columns for readability
data = data.rename(
    columns = {
    'age': 'Age',
    'sex': 'Sex',
    'cp': 'Chest_Pain_Type',
    'trestbps': 'Resting_Blood_Pressure',
    'chol': 'Serum_Cholesterol',
    'fbs': 'Fasting_Blood_Sugar',
    'restecg': 'Resting_Electrocardiogram',
    'thalach': 'Max_Heart_Rate',
    'exang': 'Exercise_Induced_Angina',
    'oldpeak': 'ST_Depression_Exercise',
    'slope': 'ST_Segment_Slope',
    'ca': 'Num_Major_Vessels',
    'thal': 'Thalassemia',
    'target': 'Heart_Disease_Presence'},
  errors = 'raise')

In [8]:
heart_data = data.copy()

In [11]:
#Replacing categorical codes 
heart_data['Sex'] = heart_data['Sex'].replace({0: 'female', 1: 'male'})
heart_data['Chest_Pain_Type'] = heart_data['Chest_Pain_Type'].replace({
    0: 'typical angina',
    1: 'atypical angina',
    2: 'non-anginal pain',
    3: 'asymptomatic'
})

heart_data['Fasting_Blood_Sugar'] = heart_data['Fasting_Blood_Sugar'].replace({
    0: 'lower than 120mg/ml',
    1: 'greater than 120mg/ml'
})

heart_data['Resting_Electrocardiogram'] = heart_data['Resting_Electrocardiogram'].replace({
    0: 'normal',
    1: 'ST-T wave abnormality',
    2: 'left ventricular hypertrophy'
})

heart_data['Exercise_Induced_Angina'] = heart_data['Exercise_Induced_Angina'].replace({
    0: 'no',
    1: 'yes'
})

heart_data['ST_Segment_Slope'] = heart_data['ST_Segment_Slope'].replace({
    0: 'upsloping',
    1: 'flat',
    2: 'downsloping'
})

heart_data['Thalassemia'] = heart_data['Thalassemia'].replace({
    1: 'normal',
    2: 'fixed defect',
    3: 'reversible defect'
})


In [10]:
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 296 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        296 non-null    int64  
 1   Sex                        296 non-null    object 
 2   Chest_Pain_Type            296 non-null    object 
 3   Resting_Blood_Pressure     296 non-null    int64  
 4   Serum_Cholesterol          296 non-null    int64  
 5   Fasting_Blood_Sugar        296 non-null    object 
 6   Resting_Electrocardiogram  296 non-null    object 
 7   Max_Heart_Rate             296 non-null    int64  
 8   Exercise_Induced_Angina    296 non-null    object 
 9   ST_Depression_Exercise     296 non-null    float64
 10  ST_Segment_Slope           296 non-null    object 
 11  Num_Major_Vessels          296 non-null    int64  
 12  Thalassemia                296 non-null    object 
 13  Heart_Disease_Presence     296 non-null    int64  

In [12]:
heart_data.head(10)

Unnamed: 0,Age,Sex,Chest_Pain_Type,Resting_Blood_Pressure,Serum_Cholesterol,Fasting_Blood_Sugar,Resting_Electrocardiogram,Max_Heart_Rate,Exercise_Induced_Angina,ST_Depression_Exercise,ST_Segment_Slope,Num_Major_Vessels,Thalassemia,Heart_Disease_Presence
0,63,male,asymptomatic,145,233,greater than 120mg/ml,normal,150,no,2.3,upsloping,0,normal,1
1,37,male,non-anginal pain,130,250,lower than 120mg/ml,ST-T wave abnormality,187,no,3.5,upsloping,0,fixed defect,1
2,41,female,atypical angina,130,204,lower than 120mg/ml,normal,172,no,1.4,downsloping,0,fixed defect,1
3,56,male,atypical angina,120,236,lower than 120mg/ml,ST-T wave abnormality,178,no,0.8,downsloping,0,fixed defect,1
4,57,female,typical angina,120,354,lower than 120mg/ml,ST-T wave abnormality,163,yes,0.6,downsloping,0,fixed defect,1
5,57,male,typical angina,140,192,lower than 120mg/ml,ST-T wave abnormality,148,no,0.4,flat,0,normal,1
6,56,female,atypical angina,140,294,lower than 120mg/ml,normal,153,no,1.3,flat,0,fixed defect,1
7,44,male,atypical angina,120,263,lower than 120mg/ml,ST-T wave abnormality,173,no,0.0,downsloping,0,reversible defect,1
8,52,male,non-anginal pain,172,199,greater than 120mg/ml,ST-T wave abnormality,162,no,0.5,downsloping,0,reversible defect,1
9,57,male,non-anginal pain,150,168,lower than 120mg/ml,ST-T wave abnormality,174,no,1.6,downsloping,0,fixed defect,1


### Statistical Summary of the Heart Data

To gain a quick overview of the dataset’s numerical features, the **`.describe()`** function will be used. This method generates key summary statistics such as **count**, **mean**, **standard deviation**, **minimum**, **maximum**, and the **quartile (25%, 50%, 75%) values** for each numeric column.

This summary helps identify data distribution patterns, detect potential outliers, and understand the overall range and central tendency of each feature — all of which are essential for exploring the characteristics of patients in the heart disease dataset.


In [13]:
heart_data.describe()

Unnamed: 0,Age,Resting_Blood_Pressure,Serum_Cholesterol,Max_Heart_Rate,ST_Depression_Exercise,Num_Major_Vessels,Heart_Disease_Presence
count,296.0,296.0,296.0,296.0,296.0,296.0,296.0
mean,54.523649,131.60473,247.155405,149.560811,1.059122,0.679054,0.540541
std,9.059471,17.72662,51.977011,22.970792,1.166474,0.939726,0.499198
min,29.0,94.0,126.0,71.0,0.0,0.0,0.0
25%,48.0,120.0,211.0,133.0,0.0,0.0,0.0
50%,56.0,130.0,242.5,152.5,0.8,0.0,1.0
75%,61.0,140.0,275.25,166.0,1.65,1.0,1.0
max,77.0,200.0,564.0,202.0,6.2,3.0,1.0


The statistical summary of the heart dataset shows that the average patient age is about **54 years**, with ages ranging from **29 to 77**. The typical **resting blood pressure** is around **132 mmHg**, and **serum cholesterol** averages **247 mg/dL**, though some patients have extremely high levels reaching **564 mg/dL**, suggesting possible cholesterol-related risks. 

The **maximum heart rate achieved** averages **150 bpm**, indicating moderate cardiovascular performance across the group. The **ST depression** values vary widely, from **0.0 to 6.2**, reflecting differences in heart stress response during exercise. Most patients have **no major blood vessels** affected (median = 0), and about **54%** show signs of **heart disease**, highlighting a nearly even distribution between those with and without the condition. 

Overall, the data reflects a middle-aged population with a moderate prevalence of heart disease and a wide range of cardiovascular health indicators.
