# Heart Disease Dataset

## About the data

This dataset contains various health indicators and risk factors related to heart disease. Parameters such as age, gender, blood pressure, cholesterol levels, smoking habits, and exercise patterns have been collected to analyze heart disease risk and contribute to health research. The dataset can be used by healthcare professionals, researchers, and data analysts to examine trends related to heart disease, identify risk factors, and perform various health-related analyses.


The attributes included are:
 Age: The individual's age. <br>
Gender: The individual's gender (Male or Female).<br>
Blood Pressure: The individual's blood pressure (systolic).<br>
Cholesterol Level: The individual's total cholesterol level.<br>
Exercise Habits: The individual's exercise habits (Low, Medium, High).<br>
Smoking: Whether the individual smokes or not (Yes or No).<br>
Family Heart Disease: Whether there is a family history of heart disease (Yes or No).<br>
Diabetes: Whether the individual has diabetes (Yes or No).<br>
BMI: The individual's body mass index.<br>
High Blood Pressure: Whether the individual has high blood pressure (Yes or No).<br>
Low HDL Cholesterol: Whether the individual has low HDL cholesterol (Yes or No).<br>
High LDL Cholesterol: Whether the individual has high LDL cholesterol (Yes or No).<br>
Alcohol Consumption: The individual's alcohol consumption level (None, Low, Medium, High).<br>
Stress Level: The individual's stress level (Low, Medium, High).<br>
Sleep Hours: The number of hours the individual sleeps.<br>
Sugar Consumption: The individual's sugar consumption level (Low, Medium, High).<br>
Triglyceride Level: The individual's triglyceride level.<br>
Fasting Blood Sugar: The individual's fasting blood sugar level.<br>
CRP Level: The C-reactive protein level (a marker of inflammation).<br>
Homocysteine Level: The individual's homocysteine level (an amino acid that affects blood vessel health).<br>
Heart Disease Status: The individual's heart disease status (Yes or No)

## Data overview

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("heart_disease.csv")
df.head()

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
0,56.0,Male,153.0,155.0,High,Yes,Yes,No,24.991591,Yes,...,No,High,Medium,7.633228,Medium,342.0,,12.969246,12.38725,No
1,69.0,Female,146.0,286.0,High,No,Yes,Yes,25.221799,No,...,No,Medium,High,8.744034,Medium,133.0,157.0,9.355389,19.298875,No
2,46.0,Male,126.0,216.0,Low,No,No,No,29.855447,No,...,Yes,Low,Low,4.44044,Low,393.0,92.0,12.709873,11.230926,No
3,32.0,Female,122.0,293.0,High,Yes,Yes,No,24.130477,Yes,...,Yes,Low,High,5.249405,High,293.0,94.0,12.509046,5.961958,No
4,60.0,Male,166.0,242.0,Low,Yes,Yes,Yes,20.486289,Yes,...,No,Low,High,7.030971,High,263.0,154.0,10.381259,8.153887,No


In [3]:
df.shape

(10000, 21)

The data has 10K instances and 21 attributes are the following

In [4]:
df.columns

Index(['Age', 'Gender', 'Blood Pressure', 'Cholesterol Level',
       'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes', 'BMI',
       'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL Cholesterol',
       'Alcohol Consumption', 'Stress Level', 'Sleep Hours',
       'Sugar Consumption', 'Triglyceride Level', 'Fasting Blood Sugar',
       'CRP Level', 'Homocysteine Level', 'Heart Disease Status'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   9971 non-null   float64
 1   Gender                9981 non-null   object 
 2   Blood Pressure        9981 non-null   float64
 3   Cholesterol Level     9970 non-null   float64
 4   Exercise Habits       9975 non-null   object 
 5   Smoking               9975 non-null   object 
 6   Family Heart Disease  9979 non-null   object 
 7   Diabetes              9970 non-null   object 
 8   BMI                   9978 non-null   float64
 9   High Blood Pressure   9974 non-null   object 
 10  Low HDL Cholesterol   9975 non-null   object 
 11  High LDL Cholesterol  9974 non-null   object 
 12  Alcohol Consumption   9968 non-null   object 
 13  Stress Level          9978 non-null   object 
 14  Sleep Hours           9975 non-null   float64
 15  Sugar Consumption   

In [6]:
df.describe()

Unnamed: 0,Age,Blood Pressure,Cholesterol Level,BMI,Sleep Hours,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level
count,9971.0,9981.0,9970.0,9978.0,9975.0,9974.0,9978.0,9974.0,9980.0
mean,49.296259,149.75774,225.425577,29.077269,6.991329,250.734409,120.142213,7.472201,12.456271
std,18.19397,17.572969,43.575809,6.307098,1.753195,87.067226,23.584011,4.340248,4.323426
min,18.0,120.0,150.0,18.002837,4.000605,100.0,80.0,0.003647,5.000236
25%,34.0,134.0,187.0,23.658075,5.449866,176.0,99.0,3.674126,8.723334
50%,49.0,150.0,226.0,29.079492,7.003252,250.0,120.0,7.472164,12.409395
75%,65.0,165.0,263.0,34.520015,8.531577,326.0,141.0,11.255592,16.140564
max,80.0,180.0,300.0,39.996954,9.999952,400.0,160.0,14.997087,19.999037


## Null values

In [7]:
df.isnull().sum()

Age                     29
Gender                  19
Blood Pressure          19
Cholesterol Level       30
Exercise Habits         25
Smoking                 25
Family Heart Disease    21
Diabetes                30
BMI                     22
High Blood Pressure     26
Low HDL Cholesterol     25
High LDL Cholesterol    26
Alcohol Consumption     32
Stress Level            22
Sleep Hours             25
Sugar Consumption       30
Triglyceride Level      26
Fasting Blood Sugar     22
CRP Level               26
Homocysteine Level      20
Heart Disease Status     0
dtype: int64

Since all null values are less than 2%, then they will be dropped

In [8]:
df = df.dropna()
df.isnull().sum()

Age                     0
Gender                  0
Blood Pressure          0
Cholesterol Level       0
Exercise Habits         0
Smoking                 0
Family Heart Disease    0
Diabetes                0
BMI                     0
High Blood Pressure     0
Low HDL Cholesterol     0
High LDL Cholesterol    0
Alcohol Consumption     0
Stress Level            0
Sleep Hours             0
Sugar Consumption       0
Triglyceride Level      0
Fasting Blood Sugar     0
CRP Level               0
Homocysteine Level      0
Heart Disease Status    0
dtype: int64

## Datatypes

In [9]:
df.dtypes

Age                     float64
Gender                   object
Blood Pressure          float64
Cholesterol Level       float64
Exercise Habits          object
Smoking                  object
Family Heart Disease     object
Diabetes                 object
BMI                     float64
High Blood Pressure      object
Low HDL Cholesterol      object
High LDL Cholesterol     object
Alcohol Consumption      object
Stress Level             object
Sleep Hours             float64
Sugar Consumption        object
Triglyceride Level      float64
Fasting Blood Sugar     float64
CRP Level               float64
Homocysteine Level      float64
Heart Disease Status     object
dtype: object

In [10]:
df['Age'] = df['Age'].astype(int)
df['Blood Pressure'] = df['Blood Pressure'].astype(int)

In [11]:
df['BMI'] = df['BMI'].round(2)
df['Sleep Hours'] = df['Sleep Hours'].round(2)
df['CRP Level'] = df['CRP Level'].round(2)
df['Homocysteine Level'] = df['Homocysteine Level'].round(2)

In [12]:
df.dtypes

Age                       int32
Gender                   object
Blood Pressure            int32
Cholesterol Level       float64
Exercise Habits          object
Smoking                  object
Family Heart Disease     object
Diabetes                 object
BMI                     float64
High Blood Pressure      object
Low HDL Cholesterol      object
High LDL Cholesterol     object
Alcohol Consumption      object
Stress Level             object
Sleep Hours             float64
Sugar Consumption        object
Triglyceride Level      float64
Fasting Blood Sugar     float64
CRP Level               float64
Homocysteine Level      float64
Heart Disease Status     object
dtype: object

In [13]:
df.head()

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
1,69,Female,146,286.0,High,No,Yes,Yes,25.22,No,...,No,Medium,High,8.74,Medium,133.0,157.0,9.36,19.3,No
2,46,Male,126,216.0,Low,No,No,No,29.86,No,...,Yes,Low,Low,4.44,Low,393.0,92.0,12.71,11.23,No
3,32,Female,122,293.0,High,Yes,Yes,No,24.13,Yes,...,Yes,Low,High,5.25,High,293.0,94.0,12.51,5.96,No
4,60,Male,166,242.0,Low,Yes,Yes,Yes,20.49,Yes,...,No,Low,High,7.03,High,263.0,154.0,10.38,8.15,No
5,25,Male,152,257.0,Low,Yes,No,No,28.14,No,...,No,Low,Medium,5.5,Low,126.0,91.0,4.3,10.82,No


## Outliers

In [14]:
def value_outliers(attribute):
    Q1 = df[attribute].quantile(0.25)
    Q3 = df[attribute].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[attribute] < lower_bound) | (df[attribute] > upper_bound)]
    return outliers

#### Blood pressure

In [15]:
outliers = value_outliers("Blood Pressure")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### Cholesterol level

In [16]:
outliers = value_outliers("Cholesterol Level")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### BMI

In [17]:
outliers = value_outliers("BMI")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### Sleep hours

In [18]:
outliers = value_outliers("Sleep Hours")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### Triglyceride level

In [19]:
outliers = value_outliers("Triglyceride Level")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### Fasting blood sugar

In [20]:
outliers = value_outliers("Fasting Blood Sugar")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### CRP level

In [21]:
outliers = value_outliers("CRP Level")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


#### Homocysteine level

In [22]:
outliers = value_outliers("Homocysteine Level")
outliers

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status


No outliers to be dealt with

In [23]:
df

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
1,69,Female,146,286.0,High,No,Yes,Yes,25.22,No,...,No,Medium,High,8.74,Medium,133.0,157.0,9.36,19.30,No
2,46,Male,126,216.0,Low,No,No,No,29.86,No,...,Yes,Low,Low,4.44,Low,393.0,92.0,12.71,11.23,No
3,32,Female,122,293.0,High,Yes,Yes,No,24.13,Yes,...,Yes,Low,High,5.25,High,293.0,94.0,12.51,5.96,No
4,60,Male,166,242.0,Low,Yes,Yes,Yes,20.49,Yes,...,No,Low,High,7.03,High,263.0,154.0,10.38,8.15,No
5,25,Male,152,257.0,Low,Yes,No,No,28.14,No,...,No,Low,Medium,5.50,Low,126.0,91.0,4.30,10.82,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,25,Female,136,243.0,Medium,Yes,No,No,18.79,Yes,...,Yes,Medium,High,6.83,Medium,343.0,133.0,3.59,19.13,Yes
9996,38,Male,172,154.0,Medium,No,No,No,31.86,Yes,...,Yes,,High,8.25,Low,377.0,83.0,2.66,9.72,Yes
9997,73,Male,152,201.0,High,Yes,No,Yes,26.90,No,...,Yes,,Low,4.44,Low,248.0,88.0,4.41,9.49,Yes
9998,23,Male,142,299.0,Low,Yes,No,Yes,34.96,Yes,...,Yes,Medium,High,8.53,Medium,113.0,153.0,7.22,11.87,Yes


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9500 entries, 1 to 9999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   9500 non-null   int32  
 1   Gender                9500 non-null   object 
 2   Blood Pressure        9500 non-null   int32  
 3   Cholesterol Level     9500 non-null   float64
 4   Exercise Habits       9500 non-null   object 
 5   Smoking               9500 non-null   object 
 6   Family Heart Disease  9500 non-null   object 
 7   Diabetes              9500 non-null   object 
 8   BMI                   9500 non-null   float64
 9   High Blood Pressure   9500 non-null   object 
 10  Low HDL Cholesterol   9500 non-null   object 
 11  High LDL Cholesterol  9500 non-null   object 
 12  Alcohol Consumption   9500 non-null   object 
 13  Stress Level          9500 non-null   object 
 14  Sleep Hours           9500 non-null   float64
 15  Sugar Consumption    

In [25]:
df.to_excel("processed_heart_disease.xlsx")