# Heart Disease Prediction using BLR

#### Background:
In recent years, the field of healthcare has seen a significant transformation with the advent of data science. Data science in healthcare involves the application of statistical methods, machine learning techniques, and computational algorithms to analyze and interpret complex healthcare data. 

The dataset originally comes from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to collect data on the health status of U.S. residents. As described by the CDC: "Established in 1984 with 15 states, BRFSS now collects data in all 50 states, the District of Columbia, and three U.S. territories. 


#### Attributes:

###### Demographic:
    BMI: Body Mass Index(Numeric)
    Sex: male or female(Nominal)
    Age: Age of the patient;(Categorical)
    Race: Black or White(Factor)

    

###### Behavioral:

    Smoker: whether or not the patient is a current smoker (Factor) Values: `Yes` or `no`
    Does a patient consume alcohol: (Factor) Values: `Yes` or `no`
    Difficulty Walking: (Factor) Values: `Yes` or `no`
    Physical activity such as running, walking, skipping, etc: (Factor) Values: `Yes` or `no
    Average Sleep time in hours: (Numeric)
    
    
###### Information on medical history:

    Physical Health: For how many days during the past 30 days was your physical health not good? (Numeric)
    Mental Health: For how many days during the past 30 days was your mental health not good? (Numeric)
    General Health: General health of the patient.(Factor)
    Asthma: Whether a patient is suffering from Asthma. (Factor) Values: `Yes` or `no
    Kidney Disease: Whether a patient has kidney disease (Factor) Values: `Yes` or `no
    Skin Cancer: Whether a patient has skin cancer (Factor) Values: `Yes` or `no
    Stroke: Whether a patient has any stroke (Factor) Values: `Yes` or `no
    Diabetic: Whether a patient is suffering from diabetes. (Factor) Values: `Yes` or `no , Yes(during pregnancy), No, borderline 
   
    
###### Information on current medical condition:

    This dataset set contains information of patients who are suffering from heart disease
    Heart Disease: Whether patient is suffering from a heart disease (Factor) Values: `Yes` or `no
    
    
    
###### Target variable to predict:
Heart Disease as it associates to various factors such as demographics, medical history and behaviour - (binary: “1”, means “Yes”, “0” means “No”)


STEP 1:

Data Management and Exploratoray Data Analysis


In [2]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import os
os.chdir(r"C:\Users\willi\GitHub\Capstone_Project\data\raw")

In [None]:
# Load Datasets

BH = pd.read_csv('Behaviour.csv')
DG = pd.read_csv('Demographics.csv')
HeartD = pd.read_csv('Heart Disease.csv')
MedicalH = pd.read_csv('Medical History.csv')
print("✅ Datasets Loaded Successfully")



✅ Datasets Loaded Successfully


In [None]:
# Check the shape of the datasets

print("Shape of Behaviour Dataset:", BH.shape)
print("Shape of Demographics Dataset:", DG.shape)   
print("Shape of Heart Disease Dataset:", HeartD.shape)
print("Shape of Medical History Dataset:", MedicalH.shape)

Shape of Behaviour Dataset: (319795, 6)
Shape of Demographics Dataset: (319795, 5)
Shape of Heart Disease Dataset: (319795, 2)
Shape of Medical History Dataset: (319795, 9)


In [5]:
# Preview Data Structure

print("\n--- Behaviour Info ---")
BH.info()

print("\n--- Demographics Info ---")
DG.info()

print("\n--- Heart Disease Info ---")
HeartD.info()

print("\n--- Medical History Info ---")
MedicalH.info()





--- Behaviour Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   pid               319795 non-null  object
 1   Smoking           319795 non-null  object
 2   AlcoholDrinking   319795 non-null  object
 3   DiffWalking       319795 non-null  object
 4   PhysicalActivity  319795 non-null  object
 5   SleepTime         319795 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 14.6+ MB

--- Demographics Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   pid          319795 non-null  object 
 1   BMI          319795 non-null  float64
 2   Sex          319795 non-null  object 
 3   AgeCategory  319795 non-null  object 
 4   Race         319795 non-null  object 
d

In [6]:
# Count Missing Values

print("Missing in Demographics:\n", DG.isnull().sum())
print("\nMissing in Medical History:\n", MedicalH.isnull().sum())
print("\nMissing in Behaviour:\n", BH.isnull().sum())
print("\nMissing in Heart Disease:\n", HeartD.isnull().sum())


Missing in Demographics:
 pid            0
BMI            0
Sex            0
AgeCategory    0
Race           0
dtype: int64

Missing in Medical History:
 pid               0
PhysicalHealth    0
MentalHealth      0
GenHealth         0
Asthma            0
KidneyDisease     0
SkinCancer        0
Stroke            0
Diabetic          0
dtype: int64

Missing in Behaviour:
 pid                 0
Smoking             0
AlcoholDrinking     0
DiffWalking         0
PhysicalActivity    0
SleepTime           0
dtype: int64

Missing in Heart Disease:
 pid             0
HeartDisease    0
dtype: int64


In [7]:
# Check for Duplicates

print("Duplicates in Demographics:", DG.duplicated().sum())
print("Duplicates in Medical History:", MedicalH.duplicated().sum())
print("Duplicates in Behaviour:", BH.duplicated().sum())
print("Duplicates in Heart Disease:", HeartD.duplicated().sum())


Duplicates in Demographics: 0
Duplicates in Medical History: 0
Duplicates in Behaviour: 0
Duplicates in Heart Disease: 0


Summary

In [9]:
# Summary view of the datasets


summary = pd.DataFrame({
    "Dataset": ["Demographics", "Medical History", "Behaviour", "Heart Disease"],
    "Rows": [DG.shape[0], MedicalH.shape[0], BH.shape[0], HeartD.shape[0]],
    "Columns": [DG.shape[1], MedicalH.shape[1], BH.shape[1], HeartD.shape[1]],
    "Duplicates": [
        DG.duplicated().sum(),
        MedicalH.duplicated().sum(),
        BH.duplicated().sum(),
        HeartD.duplicated().sum()
    ]
})
summary


Unnamed: 0,Dataset,Rows,Columns,Duplicates
0,Demographics,319795,5,0
1,Medical History,319795,9,0
2,Behaviour,319795,6,0
3,Heart Disease,319795,2,0


# Step 2: Merge Datasets 

All 4 data sets will be merged into one Master dataset for EDA and cleaning. Variables will be examined and  based on their association with the target variable they will either be kept or dropped.

All datasets have `pid` Patient ID column. We shall use and execute an inner merge to keep only the records present in all datasets.

In [10]:
# Merge Demographics and Behaviour on pid

df_1 = pd.merge(DG,BH, on="pid", how="inner")

# Merge result with Medical History
df_2 = pd.merge(df_1, MedicalH, on="pid", how="inner")

# Merge second result with Heart Disease
master_data = pd.merge(df_2, HeartD, on="pid", how="inner")

print("✅ Datasets Merged Successfully")

✅ Datasets Merged Successfully


Inspect the master dataset

In [11]:
# Shape and columns
print("Shape of master dataset:", master_data.shape)
print("Columns in master dataset:", master_data.columns.tolist())

# Preview first few rows
master_data.head()


Shape of master dataset: (319795, 19)
Columns in master dataset: ['pid', 'BMI', 'Sex', 'AgeCategory', 'Race', 'Smoking', 'AlcoholDrinking', 'DiffWalking', 'PhysicalActivity', 'SleepTime', 'PhysicalHealth', 'MentalHealth', 'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer', 'Stroke', 'Diabetic', 'HeartDisease']


Unnamed: 0,pid,BMI,Sex,AgeCategory,Race,Smoking,AlcoholDrinking,DiffWalking,PhysicalActivity,SleepTime,PhysicalHealth,MentalHealth,GenHealth,Asthma,KidneyDisease,SkinCancer,Stroke,Diabetic,HeartDisease
0,PID-01,16.6,Female,55-59,White,Yes,No,No,Yes,5,3,30,Very good,Yes,No,Yes,No,Yes,No
1,PID-02,20.34,Female,80 or older,White,No,No,No,Yes,7,0,0,Very good,No,No,No,Yes,No,No
2,PID-03,26.58,Male,65-69,White,Yes,No,No,Yes,8,20,30,Fair,Yes,No,No,No,Yes,No
3,PID-04,24.21,Female,75-79,White,No,No,No,No,6,0,0,Good,No,No,Yes,No,No,No
4,PID-05,23.71,Female,40-44,White,No,No,Yes,Yes,8,28,0,Very good,No,No,No,No,No,No


After reviewing all the column's it has been resolved that for now we keep all of them as all columns have a percieved correlation to heart disease.

# Step 3: Summary Table

We shall proceed to create a summary table of each variable to understand its values, variation, missing data, inconsistent data etc.

In [15]:
# Descriptive Summary of the master dataset
summary_stats = master_data.describe(include='all').transpose()
summary_stats['missing'] = master_data.isnull().sum()
summary_stats['unique'] = master_data.nunique()

print("\n--- Descriptive Summary of Master Dataset ---\n",summary_stats)


--- Descriptive Summary of Master Dataset ---
                      count  unique         top    freq       mean       std  \
pid                 319795  319795  PID-319795       1        NaN       NaN   
BMI               319795.0    3604         NaN     NaN  28.325399    6.3561   
Sex                 319795       2      Female  167805        NaN       NaN   
AgeCategory         319795      13       65-69   34151        NaN       NaN   
Race                319795       6       White  245212        NaN       NaN   
Smoking             319795       2          No  187887        NaN       NaN   
AlcoholDrinking     319795       2          No  298018        NaN       NaN   
DiffWalking         319795       2          No  275385        NaN       NaN   
PhysicalActivity    319795       2         Yes  247957        NaN       NaN   
SleepTime         319795.0      24         NaN     NaN   7.097075  1.436007   
PhysicalHealth    319795.0      31         NaN     NaN    3.37171   7.95085   
Ment

In [18]:
print(master_data.head())

      pid    BMI     Sex  AgeCategory   Race Smoking AlcoholDrinking  \
0  PID-01  16.60  Female        55-59  White     Yes              No   
1  PID-02  20.34  Female  80 or older  White      No              No   
2  PID-03  26.58    Male        65-69  White     Yes              No   
3  PID-04  24.21  Female        75-79  White      No              No   
4  PID-05  23.71  Female        40-44  White      No              No   

  DiffWalking PhysicalActivity  SleepTime  PhysicalHealth  MentalHealth  \
0          No              Yes          5               3            30   
1          No              Yes          7               0             0   
2          No              Yes          8              20            30   
3          No               No          6               0             0   
4         Yes              Yes          8              28             0   

   GenHealth Asthma KidneyDisease SkinCancer Stroke Diabetic HeartDisease  
0  Very good    Yes            No       