# Table of Contents

1. [Introduction](#Introduction)
2. [Data Loading and Exploration](#Data-Loading-and-Exploration)
3. [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
4. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
5. [Predictive Modeling](#Predictive-Modeling)
6. [Conclusion and Future Work](#Conclusion-and-Future-Work)

# Introduction

There is an intriguing relationship between lifestyle factors and heart disease, and this dataset provides us with the opportunity to explore multiple facets of cardiovascular health. In this notebook we will venture through data loading, cleaning, visualization, and build a prediction model for heart disease status. If you find any part of this notebook useful, feel free to upvote it.

In [1]:
# Import required libraries and suppress warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib
matplotlib.use('Agg')  # Important if matplotlib is imported, ensures non-interactive backend
import matplotlib.pyplot as plt
plt.switch_backend('Agg')  # Switching backend for plt if needed
%matplotlib inline

import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
from sklearn.inspection import permutation_importance

# Set seaborn style for better visuals
sns.set(style='whitegrid')

print('All libraries imported successfully')

All libraries imported successfully


In [2]:
# Data Loading and Basic Exploration
# Disblaw Max width
pd.set_option('display.max_colwidth', None)
# Display Max Columns
pd.set_option('display.max_columns', None)
# Load the dataset; assuming the file is in the same directory as the notebook
df = pd.read_csv('../data/dataset/heart_disease.csv', encoding='ascii', delimiter=',')

# Print the shape and first few rows to get an initial glimpse of the data
print('Dataset shape:', df.shape)
print('Dataset preview:')
display(df.sample(20))

# Display information about data types and non-null counts
df.info()

Dataset shape: (10000, 21)
Dataset preview:


Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,Low HDL Cholesterol,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
6699,65.0,Male,160.0,250.0,Medium,No,Yes,Yes,24.406224,No,Yes,Yes,Low,Medium,7.028653,Medium,288.0,102.0,4.53019,9.852174,No
3276,35.0,Male,152.0,298.0,High,Yes,Yes,Yes,18.756222,Yes,No,No,,Low,4.214585,High,316.0,109.0,11.51513,16.909763,No
6918,29.0,Female,159.0,285.0,Medium,Yes,No,Yes,38.304633,Yes,No,No,Medium,High,5.528442,High,211.0,125.0,14.800753,17.296138,No
2124,76.0,Male,177.0,256.0,Low,Yes,Yes,No,29.57826,Yes,No,No,Medium,Low,5.989524,Low,182.0,160.0,0.830706,12.092233,No
2229,43.0,Female,146.0,151.0,Low,No,Yes,No,30.376095,No,No,Yes,Medium,Low,9.30094,Low,149.0,84.0,1.829368,9.711392,No
3003,31.0,Female,135.0,178.0,Medium,No,Yes,No,23.725,No,Yes,Yes,Low,Medium,4.02636,Medium,263.0,154.0,7.859166,6.309483,No
4650,79.0,Female,178.0,284.0,Medium,No,No,Yes,26.554921,Yes,Yes,No,High,Medium,8.847793,Medium,219.0,125.0,11.434804,18.638214,No
8590,40.0,Female,169.0,171.0,High,Yes,Yes,Yes,38.260155,No,No,Yes,,High,9.168673,Low,223.0,100.0,11.10962,15.309295,Yes
9547,78.0,Male,125.0,241.0,High,No,Yes,Yes,18.401492,Yes,No,Yes,,High,9.92139,Medium,153.0,140.0,6.602327,16.696622,Yes
5522,34.0,Female,151.0,160.0,High,Yes,No,No,33.060953,No,No,No,High,Low,5.125101,Low,139.0,158.0,8.377061,8.18016,No


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   9971 non-null   float64
 1   Gender                9981 non-null   object 
 2   Blood Pressure        9981 non-null   float64
 3   Cholesterol Level     9970 non-null   float64
 4   Exercise Habits       9975 non-null   object 
 5   Smoking               9975 non-null   object 
 6   Family Heart Disease  9979 non-null   object 
 7   Diabetes              9970 non-null   object 
 8   BMI                   9978 non-null   float64
 9   High Blood Pressure   9974 non-null   object 
 10  Low HDL Cholesterol   9975 non-null   object 
 11  High LDL Cholesterol  9974 non-null   object 
 12  Alcohol Consumption   7414 non-null   object 
 13  Stress Level          9978 non-null   object 
 14  Sleep Hours           9975 non-null   float64
 15  Sugar Consumption   

In [3]:
# Data Cleaning and Preprocessing

# Check for missing values in the dataset
missing_values = df.isnull().sum()
print('Missing values per column:')
print(missing_values)

# If missing values exist, we have options to impute, drop, or otherwise handle them
# In our case, RandomForestClassifier does not natively accept NaNs so we need to address them.
# Here we choose to drop rows with any missing values. Alternatively, one could use an imputer.
df = df.dropna()

# Confirm that there are no missing values after cleanup
print('Missing values after cleaning:')
print(df.isnull().sum())

# Examine data types for potential conversion (if any dates were present, here we would infer and convert them)
print('Data types:')
print(df.dtypes)

Missing values per column:
Age                       29
Gender                    19
Blood Pressure            19
Cholesterol Level         30
Exercise Habits           25
Smoking                   25
Family Heart Disease      21
Diabetes                  30
BMI                       22
High Blood Pressure       26
Low HDL Cholesterol       25
High LDL Cholesterol      26
Alcohol Consumption     2586
Stress Level              22
Sleep Hours               25
Sugar Consumption         30
Triglyceride Level        26
Fasting Blood Sugar       22
CRP Level                 26
Homocysteine Level        20
Heart Disease Status       0
dtype: int64
Missing values after cleaning:
Age                     0
Gender                  0
Blood Pressure          0
Cholesterol Level       0
Exercise Habits         0
Smoking                 0
Family Heart Disease    0
Diabetes                0
BMI                     0
High Blood Pressure     0
Low HDL Cholesterol     0
High LDL Cholesterol    0
Alcohol 

In [4]:
"""# Exploratory Data Analysis (EDA)

# Let's get summary statistics for numeric features
print('Summary statistics for numeric columns:')
display(df.describe())

# Create a correlation heatmap of numeric variables if there are four or more numeric columns
numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(10, 8))
    sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()

# Create pair plot for numeric variables to examine distributions and relationships
sns.pairplot(numeric_df)
plt.suptitle('Pair Plot for Numeric Features', y=1.02)
plt.show()

# Plot histograms for select numeric columns
numeric_columns = numeric_df.columns
for col in numeric_columns:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col], kde=True)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

# For categorical variables, display count plots
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    plt.figure(figsize=(6, 4))
    sns.countplot(data=df, x=col, palette='viridis')
    plt.title(f'Count Plot of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()"""

"# Exploratory Data Analysis (EDA)\n\n# Let's get summary statistics for numeric features\nprint('Summary statistics for numeric columns:')\ndisplay(df.describe())\n\n# Create a correlation heatmap of numeric variables if there are four or more numeric columns\nnumeric_df = df.select_dtypes(include=[np.number])\nif numeric_df.shape[1] >= 4:\n    plt.figure(figsize=(10, 8))\n    sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')\n    plt.title('Correlation Heatmap of Numeric Features')\n    plt.show()\n\n# Create pair plot for numeric variables to examine distributions and relationships\nsns.pairplot(numeric_df)\nplt.suptitle('Pair Plot for Numeric Features', y=1.02)\nplt.show()\n\n# Plot histograms for select numeric columns\nnumeric_columns = numeric_df.columns\nfor col in numeric_columns:\n    plt.figure(figsize=(6, 4))\n    sns.histplot(df[col], kde=True)\n    plt.title(f'Histogram of {col}')\n    plt.xlabel(col)\n    plt.ylabel('Frequency')\n    plt.show()\n\n# 

In [5]:
# Drop rows with missing target values if any exist (should be none after cleaning)
df_model = df.dropna(subset=['Heart Disease Status']).copy()

# Separate features and target
X = df_model.drop('Heart Disease Status', axis=1)
y = df_model['Heart Disease Status']

# Check for any remaining missing values in features
if X.isnull().sum().sum() > 0:
    print('Warning: Found missing values in features. Consider imputing or further cleaning.')

# Identify categorical columns that need encoding
categorical_cols = X.select_dtypes(include=['object']).columns
print('Categorical columns to be encoded:', list(categorical_cols))

# Apply one-hot encoding to categorical features
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Categorical columns to be encoded: ['Gender', 'Exercise Habits', 'Smoking', 'Family Heart Disease', 'Diabetes', 'High Blood Pressure', 'Low HDL Cholesterol', 'High LDL Cholesterol', 'Alcohol Consumption', 'Stress Level', 'Sugar Consumption']


In [6]:

# Initialize and train the RandomForestClassifier
model = model = RandomForestClassifier(
    n_estimators=100,          # try 100–300 trees
    max_depth=5,               # limit tree depth
    min_samples_split=10,      # increase minimum samples required to split
    min_samples_leaf=5,        # increase minimum samples in leaf nodes
    max_features='sqrt',       # limit number of features considered at each split
    random_state=42,
    class_weight='balanced'
)
model.fit(X_train, y_train)

In [7]:
import joblib
# Save model and columns
joblib.dump(model, "../Models/heart_disease_model.pkl")
joblib.dump(X.columns.tolist(), "../Models/heart_model_columns.pkl")

['../Models/heart_model_columns.pkl']

In [8]:
df["Heart Disease Status"].value_counts()

Heart Disease Status
No     5632
Yes    1435
Name: count, dtype: int64

In [9]:
# Select the row to predict
row = df.loc[8445].drop("Heart Disease Status")

# Convert to DataFrame for encoding
row_df = pd.DataFrame([row])

# Apply one-hot encoding to match training features
row_encoded = pd.get_dummies(row_df, columns=categorical_cols, drop_first=True)

# Reindex to match the columns of X (fill missing columns with 0)
row_encoded = row_encoded.reindex(columns=X.columns, fill_value=0)

# Predict using the trainezd model
prediction = model.predict(row_encoded)
print("Predicted Heart Disease Status:", prediction[0])

Predicted Heart Disease Status: Yes


In [10]:
y_train

1020     No
7870     No
3021     No
794      No
8445    Yes
       ... 
5360     No
7375     No
7425     No
7649     No
1223     No
Name: Heart Disease Status, Length: 5653, dtype: object

In [11]:
y_train

1020     No
7870     No
3021     No
794      No
8445    Yes
       ... 
5360     No
7375     No
7425     No
7649     No
1223     No
Name: Heart Disease Status, Length: 5653, dtype: object

# Conclusion and Future Work

In this notebook we loaded and cleaned a heart disease dataset, explored various relationships among the features with a variety of visualization techniques, and built a predictive model to classify heart disease status. We addressed the common error of missing values by dropping rows with any NaNs, which prevented issues when fitting the RandomForestClassifier. In future analysis, one might consider more advanced imputation methods, deep-dive analyses into feature interactions, or even try alternate models like HistGradientBoostingClassifier which natively handles missing values.

The approach illustrated here is robust for exploratory data analysis and rapid prototyping. If you found this notebook engaging or useful, please consider upvoting.