<a href="https://colab.research.google.com/github/parvichakravarti/CAREPOINT-Hospital-Dashboard-PowerBI/blob/main/Heart~Disease~Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Disease Prediction Model
An AI-model that predicts heart disease risk from user health data....

- [GitHub](https://github.com/parvichakravarti)


## 📊 Dataset Summary

- **Source**: [Centers for Disease Control and Prevention (CDC)](https://www.cdc.gov/)
- **Dataset**: [Annual 2020 Behavioral Risk Factor Surveillance System (BRFSS)](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)
- **Samples**: 400,000+ adult survey respondents in the United States
- **Features include**:
  - Demographics (age, gender, race)
  - Lifestyle choices (smoking, alcohol use, physical activity)
  - Physical and mental health status
  - Medical conditions (diabetes, high blood pressure, etc.)

## Sample Output

**Person Data:**  
![Person Data](https://github.com/marianoluiz/heart-disease-predictor/blob/main/img/predictX.png?raw=1)

**Diagnosis:**  
![Diagnosis](https://github.com/marianoluiz/heart-disease-predictor/blob/main/img/result.png?raw=1)

### Install dependencies

In [None]:
# Install dependencies
!pip install pandas==2.3.0 scikit-learn==1.7.0 tensorflow[and-cuda]==2.19.0 kaggle==1.7.4.5



In [None]:
# Download the dataset using the kaggle API or manually in the link from the dataset summary.
!kaggle datasets download -d kamilpytlak/personal-key-indicators-of-heart-disease

# Read Documentation: https://www.kaggle.com/docs/api#getting-started-installation-&-authentication

Dataset URL: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease
License(s): CC0-1.0
Downloading personal-key-indicators-of-heart-disease.zip to /mnt/drive_d/projs/Heart Disease Predictor
  0%|                                               | 0.00/21.4M [00:00<?, ?B/s]
100%|███████████████████████████████████████| 21.4M/21.4M [00:00<00:00, 334MB/s]


In [None]:
# Import dependencies
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

2025-06-09 02:10:43.890470: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-09 02:10:44.016290: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749406244.076079   45218 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749406244.079441   45218 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1749406244.091117   45218 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

### Process the dataset

In [None]:
#  Declare the dataset

# Change .csv name and directory as needed
dataset = pd.read_csv('./heartdisease2020_dataset_by_kamil/heart_2020_cleaned.csv')
dataset.rename(columns={
    'HeartDisease': 'heart_disease',
    'BMI': 'bmi',
    'Smoking': 'smoking',
    'AlcoholDrinking': 'alcohol_drinking',
    'Stroke': 'stroke',
    'PhysicalHealth': 'physical_health',
    'MentalHealth': 'mental_health',
    'DiffWalking': 'diff_walking',
    'Sex': 'sex',
    'AgeCategory': 'age_category',
    'Race': 'race',
    'Diabetic': 'diabetic',
    'PhysicalActivity': 'physical_activity',
    'GenHealth': 'gen_health',
    'SleepTime': 'sleep_time',
    'Asthma': 'asthma',
    'KidneyDisease': 'kidney_disease',
    'SkinCancer': 'skin_cancer'
}, inplace=True)

# Change the data to numbers
age_map = {
    '18-24': 0,
    '25-29': 1,
    '30-34': 2,
    '35-39': 3,
    '40-44': 4,
    '45-49': 5,
    '50-54': 6,
    '55-59': 7,
    '60-64': 8,
    '65-69': 9,
    '70-74': 10,
    '75-79': 11,
    '80 or older':  12,
}

gen_health_map = {
    'Poor': 0,
    'Fair': 1,
    'Good': 2,
    'Very good': 3,
    'Excellent': 4
}

dataset['age_category'].replace(age_map, inplace=True)
dataset['gen_health'].replace(gen_health_map, inplace=True)
dataset['heart_disease'].replace({'Yes': 1, 'No': 0}, inplace=True)

# Split the data to be x and y
x = dataset.drop(['heart_disease'], axis=1)
y = dataset['heart_disease']

x = pd.get_dummies(x, dtype=int)

# Use scikit-learn to split data for training and testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['age_category'].replace(age_map, inplace=True)
  dataset['age_category'].replace(age_map, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['gen_health'].replace(gen_health_map, inplace=True)
  dataset['gen_health'].replace(gen_health_map, inplace=True

In [None]:
# See dataset modifications

print('\nx data:\n\n', x)
print('\ny data:\n\n', y)


x data:

           bmi  physical_health  mental_health  age_category  gen_health  \
0       16.60              3.0           30.0             7           3   
1       20.34              0.0            0.0            12           3   
2       26.58             20.0           30.0             9           1   
3       24.21              0.0            0.0            11           2   
4       23.71             28.0            0.0             4           3   
...       ...              ...            ...           ...         ...   
319790  27.41              7.0            0.0             8           1   
319791  29.84              0.0            0.0             3           3   
319792  24.24              0.0            0.0             5           2   
319793  32.81              0.0            0.0             1           2   
319794  46.56              0.0            0.0            12           2   

        sleep_time  smoking_No  smoking_Yes  alcohol_drinking_No  \
0              5.0  

### Train the model

In [None]:
# declare the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(34, )),    # Hidden Layer
    tf.keras.layers.Dense(32, activation='relu'),                       # Hidden Layer
    tf.keras.layers.Dense(1, activation='sigmoid')                      # Output for Binary Classification
])

# Configure the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

model.fit(x, y, epochs=8, validation_data=(x_test, y_test), callbacks=[early_stop])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
I0000 00:00:1749406291.518281   45218 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1220 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6


Epoch 1/8


I0000 00:00:1749406292.595018   45975 service.cc:152] XLA service 0x7d1ad80057c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1749406292.595044   45975 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 3050 Laptop GPU, Compute Capability 8.6
2025-06-09 02:11:32.610773: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1749406292.729536   45975 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m 159/9994[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 958us/step - accuracy: 0.8506 - loss: 0.4320

I0000 00:00:1749406293.757545   45975 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m9994/9994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 1ms/step - accuracy: 0.9120 - loss: 0.2432 - val_accuracy: 0.9136 - val_loss: 0.2320
Epoch 2/8
[1m9994/9994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 977us/step - accuracy: 0.9156 - loss: 0.2280 - val_accuracy: 0.9139 - val_loss: 0.2317
Epoch 3/8
[1m9994/9994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1ms/step - accuracy: 0.9160 - loss: 0.2269 - val_accuracy: 0.9141 - val_loss: 0.2304
Epoch 4/8
[1m9994/9994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1ms/step - accuracy: 0.9157 - loss: 0.2274 - val_accuracy: 0.9141 - val_loss: 0.2339
Epoch 5/8
[1m9994/9994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1ms/step - accuracy: 0.9162 - loss: 0.2251 - val_accuracy: 0.9142 - val_loss: 0.2302
Epoch 6/8
[1m9994/9994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1ms/step - accuracy: 0.9161 - loss: 0.2264 - val_accuracy: 0.9139 - val_loss: 0.2295
Epoch 7/8
[1m9994/9994

<keras.src.callbacks.history.History at 0x7d1bd5946450>

### Evaluate the Model

In [None]:
# Evaluate the model
loss, accuracy =  model.evaluate(x_test, y_test)
print('The accuracy of the model is :', accuracy)

[1m1999/1999[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 825us/step - accuracy: 0.9112 - loss: 0.2351
The accuracy of the model is : 0.9135071039199829


### Export the model

In [None]:
# Save the model
model.save('heartrisk_detector_model.keras')

### Try the Model

In [None]:
# Define people to diagnose
data = [
    # Underweight young adult female with high mental distress, doesn’t smoke or drink, average sleep. Generally fair health.
    [17.8, 2.0, 25.0, 4, 3, 6.0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0],
    # Healthy Asian male with no reported health issues, sleeps well, no smoking/drinking, excellent general health.
    [22.5, 0.0, 0.0, 8, 4, 9.0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0],
    # Obese Black male smoker with stroke history and mobility issues, poor general health and low sleep.
    [35.6, 12.0, 10.0, 9, 1, 4.0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0],
    # Normal BMI Hispanic female, no health issues, doesn’t smoke or drink, high general health.
    [26.2, 0.0, 0.0, 12, 2, 7.0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0],
    # Obese male drinker, severe physical/mental distress, poor general health, limited mobility.
    [30.1, 30.0, 20.0, 11, 0, 5.0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0],
    # Underweight Hispanic female, light physical/mental health issues, sleeps well, nonsmoker.
    [19.4, 1.0, 2.0, 5, 3, 10.0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0],
    # Severely obese Black male smoker, poor physical and mental health, limited mobility, poor general health.
    [41.2, 6.0, 15.0, 10, 1, 6.0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0],
    # Overweight female from Other race group, no issues, good sleep and health, nonsmoker.
    [28.0, 0.0, 0.0, 6, 2, 8.0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0],
    # Normal weight Native American female, mild physical issues, fair general health, sleeps okay.
    [24.3, 3.0, 0.0, 7, 3, 7.0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
    # Obese Hispanic male drinker, major health issues and stroke, limited mobility, poor health.
    [38.7, 20.0, 30.0, 9, 0, 4.0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1],
]

# Define the column names
columns = [
    "bmi", "physical_health", "mental_health", "age_category", "gen_health", "sleep_time",
    "smoking_No", "smoking_Yes", "alcohol_drinking_No", "alcohol_drinking_Yes",
    "stroke_No", "stroke_Yes", "diff_walking_No", "diff_walking_Yes",
    "sex_Female", "sex_Male",
    "race_American Indian/Alaskan Native", "race_Asian", "race_Black", "race_Hispanic",
    "race_Other", "race_White", "diabetic_No", "diabetic_No, borderline diabetes", "diabetic_Yes",
    "diabetic_Yes (during pregnancy)", "physical_activity_No", "physical_activity_Yes",
    "asthma_No", "asthma_Yes", "kidney_disease_No", "kidney_disease_Yes",
    "skin_cancer_No", "skin_cancer_Yes"
]

# Create the DataFrame
predictX = pd.DataFrame(data, columns=columns)

# Predict
prediction = model.predict(predictX)

for index, result in enumerate(prediction):
    prob = result[0]

    if prob > 0.8:
        risk = "Very High Risk"
    elif prob > 0.6:
        risk = "High Risk"
    elif prob > 0.4:
        risk = "Moderate Risk"
    elif prob > 0.2:
        risk = "Low Risk"
    else:
        risk = "Very Low Risk"

    print(f"\nPerson {index + 1}: {risk} of Heart Disease | Predicted Probability: {prob:.2f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step

Person 1: Very Low Risk of Heart Disease | Predicted Probability: 0.01

Person 2: Very Low Risk of Heart Disease | Predicted Probability: 0.01

Person 3: High Risk of Heart Disease | Predicted Probability: 0.64

Person 4: Very Low Risk of Heart Disease | Predicted Probability: 0.10

Person 5: Moderate Risk of Heart Disease | Predicted Probability: 0.44

Person 6: Very Low Risk of Heart Disease | Predicted Probability: 0.01

Person 7: Moderate Risk of Heart Disease | Predicted Probability: 0.50

Person 8: Very Low Risk of Heart Disease | Predicted Probability: 0.02

Person 9: Very Low Risk of Heart Disease | Predicted Probability: 0.03

Person 10: High Risk of Heart Disease | Predicted Probability: 0.69
