# Heart Disease Predictive Model

**This notebook contains a model made by Myles Ezeanii.**

I used the following dataset to build a Logistic Regression Model used to accurately predict the presence of heart disease based on multiple features such as:

---
* **Age**
* **Sex** (Male/Female)
* **Resting Blood Pressure** (mm Hg)
* **Serum Cholesterol** (mg/dl)
* **Fasting Blood Sugar**
  * True: Over 120 mg/dl
  * False: Under 120 mg/dl
* **Resting Electrocardiographic Results**
  * 0: Represents "normal" electrocardiographic results
  * 1: Represents an ST-T wave abnormality ( seen with T wave inversions and/or ST elevation or depression of > 0.05 mV).
  * 2: Represents "probable or definite left ventricular hypertrophy" seen by Estes' criteria.   
* **Maximum Heart Rate** (measured during exercise)
* **Exercise-induced Angina** (Yes/No)
* **Oldpeak**
  * ST depression induced by exercise relative to rest.
* **Slope of peak exercise ST segment**
* **Number of major vessels (0-3) colored by fluoroscopy**
* **Thallium stress test result**
  * Normal, Fixed Defect, Reversible Defect
---

The "goal" field is used to refer to the presence of heart disease in the specific patient.

This is represented as an integer valued from 0 (no presence) to 1,2,3,4 (varying levels of presence).

---
In order to search for the optimal model parameters, I utilized a grid search in order to find the best predictor.

My overall goal is for the model to be used as a diagnostic tool which can assist healthcare professionals in identifying and determining if individuals are at a risk of heart disease.

---
Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.

In [None]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

# Receive the Dataset from the UCI Machine Learning Repository
heartDisease = fetch_ucirepo(id=45)

# Get the data as pandas dataframes.
X = heartDisease.data.features
y = heartDisease.data.targets # 0 represents No Heart Disease, 1-4 represents heart disease to varying degrees

print(y)

     num
0      0
1      2
2      1
3      0
4      0
..   ...
298    1
299    2
300    3
301    1
302    0

[303 rows x 1 columns]


In [None]:
# Split data into training and testing sets
XTrained, XTested, yTrained, yTested = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Create an imputer to replace missing values with the mean
imputer = SimpleImputer(strategy='mean')

In [None]:
# Fit the imputer on the training data and transform both training and testing data
XTrainedImputed = imputer.fit_transform(XTrained)
XTestedImputed = imputer.transform(XTested) #Use the trained imputer from XTrained to avoid data leakage

In [None]:
# Our current Parameter Grid
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet', None],
    'multi_class': ['auto', 'ovr', 'multinomial'],
    'C': [0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'warm_start': [True, False],
    'l1_ratio': [None, 0.1, 0.5, 0.9],
}

In [None]:
# Initialize and train the Random Forest model
model = LogisticRegression(l1_ratio = None, max_iter = 1000, multi_class = 'ovr', warm_start = True)
model.fit(XTrainedImputed, yTrained.values.ravel())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# Make predictions on the test set
yPrediction = model.predict(XTestedImputed)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(yTested, yPrediction)
print(f"Accuracy: {accuracy}")

Accuracy: 0.5409836065573771


In [None]:
from sklearn.model_selection import GridSearchCV

# We will use a Grid Search to cross validate in order to determine which parameters make the most accurate model

gridSearch = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs=-1)

# Fit the GridSearchCV object to the training data
gridSearch.fit(XTrainedImputed, yTrained.values.ravel())

# Best Hyperparameters & Best Score
best_params = gridSearch.best_params_
best_score = gridSearch.best_score_
print(f"Best Hyperparameters: {best_params}")
print(f"Best Accuracy: {best_score}")

# Train new model with best hyperparameters
best_model = LogisticRegression(**best_params)
best_model.fit(XTrainedImputed, yTrained.values.ravel())

# Make predictions on the test set using the best model
yPrediction = best_model.predict(XTestedImputed)

# Calculate accuracy
accuracy = accuracy_score(yTested, yPrediction)
print(f"Accuracy with Best Model: {accuracy}")

4280 fits failed out of a total of 9600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
480 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1194, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 67, in _check_solver

Best Hyperparameters: {'C': 1, 'l1_ratio': None, 'multi_class': 'auto', 'penalty': 'l1', 'solver': 'liblinear', 'warm_start': True}
Best Accuracy: 0.6281462585034013
Accuracy with Best Model: 0.5737704918032787


