##Introduction

This notebook will be used to create and fit models designed to predict whether individuals with certain risk factors will experience heart disease or heart attacks. This can be utilized by medical professionals in efforts to intervene before individuals with high risk actually experience these cardiac events.  A proactive individual with reasonably low risk could use this model to minimize their risk of cardiac events.

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Import Libraries

First, we will need to import the Python Libraries we will use to create and evaluate our models.

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier


##Load Data

Next, we will load our data.

In [11]:
df = pd.read_csv('/content/drive/MyDrive/Coding Dojo/Raw Data/heart_disease_health_indicators_BRFSS2015.csv')
df.head()

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   HeartDiseaseorAttack  253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   Diabetes              253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

In [13]:
df.shape

(253680, 22)

##Data Cleaning

We saw in our EDA that the dataset has no null values and every point is properly formatted.

We did find duplicated values, so we will drop them.

In [6]:
# confirm duplicated values
df.duplicated().any()

True

In [14]:
# drop duplicated values
df.drop_duplicates(keep = 'first', inplace = True)

In [15]:
# confirm duplicates are dropped
df.duplicated().any()

False

##Pre-processing

We will begin our pre-processing by declaring our target vector and features matrix.  We will then implement a train_test_split.

In [45]:
# check balance of target
df['HeartDiseaseorAttack'].value_counts(normalize = 'true')

0.0    0.896784
1.0    0.103216
Name: HeartDiseaseorAttack, dtype: float64

In [18]:
# declare features (X) and target (y)
y = df['HeartDiseaseorAttack']
X = df.drop(columns = 'HeartDiseaseorAttack')

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

Our target is very unbalanced.  This can make modeling difficult.

Also, our dataset is large.  This can drastically slow model production.

Both of these issues will be dealt with utilizing random under-sampling.

In [33]:
# import imbalanced-learn
from imblearn.under_sampling import RandomUnderSampler

# utilize RandomUnderSampler to build usable training data
rus = RandomUnderSampler(random_state = 42, replacement = True)
X_rus, y_rus = rus.fit_resample(X_train, y_train)

In [34]:
print('Original Dataset Shape:\n', y_train.value_counts())
print('\nResampled Dataset Shape:\n', y_rus.value_counts())

Original Dataset Shape:
 0.0    154552
1.0     17783
Name: HeartDiseaseorAttack, dtype: int64

Resampled Dataset Shape:
 1.0    17783
0.0    17783
Name: HeartDiseaseorAttack, dtype: int64


We see that our dataset for training has reduced from 172,335 rows to 35,566 rows. Also, our target is now balanced.

We will now utilize this data for model creation.

In [35]:
# instantiate num_selector
num_sel = make_column_selector(dtype_include = ['number'])

# instantiate preprocessor
preprocessor = make_column_transformer((StandardScaler(), num_sel))

##Model Selection

With our pre-processing complete, we will create a function to test default models in an attempt to determine which algorithms will provide the best results.

In [37]:
def evaluate_default(pipe):
  pipe.fit(X_rus, y_rus)

  train_pred = pipe.predict(X_train)
  test_pred = pipe.predict(X_test)

  print('Training Metrics:\n', classification_report(y_train, train_pred))
  print('\nTest Metrics:\n', classification_report(y_test, test_pred))

In [38]:
# test Logistic Regression
log_pipe = make_pipeline(preprocessor, LogisticRegression(random_state = 42))

evaluate_default(log_pipe)

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.74      0.84    154552
         1.0       0.25      0.78      0.38     17783

    accuracy                           0.74    172335
   macro avg       0.61      0.76      0.61    172335
weighted avg       0.89      0.74      0.79    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.74      0.84     51512
         1.0       0.26      0.78      0.39      5934

    accuracy                           0.74     57446
   macro avg       0.61      0.76      0.61     57446
weighted avg       0.89      0.74      0.79     57446



Our default Logistic Regression achieved the best test recall score among the algorithms we tested.  This is important for our specific data because we want to avoid incorrectly telling people that they will not experience heart disease. Given Logistic Regression's performance in recall, we will tune it for optimal results to consider for our final model.

In [39]:
# test Random Forest
rf_pipe = make_pipeline(preprocessor, RandomForestClassifier(random_state = 42))

evaluate_default(rf_pipe)

Training Metrics:
               precision    recall  f1-score   support

         0.0       1.00      0.73      0.85    154552
         1.0       0.30      1.00      0.46     17783

    accuracy                           0.76    172335
   macro avg       0.65      0.87      0.65    172335
weighted avg       0.93      0.76      0.81    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.71      0.82     51512
         1.0       0.24      0.79      0.37      5934

    accuracy                           0.72     57446
   macro avg       0.60      0.75      0.59     57446
weighted avg       0.89      0.72      0.77     57446



In [41]:
# test KNN
knn_pipe = make_pipeline(preprocessor, KNeighborsClassifier())

evaluate_default(knn_pipe)

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.70      0.82    154552
         1.0       0.24      0.83      0.37     17783

    accuracy                           0.72    172335
   macro avg       0.61      0.76      0.60    172335
weighted avg       0.90      0.72      0.77    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.96      0.70      0.81     51512
         1.0       0.22      0.75      0.34      5934

    accuracy                           0.70     57446
   macro avg       0.59      0.72      0.57     57446
weighted avg       0.88      0.70      0.76     57446



Both Random Forest and KNN were outperformed by the default Logistic Regression, as well as all 3 Boosted Models that were tested.  As such, neither will be tuned for consideration for our model.

In [42]:
# test GBC
gbc_pipe = make_pipeline(preprocessor, GradientBoostingClassifier(random_state = 42))

evaluate_default(gbc_pipe)

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.72      0.83    154552
         1.0       0.25      0.81      0.38     17783

    accuracy                           0.73    172335
   macro avg       0.61      0.76      0.60    172335
weighted avg       0.90      0.73      0.78    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.72      0.83     51512
         1.0       0.25      0.81      0.38      5934

    accuracy                           0.73     57446
   macro avg       0.61      0.76      0.60     57446
weighted avg       0.90      0.73      0.78     57446



In [43]:
# test Light GBM
lgbm_pipe = make_pipeline(preprocessor, LGBMClassifier(random_state = 42))

evaluate_default(lgbm_pipe)

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.72      0.83    154552
         1.0       0.25      0.82      0.38     17783

    accuracy                           0.73    172335
   macro avg       0.61      0.77      0.60    172335
weighted avg       0.90      0.73      0.78    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.72      0.83     51512
         1.0       0.25      0.81      0.38      5934

    accuracy                           0.73     57446
   macro avg       0.61      0.76      0.60     57446
weighted avg       0.90      0.73      0.78     57446



In [44]:
# test XGBOOST
xgb_pipe = make_pipeline(preprocessor, XGBClassifier(random_state = 42))

evaluate_default(xgb_pipe)

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.72      0.83    154552
         1.0       0.25      0.81      0.38     17783

    accuracy                           0.73    172335
   macro avg       0.61      0.76      0.60    172335
weighted avg       0.90      0.73      0.78    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.72      0.83     51512
         1.0       0.25      0.81      0.38      5934

    accuracy                           0.73     57446
   macro avg       0.61      0.76      0.60     57446
weighted avg       0.90      0.73      0.78     57446



All boosted models (GradientBoostingClassifier, Light GBM, and XGBOOST) performed equally in the default testing.  These were our most accurate default models.  We will tune GBM for consideration for our final model.

##Tuning Candidates for Final Model Consideration

We will utilize GridSearchCV to optimize Logistic Regression and GradientBoostingClassifier models before testing to determine final model for implementation.

##Logistic Regression Hyperparameter Tuning

In [47]:
# setting parameters for grid search
log_params = {'logisticregression__penalty': ['l1', 'l2', 'elasticnet', 'none'],
              'logisticregression__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
              'logisticregression__C': [.001, .01, .1, 1, 10, 100, 1000],
              'logisticregression__random_state': [42]}

In [48]:
# utilize GridSearchCV to optimize LogisticRegression
log_grid = GridSearchCV(log_pipe, log_params)
log_grid.fit(X_rus, y_rus)

  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ignore the C and l1_ratio parameters"
  "Setting penalty='none' will ign

GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f3d0e5d1350>)])),
                                       ('logisticregression',
                                        LogisticRegression(random_state=42))]),
             param_grid={'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100,
                                                   1000],
                         'logisticregression__penalty': ['l1', 'l2',
                                                         'elasticnet', 'none'],
                         'logisticregression__random_state': [42],
                         'logisticregression__solver': ['newton-cg', 'lbfgs',
                    

In [50]:
# get optimal hyperparameters for Logistic Regression
log_grid.best_params_

{'logisticregression__C': 1,
 'logisticregression__penalty': 'l2',
 'logisticregression__random_state': 42,
 'logisticregression__solver': 'sag'}

In [51]:
# set optimized Logistic Regression
best_log = log_grid.best_estimator_

##GBC Hyperparameter Tuning

This GridSearch seems to be taking excessively long to run.  3 attempts at an hour each. Final edits will be done when this is completed.

"Sometimes it's not done, but it's due." - Josh

In [61]:
# set parameters for gridsearch
#gbc_params = {'gradientboostingclassifier__loss': ['deviance', 'exponential'],
              'gradientboostingclassifier__n_estimators': [10, 25, 50, 100],
              'gradientboostingclassifier__criterion': ['friedman_mse', 'squared_error', 'absolute_error'],
              'gradientboostingclassifier__min_samples_split': [1, 2, 3],
              'gradientboostingclassifier__min_samples_leaf': [1, 2, 3],
              'gradientboostingclassifier__max_depth': [1, 3, 5],
              'gradientboostingclassifier__random_state': [42]}

In [62]:
#gbc_grid = GridSearchCV(gbc_pipe, gbc_params)
#gbc_grid.fit(X_rus, y_rus)



KeyboardInterrupt: ignored

In [None]:
# view optimal GBC hyperparameters
#gbc_grid.best_params_

In [None]:
# set optimized GradientBoosting Classifier
#best_gbc = gbc_grid.best_estimator_

##Test Tuned Models

In [63]:
#evaluate optimized logistic regression
evaluate_default(best_log)

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.74      0.84    154552
         1.0       0.25      0.78      0.38     17783

    accuracy                           0.74    172335
   macro avg       0.61      0.76      0.61    172335
weighted avg       0.89      0.74      0.79    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.97      0.74      0.84     51512
         1.0       0.26      0.78      0.39      5934

    accuracy                           0.74     57446
   macro avg       0.61      0.76      0.61     57446
weighted avg       0.89      0.74      0.79     57446



In [65]:
#test logistic regression using entire dataset
full_log = best_log.fit(X_train, y_train)

log_train = full_log.predict(X_train)
log_test = full_log.predict(X_test)

print('Training Metrics:\n', classification_report(y_train, log_train))
print('\nTest Metrics:\n', classification_report(y_test, log_test))

Training Metrics:
               precision    recall  f1-score   support

         0.0       0.91      0.99      0.95    154552
         1.0       0.54      0.12      0.20     17783

    accuracy                           0.90    172335
   macro avg       0.72      0.56      0.57    172335
weighted avg       0.87      0.90      0.87    172335


Test Metrics:
               precision    recall  f1-score   support

         0.0       0.91      0.99      0.95     51512
         1.0       0.57      0.12      0.20      5934

    accuracy                           0.90     57446
   macro avg       0.74      0.56      0.57     57446
weighted avg       0.87      0.90      0.87     57446



Our Logistic Regression model does fairly well.  99% of the people it predicts do not have heart disease or attacks actually do not. 57% of the people it predicts to have heart disease or attacks do end up having heart conditions. This will be the model moving forward into production.

In [None]:
# evaluate GBC
#evaluate_default(best_gbc)

In [None]:
# test GBC using entire dataset
#full_gbc = best_gbc.fit(X_train, y_train)
#gbc_train = full_gbc.predict(X_train)
#gbc_test = full_gbc.predict(X_test)
#print('Training Metrics:\n', classification_report(y_train, gbc_train))
#print('\nTest Metrics:\n', classification_report(y_test, gbc_test))

Once the Gridsearch is completed, we will evaluate the GBC and re-evaluate which model will move into production.