## Modeling on Heart Disease Kaggle Project
CoderGirl - Data Science June 2020. 

Full Disclosure:  I was behind during the course for the machine learning portion (I had a baby on March 4th that put me behind a few weeks).  So while I was able to complete the Coursera/Google courses and relevant notebooks/homework, I did not get to practice the coding for modeling as much as I would have liked to have gotten a firm grasp on it.  I understand the larger context of modeling and machine learning and in particular the model I chose to use (random forest) however, I had some assistance in writing the code portion of this notebook. I plan on continuing to gain knowledge on modeling as I work through some of my own private data sets.

## Import necessary libraries and data set for analysis and models.

In [1]:
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score,recall_score,precision_score,roc_auc_score,f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import os
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.model_selection import KFold

In [2]:
heartdisease= pd.read_csv(r'C:\Users\marie\OneDrive\Desktop\CoderGirl\dev\heart.csv')


In [3]:
heartdisease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Feature Engineering
From my EDA the features cp (chest pain type), resting ecg, slope, and ca (number of vessels colored by fluroscopy) were the features that would be valuable for modeling.  I am using feature engineering to add additional categories based on the patient's value of each of these features.  I will also change categorical values into numerical values and use the standardscaler command to standardize the data have a mean of 0 and sd of 1 in a normal distribution.  

In [4]:
heartdisease['cp'][heartdisease['cp'] == 1] = 'typical angina'
heartdisease['cp'][heartdisease['cp'] == 2] = 'atypical angina'
heartdisease['cp'][heartdisease['cp'] == 3] = 'non-anginal pain'
heartdisease['cp'][heartdisease['cp'] == 4] = 'asymptomatic'
heartdisease['restecg'][heartdisease['restecg'] == 0] = 'normal'
heartdisease['restecg'][heartdisease['restecg'] == 1] = 'ST-T wave abnormality'
heartdisease['restecg'][heartdisease['restecg'] == 2] = 'left ventricular hypertrophy'
heartdisease['slope'][heartdisease['slope'] == 1] = 'upsloping'
heartdisease['slope'][heartdisease['slope'] == 2] = 'flat'
heartdisease['slope'][heartdisease['slope'] == 3] = 'downsloping'
heartdisease['thal'][heartdisease['thal'] == 1] = 'normal'
heartdisease['thal'][heartdisease['thal'] == 2] = 'fixed defect'
heartdisease['thal'][heartdisease['thal'] == 3] = 'reversable defect'


In [5]:
heartdisease1=pd.get_dummies(heartdisease,drop_first=True)

In [6]:
heartdisease1.head()

Unnamed: 0,age,sex,trestbps,chol,fbs,thalach,exang,oldpeak,ca,target,cp_atypical angina,cp_non-anginal pain,cp_typical angina,restecg_left ventricular hypertrophy,restecg_normal,slope_flat,slope_upsloping,thal_fixed defect,thal_normal,thal_reversable defect
0,63,1,145,233,1,150,0,2.3,0,1,0,1,0,0,1,0,0,0,1,0
1,37,1,130,250,0,187,0,3.5,0,1,1,0,0,0,0,0,0,1,0,0
2,41,0,130,204,0,172,0,1.4,0,1,0,0,1,0,1,1,0,1,0,0
3,56,1,120,236,0,178,0,0.8,0,1,0,0,1,0,0,1,0,1,0,0
4,57,0,120,354,0,163,1,0.6,0,1,0,0,0,0,0,1,0,1,0,0


In [7]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X =  pd.DataFrame(sc_X.fit_transform(heartdisease1.drop(["target"],axis = 1),),
        columns=['age', 'trestbps', 'chol', 'thalach', 'oldpeak',
       'ca', 'sex_male', 'cp_atypical angina','cp_type_non-anginal pain','cp_type_typical angina','fbs_lower than 120mg/ml','rest_ecg_left ventricular hypertrophy','rest_ecg_normal','exang_yes','slope_flat','slope_upsloping','thal_fixed defect','thal_normal','thal_reversable defect'])

In [8]:
y=heartdisease['target']

## Creating train/test sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,stratify=y, random_state=5)

## Create a model and predict test set

In [10]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(random_state = 0)
logit.fit(X_train, y_train)

y_pred = logit.predict(X_test)


In [11]:
from sklearn.model_selection import cross_val_score
roc=roc_auc_score(y_test, y_pred)
accuracies = cross_val_score(estimator = logit, X = X_test, y = y_test, cv = 10)
acc = accuracies.mean()
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results = pd.DataFrame([['Base - Logistic Regression', acc,prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])

results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC
0,Base - Logistic Regression,0.848095,0.857143,0.909091,0.882353,0.86526


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,accuracy_score
random_forest = RandomForestClassifier(n_estimators=500,criterion='entropy',max_depth=5).fit(X_train, y_train)
y_pred_random = random_forest.predict(X_test)

## Metric of Model Performance

In [13]:
roc=roc_auc_score(y_test, y_pred)
accuracies = cross_val_score(estimator = random_forest, X = X_test, y = y_test, cv = 10)
acc = accuracies.mean()
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['Random Forest', acc,prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results = results.append(model_results,sort=True)
results

Unnamed: 0,Accuracy,F1 Score,Model,Precision,ROC,Recall
0,0.848095,0.882353,Base - Logistic Regression,0.857143,0.86526,0.909091
0,0.789048,0.882353,Random Forest,0.857143,0.86526,0.909091
