Title: Prediction of Coronary Heart Disease Using Supervised Machine Learning

Module 5 Lecture B

Objective: To gain hands-on experience applying the different phases of the lifecycle to create several supervised machine learning modules to predict Coronary Heart Disease (CHD



    1. Data pre-processing - Raw data importation and dataset preparation

In [None]:
#Type the below code to open your computer folder system
# To allow the uploading of the data set

In [None]:
from google.colab import files

In [None]:
#To upload the files or datasets to the colab environment,
# type the below python code in the cell

In [None]:
data_file = files.upload()

In [None]:
import pandas as pd # to convert uploaded data into dataframe structures
# to facilitate data preparation, transformation, and analysis in Python
import numpy as np # Numerical Python, munpy for short,
# typically used for scientific computations

In [None]:
# Create a computer storage location named, chds,
# to store in memory the Coronary Heart Disease dataset
# after pandas read raw data and create structures

In [None]:
chds = pd.read_csv('CHD2.csv')

In [None]:
# when chds is used, Python knows to look for the data associated with that name.

In [None]:
# From now on, you can refer to CHD dataset by its variable name, chds.

In [None]:
chds.head() # allows you display the first five rows
# or cases or observations to see if the dataset has been sucessfully upload in the Google colab

In [None]:
chds.columns

In [None]:
#CoronaryDisease column is our target or labelled or outcome variable

In [None]:
# Focus on CoronaryDisease column to get a count of the number
# of patients with CHD and without CHD

In [None]:
print(chds.CoronaryDisease.value_counts())

In [None]:
# Above result shows it's imbalanced

2. Data Visualization - Visually explore the attributes of the dataset

In [None]:
# Plot it to visualize it

In [None]:
# Import visualization/plot libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Visualization - Bar plot to visualize the extent to which our target or outcome variable is imbalanced

In [None]:
fig, ax = plt.subplots(figsize=(6,4)) # The width of the figure approximately 6 inches, and its height 4 inches
sns.countplot(x= 'CoronaryDisease', data=chds, palette = "Set1")
_ = plt.title('Coronary Heart Diseasec(CHD) or Not')
_ = plt.xlabel('0 = No Coronary Heart Disease, 1 = Coronary Heart Disease')

In [None]:
# Above bar plot shows that the number of patients who have CHD
# is about half of the patients who do not have CHD.

In [None]:
# This means that our dataset is imbalanced leaning more towards the absence of CHD

In [None]:
chds. info()

In [None]:
chds.shape # To give number of rows and columns

In [None]:
chds.dtypes # To display the different data types

3. Perform descriptive data analysis to summarize the data, visualize and identify outliers, null/blank values

In [None]:
# Descriptive Statistics for 10 columns

In [None]:
chds.describe()

In [None]:
# Transpose the display using T

In [None]:
chds.describe().T

In [None]:
# Check to see if there are null/blank values or not using IsNull() function

In [None]:
chds.isnull().head(20)

In [None]:
# Let's try to get a count as another check for null/blank values using sum() function

In [None]:
chds.isnull().sum()

In [None]:
# Generate histograms to examine the distributions of the dataset

In [None]:
p = chds.hist(figsize=(18,18))

4. Compute the correlation coefficients between all the features or variables

In [None]:
# Calculate the correlation coefficients using seaborn that has been imported above to show heatmap

In [None]:
plt.figure(figsize=(17,10)) # The width of the figure approxiamately 17 inches, and its height 10 inches
p = sns.heatmap(chds.corr(), annot=True,cmap = 'YlGnBu')

5. Prepare the dataset for Model Building

In [None]:
# Separate Independent variables (X) from Dependent variable (Y) OR Target or Outcome column

In [None]:
# With the below code, drop the CoronaryDisease column that has the label,
# aka the dependent variable or the outcome.

In [None]:
X = chds.drop('CoronaryDisease', axis=1) # Create storage location X
# to put the remaining columns or the Independent Variables.

In [None]:
# The remaining columns which are the independent or predictor variables representing factors that
# influence Y (Dependent variable or target or outcome) whether a patient will have CHD or not

In [None]:
# Create storage location Y for the Target variable or the dependent variable
# that has labels showing whether a patient has CHD or not

In [None]:
Y = chds['CoronaryDisease']

In [None]:
# Import sklearn train_test_split library

In [None]:
from sklearn.model_selection import train_test_split

6. Dataset Split into (2) subsets: train (70%) and test (30%). Use the train dataset for training the model, and the test dataset for predictions

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=7)

In [None]:
X_train

In [None]:
X_test

7. Conduct model building trying several algorithms: a) Decision Tree, b) Support Vector Machine, c) Random Forest, d) Logistic Regression, e) K-Nearest Neighbors, and f) Xgboost, in accordance with scikit-learn developers (2019).

In [None]:
# a) Use Decision Tree algorithm (nonparametric) to build the predictive model (Rashidi et al., 2019)

In [None]:
# Import relevant sklearn libraries for Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [None]:
dectree = DecisionTreeClassifier()
dectree.fit(X_train, Y_train)

In [None]:
# Use below code to obtain Decision Tree accuracy score

In [None]:
dectree_pred = dectree.predict(X_test)

In [None]:
print("Accuracy Score =", format(metrics.accuracy_score(Y_test,dectree_pred)))

In [None]:
# Accuracy is low.

In [None]:
# b) Use Support Vector Machine algorithm (nonparametric) to build the predictive model (Rashidi et al., 2019)

In [None]:
# Import sklearn support vector machine

In [None]:
from sklearn.svm import SVC

In [None]:
supportv_model = SVC()
supportv_model.fit(X_train, Y_train)

In [None]:
supportv_pred = supportv_model.predict(X_test)

In [None]:
print("Accuracy Score ="),format(metrics.accuracy_score(Y_test, supportv_pred))

In [None]:
# Accuracy score is better than Decision Tree

In [None]:
# c) Use Randomforest (nonparametric) to build the predictive model (Rashidi et al., 2019)

In [None]:
# Import Scikit Learn Randomforest library

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
ranfc = RandomForestClassifier(n_estimators=200)
ranfc.fit(X_train, Y_train)

In [None]:
# Use below code to obtain Randomforest accuracy score on training and test datasets

In [None]:
ranfc_train = ranfc.predict(X_train)

In [None]:
print("Accuracy_Score ="), format(metrics.accuracy_score(Y_train, ranfc_train))

In [None]:
# Accuracy score above indicates overfitting

In [None]:
#Let's use the Test dataset

In [None]:
ranfc_test = ranfc.predict(X_test)

In [None]:
print("Accuracy_Score ="), format(metrics.accuracy_score(Y_test, ranfc_test))

In [None]:
# d) Use LogisticRegression algorithm (parametric) to build
# the predictive model (Rashidi et al., 2019)

In [None]:
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

In [None]:
lrsmodel = LogisticRegression()

In [None]:
lrsmodel.fit(X_train, Y_train)
lrsmodel.score(X_test, Y_test)

In [None]:
# With an accuracy score of 0.69, Logistic Regression is the next best model
# for predicting Coronary Heart Disease. But it is a parametric algorithm.
# Because our data does not appear to have normal distribution,
# we can't choose a parametric algorithm like Logistic Regression (Rashidi et al., 2019)

In [None]:
# e) Use K-Nearest Neighbors algorithm (nonparametric)
# to build the predicitive model (Rahsidi et al., 2019)

In [None]:
# Import sklearn.neighbors KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knneigh = KNeighborsClassifier(n_neighbors=3)
knneigh.fit(X_train, Y_train)

In [None]:
knneigh_pred = knneigh.predict(X_test)

In [None]:
print("Accuracy_Score ="), format(metrics.accuracy_score(Y_test, knneigh_pred))

In [None]:
# f) Use XGBClassifier algorithm (nonparametric) to build the predictive model (Rashidi et al., 2019)

In [None]:
from xgboost import XGBClassifier

In [None]:
xgboost_model = XGBClassifier()

In [None]:
xgboost_model.fit(X_train, Y_train)

In [None]:
# Use test dataset to make predicitions

In [None]:
xgboost_ypred = xgboost_model.predict(X_test)

In [None]:
print("Accuracy_Score ="), format(metrics.accuracy_score(Y_test, xgboost_ypred))

In [None]:
# For the Randome Forest predictive model, Determine which
# variable or features influence its performance the most

In [None]:
# Visualize feature importances

In [None]:
(pd.Series(ranfc.feature_importances_, index=X.columns).plot(kind='barh'))
plt.title("Feature Importances")
plt.xlabel("Feature Importances")

In [None]:
# Above results are consistent with the correlations shown in the above heatmap with the Dependent variable, Y.

      8. Select the model with the highest accuracy and save it.

RandomForest has the highest accuracy (69.8%) followed by Logistic Regression (69.0%), and XGboost (68.3%). However, since the dataset appears to come from a distribution that is not normal, let's adopt and save the RandomForest model (ranfc) for CHD predicitions for new patients, as it not only has the highest accuracy (69.8%), it's also a nonparametric algorithm that does not care whether the data came from normal distribution or not.

In [None]:
# Import pickle and use the dump() function to save the trained model (ranfc)

In [None]:
import pickle

In [None]:
saved_model = pickle.dumps(ranfc)

In [None]:
# Create a storage location, ranfc_from_pickle, to load saved model

In [None]:
ranfc_from_pickle = pickle.loads(saved_model)

In [None]:
ranfc_from_pickle.predict(X_test)

In [None]:
ranfc_from_pickle.predict(X_test)

In [None]:
# Display data that are in the X_test dataset to use for futher testing

In [None]:
X_test

In [None]:
# Each of the above patients has data points for the X variables
# Simply type in the data points in the trained and saved model
# for specific patient to predict the target or Y

In [None]:
# Below is an example, for patient #402, type in the data points
# in the trained model using the predict functionand see what the output it

In [None]:
ranfc.predict([[162, 6.94, 4.55, 33.36, 1, 52, 27.09, 32.06, 43]]) # for patient #402

In [None]:
# Above result, 0, means no CHD is predicted for patient #402

In [None]:
ranfc.predict([[124, 4.20, 2.94, 27.59, 0, 50, 30.31, 85.06, 30]]) # for patient #204

In [None]:
# Above result, 0, means no CHD is predicted for patient #204

In [None]:
# Let's get a prediction for patient #399

In [None]:
ranfc.predict([[132, 0.00, 6.63, 29.58, 1, 37, 29.41, 2.57, 62]]) # for patient #399

In [None]:
# Above result, 0, means no CHD is predicted for patient #399

In [None]:
# New patient #1 data: sbp (200) tobacco (13)
# ldl (4) adiposity (28.61) famhist (1) typea (12) obesity (19) alcohol (2.06) age (63)

In [None]:
ranfc.predict([[200, 13, 4, 28.61, 1, 12, 19, 2.06, 63]])

In [None]:
# Above result, 1, means CHD is predicted for this new patient #1

In [None]:
# For new patient #2, do the same

In [None]:
# New patient #2 data: sbp (145) tobacco (11)
# ldl (5) adiposity (16.2) famhist (0) typea (79) obesity (30) alcohol (2.62) age (38)

In [None]:
# See below trained and saved model and replace ? with the new data
# to get a CHD prediction for new patient #2

In [None]:
ranfc.predict([[145, 11, 5, 16.2, 0, 79, 30, 2.62, 38]]) # for patient #2

In [None]:
# For new patient #3, do the same

In [None]:
# New patient #3 data: sbp (200) tobacco (25)
# ldl (2) adiposity (32.27) famhist (0) typea (80) obesity (1) alcohol (56.06) age (60)

In [None]:
# See below trained and saved model and replace ? with the new data
# to get a CHD prediction for new patient #3

In [None]:
ranfc.predict([[200, 25, 2, 32.27, 0, 80, 1, 56.06, 60]]) # for patient #3

In [None]:
# For new patient #4, do the same

In [None]:
# New patient #4 data: sbp (118) tobacco (3)
# ldl (2) adiposity (10.05) famhist (1) typea (20) obesity (10) alcohol (0) age (17)

In [None]:
ranfc.predict([[118, 3, 2, 10.05, 1, 20, 10, 0, 17]]) # for patient #4

References

scikit-learn developers (2019). 1. Supervised learning - scikit-learn 0.22 documentation. Scikit-Learn.org.

Rashidi,....