**DAT 402 Project 1** <br> <ul> <li> This data set was taken from kaggle (https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data). Using the Naive Bayes Classification Model, we can predict whether or not a patient will die or not based on all the independent variables. <br> <li> This project was done by Pierce Renio.

Start by importing necessary libraries.

In [132]:
#These libraries were also used in HW1
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import MultinomialNB, CategoricalNB

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [133]:
df = pd.read_csv('heart_failure_clinical_records_dataset.csv', sep=',')

print(df.shape)
df.head()


(299, 13)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


Each row is a patient. The variable "DEATH_EVENT" will be the dependent variable we are trying to predict based on the other 12 independent variables. None of the variables are categorical; they are all integers. 

## Data cleaning

<br>

Since there are no categorical variables, there is no need to remove any variables nor convert any variables to integers.
We will remove any observations that contain n/A values if they exist.

In [134]:
df.isnull().sum()
df = df.dropna()

There are no n/A values so there is no need to further clean the data.

<br>

Let's look at the proportions for DEATH_EVENT

In [135]:
df['DEATH_EVENT'].value_counts(normalize=True)

0    0.67893
1    0.32107
Name: DEATH_EVENT, dtype: float64

More patients have survived that died in this data set. Now we will begin fitting Naive Bayes on the training data.

In [136]:
#Setting seed
np.random.seed(125432)

#Shuffling rows of the data set; 
data_randomized = df.sample(frac=1)

#Splitting the data (70/30)
trainsize = round(len(data_randomized) * 0.7)

training_set = data_randomized[:trainsize].reset_index(drop=True)
test_set = data_randomized[trainsize:].reset_index(drop=True)

#Sanity check
print(training_set.shape)
print(test_set.shape)

(209, 13)
(90, 13)


Checking to see if the DEATH_EVENT proportions for the training and testing set are roughly equivalent.

In [137]:
training_set['DEATH_EVENT'].value_counts(normalize=True)

0    0.684211
1    0.315789
Name: DEATH_EVENT, dtype: float64

In [138]:
test_set['DEATH_EVENT'].value_counts(normalize=True)

0    0.666667
1    0.333333
Name: DEATH_EVENT, dtype: float64

To create the Naive Bayes model we must create 2 separate data frames, one for the x values and one for the y value. The data frame for the x values will have all the features except DEATH_EVENT. The data frame for the y value will only have the feature DEATH_EVENT.

In [157]:
trainX = training_set.iloc[:,:-1]
trainy = training_set['DEATH_EVENT']

colnames = trainX.columns

trainX.head()
trainy.head()

2508


We will also create 2 separate data frames for the test set. Essentially the same as the training set but created with the test set.

In [155]:
testX = test_set.iloc[:,:-1]
testy = test_set['DEATH_EVENT']


testX.head()
testy.head()

(90, 12)


Since trainY and testY are already a Bernouli variable, there is no need to convert it.

In [172]:
#For the independenet variable
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
trainBrnli = le.fit_transform(trainy)

#For the dependent variable
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()
trainX = enc.fit_transform(trainX)  

trainX = pd.DataFrame(trainX, columns=colnames) 
            
#Sanity check
trainX.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,20.0,0.0,123.0,1.0,7.0,0.0,90.0,6.0,9.0,1.0,0.0,43.0
1,20.0,0.0,132.0,1.0,4.0,0.0,71.0,15.0,0.0,1.0,1.0,84.0
2,25.0,0.0,23.0,1.0,8.0,1.0,29.0,6.0,16.0,1.0,0.0,15.0
3,25.0,1.0,6.0,1.0,10.0,0.0,23.0,5.0,13.0,0.0,0.0,59.0
4,22.0,0.0,76.0,0.0,2.0,1.0,65.0,5.0,16.0,1.0,1.0,3.0


In [173]:
model = CategoricalNB() 
model.fit(trainX,trainBrnli)

CategoricalNB()

With the model now trained, we will test the model on the train set.

In [174]:
yhattrain = model.predict(trainX)

We can see how effective the model is on the training data by looking at a confusion matrix and the accuracy score.

In [175]:
pd.crosstab(yhattrain, trainBrnli)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,141,4
1,2,62


In [176]:
confusion_matrix(yhattrain, trainy)

array([[141,   4],
       [  2,  62]])

In [177]:
accuracy_score(yhattrain, trainy)

0.9712918660287081

The number proportion of correctly predicted outputs is calculated by: ${141+62 \over141+4+2+62} \approx 97.13\%$. <br> 
Due to the bias-variance tradeoff, this high accuracy score might not be nearly as high for the test set. If so, then this is an example of overfitting the data.

We will now test the model on the test set.

In [178]:
testBrnli = le.fit_transform(testy)

testX = enc.fit_transform(testX)

testX = pd.DataFrame(testX, columns=colnames)


yhattest = model.predict(testX)

In [179]:
yhattest = model.predict(testX)

In [180]:
confusion_matrix(yhattest, testBrnli)

array([[50, 15],
       [10, 15]])

In [181]:
accuracy_score(yhattest, testBrnli)

0.7222222222222222

The number proportion of correctly predicted outputs is calculated by: ${50+15 \over50+15+10+15} \approx 72.22\%$. <br> 
This accuracy score is not as high for the test set. This might be a sign overfitting the data as the accuracy score for the training data was much higher. If overfitting is the case, then the bias is low and the variance is high, thus the error in the model will also be higher.