**Heart Failure** <br> <ul> <li> This data set was taken from kaggle (https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data). Using the decision tree and Random Forest methods, we can predict whether or not a patient will die or not based on all the independent variables.<br><li> The data cleaning is the same as the Naive Bayes Classification method performed with the same data. <br> <li> This project was done by Pierce Renio.

Start by importing necessary libraries.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [4]:
df=pd.read_csv('heart_failure_clinical_records_dataset.csv', sep=',')
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


Each row is a patient. The variable "DEATH_EVENT" will be the dependent variable we are trying to predict based on the other 12 independent variables. None of the variables are categorical; they are all integers. 

## Data cleaning

<br>

Since there are no categorical variables, there is no need to remove any variables nor convert any variables to integers.
We will remove any observations that contain n/A values if they exist.

In [8]:
df.isnull().sum()
df = df.dropna()

There are no n/A values so there is no need to further clean the data.

<br>

Let's look at the proportions for DEATH_EVENT

In [9]:
df['DEATH_EVENT'].value_counts(normalize=True)

0    0.67893
1    0.32107
Name: DEATH_EVENT, dtype: float64

More patients have survived than died in this data set. Now we will begin fitting the decision tree model on the training data.

In [14]:
X = df.drop(columns=["DEATH_EVENT"])
y = df["DEATH_EVENT"]
trainX, testX, trainy, testy = train_test_split(X,y, test_size=0.2, random_state = 53)

The model will now be created. It will first predict the training data.

In [15]:
modeldt = DecisionTreeClassifier(criterion = "entropy", random_state=53)
modeldt.fit(trainX, trainy)

DecisionTreeClassifier(criterion='entropy', random_state=53)

In [16]:
yhattrain = modeldt.predict(trainX)

In [17]:
confusion_matrix(yhattrain,trainy)

array([[164,   0],
       [  0,  75]])

In [18]:
accuracy_score(yhattrain,trainy)

1.0

The model will now predict the test set.

In [20]:
yhattest = modeldt.predict(testX)

In [21]:
confusion_matrix(yhattest,testy)

array([[38,  7],
       [ 1, 14]])

In [22]:
accuracy_score(yhattest,testy)

0.8666666666666667

86.67% correct. <br><br>
The Random Forest model will now be created. It will be first predict the training set.

In [23]:
modelrf = RandomForestClassifier(criterion = 'entropy', random_state = 53)
modelrf.fit(trainX, trainy)

RandomForestClassifier(criterion='entropy', random_state=53)

In [25]:
yhattrain_rf = modelrf.predict(trainX)

In [26]:
confusion_matrix(yhattrain_rf, trainy)

array([[164,   0],
       [  0,  75]])

In [27]:
accuracy_score(yhattrain_rf, trainy)

1.0

Just like the decision tree model, the Random Forest model correctly predicted all of the training data. <br> <br>The Random Forest model will now predict the test set.

In [28]:
yhattest_rf = modelrf.predict(testX)

In [29]:
confusion_matrix(yhattest_rf, testy)

array([[39, 10],
       [ 0, 11]])

In [30]:
accuracy_score(yhattest_rf, testy)

0.8333333333333334

83.33% correct.<br><br><br>The decision tree model has correctly predicted the death event 2 more times than the Random Forest model.<br>The decision tree model is slightly more effective at predicting the death event for heart failure patients than the Random Forest model.