# Stepwise Logistic Regression in Predicting Mortality by Heart Failure

Stepwise regression is a semi-automated process of building a model by successively adding or removing variables based solely on the t-statistics of their estimated coefficients ("Duke People,"n.d.). In this notebook, we are going to use this method on finding the best predictors among the available variables to predict the probability of mortality (```DEATH_EVENT```) by heart failure. The package used in this notebook is the pROC package to generate the ROC and AUC. The modelling part is done using built-in stats package in R.

In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse)
library(pROC)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import Dataset
The dataset is imported from Kaggle, with 299 observations and 13 variables. There are no missing values (NA) in the dataset. Notice that the column ```DEATH_EVENT``` is our dependent variable that we are going to predict, and it is consisted of binary values (0= alive, 1=death). Binary values of dependent variable indicates that we can use a logistic regression to build a model.

In [None]:
hfdata<-read.csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
head(hfdata)

In [None]:
glimpse(hfdata)
table(is.na(hfdata))

## Stepwise Regression
The regression is started by assigning 2 new variables (models) : ```null_model``` and ```full_model``` which represent the model with no predictors and all available predictors respectively. These models are our starting point and ending point, since stepwise regression is a process of iterating all possible pair of predictors into the model. We are using the ```glm()``` function, with the binomial family (logistic regression). Both of the null and full models are then used for the stepwise process using the ```step()``` function, with forward direction.

In [None]:
null_model<-glm(DEATH_EVENT~1,data=hfdata,family='binomial')
full_model<-glm(DEATH_EVENT~.,data=hfdata,family='binomial')
step_model <- step(null_model, 
                   scope = list(lower = null_model,
                                upper = full_model),
                   direction = "forward")

The stepwise regression will generate the best predictors based on the **Akaike Information Criterion** (AIC) above and their significance, which are shown in the model's summary below.

In [None]:
summary(step_model)

## Probability & Confusion Matrix
After the model is built, we will use the same dataset as a testing dataset, and try to predict the probability of ```DEATH_EVENT```. After the probabilities are generated, we need to classify the binary events (death or alive) based on the probability. 0.5 (50%) probability is commonly used as the benchmark of the binary events, which means any probability above 50% will be considered as death, vice versa.

In [None]:
hfdata$prob<-predict(step_model,hfdata,type='response')
hfdata$death_pred<-ifelse(hfdata$prob>=0.5,1,0)

After the classification process, we can generate the confusion matrix to see how accurate our model to predict the ```DEATH_EVENT```. The confusion matrix below shows that 250 out of 299 observations are predicted correctly (83% accuracy). We can also see that the model is good in predicting the number of people alive (184 correct predictions) than the number of people death (66 correct predictions).

In [None]:
table(hfdata$DEATH_EVENT,hfdata$death_pred)

## ROC & AUC
In addition, to see the specificity and sensitivity of the model, we can generate a ROC curve using the pROC package. We can also generate the area under the curve (AUC) from the ROC curve.

In [None]:
plot(roc(hfdata$DEATH_EVENT,hfdata$death_pred),col='red')

In [None]:
auc(roc(hfdata$DEATH_EVENT,hfdata$death_pred))

In conclusion, the accuracy and AUC values for this model indicate that the stepwise logistic regression model is a decent model to predict mortality by heart disease.