# Logistic Regression

In this notebook, we will use the package scikit-learn to perform logistic regression. As with linear regression, the first step is to load the data we will use into a format suitable for training a logistic regression model.

## Loading Data

We again read external files into a `DataFrame`, loading the readmission data from an earlier lesson from this week.

In [None]:
#load and view data
import io
import pandas as pd
readmission_data = pd.read_csv('logreg_readmission_multivariate.csv')
readmission_data.head()

## Univariate Model

Just as with linear regression, the code below separates the data into features x and outcomes y, which in this case is readmission. Next part of the code imports the `LogisticRegression` class, and fits the model. The `penalty` parameter is set to `none`. Other values of this parameter would modify how the model is trained. We will discuss this in a lesson next week. Finally, we extract the coefficient values as before and print them for viewing.

In [None]:
# seperate data into features and outcomes
X_train = readmission_data[['Length of Stay']]
y_train = readmission_data.Readmission

#train create and train model
from sklearn.linear_model import LogisticRegression
logistic_reg_model1 = LogisticRegression(penalty='none')
logistic_reg_model1.fit(X_train, y_train)
   
#extract coefficients
theta0 = logistic_reg_model1.intercept_[0]
theta1 = logistic_reg_model1.coef_[0][0]
print("fitted theta1=%s, fitted theta0=%s" % (theta1, theta0))

We use these coefficients to plot the logistic curve showing probability of readmission against length of stay. Note that this time, the variable `new_y` that gives the fitted curve is computed by the logistic expression, `1/(1+np.exp(-theta0-theta1*new_x))`. We have passed the `scatter` function the additional parameter `alpha`, whichallows the blue data points to be darkened according to how many data points share that `Length of Stay` value. We can see that the left side of the plot has many dark blue points associated with no readmission, and the right side of the plot – corresponding to a longer length of stay – is associated more with readmission.

In [None]:
#load plotting library and numerical library
import matplotlib.pyplot as plt
import numpy as np

#plot raw data
plt.scatter(x=readmission_data['Length of Stay'], y=readmission_data['Readmission'],alpha=.03)
plt.title('Readmission vs Length of Stay')
plt.xlabel('Length of Stay')
plt.ylabel('Probability of Readmission')

#plot fitted logistic curve
new_x = np.arange(start=0,stop=22,step=0.1)
new_y = 1/(1+np.exp(-theta0-theta1*new_x))
plt.plot(new_x,new_y,color="red")
plt.show()

Just as with linear regression, we can get predictions for a given set of features using the `.predict()` method. For logistic regression, this gives a value of 1 for datapoints with a predicted probability of 0.5 or higher, and a value of 0 for datapoints with a predicted probability less than that. To get the raw probabilities, we can use the method `.predict_proba()` which returns two columns, which are the predicted probabilities for 0 and 1, respectively. We use this method below to predict the readmission probabilities for three new lengths of stay.

In [None]:
#create new Data Frame with new Lengths of Stay
readmission_prediction_df = pd.DataFrame({"Length of Stay":[1.0,7.0,14.0]})
#make predictions of readmission probability (returns column for 0 and 1)
predictions = logistic_reg_model1.predict_proba(readmission_prediction_df[['Length of Stay']])
#add predictions to Data Frame 
readmission_prediction_df['Readmission Probability'] = predictions[:,1]
print(readmission_prediction_df)

If we consider predicted probabilities of over 0.5 to be a predicted readmission, we can look at true and false classifications using a confusion matrix. This matrix shows the number of datapoints for which correct and incorrect predictions were made, for either a true value of 0 or a true value of 1. This is a useful way of evaluating the fit of a logistic model.

In [None]:
#Visualize predictions vs. true values 
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(logistic_reg_model1, X_train, y_train)
plt.show()

## Multivariate Logistic Regression

Logistic regression with more than one feature proceeds very similarly. We simply specify the additional features we want in our training data – in this case they are age, gender, and number of medical conditions.

In [None]:
# seperate data into features and outcomes
X_train = readmission_data[['Length of Stay', 'Age', 'Gender (Female=1)',
       'No. of Medical Conditions']]
y_train = readmission_data.Readmission
#train create and train model
from sklearn.linear_model import LogisticRegression
logistic_reg_model2 = LogisticRegression(penalty='none')
logistic_reg_model2.fit(X_train, y_train)

As before, we can visualize model performance by looking at a confusion matrix.

In [None]:
#Visualize predictions vs. true values
ConfusionMatrixDisplay.from_estimator(logistic_reg_model2, X_train, y_train)
plt.show()

Note that the number of cells where the true and predicted labels do not match is lower than in the univariate case, indicating an improved fit.