# **Logistic Regression**

## **Overview**

So far, you've learned about linear regression, where you built a model to predict a continuous linear relationship between two variables. In that case, the outcome variable, blood glucose, is continuous and can be modeled well by a line. 

But what if your outcome variable is instead categorical, like a disease diagnosis? 

###The solution

In data science (particularly in scikit-learn), we assign numerical values (like 1 and 0) to each categorical variable (like 'Yes Diabetes' and 'No Diabetes').

If we tried to plot a linear regression model with this disease outcome as the dependent variable, you'd notice something funny.

We only have *two* possible outcome values, 1 and 0, but the linear regression function that is plotted extends infinitely.

So is there a better way to model our relationship?

The logistic function outputs a sigmoidal curve that *squeezes* outputs to between 0 and 1.

$f(x) = \frac{1}{1+e^{-x}}$

Now, by inputting our linear relationship model into the logistic function, we've generated a function that we can interpret as a *classifier*. (Note: despite its name, logistic regression, is actually a classifer.

By default, any predicted value on our logistic curve above y=0.5 would be *classified* as a member of the positive class (i.e. 'Yes Diabetes') and any predicted value below y=0.5 would be *classified* as a member of the negative class (i.e. 'No Diabetes').







### Code Setup

In [0]:
import pandas as pd

#importing the LogisticRegression model from scikit-learn
from sklearn.linear_model import LogisticRegression

#importing a built-in function for automatically creating train and test sets from your dataset
from sklearn.model_selection import train_test_split

#importing an accuracy metric to evaluate the trained model
from sklearn.metrics import accuracy_score

### Dataset Import

In this tutorial, we're going to use an existing dataset downloaded from Kaggle (a popular data science/machine learning competition website). The [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) itself was provided by the National Institute of Diabetes and Digestive and Kidney Diseases.

The dataset includes 767 individuals and 8 associated features (e.g. age, BMI, 2 hour serum insulin, and others). The last column includes the diagnosis that the investigators provided for the individual's diabetic status and is '1' if they have diabetes or '0' if not.

In [0]:
url = 'https://raw.githubusercontent.com/amurugan19/medmldatasets/master/diabetes.csv'

#importing the csv file into a dataframe
data = pd.read_csv(url)

####Train/Test Split

The next step is to generate a train and test set for our model from our original dataset. This can be done manually, but luckily, scikit-learn includes a handy function, called *train_test_split*, that will automatically split your original dataset.

In [0]:
#creating a dataframe variable called X to store our input features (e.g. age, BMI, etc.)
X = data.iloc[1:,0:-1]

#separate dataframe variable called y to store our labels (e.g. 1 and 0 for Y or N diabetes diagnosis
y = data.iloc[1:,-1]

#using a built-in function to generate the new train and test arrays
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

###Model Building

At this point, we've successfully created our train and test sets. Now it's time to build our logistic regression model! 

In [0]:
model = LogisticRegression(max_iter=500);

#training our model using our training inputs (X) and training labels (y)
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Since our model has approximately *learned* the relationship between our inputs (like age, BMI, and 2 hour serum insulin level,etc.) and an individual's diabetes status in our training set, we can now use the model to *predict* the diabetes status for the individuals in our test set!


In [0]:
#creating a new dataframe to store our model's novel predictions
y_pred = model.predict(X_test)

#printing our predictions 
print(y_pred)

[0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1
 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1
 1 0 0 0 1 1]


As you can see in the previous cell's output, our model has generated predictions for each patient in our test set (with 1 corresponding to 'Diabetes' and 0 corresponding to 'No Diabetes'). 

Unfortunately, plotting and visualizing this model is non-trivial, since our dataset contained 8 input variables (meaning that our data and model exist in an *8-dimensional space* (and most of us can't really visualize anything beyond a x-y-z axis for a 3-dimensional space)!

However, we can still evaluate the accuracy of our model's predictions against the *ground truth* values (i.e. each individual's actual diabetes diagnosis) provided beforehand by the study investigators. 

In [0]:
#this is one of many pre-written functions for evaluation in the scikit metrics module
score = accuracy_score(y_test, y_pred)

print("Logistic Regression Accuracy: ", score)

Logistic Regression Accuracy:  0.7402597402597403


And that's all the code you need to generate a complete Logistic Regression binary classifier! For next steps, there are [various model parameters](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that you can customize to improve performance for your classification task. Using your same dataset, can also try other classifier models like Random Forests or K-nearest-neighbors that we'll discuss elsewhere.


##Multi-Class Classification

In the previous example, we used a dataset where the outcome measure was *binary*, that is, there were only two possibilities: diabetes or no diabetes. But what if you have more than two outcome classes? Imagine a neurology classification task where the outcomes include 3 type of bleeds: epidural hematoma, subdural hematoma, and subarachnoid hematoma. This type of task then becomes a *multi-class* (here, a 3 class) classification. 

Multi-class classification can be done using a "one vs. rest" approach for Logistic Regression, but it becomes a bit more complicated.

Sequentally, each one of the three classes above is classified against the union of the other classes (i.e. epidural vs. {subdural examples $\cup$ subarachnoid examples}; this effectively transforms our multi-class classification task into *many* binary classification tasks. Then, many separate logistic regression models are trained on each of these "binary" classification tasks.

Alternately, you may choose to use other classification models like [*k nearest neighbors*](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification), which inherently are suited to multi-class classification.

The implementation details are beyond the scope of this tutorial, but the scikit-learn documentation as well as Medium tutorials are a good next step.
