# Binary Logistic Regression in Python #

Killian McKee

## Overview ##

1. [What is Logistic Regression?](#section1)
2. [Pros and Cons of Logistic Regression](#section2)
3. [When to use Logistic Regression](#section3)
4. [Key Parameters](#section4)
5. [Walkthrough: Binary Logistic Regression](#section5)
6. [Review](#section6) 
7. [Related Topics](#section7) 
8. [Sources](#section8) 

<a id='section1'></a>

### What is Logistic Regression? ###

Logistic regression is a classification algorithm used to predict the probability of some (usually two) categorical variables. In binomial regression, the dependent variable is a binary variable where data is coded as a 1 (success) or a 0 (faiulure, no). When we perform logistic regression on multiple classes we would have additional classes for each dependent variable. Logistic regression models predict the probability Y=1 given X. 


<img src='log_vs_linear.jpeg'>

<img src='log_reg.gif'>

<a id='section2'></a>

### Pros and Cons of Logistic Regression ###

#### Pros ###

1. Easy to interpret since the output can be displayed by probability
2. Can also be used for ranking 
3. Fast
4. Low variance 

#### Cons ####

1. Prone to high bias 

<a id='section3'></a>

### When to use Binomial Logistic Regression ###

Binomial logistic regression is good at predicting whether something falls into one of two categories (win/lose,pass/fail,sick/ill,etc.) when when the following conditions are met:

1. The dependent variable is binary i.e. there are two possible outcomes
2. The independent variables should be independent of one another i.e. there is no (or very little) multicollinearity
3. Sample sizes are large
4. The independent variables are linearly related to the log odds
5. Only meaningful variables are included (using a method like PCA or Lasso can help with finding meaningful variables). 

<a id='section4'></a>

### Key Parameters ### 

The equation for logistic regression is P=(1/(1+e^(b0+b1x)). The key parameters:

1. b0 controls how far to the left or right the curve is on a graph 
2. b1 controls the steepness of the curve (slope)

Typically, logistic regression is computed with packages like sci-kit learn, more info on those parameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

### Other Notes ### 

Logistic regression works best with large sample sizes. [Cross validation](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) is a tool to help evaluate the ability of a model to fit to new test data, and is especially valuable when sample sizes are small. 

<a id='section5'></a>

### Binary Logistic Regression Walkthrough ###

Let's walk through a simple example of how we might perform a binary logistic regression to predict what species of iris we are looking at given features like petal length and width. 

In [1]:
import numpy as np 
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns 
from vega_datasets import data 




In [2]:
#load the datasets 
iris_df=data.iris()

#examine the dataset
iris_df.head()


Unnamed: 0,petalLength,petalWidth,sepalLength,sepalWidth,species
0,1.4,0.2,5.1,3.5,setosa
1,1.4,0.2,4.9,3.0,setosa
2,1.3,0.2,4.7,3.2,setosa
3,1.5,0.2,4.6,3.1,setosa
4,1.4,0.2,5.0,3.6,setosa


In [3]:
#since we are interested in the predicting the species, lets see how many unique entries this column contains
print(iris_df['species'].nunique())

#we are only going to predict two species, so lets drop the virginica species from our dataframe
iris_df=iris_df[iris_df['species']!='virginica']


3


In [4]:
#now we select our x data so that we can perform a train test split
#this will be everything species column, which is our dependent variable 

X=iris_df.iloc[:,:-1]
print(X.head())
print(X.shape)

#our y data will be our dependent variable (species)

y=iris_df.iloc[:,-1]
y.head()

   petalLength  petalWidth  sepalLength  sepalWidth
0          1.4         0.2          5.1         3.5
1          1.4         0.2          4.9         3.0
2          1.3         0.2          4.7         3.2
3          1.5         0.2          4.6         3.1
4          1.4         0.2          5.0         3.6
(100, 4)


0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

In [5]:
#now we split up our training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [6]:
#check the shape of the training data 
#we have 75 rows with 4 columns (petal length width, sepal length, width, etc.)
X_train.shape

(75, 4)

In [7]:
#Time to fit our regression model! 
#use this method when cross validating is less of a concern
classifier = LogisticRegression(random_state=18)
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=18, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [8]:
#here we fit a logistic regression model and apply cross validation, which would be the preferred method when computational
#constraints are not high or the model is small
#we see the model is still getting 100% accuracy, likely because of the clean sample data 

classifier= LogisticRegressionCV(cv=5, random_state=0).fit(X, y)
classifier.score(X_test,y_test)

1.0

In [9]:
#lets call a confusion matrix to see how we did
#we can see the model was 100% accurate on our test data, which should typically raise suspicion but in this case is ok
#because the data is very clean. 

y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[13  0]
 [ 0 12]]


if you're looking for information on confusion matrices look [here](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)

In [10]:
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 1.00


In [11]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        13
 versicolor       1.00      1.00      1.00        12

avg / total       1.00      1.00      1.00        25



<a id='section6'></a>

### Review ###

We learned that logistic regression is a useful tool for solving classification problems when the independent variables are linearly related to the log odds and the independent variables are meaningful (more criteria for binomial regressions). Next, we did a basic walkthrough of a binomial logistic regression and performed cross validation to evaluate how our regressor handled new data. Lastly, we created a classification report to view the accuracy of our logit model on withheld test data. 

<a id='section7'></a>

### Related Topics ### 

1. [Multinomial Logistic Regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression)
2. [The Math Behind Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) 


<a id='section8'></a>

### Sources ### 

1. https://www.saedsayad.com/logistic_regression.htm
2. https://datascienceplus.com/building-a-logistic-regression-in-python-step-by-step/
3. http://aritter.github.io/courses/5525_slides/logistic_regression.pdf