### Logistic Regression in Statsmodels

In [1]:
import pandas as pd
import statsmodels.formula.api as sm

In [2]:
# Load in the dataset
df = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,is_red,high_quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,1.0,0.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,1.0,0.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,1.0,0.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0


#### Below, we fit a logistic regression model using statsmodels (patsy's) logistic regression formula

The formula says that the high_quality (coded as 1 or 0) DEPENDS on (~) following attributes:
 `residual_sugar, pH, and alcohol`.


In [3]:
model = sm.logit(
    "high_quality ~ residual_sugar + pH + alcohol",
    data = df
).fit()

Optimization terminated successfully.
         Current function value: 0.418431
         Iterations 6


In [4]:
model.summary()

0,1,2,3
Dep. Variable:,high_quality,No. Observations:,6497.0
Model:,Logit,Df Residuals:,6493.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 27 Mar 2017",Pseudo R-squ.:,0.1557
Time:,15:27:39,Log-Likelihood:,-2718.5
converged:,True,LL-Null:,-3219.8
,,LLR p-value:,5.039e-217

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-11.7871,0.803,-14.674,0.000,-13.361 -10.213
residual_sugar,0.0471,0.009,5.441,0.000,0.030 0.064
pH,0.1419,0.217,0.654,0.513,-0.283 0.567
alcohol,0.8946,0.031,28.600,0.000,0.833 0.956


In [5]:
import math

In [6]:
math.exp(1)

2.718281828459045

In [7]:
math.exp(0.8946) ## <-- this is e ^ 0.8946

## As alcohol increases by 1 unit, we expect that a wine is
## 2.446 times *AS* likely to be classified as "high-quality."
## This is the ODDS RATIO associating quality of wine to
## alcohol content.

## e ^ 0.8946 is the ODDS RATIO
## 0.8946 is the LOG-ODDS RATIO

2.4463570509072867

In the table above
- `coef`, represents the coefficients we have learned for each feature
        For example, '0.8946' for alcohol represent the change in log odds 
        As alcohol content increases, the likelihood of high_quality increases

#### We can add interaction effects as well
- The `:` operator in patsy / formula-syntax represents when we care about two variables occurring together
- The `*` operator expands as follows: `a * b` expands to `a + b + a:b`, both of the original terms and interaction

In [8]:
model = sm.logit(
    "high_quality ~ residual_sugar * alcohol",
    data = df
).fit()

Optimization terminated successfully.
         Current function value: 0.416779
         Iterations 6


In [9]:
model.summary()

0,1,2,3
Dep. Variable:,high_quality,No. Observations:,6497.0
Model:,Logit,Df Residuals:,6493.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 27 Mar 2017",Pseudo R-squ.:,0.159
Time:,15:42:00,Log-Likelihood:,-2707.8
converged:,True,LL-Null:,-3219.8
,,LLR p-value:,1.1090000000000001e-221

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-13.0762,0.532,-24.600,0.000,-14.118 -12.034
residual_sugar,0.3601,0.066,5.431,0.000,0.230 0.490
alcohol,1.0565,0.047,22.343,0.000,0.964 1.149
residual_sugar:alcohol,-0.0302,0.006,-4.763,0.000,-0.043 -0.018


In [10]:
model = sm.logit(
    "high_quality ~ residual_sugar + alcohol + residual_sugar:alcohol",
    data = df
).fit()

Optimization terminated successfully.
         Current function value: 0.416779
         Iterations 6


In [12]:
model.summary()

0,1,2,3
Dep. Variable:,high_quality,No. Observations:,6497.0
Model:,Logit,Df Residuals:,6493.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 27 Mar 2017",Pseudo R-squ.:,0.159
Time:,15:43:52,Log-Likelihood:,-2707.8
converged:,True,LL-Null:,-3219.8
,,LLR p-value:,1.1090000000000001e-221

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-13.0762,0.532,-24.600,0.000,-14.118 -12.034
residual_sugar,0.3601,0.066,5.431,0.000,0.230 0.490
alcohol,1.0565,0.047,22.343,0.000,0.964 1.149
residual_sugar:alcohol,-0.0302,0.006,-4.763,0.000,-0.043 -0.018


In [13]:
model = sm.logit(
    "high_quality ~ color",
    data = df
).fit()

Optimization terminated successfully.
         Current function value: 0.491511
         Iterations 6


In [14]:
model.summary()

0,1,2,3
Dep. Variable:,high_quality,No. Observations:,6497.0
Model:,Logit,Df Residuals:,6495.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 27 Mar 2017",Pseudo R-squ.:,0.008222
Time:,15:44:03,Log-Likelihood:,-3193.3
converged:,True,LL-Null:,-3219.8
,,LLR p-value:,3.429e-13

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.8514,0.073,-25.355,0.000,-1.995 -1.708
color[T.white],0.5647,0.081,6.985,0.000,0.406 0.723


In [15]:
model = sm.logit(
    "high_quality ~ color : quality",
    data = df
).fit()

PerfectSeparationError: Perfect separation detected, results not available