This lab on Logistic Regression is a Python adaptation from p. 154-161 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016).

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# 4.6.2 Logistic Regression

Let's return to the ${\tt Smarket}$ data from ${\tt ISLR}$. 

In [3]:
df = pd.read_csv('Smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)
df.describe()

Unnamed: 0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
count,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0
mean,0.003834,0.003919,0.001716,0.001636,0.00561,1.478305,0.003138
std,1.136299,1.13628,1.138703,1.138774,1.14755,0.360357,1.136334
min,-4.922,-4.922,-4.922,-4.922,-4.922,0.35607,-4.922
25%,-0.6395,-0.6395,-0.64,-0.64,-0.64,1.2574,-0.6395
50%,0.039,0.039,0.0385,0.0385,0.0385,1.42295,0.0385
75%,0.59675,0.59675,0.59675,0.59675,0.597,1.641675,0.59675
max,5.733,5.733,5.733,5.733,5.733,3.15247,5.733


In this lab, we will fit a logistic regression model in order to predict ${\tt Direction}$ using ${\tt Lag1}$ through ${\tt Lag5}$ and ${\tt Volume}$. We'll build our model using the ${\tt glm()}$ function, which is part of the
${\tt formula}$ submodule of (${\tt statsmodels}$).

In [4]:
import statsmodels.formula.api as smf

We can use an ${\tt R}$-like formula string to separate the predictors from the response.

In [5]:
formula = 'Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume'

The ${\tt glm()}$ function fits **generalized linear models**, a class of models that includes logistic regression. The syntax of the ${\tt glm()}$ function is similar to that of ${\tt lm()}$, except that we must pass in the argument ${\tt family=sm.families.Binomial()}$ in order to tell ${\tt R}$ to run a logistic regression rather than some other type of generalized linear model.

In [6]:
formula = 'Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume'
model = smf.glm(formula=formula, data=df, family=sm.families.Binomial())
result = model.fit()
print(result.summary())

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['Direction[Down]', 'Direction[Up]']   No. Observations:                 1250
Model:                                              GLM   Df Residuals:                     1243
Model Family:                                  Binomial   Df Model:                            6
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -863.79
Date:                                  Mon, 09 Mar 2020   Deviance:                       1727.6
Time:                                          11:02:21   Pearson chi2:                 1.25e+03
No. Iterations:                                       4                                         
Covariance Type:                              nonrobust                                         
                 coef    std e

In [8]:
print("Coeffieients")
print(result.params)
print(' ')
print("p-Values")
print(result.pvalues)
print(' ')

Coeffieients
Intercept    0.126000
Lag1         0.073074
Lag2         0.042301
Lag3        -0.011085
Lag4        -0.009359
Lag5        -0.010313
Volume      -0.135441
dtype: float64
 
p-Values
Intercept    0.600700
Lag1         0.145232
Lag2         0.398352
Lag3         0.824334
Lag4         0.851445
Lag5         0.834998
Volume       0.392404
dtype: float64
 


In [8]:
predictions = result.predict()
print(predictions[0:10])

[0.49291587 0.51853212 0.51886117 0.48477764 0.48921884 0.49304354
 0.50734913 0.49077084 0.48238647 0.51116222]


In [10]:
print(np.column_stack((df.as_matrix(columns=["Direction"]).flatten(), result.model.endog)))

[['Up' 0.0]
 ['Up' 0.0]
 ['Down' 1.0]
 ...
 ['Up' 0.0]
 ['Down' 1.0]
 ['Down' 1.0]]


  """Entry point for launching an IPython kernel.


In [11]:
predictions_nominal = [ "Up" if x < 0.5 else "Down" for x in predictions]

In [13]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(df["Direction"], predictions_nominal))

[[145 457]
 [141 507]]


In [15]:
print(classification_report(df["Direction"], predictions_nominal, digits=3))

              precision    recall  f1-score   support

        Down      0.507     0.241     0.327       602
          Up      0.526     0.782     0.629       648

    accuracy                          0.522      1250
   macro avg      0.516     0.512     0.478      1250
weighted avg      0.517     0.522     0.483      1250



In [16]:
x_train = df[:'2004'][:]
y_train = df[:'2004']['Direction']

x_test = df['2005':][:]
y_test = df['2005':]['Direction']

In [17]:
model = smf.glm(formula=formula, data=x_train, family=sm.families.Binomial())
result = model.fit()

Notice that we have trained and tested our model on two completely separate
data sets: training was performed using only the dates before 2005,
and testing was performed using only the dates in 2005. Finally, we compute
the predictions for 2005 and compare them to the actual movements
of the market over that time period.

In [22]:
predictions = result.predict(x_test)
predictions_nominal = [ "Up" if x < 0.5 else "Down" for x in predictions]
print(classification_report(y_test, predictions_nominal, digits=3))

              precision    recall  f1-score   support

        Down      0.443     0.694     0.540       111
          Up      0.564     0.312     0.402       141

    accuracy                          0.480       252
   macro avg      0.503     0.503     0.471       252
weighted avg      0.511     0.480     0.463       252



In [25]:
formula = 'Direction ~ Lag1+Lag2'
model = smf.glm(formula=formula, data=df, family=sm.families.Binomial())
# This will test your new model
result = model.fit()
predictions = result.predict(x_test)
predictions_nominal = [ "Up" if x < 0.5 else "Down" for x in predictions]
print(classification_report(y_test, predictions_nominal, digits=3))

              precision    recall  f1-score   support

        Down      0.500     0.081     0.140       111
          Up      0.564     0.936     0.704       141

    accuracy                          0.560       252
   macro avg      0.532     0.509     0.422       252
weighted avg      0.536     0.560     0.455       252



In [27]:
print(result.predict(pd.DataFrame([[1.2,1.1],[1.5,-0.8]], columns = ["Lag1","Lag2"])))

0    0.515122
1    0.499355
dtype: float64
