Created: 12/4/2019  
DATA 512 Human Centered Data Science  
Peter Meleney  

# Chapter 4: Logistic Regression

## Overview

In this lab we will:
1. Review the Smarket data set.
1. Fit a logistic Regression

In [2]:
#Imports for Chapter 4: Logistic Regression
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegression #The version of logistic regression we will be implementing

In [11]:
#allows for matplotlib charts to plot inline in a jupyter notebook.
%matplotlib inline 

#Increases number of characters displayed per column in a dataframe to 100
pd.options.display.max_colwidth = 150

## The Smarket Dataset

The Smarket data set is a record of 1250 observations of the "daily percentage returns for the S&P 500 stock index between 2001 and 2005." [1]  The objective of this section will be to use the **Lag** and **Volume** columns to predict the value of the **Direction** column. 

This is a very simple implementation of what is called a technical analysis.  We are using the vascillations of the market in the past to predict the future.  In this dataset both the **Today** column, and the **Direction** column could be used as target variables, only the **Lag** and **Volume** columns are allowable predictive data.  In our case we are predicting the **Direction** column because it has a binary class, appropriate for a binary classifier like logistic regression.

These data are sourced from, and table is as in [1].

In [16]:
data_description = pd.DataFrame([["Year", "The year that the observation was recorded."],
["Lag1","Percentage return for previous day."],
["Lag2", "Percentage return for 2 days previous"],
["Lag3", "Percentage return for 3 days previous"],  
["Lag4", "Percentage return for 4 days previous"],
["Lag5", "Percentage return for 5 days previous"],
["Volume", "Volume of shares traded (number of daily shares traded in billions)"],  
["Today", "Percentage return for today"], 
["Direction", "A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day"]])

data_description.columns = ['Name', "Description"]
data_description.index = data_description.iloc[:,0]
data_description.drop('Name', axis = 1, inplace=True)
data_description

Unnamed: 0_level_0,Description
Name,Unnamed: 1_level_1
Year,The year that the observation was recorded.
Lag1,Percentage return for previous day.
Lag2,Percentage return for 2 days previous
Lag3,Percentage return for 3 days previous
Lag4,Percentage return for 4 days previous
Lag5,Percentage return for 5 days previous
Volume,Volume of shares traded (number of daily shares traded in billions)
Today,Percentage return for today
Direction,A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day


## Import and Review Data

In [6]:
df = pd.read_csv('data/Smarket.csv', index_col=0)

In [7]:
df.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1250 entries, 1 to 1250
Data columns (total 9 columns):
Year         1250 non-null int64
Lag1         1250 non-null float64
Lag2         1250 non-null float64
Lag3         1250 non-null float64
Lag4         1250 non-null float64
Lag5         1250 non-null float64
Volume       1250 non-null float64
Today        1250 non-null float64
Direction    1250 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 97.7+ KB


In [15]:
df.describe()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
count,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0
mean,2003.016,0.003834,0.003919,0.001716,0.001636,0.00561,1.478305,0.003138
std,1.409018,1.136299,1.13628,1.138703,1.138774,1.14755,0.360357,1.136334
min,2001.0,-4.922,-4.922,-4.922,-4.922,-4.922,0.35607,-4.922
25%,2002.0,-0.6395,-0.6395,-0.64,-0.64,-0.64,1.2574,-0.6395
50%,2003.0,0.039,0.039,0.0385,0.0385,0.0385,1.42295,0.0385
75%,2004.0,0.59675,0.59675,0.59675,0.59675,0.597,1.641675,0.59675
max,2005.0,5.733,5.733,5.733,5.733,5.733,3.15247,5.733


## Logistic Regression

In [22]:
X = df[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
y = df['Direction'].replace('Up',1).replace('Down',0)
model = LogisticRegression(solver='lbfgs')
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Again we can call model.coef_ and model.intercept_ to discover the coefficients and the intercept of the model, and again we can use a loop to pring out all of the variable names and the corresponding coefficients to the notebook.

In [29]:
i = 0
for coef in model.coef_[0]:
    print(X.columns[i] +": " + str(round(coef,3)))
    i +=1
    
print("intercept: " + str(round(model.intercept_[0],3)))

Lag1: -0.073
Lag2: -0.042
Lag3: 0.011
Lag4: 0.009
Lag5: 0.01
Volume: 0.132
intercept: -0.121


## References

[1] Hastie, Trevor. ISLR v1.2, RDocumentation https://www.rdocumentation.org/packages/ISLR/versions/1.2/topics/Smarket Last Updated October 19th, 2017. Accessed December 4th 2019.