# **Exercise on Smarket data: practice on financial data**
In this lab we will examine the Smarket data, which is part of the ISLP library. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, `Lag1` through `Lag5`. We have also recorded `Volume` (the number of shares traded on the previous day, in billions), `Today` (the percentage return on the date in question) and `Direction` (whether the market was Up or Down on this date).

In [15]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots 
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,summarize)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, auc, roc_curve,roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [2]:
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Now we are ready to load the `Smarket` data.

In [3]:
Smarket = load_data('Smarket') 
Smarket

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.010,1.19130,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.29650,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.41120,-0.623,Down
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.27600,0.614,Up
4,2001,0.614,-0.623,1.032,0.959,0.381,1.20570,0.213,Up
...,...,...,...,...,...,...,...,...,...
1245,2005,0.422,0.252,-0.024,-0.584,-0.285,1.88850,0.043,Up
1246,2005,0.043,0.422,0.252,-0.024,-0.584,1.28581,-0.955,Down
1247,2005,-0.955,0.043,0.422,0.252,-0.024,1.54047,0.130,Up
1248,2005,0.130,-0.955,0.043,0.422,0.252,1.42236,-0.298,Down


We can see what the variable names are.

In [4]:
Smarket.columns

Index(['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today',
       'Direction'],
      dtype='object')

We compute the correlation matrix using the `corr()` method for data frames, which produces a matrix that contains all of the pairwise correlations among the variables. 

In [5]:
Smarket_drop = Smarket.loc[:, Smarket.columns != 'Direction'] #drop the qualitative column
Smarket_drop.corr()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
Year,1.0,0.0297,0.030596,0.033195,0.035689,0.029788,0.539006,0.030095
Lag1,0.0297,1.0,-0.026294,-0.010803,-0.002986,-0.005675,0.04091,-0.026155
Lag2,0.030596,-0.026294,1.0,-0.025897,-0.010854,-0.003558,-0.043383,-0.01025
Lag3,0.033195,-0.010803,-0.025897,1.0,-0.024051,-0.018808,-0.041824,-0.002448
Lag4,0.035689,-0.002986,-0.010854,-0.024051,1.0,-0.027084,-0.048414,-0.0069
Lag5,0.029788,-0.005675,-0.003558,-0.018808,-0.027084,1.0,-0.022002,-0.03486
Volume,0.539006,0.04091,-0.043383,-0.041824,-0.048414,-0.022002,1.0,0.014592
Today,0.030095,-0.026155,-0.01025,-0.002448,-0.0069,-0.03486,0.014592,1.0


As one would expect, the correlations between the lagged return variables and today’s return are close to zero. The only substantial correlation is between `Year` and `Volume`. By plotting the data we see that `Volume` is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.

Train a logistic regression model by the observations from 2001 through 2004, with Lags and volume. Then use this model to predict market directions from 2005.

In [6]:
X = Smarket.iloc[:, 1:6] #take out lags
y = Smarket.Direction #get response

In [7]:
y = y.map({'Up': 1, 'Down': 0}) #map to numerical values for cross validation

In [8]:
train = (Smarket.Year < 2005) #split training and test data
X_train, X_test = X.loc[train], X.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]

# Make market direction prediction by decision tree

In [9]:
# set tuning values
tuned_parameters = [{"ccp_alpha": [1,0.1,0.01,0.001,0.0001]}]
treeCV = GridSearchCV(DecisionTreeClassifier(random_state=67), tuned_parameters, scoring='accuracy',cv=10)
# more details see https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
treeCV.fit(X_train, y_train)
print("Best parameters set found on validation set:")
print()
print(treeCV.best_params_)
print()
print("Grid scores on validation set:")
print()
means = treeCV.cv_results_["mean_test_score"]
stds = treeCV.cv_results_["std_test_score"]
for mean, std, params in zip(means, stds, treeCV.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std, params))

Best parameters set found on validation set:

{'ccp_alpha': 0.001}

Grid scores on validation set:

0.508 (+/-0.003) for {'ccp_alpha': 1}
0.508 (+/-0.003) for {'ccp_alpha': 0.1}
0.508 (+/-0.003) for {'ccp_alpha': 0.01}
0.537 (+/-0.051) for {'ccp_alpha': 0.001}
0.530 (+/-0.051) for {'ccp_alpha': 0.0001}


In [10]:
# predict test set labels
ypred_tree = treeCV.predict(X_test)
accuracy_score(y_test,ypred_tree)

0.5119047619047619

In [11]:
confusion_table(ypred_tree, y_test)

Truth,0,1
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
0,55,67
1,56,74


# Make market direction prediction by random forest

In [114]:
# set tuning values
tuned_parameters = [{"max_features": [1,2,3,4,5]}]
rfCV = GridSearchCV(RandomForestClassifier(n_estimators=500,bootstrap=True,oob_score=True,random_state=234), tuned_parameters, scoring='accuracy',cv=5)
# more details see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rfCV.fit(X_train, y_train)
print("Best parameters set found on validation set:")
print()
print(rfCV.best_params_)
print()
print("Grid scores on validation set:")
print()
means = rfCV.cv_results_["mean_test_score"]
stds = rfCV.cv_results_["std_test_score"]
for mean, std, params in zip(means, stds, rfCV.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std, params))

Best parameters set found on validation set:

{'max_features': 3}

Grid scores on validation set:

0.480 (+/-0.039) for {'max_features': 1}
0.481 (+/-0.040) for {'max_features': 2}
0.494 (+/-0.049) for {'max_features': 3}
0.484 (+/-0.032) for {'max_features': 4}
0.492 (+/-0.046) for {'max_features': 5}


In [115]:
# predict test set labels
ypred_rf = rfCV.predict(X_test)
accuracy_score(y_test,ypred_rf)

0.5119047619047619

In [119]:
confusion_table(ypred_rf, y_test)

Truth,0,1
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
0,46,58
1,65,83


# Make market direction prediction by gradient boosting

In [141]:
grd = GradientBoostingClassifier(max_depth=5, n_estimators=100, random_state=234)
# more details see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
# set tuning values
tuned_parameters = [{"learning_rate": [0.001,0.01,0.1,1,10]}]
grdCV = GridSearchCV(grd, tuned_parameters, scoring='accuracy',cv=10)
grdCV.fit(X_train, y_train)
print("Best parameters set found on validation set:")
print()
print(grdCV.best_params_)
print()
print("Grid scores on validation set:")
print()
means = grdCV.cv_results_["mean_test_score"]
stds = grdCV.cv_results_["std_test_score"]
for mean, std, params in zip(means, stds, grdCV.cv_results_["params"]):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std, params))

Best parameters set found on validation set:

{'learning_rate': 10}

Grid scores on validation set:

0.508 (+/-0.054) for {'learning_rate': 0.001}
0.506 (+/-0.069) for {'learning_rate': 0.01}
0.497 (+/-0.045) for {'learning_rate': 0.1}
0.500 (+/-0.056) for {'learning_rate': 1}
0.510 (+/-0.026) for {'learning_rate': 10}


In [142]:
# predict test set labels
ypred_grd = grdCV.predict(X_test)
accuracy_score(y_test,ypred_grd)

0.5634920634920635

In [143]:
confusion_table(ypred_grd, y_test)

Truth,0,1
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
0,75,74
1,36,67
