# Classification and Regression on Stock Market Data
Joshua Huang and Andy Lee


Our initial goal was to apply machine learning to stock market data, and we started by looking at the classification problem of predicting closing price trends based on previous data. After improving our models we features, we then looked to quantitative prediction with regression.


In [1]:
import quandl
import pandas as pd
import numpy as np
import datetime
import sys
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import train_test_split
pd.options.mode.chained_assignment = None 

We started by pulling stock data using the duandl library. Each dataframe initially contained the open/close/high/low price of the stock for each day from 1983-2018. Our goal was to predict whether the stock price would increase or decrease the following day based on previous data. We used five different models to see which ones would perform better on different data sets, with the three we went over in class as well as the Random Forest Classifier and the SVC, which we found to do better on the benchmark data.

In [2]:
model1 = GaussianNB()
model2 = KNeighborsClassifier()
model3 = BernoulliNB()
model4 = RandomForestClassifier()
model5 = SVC()

In [3]:
dfA = quandl.get("WIKI/AMD")
dfA.tail()
dfI = quandl.get("WIKI/INTC")
dfI.tail()
dfN = quandl.get("WIKI/NVDA")
dfN.tail(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-03-14,249.45,250.55,246.22,248.74,12573947.0,0.0,1.0,249.45,250.55,246.22,248.74,12573947.0
2018-03-15,249.29,252.62,247.64,249.34,9992834.0,0.0,1.0,249.29,252.62,247.64,249.34,9992834.0
2018-03-16,250.0,251.25,248.48,250.48,9634107.0,0.0,1.0,250.0,251.25,248.48,250.48,9634107.0
2018-03-19,248.18,249.35,237.0,241.0,17472128.0,0.0,1.0,248.18,249.35,237.0,241.0,17472128.0
2018-03-20,241.12,251.15,241.12,249.58,13725505.0,0.0,1.0,241.12,251.15,241.12,249.58,13725505.0
2018-03-21,249.32,252.0,247.33,248.56,10841782.0,0.0,1.0,249.32,252.0,247.33,248.56,10841782.0
2018-03-22,246.0,247.88,240.341,241.85,13663927.0,0.0,1.0,246.0,247.88,240.341,241.85,13663927.0
2018-03-23,242.4,242.67,232.52,232.97,18225390.0,0.0,1.0,242.4,242.67,232.52,232.97,18225390.0
2018-03-26,238.0,244.53,235.9,244.48,15130542.0,0.0,1.0,238.0,244.53,235.9,244.48,15130542.0
2018-03-27,247.75,250.0,219.845,225.52,34462113.0,0.0,1.0,247.75,250.0,219.845,225.52,34462113.0


In [None]:
def sign(x):
    if x > 0:
        return -1
    else:
        return 1
    
def calcS(df):
    temp = df.shift(1) - df
    temp['delt'] = temp['Close'].apply(sign)
    return temp

def add_features(s):
    arr = [c for c in s]
#     add_nasdaq(arr)
    return np.array(arr)

We then calculated the change in price by shifting the dataframe and subtracting it from itself. We took the sign of this value which became the label for each day. We took a benchmark of the classifiers without adding in extra features and saw a pretty low accuracy of 48.1% for naive bayes, but up to 52% for SVC and Random Forest Classifier.

In [4]:
dft = calcS(dfA)
dft.head()
train = dfA.head(6829)
test = dfA.tail(2000)
ltrain = dft.head(6829)
ltest = dft.tail(2000)
ltrain.shape

NameError: name 'calcS' is not defined

In [175]:
# X_train, X_test, y_train, y_test = train_test_split(xtrain2, xlabels2, test_size=0.33, random_state=42)
# summarize the fit of the model
ntrain = np.array(train)
ntrain = [add_features(a) for a in ntrain]
tlabels = np.array(ltrain['delt'])
ytest = np.array(test)
ytest = [add_features(a) for a in ytest]
ylabels = np.array(ltest['delt'])
model1.fit(ntrain, tlabels)
predicted1 = model1.predict(ytest)
expected1 = ylabels

model2.fit(ntrain, tlabels)
predicted2 = model2.predict(ytest)
expected2 = ylabels

model3.fit(ntrain, tlabels)
predicted3 = model3.predict(ytest)
expected3 = ylabels

model4.fit(ntrain, tlabels)
predicted3 = model3.predict(ytest)
expected3 = ylabels

model5.fit(ntrain, tlabels)
predicted3 = model3.predict(ytest)
expected3 = ylabels
print("Gaussian NB\n")
print(metrics.accuracy_score(expected1, predicted1))
print(metrics.classification_report(expected1, predicted1))
print(metrics.confusion_matrix(expected1, predicted1))
print("\n")

print("K-Neighbors Classifier\n")
print(metrics.accuracy_score(expected2, predicted2))
print(metrics.classification_report(expected2, predicted2))
print(metrics.confusion_matrix(expected2, predicted2))
print("\n")

print("Bernoulli NB\n")
print(metrics.accuracy_score(expected3, predicted3))
print(metrics.classification_report(expected3, predicted3))
print(metrics.confusion_matrix(expected3, predicted3))
print("\n")

print("Random Forest\n")
print(metrics.accuracy_score(expected3, predicted3))
print(metrics.classification_report(expected3, predicted3))
print(metrics.confusion_matrix(expected3, predicted3))
print("\n")

print("SVC\n")
print(metrics.accuracy_score(expected3, predicted3))
print(metrics.classification_report(expected3, predicted3))
print(metrics.confusion_matrix(expected3, predicted3))
print("\n")

Gaussian NB

0.481
             precision    recall  f1-score   support

         -1       0.47      0.61      0.53       952
          1       0.51      0.36      0.42      1048

avg / total       0.49      0.48      0.47      2000

[[585 367]
 [671 377]]


K-Neighbors Classifier

0.4985
             precision    recall  f1-score   support

         -1       0.47      0.49      0.48       952
          1       0.52      0.50      0.51      1048

avg / total       0.50      0.50      0.50      2000

[[469 483]
 [520 528]]


Bernoulli NB

0.524
             precision    recall  f1-score   support

         -1       0.00      0.00      0.00       952
          1       0.52      1.00      0.69      1048

avg / total       0.27      0.52      0.36      2000

[[   0  952]
 [   0 1048]]


Random Forest

0.524
             precision    recall  f1-score   support

         -1       0.00      0.00      0.00       952
          1       0.52      1.00      0.69      1048

avg / total       0.27  

  'precision', 'predicted', average, warn_for)


We then looked at features we could add and decided on several that would provide information about global markets as well as historical data on AMD. The first feature was data on the NASDAQ index, which helps define performance of the entire US economy as a whole. We also incorporated google trends data, as a larger trend value could mean larger variance in stock price the following day. Finally, we incorporated company specific data, in calculating the moving-average of 5 and 10 days, essentially the average closing price over the period of time, because crossing the moving average or pulling away from it can define the start of a bull or bear trend. We then added features based on financial ratios found on the balance sheet such as beta, the volatility of the stock, as well as price/earnings ratio, a figure that compares the stock price to how much investers make from the stock on average.

In [178]:
quandl.ApiConfig.api_key = 'bjxPxJDJRbGxYNqxstKH'
cols = [2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
ncols = ["Tdate", "IndexV", "H", "L", "TargetMarket", "DMV"]
dfAR = pd.read_csv('AMD.csv', sep=',', names=cols,
                    encoding='latin-1')
dfn = pd.read_csv('NASDAQ.csv', sep=',', names=ncols,
                    encoding='latin-1')
dfn = dfn[1:]
dfn = dfn[::-1]
dfAR = dfAR[3:]
dfn.head()
dfNratio = pd.read_csv('out.csv', sep=',', header = 0, index_col=0,
                    encoding='latin-1')
dftrends = pd.read_csv('trends.csv', sep=',', header = 0, index_col=0,
                    encoding='latin-1')
dfAmd = pd.read_csv('Amd_historical.csv', sep=',', header = 0, index_col=0,
                    encoding='latin-1')
dfNratio.head()
dftrends.head()
dfAmd.head()


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume,5-day Moving,10-day Moving
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3/21/83,35.88,36.13,35.25,36.0,127700.0,8.97,9.03,8.81,9.0,510800.0,35.0,34.56
3/22/83,34.88,35.88,34.0,34.0,82000.0,8.72,8.97,8.5,8.5,328000.0,35.0,34.56
3/23/83,34.0,35.25,33.88,34.88,106800.0,8.5,8.81,8.47,8.72,427200.0,35.0,34.56
3/24/83,34.88,35.13,34.63,35.13,98300.0,8.72,8.78,8.66,8.78,393200.0,35.0,34.56
3/25/83,35.63,36.25,35.0,35.5,52600.0,8.91,9.06,8.75,8.87,210400.0,35.1,34.56


In [179]:
def f(s):
    a = [str(x) for x in s]
    a[1] = str(1)
    return "/".join(a)
    

In [180]:
dftrends['d2'] = dftrends.index
dftrends['d2'] = dftrends['d2'].apply(lambda x: str(x)[:7])
dftrends.head()

Unnamed: 0_level_0,AMD,isPartial,d2
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5/1/11,48,False,5/1/11
6/1/11,52,False,6/1/11
7/1/11,56,False,7/1/11
8/1/11,57,False,8/1/11
9/1/11,60,False,9/1/11


Because financial data is extremely difficult to find without paying for premium services, for example, quandl requires a $675 dollar subscription to access the same financial figures we read in through a csv, we had to shorten the stock data to only the years 2011-2018, which we could find additional features to add.;

In [181]:
dfAc = dfAmd[-1727:]
dfAc['d2'] = dfAc.index
dfAc['d2'] = dfAc['d2'].apply(lambda x: f(str(x).split('/')))
dfAc.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume,5-day Moving,10-day Moving,d2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
5/16/11,8.77,8.95,8.65,8.67,16573500.0,8.77,8.95,8.65,8.67,16573500.0,8.86,8.86,5/1/11
5/17/11,8.62,8.62,8.38,8.54,20017900.0,8.62,8.62,8.38,8.54,20017900.0,8.78,8.84,5/1/11
5/18/11,8.53,8.73,8.52,8.67,17954500.0,8.53,8.73,8.52,8.67,17954500.0,8.74,8.83,5/1/11
5/19/11,8.62,8.71,8.55,8.64,15699700.0,8.62,8.71,8.55,8.64,15699700.0,8.68,8.8,5/1/11
5/20/11,8.58,8.7,8.54,8.62,16694300.0,8.58,8.7,8.54,8.62,16694300.0,8.63,8.77,5/1/11


In [182]:
s1 = pd.merge(dfAc, dftrends, how='left', on=['d2'])
s1 = s1.drop(columns=['d2', 'isPartial'])
s1.head()

Unnamed: 0,Open,High,Low,Close,Volume,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume,5-day Moving,10-day Moving,AMD
0,8.77,8.95,8.65,8.67,16573500.0,8.77,8.95,8.65,8.67,16573500.0,8.86,8.86,48
1,8.62,8.62,8.38,8.54,20017900.0,8.62,8.62,8.38,8.54,20017900.0,8.78,8.84,48
2,8.53,8.73,8.52,8.67,17954500.0,8.53,8.73,8.52,8.67,17954500.0,8.74,8.83,48
3,8.62,8.71,8.55,8.64,15699700.0,8.62,8.71,8.55,8.64,15699700.0,8.68,8.8,48
4,8.58,8.7,8.54,8.62,16694300.0,8.58,8.7,8.54,8.62,16694300.0,8.63,8.77,48


In [183]:
def add_features(s):
    arr = [c for c in s]
#     add_nasdaq(arr)
    return np.array(arr)

In [184]:
# def add_nasdaq(a):
#     a.extend([a for a in dfn.iloc[i][1:]])
#     return a

In [185]:
def calcS(df):
    temp = df.shift(1) - df
    temp['delt'] = temp['Close'].apply(sign)
    return temp
    

In [186]:
def sign(x):
    if x > 0:
        return -1
    else:
        return 1

In [193]:
dfAc = dfAc.drop(columns=['d2'])
dfAc.head()

ValueError: labels ['d2'] not contained in axis

In [194]:
dfAc.head()
dfAcs = calcS(dfAc)
dfAcs.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume,5-day Moving,10-day Moving,delt
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
5/16/11,,,,,,,,,,,,,1
5/17/11,0.15,0.33,0.27,0.13,-3444400.0,0.15,0.33,0.27,0.13,-3444400.0,0.08,0.02,-1
5/18/11,0.09,-0.11,-0.14,-0.13,2063400.0,0.09,-0.11,-0.14,-0.13,2063400.0,0.04,0.01,1
5/19/11,-0.09,0.02,-0.03,0.03,2254800.0,-0.09,0.02,-0.03,0.03,2254800.0,0.06,0.03,-1
5/20/11,0.04,0.01,0.01,0.02,-994600.0,0.04,0.01,0.01,0.02,-994600.0,0.05,0.03,-1


Only for 2011-2018 with Nasdaq data

In [195]:
dfAcs.shape
featureAmd = [add_features(a) for a in np.array(s1)]
for i in range(len(featureAmd)):
    temp = [float(a) for a in featureAmd[i]]
    temp.extend([float(a) for a in dfNratio.iloc[i]])
    featureAmd[i] = np.array(temp)

In [196]:
#split the data
xtrain = featureAmd[:1000]
xlabels = np.array(dfAcs['delt'])[:1000]
ytest = featureAmd[1000:]
ylables = np.array(dfAcs['delt'])[1000:]

In [197]:
#check for null values 
x2 = [item for sublist in xtrain for item in sublist]
for i in range(len(xtrain)):
    for j in range(len(xtrain[i])):
        if not np.isfinite(xtrain[i][j]):
            print(i, j)
            print("here")

With additional features, our gaussian naive bayes accuracy increased to 54.7%, which was significantly higher than our initial accuracy, however the accuracy on some of the other models actually decreased. When looking into why this might happen, we found that the features tend to help the accuracy of the naive bayes, however extra features for decision tree classifiers such as k-neighbors classifier can overfit data, which can lead to overall worse predictions with added features.

In [198]:
model1.fit(xtrain, xlabels)
predicted1 = model1.predict(ytest)
expected1 = ylables

model2.fit(xtrain, xlabels)
predicted2 = model2.predict(ytest)
expected2 = ylables

model3.fit(xtrain, xlabels)
predicted3 = model3.predict(ytest)
expected3 = ylables

model4.fit(xtrain, xlabels)
predicted3 = model3.predict(ytest)
expected3 = ylables

model5.fit(xtrain, xlabels)
predicted3 = model3.predict(ytest)
expected3 = ylables
print("Gaussian NB\n")
print(metrics.accuracy_score(expected1, predicted1))
print(metrics.classification_report(expected1, predicted1))
print(metrics.confusion_matrix(expected1, predicted1))
print("\n")

print("K-Neighbors Classifier\n")
print(metrics.accuracy_score(expected2, predicted2))
print(metrics.classification_report(expected2, predicted2))
print(metrics.confusion_matrix(expected2, predicted2))
print("\n")

print("Bernoulli NB\n")
print(metrics.accuracy_score(expected3, predicted3))
print(metrics.classification_report(expected3, predicted3))
print(metrics.confusion_matrix(expected3, predicted3))
print("\n")

print("Random Forest\n")
print(metrics.accuracy_score(expected3, predicted3))
print(metrics.classification_report(expected3, predicted3))
print(metrics.confusion_matrix(expected3, predicted3))
print("\n")

print("SVC\n")
print(metrics.accuracy_score(expected3, predicted3))
print(metrics.classification_report(expected3, predicted3))
print(metrics.confusion_matrix(expected3, predicted3))
print("\n")

Gaussian NB

0.547455295735901
             precision    recall  f1-score   support

         -1       0.52      0.07      0.13       331
          1       0.55      0.94      0.69       396

avg / total       0.54      0.55      0.44       727

[[ 24 307]
 [ 22 374]]


K-Neighbors Classifier

0.48005502063273725
             precision    recall  f1-score   support

         -1       0.43      0.40      0.41       331
          1       0.52      0.54      0.53       396

avg / total       0.48      0.48      0.48       727

[[134 197]
 [181 215]]


Bernoulli NB

0.4718019257221458
             precision    recall  f1-score   support

         -1       0.45      0.75      0.56       331
          1       0.53      0.24      0.33       396

avg / total       0.50      0.47      0.44       727

[[248  83]
 [301  95]]


Random Forest

0.4718019257221458
             precision    recall  f1-score   support

         -1       0.45      0.75      0.56       331
          1       0.53      0.2