# Final Project
## Group Members: Jesse Zou, Andy Li, Yuhan Zheng, Zhiyao Bao

# Introduction

* What is the data science problem you are trying to solve?
    * We're trying to predict the trend of the stock market (if the stock market's price will go up or go down in general for the next day) for DOW30, SP500 and NASDAQ respectively.
* Why does the problem matter?
    * First, we will be able to have an unbiased analysis on the market and the economy. The stock market is influenced by human emotions and irrational sentiments. By being able to predict micro-movements in the market prices, we are able to also model human behavior. Additionally, it may help people involved in day trading to earn money. 
* What could the results of your predictive model be used for?
    * The model can be used by people who trade in the stock market to make better decisions and potentially generate more profit on their investments.
* Why would we want to be able to predict the thing you’re trying to predict?
    * We want to be able to predict the stock market prices because it helps us to better understand the stock market.
* Then describe the dataset that you will use to tackle this problem
    * The dataset contains a data point for each day of the SP500, DOW30, and NASDAQ. These are stock indexes, that is, a sort of average for the entire stock market. By doing an analysis on this dataset, we are capturing the entire market and economy as a whole.

In [2]:
# Imports
import warnings
#warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

In [2]:
%matplotlib inline

# Load datasets

We first cache the data sets here for continuous usage, taking away overhead that would happen from loading in the data set at each step of the training.

In [3]:
# Read data
data_dow = pd.read_csv("DOW30.csv")
data_sp = pd.read_csv("SP500.csv")
data_nas = pd.read_csv("NASDAQ.csv")

# Data Cleaning

Before performing data cleaning, we first verify whether or not it is even necessary. Some data sets come fully intact without inconsistencies, so we check if the data sets contain any null values before continuing on.

In [8]:
def need_data_cleaning():
    if data_dow.isnull().values.any() or data_sp.isnull().values.any() or data_nas.isnull().values.any():
        print("has None value in dataset")
    else:
        print("do not need data cleaning")

need_data_cleaning()

do not need data cleaning


We got "do not need data cleaning", which means there is no "Null" value in any of those three datasets we are going to use (DOW30.csv, SP500.csv, and NASDAQ.csv). In this case, we can conclude that our datasets are good and no further data cleaning is needed.

# Data Exploration

The data was then explored by using a box plot and a line graph. 

Using the box plot, the data set can be summarized into its minimum, 1st quartile, median, 3rd quartile, and maximum, along with any outliers that are contained in the data.

Using the line graph, the data is visualized in a way that may help determine trends that are not as easily detected using machine learning algorithms.

In [4]:
def box_plot(x):
    red_square = dict(markerfacecolor='r', marker='s')
    fig, ax = plt.subplots()
    ax.set_title('Horizontal Boxes')
    ax.boxplot(x, vert=False, flierprops=red_square)

box_plot(data_dow[""])

def line_graph(x, y):
    pass


KeyError: ''

# Models

The SVM model performs 3 things:
    
    - Scales the data so that it is normalized to reduce runtime.
    - Performs dimensionality reduction using PCA to reduce runtime.
    - Runs SVM.
Using these steps, we can mitigate the time spent training the SVM model without severely impacting the accuracy.

In [None]:
# SVM
def SVM_trainer(data_X, data_Y):
    svm_scaler = StandardScaler()
    svm_pca = PCA()
    svm = SVC()

    svm_ppl = Pipeline(steps=[('scaler', svm_scaler), ('pca', svm_pca), ('svm', svm)])

    svm_param_grid = {
        'pca__n_components': list(range(1, 11)),
        'svm__kernel': ['linear', 'rbf', 'poly']
    }

    svm_grid_search = GridSearchCV(svm_ppl, svm_param_grid, cv=5, scoring='accuracy')
#     svm_scores = cross_val_score(svm_grid_search, data_X, data_Y, cv=10)
#     svm_preds = cross_val_predict(svm_grid_search, data_X, data_Y, cv=10)
#     print("Accuracy:", svm_scores.mean()*100, "%")
#     print("classification report:\n",classification_report(data_Y, svm_preds))
    return svm_grid_search

# SVM_trainer(data_X, data_Y)

In [None]:
# KNN
def KNN_trainer(data_X, data_Y):
    scaler = StandardScaler()
    pca = PCA()
    knn_classifier = KNeighborsClassifier(n_neighbors=7)
    ppl = Pipeline(steps=[('scaler', scaler), ('pca', pca), ('knn', knn_classifier)])
#     scores = cross_val_score(ppl, data_X, data_Y, cv=5) 
#     print("Accuracy:", scores.mean()*100, "%")

    param_grid = {
        'pca__n_components': list(range(1, 11)),
        'knn__n_neighbors': list(range(1, 26))
    }

    knn_grid_search = GridSearchCV(ppl, param_grid, cv=5, scoring='accuracy')
#     knn_grid_search.fit(data_X, data_Y)
#     print("Best parameters:", knn_grid_search.best_params_)
#     print("Best score:", knn_grid_search.best_score_*100, "%")

#     knn_nested_score = cross_val_score(knn_grid_search, data_X, data_Y, cv=5)
#     print("Accuracy:", knn_nested_score.mean()*100, "%")
    return knn_grid_search
# knn_grid_search = KNN_trainer(data_X, data_Y)

The Neural Network model performs 3 things:
    
    - Scales the data so that it is normalized to reduce runtime and to prevent some features from outweighing others.
    - Tests varying hidden layers in order to best determine which would have the greatest performance.
    - Runs Neural Networks.
Using these steps, we can mitigate the time spent training the Neural Network model without severely impacting the accuracy.

In [3]:
# Neural Network
@ignore_warnings(category=ConvergenceWarning)
def NN_trainer(data_X, data_Y):
    nn_scaler = StandardScaler()
    nn = MLPClassifier()

    nn_ppl = Pipeline(steps=[('scaler', nn_scaler), ('nn', nn)])
    nn_param_grid = {
        'nn__hidden_layer_sizes': list(range(30, 61, 10)),
        'nn__activation': ['logistic', 'tanh', 'relu']
    }
    nn_grid_search = GridSearchCV(nn_ppl, nn_param_grid, cv=5, scoring='accuracy')
#     nn_scores = cross_val_score(nn_grid_search, data_X, data_Y, cv=5)
#     print("Accuracy:", nn_scores.mean()*100, "%")
    return nn_grid_search

# nn_grid_search = NN_trainer(nn_ppl, nn_param_grid, data_X, data_Y)

In [4]:
# Ensamble
@ignore_warnings(category=ConvergenceWarning)
def ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_X, data_Y):
    eclf = VotingClassifier(
        estimators=[('svm', svm_grid_search), ('knn', knn_grid_search), ('nn', nn_grid_search)],
        voting='hard')
    for clf, label in zip([svm_grid_search, knn_grid_search, nn_grid_search, eclf], ['SVM', 'KNN', 'Neural Network', 'Ensemble']):
        scores = cross_val_score(clf, data_X, data_Y, scoring='accuracy', cv=5)
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
        
# ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, eclf)

# Results with Various Feature Engineering

We tried three different feature engineering. 

## 1: without tracking the prev day; no date, TEDSpread, EFFR
* In the first one, we removed the date since we want to see whether we can get good prediction results assuming each day’s stock price is independent from the previous days. We also removed TEDSpred and EFFR features since initially we believe they are not quite related to the prediction of the stock price.

In [None]:
# DOW30
data_dow_processed = data_dow.drop(['Date', 'TEDSpread', 'EFFR'],axis=1)
# data_dow_processed.head()
data_dow_Y = data_dow_processed['LABEL']
data_dow_X = data_dow_processed.drop(['LABEL'],axis=1)
svm_grid_search = SVM_trainer(data_dow_X, data_dow_Y)
knn_grid_search = KNN_trainer(data_dow_X, data_dow_Y)
nn_grid_search = NN_trainer(data_dow_X, data_dow_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_dow_X, data_dow_Y)

##### Results
    Accuracy: 0.54 (+/- 0.00) [SVM]
    Accuracy: 0.50 (+/- 0.03) [KNN]
    Accuracy: 0.52 (+/- 0.04) [Neural Network]
    Accuracy: 0.53 (+/- 0.03) [Ensemble]

In [None]:
# SP500
data_sp_processed = data_sp.drop(['Date', 'TEDSpread', 'EFFR'],axis=1)
data_sp_Y = data_sp_processed['LABEL']
data_sp_X = data_sp_processed.drop(['LABEL'],axis=1)
svm_grid_search = SVM_trainer(data_sp_X, data_sp_Y)
knn_grid_search = KNN_trainer(data_sp_X, data_sp_Y)
nn_grid_search = NN_trainer(data_sp_X, data_sp_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_sp_X, data_sp_Y)

##### Results
    Accuracy: 0.53 (+/- 0.04) [SVM]
    Accuracy: 0.50 (+/- 0.02) [KNN]
    Accuracy: 0.53 (+/- 0.03) [Neural Network]
    Accuracy: 0.51 (+/- 0.04) [Ensemble]

In [None]:
# NASDAQ
data_nas_processed = data_nas.drop(['Date', 'TEDSpread', 'EFFR'],axis=1)
data_nas_Y = data_nas_processed['LABEL']
data_nas_X = data_nas_processed.drop(['LABEL'],axis=1)
SVM_trainer(data_nas_X, data_nas_Y)
svm_grid_search = SVM_trainer(data_nas_X, data_nas_Y)
knn_grid_search = KNN_trainer(data_nas_X, data_nas_Y)
nn_grid_search = NN_trainer(data_nas_X, data_nas_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_nas_X, data_nas_Y)

##### Results
    Accuracy: 0.56 (+/- 0.00) [SVM]
    Accuracy: 0.47 (+/- 0.04) [KNN]
    Accuracy: 0.56 (+/- 0.00) [Neural Network]
    Accuracy: 0.55 (+/- 0.00) [Ensemble]

## 2: tracking the prev day; no date, TEDSpread, EFFR
* In the second one, we added the predictions of the previous day of the other two datasets, and we removed the date, TEDSpred, and EFFR as we did in the first. Comparing the first and the second, we are able to conclude if adding the predictions of the previous day of the other two datasets would improve accuracy.

In [None]:
# preprocess data
def process_data(target_dataset, dataset_label1, dataset_label2, label1, label2):
    data_processed = target_dataset.drop(['Date', 'TEDSpread', 'EFFR'],axis=1)
    labels1 = dataset_label1.iloc[0:, 1]
    labels2 = dataset_label2.iloc[0:, 1]
    data_processed[label1] = labels1
    data_processed[label1] = data_processed[label1].shift(periods=1, fill_value=-1)
    data_processed[label2] = labels2
    data_processed[label2] = data_processed[label2].shift(periods=1, fill_value=-1)
    data_processed = data_processed.iloc[1: , :]
    return data_processed

In [None]:
# DOW30
data_dow_processed = process_data(data_dow, data_sp, data_nas, "SP500", "NASDAQ")
# print(data_dow_processed.head())
data_dow_Y = data_dow_processed['LABEL']
data_dow_X = data_dow_processed.drop(['LABEL'],axis=1)
svm_grid_search = SVM_trainer(data_dow_X, data_dow_Y)
knn_grid_search = KNN_trainer(data_dow_X, data_dow_Y)
nn_grid_search = NN_trainer(data_dow_X, data_dow_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_dow_X, data_dow_Y)

##### Results
    Accuracy: 0.53 (+/- 0.01) [SVM]
    Accuracy: 0.51 (+/- 0.03) [KNN]
    Accuracy: 0.52 (+/- 0.03) [Neural Network]
    Accuracy: 0.53 (+/- 0.02) [Ensemble]

In [None]:
# SP500
data_sp_processed = process_data(data_sp, data_dow, data_nas, "DOW30", "NASDAQ")
data_sp_Y = data_sp_processed['LABEL']
data_sp_X = data_sp_processed.drop(['LABEL'],axis=1)
svm_grid_search = SVM_trainer(data_sp_X, data_sp_Y)
knn_grid_search = KNN_trainer(data_sp_X, data_sp_Y)
nn_grid_search = NN_trainer(data_sp_X, data_sp_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_sp_X, data_sp_Y)

##### Results
    Accuracy: 0.55 (+/- 0.00) [SVM]
    Accuracy: 0.50 (+/- 0.02) [KNN]
    Accuracy: 0.54 (+/- 0.01) [Neural Network]
    Accuracy: 0.54 (+/- 0.01) [Ensemble]

In [None]:
# NASDAQ
data_nas_processed = process_data(data_nas, data_dow, data_sp, "DOW30", "SP500")
data_nas_Y = data_nas_processed['LABEL']
data_nas_X = data_nas_processed.drop(['LABEL'],axis=1)
SVM_trainer(data_nas_X, data_nas_Y)
svm_grid_search = SVM_trainer(data_nas_X, data_nas_Y)
knn_grid_search = KNN_trainer(data_nas_X, data_nas_Y)
nn_grid_search = NN_trainer(data_nas_X, data_nas_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_nas_X, data_nas_Y)

##### Results
    Accuracy: 0.55 (+/- 0.02) [SVM]
    Accuracy: 0.50 (+/- 0.04) [KNN]
    Accuracy: 0.56 (+/- 0.00) [Neural Network]
    Accuracy: 0.54 (+/- 0.02) [Ensemble]

## 3: tracking the prev day; no date
* In the third one, we added the predictions of the previous day of the other two datasets, and we only removed the date. Comparing the second and third, we are able to conclude if adding TEDSpred, and EFFR features would improve accuracy.

In [None]:
# preprocess data
def process_data_2(target_dataset, dataset_label1, dataset_label2, label1, label2):
    data_processed = target_dataset.drop(['Date'],axis=1)
    labels1 = dataset_label1.iloc[0:, 1]
    labels2 = dataset_label2.iloc[0:, 1]
    data_processed[label1] = labels1
    data_processed[label1] = data_processed[label1].shift(periods=1, fill_value=-1)
    data_processed[label2] = labels2
    data_processed[label2] = data_processed[label2].shift(periods=1, fill_value=-1)
    data_processed = data_processed.iloc[1: , :]
    return data_processed

In [None]:
# DOW30
data_dow_processed = process_data_2(data_dow, data_sp, data_nas, "SP500", "NASDAQ")
# print(data_dow_processed.head())
data_dow_Y = data_dow_processed['LABEL']
data_dow_X = data_dow_processed.drop(['LABEL'],axis=1)
svm_grid_search = SVM_trainer(data_dow_X, data_dow_Y)
knn_grid_search = KNN_trainer(data_dow_X, data_dow_Y)
nn_grid_search = NN_trainer(data_dow_X, data_dow_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_dow_X, data_dow_Y)

##### Results
    Accuracy: 0.54 (+/- 0.01) [SVM]
    Accuracy: 0.50 (+/- 0.03) [KNN]
    Accuracy: 0.53 (+/- 0.03) [Neural Network]
    Accuracy: 0.53 (+/- 0.02) [Ensemble]

In [None]:
# SP500
data_sp_processed = process_data_2(data_sp, data_dow, data_nas, "DOW30", "NASDAQ")
data_sp_Y = data_sp_processed['LABEL']
data_sp_X = data_sp_processed.drop(['LABEL'],axis=1)
svm_grid_search = SVM_trainer(data_sp_X, data_sp_Y)
knn_grid_search = KNN_trainer(data_sp_X, data_sp_Y)
nn_grid_search = NN_trainer(data_sp_X, data_sp_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_sp_X, data_sp_Y)

##### Results
    Accuracy: 0.55 (+/- 0.00) [SVM]
    Accuracy: 0.50 (+/- 0.02) [KNN]
    Accuracy: 0.53 (+/- 0.03) [Neural Network]
    Accuracy: 0.53 (+/- 0.02) [Ensemble]

In [None]:
# NASDAQ
data_nas_processed = process_data_2(data_nas, data_dow, data_sp, "DOW30", "SP500")
data_nas_Y = data_nas_processed['LABEL']
data_nas_X = data_nas_processed.drop(['LABEL'],axis=1)
SVM_trainer(data_nas_X, data_nas_Y)
svm_grid_search = SVM_trainer(data_nas_X, data_nas_Y)
knn_grid_search = KNN_trainer(data_nas_X, data_nas_Y)
nn_grid_search = NN_trainer(data_nas_X, data_nas_Y)
ensamble_trainer(svm_grid_search, knn_grid_search, nn_grid_search, data_nas_X, data_nas_Y)

##### Results
    Accuracy: 0.55 (+/- 0.02) [SVM]
    Accuracy: 0.48 (+/- 0.05) [KNN]
    Accuracy: 0.55 (+/- 0.01) [Neural Network]
    Accuracy: 0.54 (+/- 0.02) [Ensemble]