# CS440 Term Project

Data Source: https://finance.yahoo.com/

### Project Description

In today’s modern world, staying updated on technological advancements has become a necessity for businesses to satisfy their customers, achieve desired business goals, and most importantly outsmart the competition. In recent years, the financial sector has seen a rapid acceleration in the use of Artificial Intelligence (AI) and Machine Learning (ML) due to improved software and hardware. This has led to better outcomes for both consumers and businesses. Hedge funds were the primary users of AI and ML in financial services, but in the last few years, the spread of ML applications was seen in banks, insurance firms, and other financial institutions, to name a few. But the most steadfast and steep increase has been seen in the stock markets. AI and ML are shaping the future of stock markets. Using different techniques and deep learning algorithms, it analyzes millions of data points, predicts forecast markets with better accuracy, and as a result, there is a higher probability for higher profits and returns. 

The prediction of the volatile and unpredictable stock market has been challenging in recent years, since there are so many factors to take into consideration, such as economic factors, interest rate changes, and fiscal policy. While humans remain a large part of the trading, the stock market has become more efficient and accurate because of the recent developments in AI and ML. These techniques have made it easier for beginners to invest in the stock market. AI and ML also use pattern recognition and help gather unbiased information which leads to better predictions for traders and investors. 

The goal of this project is to train stock market datasets using different AI and ML algorithms such as Neural Networks, k-Nearest Neighbors (kNN), and Logistic Regression and try to find the best predictions. We aim to get results using the above three algorithms and see which ones best predict the outcomes of the stock market. The goal is to see which algorithm more accurately and efficiently predicts the stock market. The benefits of using these algorithms to predict outcomes are that it extracts noise and leaves out as much signal as possible and time complexity is less, hence results are available faster. 

### Data Description

This dataset includes the historical daily prices and volume information for US stocks and ETFs trading on NASDAQ, NYSE, and NYSE MKT. We do not believe using stock open and close prices and volume will be enough to predict accurately. We will need to use other ‘technical’ data like RSI which represents the relative strength of a stock and other indicators along with the open and close price.

In [None]:
# Imports
import numpy as np
import pandas as pd
import math
import pandas_ta as pta
from matplotlib import pyplot as plt
%matplotlib inline
#import talib as ta
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, Flatten, MaxPooling1D, Dropout, LeakyReLU, LSTM
from sklearn import model_selection
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from pandas.plotting import register_matplotlib_converters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer



### S&P 500

In [None]:
# Read SpyData CSV
spyData = pd.read_csv('Data/XOM.csv', sep = ",")

#Compute RSI
"""
rsi = pta.rsi(np.array(spyData.Close), length = 14)
"""

#Compute EMA**

#Compute STOCH %K & %D

stochKD = pta.stoch(low = spyData.Low, high = spyData.High, close = spyData.Close, k=14, d=3, smooth_k = 9)
stochKD = stochKD.iloc[10: , :]
spyData = spyData.iloc[23: , :]

spyData.reset_index(inplace = True)
spyData = spyData.drop("index", 1)

spyData["%K"] = np.array(stochKD.STOCHk_14_3_9)
spyData["%D"] = np.array(stochKD.STOCHd_14_3_9)

#Adding the slope the stoch lines (1 pos, 0 neg)
""""
stochSlope = []
for i in range(spyData.shape[0]):
    if(spyData["%K"][i] > spyData["%D"][i]):
        stochSlope.append(1)
    if(spyData["%K"][i] < spyData["%D"][i]):
        stochSlope.append(0)

spyData["Stoch Slope"] = np.array(stochSlope)
"""

#Compute Price 1 | 0 price in 10d (1 = up & 0 = down)

tenDay = []
for i in range(spyData.shape[0]-10):
    if(spyData.Close[i] - spyData.Close[i+10] < 0):
        tenDay.append(1)
    else:
        tenDay.append(0)
spyData = spyData.iloc[:-10]
spyData["Ten Day Gain"] = np.array(tenDay)

In [None]:
#Remove Columns

spyData = spyData.drop(["Adj Close", "Volume","Date", "%K", "%D"], 1)
#spyData = spyData.drop("Stoch Slope", 1)

In [None]:
spyData

In [None]:
n = spyData.shape[0]
splitRow = int(n * 0.80)
spyData2 = spyData.to_numpy()

#Splits
x_train = spyData2[:splitRow, :-1]
y_train = spyData2[:splitRow, -1]
x_test = spyData2[splitRow:, :-1]
y_test = spyData2[splitRow:, -1]

In [None]:
x_train

In [None]:
#norm = MinMaxScaler()
#x_train = norm.fit_transform(x_train)

x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], 1)
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], 1)


In [13]:
model = Sequential()
model.add(Dense(12, input_dim=4, activation="relu"))
model.add(Dense(12, activation="relu"))
model.add(Dense(1, activation="softmax")) #SOFTMAX

In [14]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [15]:
model.fit(x_train,y_train, epochs=150, batch_size=1000)


Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

<keras.callbacks.History at 0x2d5c8def940>

In [16]:
pred = model.predict(x_test)
pred

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],

### TESLA

#### Data Preprocessing

In [None]:
# Read In Data
tsla = pd.read_csv('Data/TSLA.csv', sep = ",")
tsla.head()

#### Convolutional Neural Network

In [None]:
tsla1 = tsla.dropna()
tsla1['Open10'] = tsla1['Open'].shift(periods = -14)
tsla1 = tsla1.dropna()
tsla1['label'] = np.where(tsla1['Open'] < tsla1['Open10'], 1, 0)
tsla1a = tsla1.drop(labels = ['Date', 'Adj Close', 'Open10'], axis = 1)
tsla1a

In [None]:
x, _ = tsla1a.shape
splitRow = int(x * 0.80)
tsla1b = tsla1a.to_numpy()

# Training data and testing data
trainX = tsla1b[:splitRow, :-1]
trainY = tsla1b[:splitRow, -1]
testX = tsla1b[splitRow:, :-1]
testY = tsla1b[splitRow:, -1] 

xPlot = range(0, x)
plt.figure()
plt.title('Data Separation')
plt.grid(True)
plt.ylabel('Open Price')
plt.plot(xPlot[:splitRow], trainX[:,0], 'blue', label='Train data')
plt.plot(xPlot[splitRow:], testX[:,0], 'red', label='Test data')
plt.legend()
plt.show()
plt.close()

In [None]:
norm = MinMaxScaler()
trainX = norm.fit_transform(trainX)

trainX = trainX.reshape(trainX.shape[0], trainX.shape[1], 1)
testX = testX.reshape(testX.shape[0], testX.shape[1], 1)

nFeatures = trainX.shape[1]
epochs = 20
batchSize = 1000
nOutput = 1
kernelSize = 1


In [None]:
model = Sequential()
model.add(Conv1D(filters = 32, kernel_size = kernelSize, padding = 'same', activation = 'relu', input_shape = (nFeatures, 1)))
model.add(Conv1D(filters = 64, kernel_size = kernelSize, padding = 'same'))
model.add(LeakyReLU(alpha = 0.01))
model.add(MaxPooling1D(pool_size = (1)))
model.add(Conv1D(filters = 128, kernel_size = kernelSize, padding = 'same'))
model.add(LeakyReLU(alpha = 0.01))
model.add(Flatten())
model.add(Dense(256,))
model.add(LeakyReLU(alpha=0.01))
model.add(Dropout(0.8))
model.add(Dense(nOutput, activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()

fitReturn = model.fit(trainX, trainY, validation_data=(testX, testY), epochs = epochs, batch_size = batchSize, verbose = 1)

In [None]:
plt.title('Loss')
plt.grid(True)
plt.xlabel('Epochs')
plt.ylabel('Values')
plt.plot(fitReturn.history['loss'], 'blue', label='Train Loss')
plt.plot(fitReturn.history['val_loss'], 'red', label='Test Loss')
plt.legend()
plt.show()
plt.close()

In [None]:
plt.title('Accuracy')
plt.grid(True)
plt.xlabel('Epochs')
plt.ylabel('Values')
plt.plot(fitReturn.history['accuracy'], 'blue', label='Train Accuracy')
plt.plot(fitReturn.history['val_accuracy'], 'red', label='Test Accuracy')
plt.legend()
plt.show()
plt.close()

In [None]:
predictY = model.predict(testX)
predictY = predictY[:, 0]

# Basic Counting
testY0 = (testY == 0).sum()
testY1 = (testY == 1).sum()
print("Test Set - Sell signal : "+str(testY0))
print("Test Set - Buy signal  : "+str(testY1))
print("="*40)
predictY0 = (predictY == 0).sum()
predictY1 = (predictY == 1).sum()
print("Predicted - Sell signal : "+str(predictY0))
print("Predicted - Buy signal  : "+str(predictY1))

In [None]:
accuracy = accuracy_score(testY, predictY)
precision = precision_score(testY, predictY)
print('Accuracy: '+str(accuracy))
print('Precision: '+str(precision))

#### Logistic Regression

In [None]:
tsla2 = tsla.dropna()
tsla2

In [None]:
tsla2['S_10'] = tsla2['Close'].rolling(window = 14).mean()
tsla2['Corr'] = tsla2['Close'].rolling(window = 14).corr(tsla2['S_10'])
#tsla2['RSI'] = ta.RSI(np.array(tsla2['Close']), timeperiod = 14)
# The difference between the open price of yesterday and today
tsla2['Open-Close'] = tsla2['Open'] - tsla2['Close'].shift(1)
# The difference close price of yesterday and the open price of today
tsla2['Open-Open'] = tsla2['Open'] - tsla2['Open'].shift(1)
tsla2 = tsla2.dropna()
tsla2 = tsla2.drop(['Date', 'Adj Close', 'Volume'], axis = 1)
X = tsla2.iloc[:,:9]
tsla2
X

In [None]:
y = np.where(tsla2['Close'].shift(-1) > tsla2['Close'],1,-1)

In [None]:
split = int(0.7 * len(tsla2))
X_train, X_test, y_train, y_test = X[:split], X[split:], y[:split], y[split:]

In [None]:
model = LogisticRegression()
model = model.fit (X_train, y_train)
pd.DataFrame(zip(X.columns, np.transpose(model.coef_)))

In [None]:
probability = model.predict_proba(X_test)
print(probability)

In [None]:
probability = model.predict_proba(X_test)
print(probability)

predicted = model.predict(X_test)

In [None]:
print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))
print(model.score(X_test,y_test))

In [None]:
cross_val = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(cross_val)
print(cross_val.mean())

#### k-Nearest Neigbors

In [None]:
tsla3 = tsla.dropna()
tsla3

In [None]:
tsla3 = tsla3[['Open', 'High', 'Close', 'Low']]
tsla3['Open-Close'] = tsla3.Open - tsla3.Close
tsla3['High-Low'] = tsla3.High = tsla3.Low
tsla3 = tsla3.dropna()
X_knn = tsla3[['Open-Close', 'High-Low']]
Y_knn = np.where(tsla3['Close'].shift(-1) > tsla3['Close'], 1, -1)
split3 = int(0.7 * len(tsla3))

X_train_knn = tsla3[:split3]
Y_train_knn = Y_knn[:split3]
X_test_knn = tsla3[split3:]
Y_test_knn = Y_knn[split3:]

In [None]:
train = []
test = []

for i in range(1, 151, 5):
    knn = KNeighborsClassifier(n_neighbors = i * 10)
    knn.fit(X_train_knn, Y_train_knn)
    train.append(accuracy_score(Y_train_knn, knn.predict(X_train_knn)))
    test.append(accuracy_score(Y_test_knn, knn.predict(X_test_knn)))

In [None]:
plt.plot(train)
plt.plot(test)

In [None]:
print("Average of the training set =", round(sum(train)/len(train), 2))
print("Average of the testing set =", round(sum(test)/len(test), 2))

### LSTM

In [None]:
plt.figure(figsize=(18,5))
plt.title('Close Price History')
plt.plot(tsla['Close'])
plt.xlabel('Data', fontsize=15 )
plt.ylabel('Close Price USD ($)', fontsize=15)
plt.show()

In [None]:
#Create a dataframe with only the 'Create Column
data = tsla.filter(['Close'])
print(data)
#convert the dataframe to a numpy array
dataset = data.values
#get the number of rows to train the model on
training_data_len= math.ceil(len(dataset)* 0.85)

In [None]:
#Scale the all of the data to be values between 0 and 1 
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)

In [None]:
#Create the scaled training data set 
train_data = scaled_data[0:training_data_len  , : ]
#Split the data into x_train and y_train data sets
x_train=[]
y_train = []
for i in range(180,len(train_data)):
    x_train.append(train_data[i-180:i,0])
    y_train.append(train_data[i,0])


In [None]:
x_train, y_train = np.array(x_train), np.array(y_train)
#Reshape the data into the shape accepted by the LSTM
x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))
x_train.shape

In [None]:
model = Sequential()
model.add(LSTM(units=50, return_sequences=True,input_shape=(x_train.shape[1],1)))
model.add(LSTM(units=50, return_sequences=False))
model.add(Dense(units=25))
model.add(Dense(units=1))

In [None]:
#Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
#Train the model
model.fit(x_train, y_train, batch_size=1, epochs=1)

In [None]:
#Test data set
test_data = scaled_data[training_data_len - 180: , : ]
#Create the x_test and y_test data sets
x_test = []
y_test =  dataset[training_data_len : , : ] 
for i in range(180,len(test_data)):
    x_test.append(test_data[i-180:i,0])

In [None]:
x_test = np.array(x_test)
x_test = np.reshape(x_test, (x_test.shape[0],x_test.shape[1],1))
#Getting the models predicted price values
predictions = model.predict(x_test) 
predictions = scaler.inverse_transform(predictions)
#Undo scaling
rmse = math.sqrt(mean_squared_error(y_test, predictions))
mae = mean_absolute_error(y_test, predictions)

print(rmse)
print(mae)

In [None]:
# Visualising the results
plt.plot(y_test,color='red',label='Real Tesla Stock price')
plt.plot(predictions,color='blue',label='Predicted Tesla Stock price')
plt.title('Tesla stock price prediction using LSTM')
plt.xlabel('Timeline (13th July- 14th August 2020)')
plt.ylabel('Stock Price')
plt.legend()
plt.show()