# Simple Stock Price Prediction Model
Here I will be implementing a simple MLP neural network for stock price prediction.  My GitHub profile is htjames0 and all code and files can be found in the repository. 

In [26]:
import tensorflow as tf
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

## Data Methods

### Introduction
The data methods will be used to convert the data from the preparsed test and training data into useable and clean training and test data that will be fed to the model.  The first method is the getTrainData method which will be used to convert the training data into the appropriate scale, size, and format.  The second method, getTestData, will be used to get the test data in order to see how accuarate the model will predict the price of a stock.  The test data will not be used until all training parameters are tuned and the model is fully optimized on the training data.  Hyperparameter tunign includes finding the optimal stepsize, activation function, hidden number of neurons, in addition to the optimal window (i.e. 15, 30, 45, 90, etc.). Each method takes the same parameters, data and windows.  The data parameter is self explainatory, if using the getTestData then you input the test_data and visa versa for the getTrainData. The window parameter is an integer that is used to determine how many days to look back in time in order to predict the price of the stock for the next day.  This parameter might be the most confusing but is important as the training data is structured based on this parameter. 

#### Example 1. Understanding the Window:
> For example, suppse the window is 15 days and we have training data from Jan 1st, 2000 to January 1st, 2002.  The window will start at January 16th, 2000 and increment forward in time by one day every time.  For this iteration, the the features (Open, High, Low and Volume) will be appended into an array with the past 15 days of data.  Also the label (Closing Price) will be recorded for January 16th.  So for the first iteration the data will look something like this, 


> We see that the closing price and feature data is recorded.  The second iteration will look like this, 


> This process will continue from January 16th, 2000 until January 1st, 2002 in this example.  Each time incrementing by one day and looking back into the past to find the correct 15 days worth of data. This will iterate the number of days of the training data set minus window number of days.  Here there are roughly 504 trading days (252 in a year) with a window of 15 thus, it would iterate 489 times to collect all the data.  After this is done the data will be reshaped into a 3D array in which each iteration will be one layer.

The data will be structured into a 3D array where the row will be the length of the window, columns will be the number of features, and layers will be the number of iterations to obtain the correct data. One can visulaize this as stacking each successive iteration behind the previous one to form a oblong rubix cube shape.  Another way to visulize it is a deck of baseball cards where each iteration is one card, the length of the card is the number of rows, and the width of the card is the number of features (Figure 1).  The top card is the most distant timeframe of data and the bottom card is the most recent.  

#### Example 1 cont. 
> So continuing the example fro above, the rows will be the number of days into the past, 15 days in this example.  The columns being the features (Open, High, Low, and Volume).  And the layers represent each iteration of finding data for the correct windows.


### Training Data Method
Below is the implementation of the getTrainData method.  There are comments in the code the explain how the code works in order to achieve the explaination above.  The method takes in the prepartitioned training data and the window length to return the x_train and y_train data for the model to be fed. 

In [27]:
def getTrainData(data, window=15):
    #itializing feature and label arrays
    x_train = []
    y_train = []

    #iterating from the start of the window to the length of the data 
    #to get a window of days for data points and making the label that last
    #day of that specific window
    for i in range(window, len(data)):
        x_train.append(data.iloc[i-window:i,[1,2,3,6]])  #the indexing here is specific to the AMD dataset
        y_train.append(data.iloc[i,4])
    
    #converting to numpy array     
    x_train = np.array(x_train)
    y_train = np.array(y_train)

    #3D array in python - (layer, row, column)
    #layer is each iteration of the loop, 1169
    #row is number of days used to predict next day, 90
    #column is the feature, Open, High, Low, or Volume 
    #4 is a bit hardcoded here and could be changed  to be an input variable of the method
    x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 4))
    
    return x_train, y_train

### Test Data Method: 
The implementation of the Test Data set is quite similar to the getTrainData method above. The method takes parameters data and window to return the x_test and y_test data for the model to be tested with.  This data will be used to test how accurate the model is in predicting the stock price of the next day.  

In [45]:
#need to think about how the test data is being generated 
    #how the window plays into effect here...need to feed it 90 days worth of training data for 
    #the first iteration of test data generation
    
def getTestData(data, window=90):
    #initializing arrays
    x_test = []
    y_test = []
    
    #iterating over the test data to gather the appropriate features 
    #for the correct label
    for i in range(window, len(data)):
            x_test.append(data.iloc[i-window:i,[1,2,3,6]])
            y_test.append(data.iloc[i,4]) 
        
    #reshaping the data into a 3D array format
    #changing formate of arrays
    x_test = np.array(x_test)
    y_test = np.array(y_test)
    
    #changing array shape 
    x_test = np.reshape(x_test, (x_test.shape[0], x_train.shape[1], 4))
    
    return x_test, y_test

## The Model


In [103]:
def modelMLP(features, step_size, hidden_neurons=5, act='relu', loss_fxn='mean_squared_error'):
    
    #method inputs
        #features - feature data used to get the number of input layer nodes
        #step-size - for optimizer
        #hidden_neurons - number of neurons in hidden layer, default of five
        #act - activation function, default ReLu function
        #loss_fxn - loss function, default as mean squared error
    
    #defining model
    model = tf.keras.models.Sequential()
   
    #layers
    model.add(tf.keras.layers.Dense(units=hidden_neurons,
                                    input_shape=(features.shape[1],1),
                                    activation='relu')
                                   )
    model.add(tf.keras.layers.Dense(units = 1,
                                    activation=act)
                                   )
    
    #optimizer
    opt = tf.keras.optimizers.Adam(learning_rate=step_size)
    
    #compile - loss fxn MSE
    model.compile(optimizer=opt,
                  loss=loss_fxn,
                  metrics=[tf.keras.metrics.RootMeanSquaredError()])
   
    return model 

In [104]:
#train function
def modelTrain(model, feature, label, epochs, batch_size):

    #feeding features and labels to model, model iterates epoch number of times
    #using batch_size number of data points per iteration
    history = model.fit(x=feature,
                        y=label,
                        batch_size=batch_size,
                        epochs=epochs)

    #weights and bias
    trained_weight = model.get_weights()[0]
    trained_bias = model.get_bias()[1]

    #historical data of model for each epoch
    epochs = history.epoch
    hist = pd.DateFram(history.history)

    #rmse for each epoch
    rmse = hist["root_mean_squared_error"]

    return trained_weight, trained_bias, epochs, rmse

In [105]:
#test function
def modelTest(model, feature, label): 
    loss = model.evaluate(feature, label, verbose=0)
    
    return loss

## Execution


In [107]:
#defining constants 
window = 90

#importing data
data = pd.read_csv('AMD.csv')
data = data.round(3)
data['Volume'] = data['Volume']/1000000

#dividing the data set into train and test data
train_end = data[data.Date=='2021-01-04'].index[0]
train_data = data.iloc[:train_end,:]
test_data = data.iloc[train_end - window:,:]

#calling the Data Methods to get the appropriate data 
x_train, y_train = getTrainData(train_data, window=90)
x_test, y_test = getTestData(test_data, window=90)

#creating model
model = modelMLP(x_train, step_size=0.01, hidden_neurons=5)
