# Creating a neural network to predict AAPL red or green days
In the third run of building neural network models, I explore a breadth-wise approach to training data. We consider how the model performs when trained on data from Apple Inc and other tech giants. Ancillary to this notebook, I will also train a model focusing on a depth-wise approach where we feed the model more data about a security, beyond the quantitative features of its trading stock. From these two, hopefully we get a better picture of how to approach using AI as a tool for trading.

## Considerations
### Securities
* AAPL
* MSFT
* AMZN
* GOOG
* FB
* TSLA
### Features
* price at open
* highest traded price on day
* lowest traded price on day
* price at close (raw)
* adjusted close (accounting for after-market actions)
* trading volume on day

### Goal
Given a list of the above features of multiple securities, we wish for the neural network to predict a red or green (current) day for AAPL from a list of 10 previous days of data. In this model, green days are exclusively upward price movement.


## 1. Get data
Downloading historical data from yahoo finance

In [1]:
import yfinance as yf
import numpy as np
from sklearn import preprocessing

We consider market data from 2014-2018

In [2]:
start_date = '2014-01-01'
end_date = '2019-01-01'

In [3]:
def my_normalize(arr):
    scaling_factor = arr[0]**(-1)
    answer = [round(i*scaling_factor,8) for i in arr]
    return answer

In [4]:
aapl = yf.download('AAPL', start=start_date, end=end_date, progress=False)
aapl.rename(columns={'Close':'AAPL_Close'}, inplace=True)
msft = yf.download('MSFT', start=start_date, end=end_date, progress=False)
amzn = yf.download('AMZN', start=start_date, end=end_date, progress=False)
goog = yf.download('GOOG', start=start_date, end=end_date, progress=False)
fb = yf.download('FB', start=start_date, end=end_date, progress=False)
tsla = yf.download('TSLA', start=start_date, end=end_date, progress=False)

In [5]:
import pandas as pd
combined = pd.concat([aapl, msft, amzn, goog, fb, tsla], axis=1)

## 2. Format data
We wish to create (10 day by 36 feature) arrays that are normalized based on their column.

In [7]:
training_data = []
num_prev_days = 50
num_features = len(combined.axes[1])
def create_training_data(df):
    prepped = []
    for i in range(len(df) - num_prev_days - 1):
        # getting data into chunks for normalization
        data = [ [] for i in range(num_features) ]
        for j in range(num_prev_days):
            row = df.values[i + j]
            for k in range(num_features):
                data[k].append(row[k])
        # normalizing using my_normalize
        normed = [ my_normalize(data[i]) for i in range(num_features) ]
        # rebuilding into a single row
        prepping = [ [] for i in range(num_prev_days) ]
        for a in range(num_prev_days):
            for b in range(num_features):
                prepping[a].append(normed[b][a])
        # calculating 0 = red or 1 = green
        delta = df['AAPL_Close'][i+num_prev_days] - df['AAPL_Close'][i+num_prev_days-1]
        result = (0, 1) [delta > 0]
        prepped.append([prepping, result])
    return prepped

In [8]:
training_data = create_training_data(combined)

  result = (0, 1) [delta > 0]


### Randomize

In [9]:
import random
random.shuffle(training_data)

In [10]:
X = []
y = []
for features, label in training_data:
    X.append(features)
    y.append(label)

## 3. Create model

In [11]:
import tensorflow as tf
x_train = X
y_train = y

### Layers
* Input: flatten
* Hidden: 2 layer, 50 neurons, rectified linear unit activation function
* Output: softmax

Notes: By tinkering, I found this to be a solid configuration while avoiding overfitting.

In [87]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(50, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(50, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax))

### Optimizer and loss function
* adam - This is a stochastic gradient descent method, based on an "adaptive estimation of first-order and second-order moments." (I may write a walkthrough on SGD. Given a random variable $X$ and integer $k>0$, $k$-th moments are $\mathbb{E}(x^k)$.)
* binary crossentropy - This loss function is useful in binary classification. I am not using it to its full potential in this model, but binary crossentropy can be very helpful when we wish to train multiple binary classifiers.
    * The specific formula for calculating loss is given below.
    * $\mathrm{Loss} = - \frac{1}{\mathrm{output \atop size}} \sum_{i=1}^{\mathrm{output \atop size}} y_i \cdot \mathrm{log}\; {\hat{y}}_i + (1-y_i) \cdot \mathrm{log}\; (1-{\hat{y}}_i)$

In [88]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## 4. Train model

In [89]:
model.fit(x_train, y_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x14dddaa60>

## 5. Evaluating model from prices on more recent days
Using closing prices from the most recent year

In [80]:
import yfinance as yf
from datetime import date
# from dateutil.relativedelta import relativedelta

In [81]:
test_start = '2019-01-01'
new_aapl = yf.download('AAPL', start=test_start, end=date.today(), progress=False)
new_aapl.rename(columns={'Close':'AAPL_Close'}, inplace=True)
new_msft = yf.download('MSFT', start=test_start, end=date.today(), progress=False)
new_amzn = yf.download('AMZN', start=test_start, end=date.today(), progress=False)
new_goog = yf.download('GOOG', start=test_start, end=date.today(), progress=False)
new_fb = yf.download('FB', start=test_start, end=date.today(), progress=False)
new_tsla = yf.download('TSLA', start=test_start, end=date.today(), progress=False)


In [82]:
new_combined = pd.concat([new_aapl, new_msft, new_amzn, new_goog, new_fb, new_tsla], axis=1)

In [83]:
val1 = create_training_data(new_combined)
X_eval1 = []
y_eval1 = []
for features, label in val1:
    X_eval1.append(features)
    y_eval1.append(label)

  result = (0, 1) [delta > 0]


In [90]:
val1_loss, val1_acc = model.evaluate(X_eval1, y_eval1)
print(val1_loss)
print(val1_acc)

0.6942554116249084
0.5508317947387695


### Is it better than guessing?
A sanity check. Run this multiple times to develop an idea of how a blind guessing strategy would perform.

In [91]:
right = 0
for day in y_eval1:
    if day == random.randint(0,1):
        right += 1
guessing = right/len(y_eval1)

In [92]:
print(guessing)

0.49353049907578556


## 6. Conclusions
Creating this notebook decreased my confidence in using AI by this approach. The addition of a greater breadth of similar numerical data did not seem to help the model. I discovered a major flaw in my previous AAPL_NN notebook in that the normalization process depreciated the value of all features except volume. After fixing that, I will go ahead and work on the fourth notebook that will look at a greater depth of data. Further, I will explore categorical data and look into KNN algorithms in the fourth notebook, as well as working on implementing PCA on this third notebook.

I believe that the lack of improvement in this third notebook lies in the lack of value the additional data brings. PCA will help with pruning the data, as well as quicken the computations. Thus, this look into categorical data on AAPL will hopefully bring about new information.
