In [3]:
import sys
!{sys.executable} -m pip install -r '../requirements.txt'

Collecting absl-py==0.13.0
  Downloading absl_py-0.13.0-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 2.2 MB/s eta 0:00:01
[?25hCollecting argon2-cffi==21.1.0
  Downloading argon2_cffi-21.1.0-cp35-abi3-macosx_10_14_x86_64.whl (38 kB)
Collecting astunparse==1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting attrs==21.2.0
  Downloading attrs-21.2.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 6.5 MB/s  eta 0:00:01
Collecting bleach==4.1.0
  Downloading bleach-4.1.0-py2.py3-none-any.whl (157 kB)
[K     |████████████████████████████████| 157 kB 9.7 MB/s eta 0:00:01
[?25hCollecting cachetools==4.2.2
  Downloading cachetools-4.2.2-py3-none-any.whl (11 kB)
Collecting certifi==2021.5.30
  Using cached certifi-2021.5.30-py2.py3-none-any.whl (145 kB)
Collecting cffi==1.14.6
  Downloading cffi-1.14.6-cp38-cp38-macosx_10_9_x86_64.whl (176 kB)
[K     |████████████████████████████████| 176 kB 8.6 MB/s eta 0:

[?25hCollecting packaging==21.0
  Using cached packaging-21.0-py3-none-any.whl (40 kB)
Collecting pandas==1.3.2
  Downloading pandas-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
[K     |████████████████████████████████| 11.4 MB 24.1 MB/s eta 0:00:01
[?25hCollecting pandas-datareader==0.10.0
  Downloading pandas_datareader-0.10.0-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 32.9 MB/s eta 0:00:01
Collecting parso==0.8.2
  Downloading parso-0.8.2-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 8.4 MB/s  eta 0:00:01
Collecting Pillow==8.3.2
  Downloading Pillow-8.3.2-cp38-cp38-macosx_10_10_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 9.0 MB/s eta 0:00:01
Collecting prometheus-client==0.11.0
  Downloading prometheus_client-0.11.0-py2.py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 15.7 MB/s eta 0:00:01
[?25hCollecting prompt-toolkit==3.0.20
  Downloading prompt_toolkit-3.0

In [4]:
# Disable warnings
import warnings
warnings.filterwarnings('ignore')

# Stock Market Predictions with a Long Short-Term Memory Neural Network

Say you're planning to invest in the stock market, so you want to model fluctuations in price by looking at the history of a sequence of prices to accurately predict what future prices will be. When analyzing a sequence of data which were observed in some constant increment of time, and each observation is directly dependent on one or more previous observations (a stock price tomorrow directly depends on its price today), you need a time series model. In this workshop, we'll start by investigating two well-known models, then compare their prediction accuracy to an LSTM nueral network.

Adapted from: https://www.datacamp.com/community/tutorials/lstm-python-stock-market

### Import Necessary Packages

In [6]:
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import urllib.request, json
import os
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

ModuleNotFoundError: No module named 'pandas_datareader'

### Load Data

The tutorial that this workshop was adapted from outlines two sources of data for use in the remainder of the workshop. For simplicity, we'll stick with the Kaggle data set that was provided. Feel free to tinker with analyzing different stock symbols (i.e., data for different companies' stock prices), although I can't guarantee that everything will work as I've only tested the code with the Kaggle data for Hewlett-Packard (HP). Generally speaking, stock prices can be measured with the following metrics:
- Open: Opening stock price of a time period
- Close: Closing stock price of a time period
- High: Highest stock price of a time period
- Low: Lowest stock price of a time period

Note that these metrics can be analyzed for various time intervals (e.g., daily, hourly, 15 minutes, 5 minutes, etc.), but in this workshop, we'll focus on daily prices over the course of multiple years. In theory, you could build a model on any interval of time you have data for, but exploring the benefits and drawbacks of such variations are outside of the scope of this workshop.

In [None]:
data_source = 'kaggle' # alphavantage or kaggle

if data_source == 'alphavantage':
    api_key = 'WFYTJ3SYJWWDIHQ8'

    # American Airlines stock market prices
    ticker = 'AAL'

    # JSON file with all the stock market data for AAL from the last 20 years
    url_string = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={ticker}&outputsize=full&apikey={api_key}'

    # Save data to this file
    file_to_save = f'../data/raw/stock_market_data-{ticker}.csv'

    # If you haven't already saved data,
    # Go ahead and grab the data from the url
    # And store date, low, high, volume, close, open values to a Pandas DataFrame
    if not os.path.exists(file_to_save):
        with urllib.request.urlopen(url_string) as url:
            data = json.loads(url.read().decode())
            # extract stock market data
            data = data['Time Series (Daily)']
            df = pd.DataFrame(columns=['Date', 'Low', 'High', 'Close', 'Open'])
            for k,v in data.items():
                date = dt.datetime.strptime(k, '%Y-%m-%d')
                data_row = [date.date(), float(v['3. low']), float(v['2. high']), float(v['4. close']), float(v['1. open'])]
                df.loc[-1,:] = data_row
                df.index = df.index + 1
        print(f'Data saved to : {file_to_save}')
        df.to_csv(file_to_save)

    # If the data is already there, just load it from the CSV
    else:
        print('File already exists. Loading data from CSV')
        df = pd.read_csv(file_to_save)

else:
    # You will be using HP's data. Feel free to experiment with other data.
    # But while doing so, be careful to have a large enough dataset and also pay attention to the data normalization
    df = pd.read_csv(os.path.join('../data/external/Stocks', 'hpq.us.txt'), delimiter=',', usecols=['Date', 'Open', 'High', 'Low', 'Close'])
    print('Loaded data from the Kaggle repository')


### Sort and Check Data

Note that it is extremely important for time series data to be ordered by time, otherwise you would be training your model on some arbitrary sequence of observations which may be detrimental to its efficacy.

In [None]:
# Sort DataFrame by date
df = df.sort_values('Date')
df.head()

In [None]:
plt.figure(figsize=(18, 9))
plt.plot(range(df.shape[0]), (df['Low'] + df['High']) / 2.0)
plt.xticks(range(0, df.shape[0], 500), df['Date'].loc[::500], rotation=45)
plt.xlabel('Date', fontsize=18)
plt.ylabel('Mid Price', fontsize=18)
plt.show()

### Split Data

As per usual, you should split your data into training and testing sets, so your model is validated upon its predicitons for observations it has never seen before.

In [None]:
# First calculate the mid prices from the highest and lowest
high_prices = df.loc[:, 'High'].to_numpy()
low_prices = df.loc[:, 'Low'].to_numpy()
mid_prices = (high_prices + low_prices) / 2.0

# Split data into training and test sets
train_data = mid_prices[:11000]
test_data = mid_prices[11000:]

### Normalize Data

Before training a model, you must normalize the data. Since different time periods of data have different value ranges, we normalize the data by "binning" the full time series into windows of some specified size (in this case it is 2500). We then smooth **only** the training data, using the exponential moving average, to reduce the amount of noise our models encounter.

In [None]:
# Scale the data to be between 0 and 1
# When scaling remember! You normalize both test and train data with respect to training data
# Because you are not supposed to have access to test data
scaler = MinMaxScaler()
train_data = train_data.reshape(-1, 1)
test_data = test_data.reshape(-1, 1)

# Train the Scaler with training data and smooth data
smoothing_window_size = 2500
for di in range(0, 10000, smoothing_window_size):
    scaler.fit(train_data[di:di + smoothing_window_size, :])
    train_data[di:di + smoothing_window_size, :] = scaler.transform(train_data[di:di + smoothing_window_size, :])

# You normalize the last bit of remaining data
scaler.fit(train_data[di + smoothing_window_size:, :])
train_data[di + smoothing_window_size:, :] = scaler.transform(train_data[di + smoothing_window_size:, :])

# Reshape both train and test data
train_data = train_data.reshape(-1)

# Normalize test data
test_data = scaler.transform(test_data).reshape(-1)

# Now perform exponential moving average smoothing
# So the data will have a smoother curve than the original ragged data
EMA = 0.0
gamma = 0.1
for ti in range(11000):
  EMA = gamma * train_data[ti] + (1 - gamma) * EMA
  train_data[ti] = EMA

# Used for visualization and test purposes
all_mid_data = np.concatenate([train_data,test_data], axis=0)

## One-Step Ahead Prediction via Averaging

We will compare different methods of modeling the stock price time series we have based on Mean Squared Error (MSE), which is calculated by averaging the squared error of each prediction we generate over all observations.

### Standard Average
$$x_{t+1}=\frac{1}{N}\sum_{i=t-N}^t x_i$$
In this case, we're saying the prediction at time $t+1$ is the average of the stock prices observed within a window of time $t-N$ to time $t$.

In [None]:
window_size = 100
N = train_data.size
std_avg_predictions = []
std_avg_x = []
mse_errors = []

for pred_idx in range(window_size, N):

    if pred_idx >= N:
        date = dt.datetime.strptime(k, '%Y-%m-%d').date() + dt.timedelta(days=1)
    else:
        date = df.loc[pred_idx, 'Date']

    std_avg_predictions.append(np.mean(train_data[pred_idx - window_size:pred_idx]))
    mse_errors.append((std_avg_predictions[-1] - train_data[pred_idx])**2)
    std_avg_x.append(date)

print(f'MSE error for standard averaging: {0.5 * np.mean(mse_errors):.5f}')

In [None]:
plt.figure(figsize=(18,9))
plt.plot(range(df.shape[0]), all_mid_data, color='b', label='True')
plt.plot(range(window_size, N), std_avg_predictions, color='orange', label='Prediction')
# plt.xticks(range(0, df.shape[0], 50), df['Date'].loc[::50], rotation=45)
plt.xlabel('Date')
plt.ylabel('Mid Price')
plt.legend(fontsize=18)
plt.show()

Notice that the model's predictions follow the actual behavior of the stock prices fairly accurately, although it seems to lag behind the actual price movement in the market by a few days. It seems as though this model is relatively useful for making short-term price predictions (i.e., a day or two ahead), but we will continue to investigate further.

### Exponential Moving Average
$$x_{t+1}=EMA_t=\gamma\times EMA_{t-1}+(1-\gamma)x_t$$
Here, $EMA_0=0$ and $EMA$ is the exponential moving average value you maintain over time. When predicting price for time $t+1$, $\gamma$ dictates how the immediately preceding observation (time $t$) is weighted against the prior moving average for time $t-1$.

In [None]:
window_size = 100
N = train_data.size

run_avg_predictions = []
run_avg_x = []

mse_errors = []

running_mean = 0.0
run_avg_predictions.append(running_mean)

decay = 0.5

for pred_idx in range(1, N):

    running_mean = running_mean * decay + (1.0 - decay)*train_data[pred_idx - 1]
    run_avg_predictions.append(running_mean)
    mse_errors.append((run_avg_predictions[-1] - train_data[pred_idx])**2)
    run_avg_x.append(date)

print(f'MSE error for EMA averaging: {0.5 * np.mean(mse_errors):.5f}')

In [None]:
plt.figure(figsize=(18,9))
plt.plot(range(df.shape[0]), all_mid_data, color='b', label='True')
plt.plot(range(0, N), run_avg_predictions, color='orange', label='Prediction')
#plt.xticks(range(0, df.shape[0], 50), df['Date'].loc[::50], rotation=45)
plt.xlabel('Date')
plt.ylabel('Mid Price')
plt.legend(fontsize=18)
plt.show()

It is apparent that the line of predictions nearly perfectly mirrors the actual stock price movement, but is it really that useful? In practical applications, you would ideally like to be able to make predictions for times $t+1$, $t+2$, etc. For the two models we just explored, however, you're only ever able to make a single prediction for the subsequent period of time (time $t+1$). What if you instead wanted to make a prediction 30 days in advance? For this purpose, we will explore the use of long short-term memory nueral networks.