# Predicting Oil Prices Using an RNN with LSTM

Researchers have found  that recurrent neural networks (RNN) with LSTM can outperform traditional forecasting models like ARIMA when  forecasting future values of certain  time series data. (For an example see [A comparison of artificial neural network and time series models for forecasting commodity prices](https://www.sciencedirect.com/science/article/pii/0925231295000208))

This Python 3 notebook will demonstrate how to apply an RNN with LSTM to forecast weekly West Texas crude oil prices. The data used to train the model covers the time period  from 01/03/1986 to 3/30/2018. The data  was downloaded from the [Federal Reserve Bank of St. Louis](https://fred.stlouisfed.org)

## Setup

1. Download the file with West Texas crude oil prices  from [here](https://raw.githubusercontent.com/lee-zhg/timeseries-rnn-lab-part1/master/data/WCOILWTICO.csv) to your local system. The name of the file is WCOILWTICO.csv.

2. Click on the data icon  at the top right of the notebook window and then select and upload the <b>WCOILWTCO.csv</b> file.
![Data icon](https://github.com/djccarew/timeseries-rnn-lab-part1/raw/master/images/ss6.png) 

3. Once the file is uploaded, place your cursor in the code cell below and select <b>Insert to code->Insert pandas Dataframe</b>.
![Insert code](https://github.com/djccarew/timeseries-rnn-lab-part1/raw/master/images/ss7.png) 
This will insert the code to load the file from  Object Storage into a DataFrame

4. Rename variable to <b>df_data_1</b> if it's different.

5. Run each cell in the notebook after reading the description of what is being done with each cell

## Import Data

In [None]:
# With your cursor in this cell, insert the code to read the dataset into a DataFrame as instructed in step 3) 
# of the setup instructions above


In [None]:
# New version of imported DataFrame indexed by the DATE column
# Make sure variable name on the right of the assigment statement matches the value inserted
# into the code cell above
data =  df_data_1.set_index('DATE')

## Build the model

In [None]:
# Required imports
from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
import pandas as pd
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
import plotly.offline as py
import plotly.graph_objs as go
import numpy as np
py.init_notebook_mode(connected=True)
%matplotlib inline


In [None]:
# Plot the data read in 
cop_trace = go.Scatter(x=data.index, y=data['WCOILWTICO'], name= 'Price')
py.iplot([cop_trace])

In [None]:
# Create a scaled version of the data with oil prices normalized between 0 and 1
values = data['WCOILWTICO'].values.reshape(-1,1)
values = values.astype('float32')
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)

In [None]:
# Split the data between training and testing 
# The first 70% of the data is used for training while the remaining 30% is used for validation
train_size = int(len(scaled) * 0.7)
test_size = len(scaled) - train_size
train, test = scaled[0:train_size,:], scaled[train_size:len(scaled),:]
print(len(train), len(test))

In [None]:
# Generates the X and Y data from the downloaded dataset. The last n values in the input data are left off
# and the Y values are generated by shifting the X values by n 
# where n is  the value of the prev_periods paramater

# See the example below , prev_periods is set to 2
# Original X (weeks 1 - 5) = 1.05, 1.15, 1.25, 1.35, 1.45
# New X (weeks 1 - 3) = 1.05, 1.15, 1.25
# Y = 1.25, 1.35, 1.45
# 
def gen_datasets(dataset, prev_periods=1):
    dataX, dataY = [], []
    for i in range(len(dataset) - prev_periods):
        a = dataset[i:(i + prev_periods), 0]
        dataX.append(a)
        dataY.append(dataset[i + prev_periods, 0])
    print(len(dataY))
    return np.array(dataX), np.array(dataY)

In [None]:
# Generate testing and validation data
# We'll use a sliding window size of 1 week to predict the next week's price 
prev_periods = 1
trainX, trainY = gen_datasets(train, prev_periods)
testX, testY = gen_datasets(test, prev_periods)

In [None]:
# Reshape into a numpy arraya of shape (m, 1, prev_periods) where m is the number of training or testing values
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

In [None]:
# Build RNN - this should take a a few minutes
model = Sequential()
model.add(LSTM(100, input_shape=(trainX.shape[1], trainX.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
history = model.fit(trainX, trainY, epochs=50, batch_size=32, validation_data=(testX, testY), verbose=0, shuffle=False)

In [None]:
# Check out MSE, RMSE, MAE for training and testing data
training_error = model.evaluate(trainX, trainY, verbose=0)
print('Training error: %.5f MSE (%.5f RMSE) %.5f MAE' % (training_error[0], sqrt(training_error[0]), training_error[1]))
testing_error = model.evaluate(testX, testY, verbose=0)
print('Testing error: %.5f MSE (%.5f RMSE) %.5f MAE' % (testing_error[0], sqrt(testing_error[0]), testing_error[1]))

In [None]:
# Plot validation loss  vs epoch number
pyplot.plot(history.history['loss'], label='training loss')
pyplot.plot(history.history['val_loss'], label='test loss')
pyplot.legend()
pyplot.show()

In [None]:
# Plot prediction vs actual using scaled values (0, 1)
yhat_test = model.predict(testX)
print(yhat_test.shape)
pyplot.plot(yhat_test, color='red', label='prediction')
pyplot.plot(testY, color='blue', label='actual')
pyplot.legend()
pyplot.show()

In [None]:
# Convert scaled prices back to original scale (USD)
yhat_test_inverse = scaler.inverse_transform(yhat_test.reshape(-1, 1))
testY_inverse = scaler.inverse_transform(testY.reshape(-1, 1))

# Add dates back
dates = data.tail(len(testX)).index
testY_reshape = testY_inverse.reshape(len(testY_inverse))
yhat_test_reshape = yhat_test_inverse.reshape(len(yhat_test_inverse))


In [None]:
# Calculate MSE< RMSE based on original USD prices
mse = mean_squared_error(testY_inverse, yhat_test_inverse)
rmse = sqrt(mse)
print('Test MSE(USD): %.3f Test RMSE(USD): %.3f' % (mse, rmse))

In [None]:
# Plot actual vs predicted using actual dates and USD
actual = go.Scatter(x=dates, y=testY_reshape, line = dict(color = ('rgb(0, 0, 255)'), width = 4), name= 'Actual Price')
predicted = go.Scatter(x=dates, y=yhat_test_reshape, line = dict(color = ('rgb(255, 0, 0)'), width = 4), name= 'Predicted Price')
py.iplot([predicted, actual])

## Run new data through model
As part of the  data prep that last weeks price (3/30/2018) was left off because we had no data for the following week (4/6/2018). Let's use this value to predict the price for the week of 4/6/2018

In [None]:
# Grab last week of  normalized data and reshape into shape expected by model 
scaled_last_prices = scaled[len(scaled) - prev_periods:len(scaled),:]
scaled_last_prices = np.reshape(scaled_last_prices, (1, 1, prev_periods))

print(scaled_last_prices)


In [None]:
# Predict the price for the week of 4/6/2018 using the model
# Note this will be on a scale of (0,1)
#print(new_scaled_last_prices.shape)
next_price_prediction = model.predict(scaled_last_prices)


In [None]:
# Transform scaled predicion back to a USD price
next_price_inverse = scaler.inverse_transform(next_price_prediction.reshape(-1, 1))
print(next_price_inverse)

Congratulations !!! You successfully built an RNN to forecast oil prices 