
# Predicting the Dow Jones with News

## General Data flow for a Text Related Business Problem

![title](resources/textmining.png)

# Problem Statement & Reference Architecture

* **Aim**: Use Reddit News Headlines to predict the movement of Dow Jones Industrial Average.   


* **Data Source**: https://www.kaggle.com/aaron7sun/stocknews 


* **Data Description**: Dow Jones details on Open, High, Low and Close for each day from 2008-08-08 to 2016-07-01 and headlines for those dates from Reddit News. 


* **Methodology**: For this project, we will use GloVe to create our word embeddings and CNNs followed by LSTMs to build our model. This model is based off the work done in this paper https://www.aclweb.org/anthology/C/C16/C16-1229.pdf.

![basic](resources/basic_intent.png)

# Installation Prerequisites

In [3]:
!apt-get update  && apt-get install -y --allow-downgrades --no-install-recommends git wget 


Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  InRelease
Ign:2 http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  InRelease
Hit:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64  Release
Hit:4 http://archive.ubuntu.com/ubuntu xenial InRelease
Hit:5 http://security.ubuntu.com/ubuntu xenial-security InRelease
Hit:6 http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  Release
Hit:8 http://archive.ubuntu.com/ubuntu xenial-updates InRelease        
Hit:9 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
Reading package lists... Done                     
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git is already the newest version (1:2.7.4-0ubuntu1.4).
wget is already the newest version (1.17.1-1ubuntu1.4).
0 upgraded, 0 newly installed, 0 to remove and 73 not upgraded.


In [4]:
!apt-get -y install graphviz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
graphviz is already the newest version (2.38.0-12ubuntu2.1).
0 upgraded, 0 newly installed, 0 to remove and 73 not upgraded.


In [5]:
!pip install nltk keras

[33mYou are using pip version 9.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [6]:
!pip install pydot

[33mYou are using pip version 9.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [7]:
!pip install graphviz

[33mYou are using pip version 9.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip

In [None]:
!unzip glove.840B.300d.zip

# Imports

In [5]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import median_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
import matplotlib.pyplot as plt

  from ._conv import register_converters as _register_converters


In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Keras Imports
from keras.models import Sequential
from keras import initializers
from keras.layers import Dropout, Activation, Embedding, Convolution1D, MaxPooling1D, Input, Dense, add, \
                         BatchNormalization, Flatten, Reshape, Concatenate
from keras.layers.recurrent import LSTM, GRU
from keras.callbacks import Callback, ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from keras.models import Model
from keras.optimizers import Adam, SGD, RMSprop
from keras import regularizers
from keras.utils.vis_utils import plot_model
import re

Using TensorFlow backend.


In [129]:
dj = pd.read_csv("/storage/DowJones.csv")
news = pd.read_csv("/storage/News.csv")

## Inspect the data

In [130]:
dj.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [131]:
dj.isnull().sum() #No missing data

Date         0
Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
dtype: int64

In [132]:
news.isnull().sum() #No missing data

Date    0
News    0
dtype: int64

In [133]:
news.head()

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


In [134]:
print(dj.shape)
print(news.shape)

(1989, 7)
(73608, 2)


In [135]:
# Compare the number of unique dates. We want matching values.
print(len(set(dj.Date)))
print(len(set(news.Date)))

1989
2943


In [136]:
# Remove the extra dates that are in news
news = news[news.Date.isin(dj.Date)]

In [137]:
print(len(set(dj.Date)))
print(len(set(news.Date)))

1989
1989


In [138]:
# Remove unwanted features - keep the 'Open' price only
dj = dj.drop(['High','Low','Close','Volume','Adj Close'], 1)
dj.head()

Unnamed: 0,Date,Open
0,2016-07-01,17924.240234
1,2016-06-30,17712.759766
2,2016-06-29,17456.019531
3,2016-06-28,17190.509766
4,2016-06-27,17355.210938


In [139]:
# Calculate the difference in opening prices between the following and current day.
# The model will try to predict the change in Open value based on the today's news.
dj = dj.set_index('Date')
dj.head()

Unnamed: 0_level_0,Open
Date,Unnamed: 1_level_1
2016-07-01,17924.240234
2016-06-30,17712.759766
2016-06-29,17456.019531
2016-06-28,17190.509766
2016-06-27,17355.210938


In [140]:
# Target variable = Tomorrow's Open Price - Today's Open Price
dj = -1 * dj.diff(periods=1)

In [141]:
dj.head()

Unnamed: 0_level_0,Open
Date,Unnamed: 1_level_1
2016-07-01,
2016-06-30,211.480468
2016-06-29,256.740235
2016-06-28,265.509765
2016-06-27,-164.701172


In [142]:
dj['Date'] = dj.index
dj = dj.reset_index(drop=True)

In [143]:
dj.head()

Unnamed: 0,Open,Date
0,,2016-07-01
1,211.480468,2016-06-30
2,256.740235,2016-06-29
3,265.509765,2016-06-28
4,-164.701172,2016-06-27


In [147]:
# Remove top row since it has a null value.
dj = dj[dj.Open.notnull()]

In [73]:
# Check if there are any more null values.
dj.isnull().sum()

Open    0
Date    0
dtype: int64

## Combine the two datasets - For each date, get all the headlines and the price

In [148]:
# Create a list of the opening prices and their corresponding daily headlines from the news
# Define/Initialize the variables
price = []
headlines = []

# For all the rows in the dataframe
for row in dj.iterrows():
    # define a new variable to store all the headlines for the day
    daily_headlines = []
    # Spot the date in the given row
    date = row[1]['Date']
    # Store the price for the date
    price.append(row[1]['Open'])
    for row_ in news[news.Date==date].iterrows():
        daily_headlines.append(row_[1]['News'])

    # Append the headlines for the date
    headlines.append(daily_headlines)
    # Track progress
    if len(price) % 500 == 0:
        print(len(price))

500
1000
1500


<table size="100">
    <tr>
        <td>headlines</td>
        <td>price</td>
    </tr>
    <tr>
        <td>headline-1, headline-2 ..., headline-n</td>
        <td>211.48</td>
    </tr>
</table>

In [149]:
# Check how headlines look like
headlines[:1], price[:1]

([['Jamaica proposes marijuana dispensers for tourists at airports following legalisation: The kiosks and desks would give people a license to purchase up to 2 ounces of the drug to use during their stay',
   "Stephen Hawking says pollution and 'stupidity' still biggest threats to mankind: we have certainly not become less greedy or less stupid in our treatment of the environment over the past decade",
   'Boris Johnson says he will not run for Tory party leadership',
   'Six gay men in Ivory Coast were abused and forced to flee their homes after they were pictured signing a condolence book for victims of the recent attack on a gay nightclub in Florida',
   'Switzerland denies citizenship to Muslim immigrant girls who refused to swim with boys: report',
   'Palestinian terrorist stabs israeli teen girl to death in her bedroom',
   'Puerto Rico will default on $1 billion of debt on Friday',
   'Republic of Ireland fans to be awarded medal for sportsmanship by Paris mayor.',
   "Afghan s

## Clean up the price list

In [77]:
price[:2]

[211.48046800000157, 256.7402349999975]

In [78]:
# Normalize opening prices (target values)
max_price = max(price)
min_price = min(price)
mean_price = np.mean(price)
def normalize(price):
    return ((price-min_price)/(max_price-min_price))

In [79]:
norm_price = []
for p in price:
    norm_price.append(normalize(p))

In [80]:
# Check that normalization worked well
print(min(norm_price))
print(max(norm_price))
print(np.mean(norm_price))

0.0
1.0
0.4551577545098642


In [81]:
# Compare the number of headlines for each day
print(max(len(i) for i in headlines))
print(min(len(i) for i in headlines))
print(np.mean([len(i) for i in headlines]))

25
22
24.996478873239436


In [82]:
norm_price[:2]

[0.5780280759194737, 0.6047364662478155]

## Clean up the headlines list

In [83]:
# remove contractions
def decontracted(phrase):
    if "'" in phrase:
        # specific
        phrase = re.sub(r"won't", "will not", phrase)
        phrase = re.sub(r"can\'t", "can not", phrase)

        # general
        phrase = re.sub(r"n\'t", " not", phrase)
        phrase = re.sub(r"\'re", " are", phrase)
        phrase = re.sub(r"\'s", " is", phrase)
        phrase = re.sub(r"\'d", " would", phrase)
        phrase = re.sub(r"\'ll", " will", phrase)
        phrase = re.sub(r"\'t", " not", phrase)
        phrase = re.sub(r"\'ve", " have", phrase)
        phrase = re.sub(r"\'m", " am", phrase)
    return phrase

text = "I should've gone to dentist so my teeth wouldn't hurt"
text1 = "But I am good now"
print(decontracted(text))
print(decontracted(text1))

I should have gone to dentist so my teeth would not hurt
But I am good now


In [84]:
def clean_text(text):
    '''Remove unwanted characters and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        # Remove the contractions
        for word in text:
            new_text.append(decontracted(word))
        # Recreate the sentence
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'0,0', '00', text) 
    text = re.sub(r'[_"\-;%()|.,+&=*%.,!?:#@\[\]]', ' ', text)
    text = re.sub(r'\'', ' ', text)
    text = re.sub(r'\$', ' $ ', text)
    text = re.sub(r'u s ', ' united states ', text)
    text = re.sub(r'u n ', ' united nations ', text)
    text = re.sub(r'u k ', ' united kingdom ', text)
    text = re.sub(r'j k ', ' jk ', text)
    text = re.sub(r' s ', ' ', text)
    text = re.sub(r' yr ', ' year ', text)
    text = re.sub(r' l g b t ', ' lgbt ', text)
    text = re.sub(r'0km ', '0 km ', text)
    
    # Remove stop words
    text = text.split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]
    text = " ".join(text)

    return text

In [85]:
# Clean the headlines
clean_headlines = []

for daily_headlines in headlines:
    clean_daily_headlines = []
    for headline in daily_headlines:
        clean_daily_headlines.append(clean_text(headline))
    clean_headlines.append(clean_daily_headlines)

In [86]:
# Take a look at some headlines to ensure everything was cleaned well
clean_headlines[:2]

[['jamaica proposes marijuana dispensers tourists airports following legalisation kiosks desks would give people license purchase 2 ounces drug use stay',
  'stephen hawking says pollution istupidity still biggest threats mankind certainly become less greedy less stupid treatment environment past decade',
  'boris johnson says run tory party leadership',
  'six gay men ivory coast abused forced flee homes pictured signing condolence book victims recent attack gay nightclub florida',
  'switzerland denies citizenship muslim immigrant girls refused swim boys report',
  'palestinian terrorist stabs israeli teen girl death bedroom',
  'puerto rico default $ 1 billion debt friday',
  'republic ireland fans awarded medal sportsmanship paris mayor',
  'afghan suicide bomber kills 40 bbc news',
  'us airstrikes kill least 250 isis fighters convoy outside fallujah official says',
  'turkish cop took istanbul gunman hailed hero',
  'cannabis compounds could treat alzheimer removing plaque formin

In [87]:
print('Roughly the number of unique words in English: {}'.format(len({word: None 
                                                                      for headlines in clean_headlines 
                                                                      for headline in headlines 
                                                                      for word in headline.split()})))


Roughly the number of unique words in English: 36311


In [88]:
# Create the word vocab
import collections
words = [word for headlines in clean_headlines for headline in headlines for word in headline.split()]
word_counts = collections.Counter(words)

In [89]:
word_counts

Counter({'encampment': 2,
         'stop': 443,
         'discount': 3,
         'messages': 56,
         'humane': 8,
         'djibouti': 3,
         'donors': 14,
         'offsets': 1,
         'sane': 4,
         'supporting': 51,
         'spectacle': 3,
         'fingerprinting': 3,
         'weaponized': 2,
         'tallies': 1,
         'lourdes': 1,
         'marshall': 9,
         'stille': 1,
         'permission': 44,
         'umayyad': 1,
         'multiplied': 2,
         'float': 6,
         'propulsion': 1,
         'drawbridge': 1,
         'obey': 10,
         'obliterated': 2,
         '220': 12,
         'purchases': 11,
         '\\r\\nleast': 1,
         'iscores': 6,
         'isuperstate': 1,
         'elements': 10,
         'iohannis': 1,
         'bala': 2,
         'educator': 1,
         'mathematical': 3,
         'dispensing': 1,
         'masterminded': 5,
         'pinned': 5,
         'reagan': 7,
         'originates': 2,
         'code': 40,
     

# A note on Word Embeddings

![word_embed](resources/wordvectors.png)

**Reference**: https://nlp.stanford.edu/projects/glove/

## We are going to use Glove embeddings to initialize our weights while designing our neural network. Let's load the same so that we can ensure our headline corpus' vocabulary matches where possible with Glove Embedding vocabulary.

In [90]:
# Load GloVe's embeddings
embeddings_index = {}
with open('/storage/glove.840B.300d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

Word embeddings: 2196016


## It is not necessary that we will have embeddings for all the words in Glove. So to limit such cases by limiting vocabulary by applying simple logic:  Remove the words that are "rare" and are not available in Glove 

In [91]:
# Limit the vocab that we will use to words that appear ≥ threshold or are in GloVe

# Define threshold
threshold = 10

#dictionary to convert words to integers
vocab_to_int = {} 

value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

In [92]:
len(vocab_to_int)

31295

In [93]:
# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>"]   

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100

print("Total Number of Unique Words:", len(word_counts))
print("Number of Words we will use:", len(vocab_to_int))
print("Percent of Words we will use: {}%".format(usage_ratio))

Total Number of Unique Words: 36311
Number of Words we will use: 31297
Percent of Words we will use: 86.19%


## For the words which are common within headlines but are absent in Glove corpus, we will have to randomly initialize them. Over the training, those values will be finetuned along with those of Glove vectors.

In [94]:
# Need to use 300 for embedding dimensions to match GloVe's vectors.
embedding_dim = 300

nb_words = len(vocab_to_int)
# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim))
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in GloVe, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

31297


## Convert the word sequences to equivalent integer sequences so that it can be used as input to the model

In [95]:
# Change the text from words to integers
# If word is not in vocab, replace it with <UNK> (unknown)
word_count = 0
unk_count = 0

headlines_sequence = []

for daily_headline in clean_headlines:
    daily_headlines_seq = []
    for headline in daily_headline:
        headline_seq = []
        for word in headline.split():
            word_count += 1
            if word in vocab_to_int:
                headline_seq.append(vocab_to_int[word])
            else:
                headline_seq.append(vocab_to_int["<UNK>"])
                unk_count += 1
        daily_headlines_seq.append(headline_seq)
    headlines_sequence.append(daily_headlines_seq)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

Total number of words in headlines: 616686
Total number of UNKs in headlines: 7139
Percent of words that are UNK: 1.16%


In [96]:
headlines_sequence[:1]

[[[8360,
   20565,
   19256,
   19168,
   17819,
   20828,
   15764,
   12124,
   11032,
   19408,
   25123,
   2789,
   18631,
   6474,
   17388,
   12559,
   22398,
   8484,
   26502,
   27409],
  [16862,
   14562,
   901,
   16737,
   31295,
   11377,
   6836,
   17685,
   17400,
   12268,
   12889,
   29411,
   9522,
   29411,
   15354,
   26168,
   20363,
   14140,
   25118],
  [27739, 17704, 901, 8984, 16932, 1061, 27865],
  [14686,
   9139,
   19757,
   27910,
   3303,
   27874,
   25047,
   11990,
   29231,
   15065,
   22813,
   20457,
   27166,
   3110,
   6341,
   15917,
   9139,
   21947,
   18701],
  [30931, 13051, 7358, 29302, 22960, 13349, 13000, 6569, 25169, 28630],
  [13142, 2979, 1432, 23369, 6350, 29519, 27270, 11725],
  [2111, 12164, 4539, 15399, 18032, 22733, 21772, 7266],
  [3357, 25529, 27909, 2136, 20726, 16722, 28127, 503],
  [2389, 5401, 17577, 20263, 29717, 30951, 8034],
  [8479,
   3025,
   12825,
   13041,
   3992,
   17954,
   4865,
   15724,
   30017,
   

## Ensure that the variations in the number of news headlines each day and length of each headlines are handled by taking an average number of headlines each day and average length per headline 

In [97]:
# Find the length of headlines
lengths = []
for headlines in headlines_sequence:
    for headline in headlines:
        lengths.append(len(headline))

# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])

In [98]:
lengths.describe()

Unnamed: 0,counts
count,49693.0
mean,12.409917
std,6.789827
min,1.0
25%,7.0
50%,10.0
75%,16.0
max,41.0


## Limit the length of a day's news to 200 words, and the length of any headline to 16 words. These values are chosen to not have an excessively long training time and balance the number of headlines used and the number of words from each headline.

In [99]:
max_headline_length = 16
max_daily_length = 200
pad_headlines = []

# For each date in all the dates available
for headlines in headlines_sequence:
    pad_daily_headlines = []
    # for each headline for each date
    for headline in headlines:
        # Add headline if it is less than max length
        if len(headline) <= max_headline_length:
            for word in headline:
                pad_daily_headlines.append(word)
        # Limit headline if it is more than max length  
        else:
            headline = headline[:max_headline_length]
            for word in headline:
                pad_daily_headlines.append(word)
    
    # Pad daily_headlines if they are less than max length
    if len(pad_daily_headlines) < max_daily_length:
        for i in range(max_daily_length-len(pad_daily_headlines)):
            pad = vocab_to_int["<PAD>"]
            pad_daily_headlines.append(pad)
    # Limit daily_headlines if they are more than max length
    else:
        pad_daily_headlines = pad_daily_headlines[:max_daily_length]
    pad_headlines.append(pad_daily_headlines)

## Split data into training and testing sets.
## Validating data will be created during training.

In [100]:
x_train, x_test, y_train, y_test = train_test_split(pad_headlines, norm_price, test_size = 0.15, random_state = 2)

x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

In [101]:
# Check the lengths
print(len(x_train))
print(len(x_test))

1689
299


# Model Building

## The CNN-RNN architecture
![cnn-rnn](resources/cnn-1d-rnn.jpg)

## 1. Define the hyperparameters

In [102]:
filter_length = 5
dropout = 0.5
learning_rate = 0.001
weights = initializers.TruncatedNormal(mean=0.0, stddev=0.1, seed=2)
nb_filter = 16
rnn_output_size = 128
hidden_dims = 128

## 2. Create the model

In [103]:
def build_model():
    
    model = Sequential()
    
    # Layer 1 - Embedding
    model.add(Embedding(nb_words, 
                         embedding_dim,
                         weights=[word_embedding_matrix], 
                         input_length=max_daily_length))
    model.add(Dropout(dropout))
    
    # Layer 2 - Convolution 1 with dropout
    model.add(Convolution1D(filters = nb_filter, 
                             kernel_size = filter_length, 
                             padding = 'same',
                             activation = 'relu'))
    model.add(Dropout(dropout))    

    # Layer 3 - Convolution 2 with Dropout 
    model.add(Convolution1D(filters = nb_filter, 
                                 kernel_size = filter_length, 
                                 padding = 'same',
                                 activation = 'relu'))
    model.add(Dropout(dropout))    

    # Layer 4 - RNN with dropout
    model.add(LSTM(rnn_output_size, 
                    activation=None,
                    kernel_initializer=weights,
                    dropout = dropout))    

    # Layer 5 - Dense FFN with Dropout
    model.add(Dense(hidden_dims, kernel_initializer=weights))
    model.add(Dropout(dropout))
    
    model.add(Dense(1, 
                    kernel_initializer = weights,
                    name='output'))

    model.compile(loss='mean_squared_error',
                  optimizer=Adam(lr=learning_rate,clipvalue=1.0))
    return model

## 3. Fit the model

In [104]:
model = build_model()
print()
save_best_weights = 'best_weights.h5'

callbacks = [ModelCheckpoint(save_best_weights, monitor='val_loss', save_best_only=True),
            EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto'),
            ReduceLROnPlateau(monitor='val_loss', factor=0.2, verbose=1, patience=3)]

history = model.fit([x_train],
                    y_train,
                    batch_size=128,
                    epochs=100,
                    validation_split=0.15,
                    verbose=True,
                    shuffle=True,
                    callbacks = callbacks)
print(model.summary())


Current model: LR=0.001, Dropout=0.3

Train on 1435 samples, validate on 254 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100

Epoch 00015: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 16/100
Epoch 17/100
Epoch 00017: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 200, 300)          9389100   
_________________________________________________________________
dropout_13 (Dropout)         (None, 200, 300)          0         
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 200, 16)           24016     
_________________________________________________________________
dropout_14 (Dropout)         (None, 200, 16)           0        

## 4. Predict using the model

In [105]:
predictions = model.predict([x_test], verbose = True)



In [106]:
# Compare testing loss to training and validating loss
mse(y_test, predictions)

0.007034644158361947

In [107]:
# Revert prediction back to actual scale
def unnormalize(price):
    '''Revert values to their unnormalized amounts'''
    price = price*(max_price-min_price)+min_price
    return(price)

In [108]:
# Store back-scaled predictions
unnorm_predictions = []
for pred in predictions:
    unnorm_predictions.append(unnormalize(pred))

# Store back-scaled actuals
unnorm_y_test = []
for y in y_test:
    unnorm_y_test.append(unnormalize(y))

In [109]:
# Calculate the median absolute error for the predictions
mae(unnorm_y_test, unnorm_predictions)

76.57360821093789

In [150]:
pd.Series(unnorm_y_test).describe()

count    299.000000
mean       7.094101
std      139.532324
min     -673.139648
25%      -54.689941
50%       10.759766
75%       87.465332
max      541.050782
dtype: float64

## Make Your Own Predictions

Below is the code necessary to make your own predictions. I found that the predictions are most accurate when there is no padding included in the input data. In the create_news variable, I have some default news that you can use, which is from April 30th, 2017. Just change the text to whatever you want, then see the impact your new headline will have.

In [110]:
def news_to_int(news):
    '''Convert your created news into integers'''
    ints = []
    for word in news.split():
        if word in vocab_to_int:
            ints.append(vocab_to_int[word])
        else:
            ints.append(vocab_to_int['<UNK>'])
    return ints

In [111]:
def padding_news(news):
    '''Adjusts the length of your created news to fit the model's input values.'''
    padded_news = news
    if len(padded_news) < max_daily_length:
        for i in range(max_daily_length-len(padded_news)):
            padded_news.append(vocab_to_int["<PAD>"])
    elif len(padded_news) > max_daily_length:
        padded_news = padded_news[:max_daily_length]
    return padded_news

In [151]:
# Default news that you can use

create_news =  "Woman says note from Chinese 'prisoner' was hidden in new purse. \
               21,000 AT&T workers poised for Monday strike \
               housands march against Trump climate policies in D.C., across USA \
               Kentucky judge won't hear gay adoptions because it's not in the child's \"best interest\" \
               Multiple victims shot in UTC area apartment complex \
               Drones Lead Police to Illegal Dumping in Riverside County | NBC Southern California \
               An 86-year-old Californian woman has died trying to fight a man who was allegedly sexually assaulting her 61-year-old friend. \
               Fyre Festival Named in $5Million+ Lawsuit after Stranding Festival-Goers on Island with Little Food, No Security. \
               The \"Greatest Show on Earth\" folds its tent for good \
               U.S.-led fight on ISIS have killed 352 civilians: Pentagon \
               Woman offers undercover officer sex for $25 and some Chicken McNuggets \
               Ohio bridge refuses to fall down after three implosion attempts \
               Jersey Shore MIT grad dies in prank falling from library dome \
               New York graffiti artists claim McDonald's stole work for latest burger campaign \
               SpaceX to launch secretive satellite for U.S. intelligence agency \
               Severe Storms Leave a Trail of Death and Destruction Through the U.S. \
               Hamas thanks N. Korea for its support against ‘Israeli occupation’ \
               Baker Police officer arrested for allegedly covering up details in shots fired investigation \
               Miami doctor’s call to broker during baby’s delivery leads to $33.8 million judgment \
               Minnesota man gets 15 years for shooting 5 Black Lives Matter protesters \
               South Australian woman facing possible 25 years in Colombian prison for drug trafficking \
               The Latest: Deal reached on funding government through Sept. \
               Russia flaunts Arctic expansion with new military bases"

clean_news = clean_text(create_news)

int_news = news_to_int(clean_news)

pad_news = padding_news(int_news)

pad_news = np.array(pad_news).reshape((1,-1))

pred = model.predict([pad_news])

price_change = unnormalize(pred)

print("The Dow should open: {} from the previous open.".format(np.round(price_change[0][0],2)))

The Dow should open: -23.75 from the previous open.
