# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

## Load and Clean Raw Data

In [39]:
# Utility libraries
import os
import pickle
import numpy as np
import pandas as pd
import re
import calendar

# Prepocessing libraries
from sklearn.model_selection import train_test_split
import gensim

from keras.models import Model
from keras.layers import Input, concatenate, Concatenate, TimeDistributed
from keras.layers import Dense, Bidirectional, Dropout, Flatten, merge 
from keras.layers import Conv1D, Conv2D, MaxPool1D
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

In [40]:
import keras.backend as K

def precision(y_true, y_pred):
    """Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision


def recall(y_true, y_pred):
    """Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

In [41]:
dataPath = '../data'
reutersFile = 'news_reuters.csv'
stockFile = 'stockReturns.json'

rawX = pd.read_csv(os.path.join(dataPath, reutersFile), header=None, 
                   names=['ticker', 'company', 'pub_date', 'headline', 'first_sent', 'category'])
rawY = pd.read_json(os.path.join(dataPath, stockFile))
# rawY = json.load(os.path.join(dataPath, stockFile))

#### Reformat and Merge Data

In [42]:
def reformat_y_data(data, tickerType='mid'):
    """Convert stock data into binary postive/negative"""
    tmp = data[tickerType].apply(pd.Series)
    tmp = tmp.stack().rename('price', inplace=True).reset_index()
    tmp['y'] = np.where(tmp['price'] >= 0, 1, 0)
    tmp.rename(columns={'level_0': 'ticker', 'level_1': 'pub_date'}, inplace=True)
    return tmp

def clean_and_merge_data(X, Y):
    """Filter X to only those tickers with stock data"""
    y_tickers = set(Y['ticker'])
    X = X.loc[X['ticker'].isin(y_tickers)]
    # Make sure data types are the same for merge    
    Y['pub_date'] = Y['pub_date'].astype(rawX['pub_date'].dtype)
    Y['ticker'] = Y['ticker'].astype(rawX['ticker'].dtype)
    return X.merge(Y, on=['ticker', 'pub_date'], how='left')

In [43]:
cleanY = reformat_y_data(rawY, 'short')

merged = clean_and_merge_data(rawX, cleanY)

#### Clean up text columns and tokenize data

In [44]:
def clean_text(sent):
    """Clean up text data by:
    
    1. Replacing double spaces into a single space
    2. Replace U.S. to United States so U won't get deleted with next 
       replacement
    3. Remove all capitalized words at the beginning of the 
       sentence, since those are mostly places (aka NEW YORK)
    4. Remove unnecessary punctuation (hyphens and asterisks)
    5. Remove dates
    """
    monthStrings = list(calendar.month_name)[1:] + list(calendar.month_abbr)[1:]
    monthPattern = '|'.join(monthStrings)
    
    sent = re.sub(r' +', ' ', sent)
    sent = re.sub(r'U.S.', 'United States', sent)
    sent = re.sub(r'^(\W?[A-Z\s\d]+\b-?)', '', sent)
    sent = re.sub(r'^ ?\W ', '', sent)
    sent = re.sub(r'({}) \d+'.format(monthPattern), '', sent)
    
    # replace double spaces one more time after previous cleaning 
    sent = re.sub(r' +', ' ', sent)
    return sent 

In [45]:
def tokenize_sent(col):
    """Tokenize string into a sequence of words"""
    return [text_to_word_sequence(text, lower=False) for text in col]

def filt_to_one(x, random_state=10):
    """Filter dataset so that there is only one observation per day.
    
    If there is more than one record, will use the topStory record
    if one exists.  If one doesn't or there are 2 topStory records
    then it will randomly select one of the observations.
    """
    if x.shape[0] > 1:
        if 'topStory' in x['category'].unique():
            x = x.loc[x['category'] == 'topStory']
        if x.shape[0] > 1:
            x = x.sample(n=1, random_state=random_state)
    return x

In [46]:
# Clean up text
merged['headline'] = merged.headline.apply(clean_text)
merged['first_sent'] = merged.first_sent.apply(clean_text)

# Turn sentences into tokens
merged['headline_token'] = tokenize_sent(merged.headline)
merged['first_sent_token'] = tokenize_sent(merged.first_sent)

# Get one record per company/day
finalData = merged.groupby(by=['ticker', 'pub_date']).apply(filt_to_one)

# Combine Headline and First Sentence into one text 
finalData['final_text'] = finalData['headline_token'] + finalData.first_sent_token

# Remove observations with missing stock price
finalData.dropna(inplace=True)

In [47]:
# split into train and test
train, test = train_test_split(finalData, test_size = .2, random_state=10)

#### Create Lexicon and Transform Data to Integers for Modeling

In [48]:
class lexiconTransformer():
    """Create a lexicon and transform sentences and
       to indexes for use in the model."""
    
    def __init__(self, words_min_freq = 1, unknown_word_token = u'<UNK>',
                 savePath='models', saveName='stock_word_lexicon'):
        self.words_min_freq = words_min_freq
        self.words_lexicon = None
        self.unknown_word_token = unknown_word_token
        self.indx_to_words_dict = None
        self.savePath = savePath
        self.saveName = saveName + '.pkl'
    
    def fit(self, sents):
        """Create lexicon based on sentences"""
        self.make_words_lexicon(sents)        
        self.make_lexicon_reverse()
        self.save_lexicon()
                
    def transform(self, sents):
        sents_indxs = self.tokens_to_idxs(sents, self.words_lexicon)
        return sents_indxs

    def fit_transform(self, sents):
        self.fit(sents)
        return self.transform(sents)
        
    def make_words_lexicon(self, sents_token):
        """Wrapper for words lexicon"""
        self.words_lexicon = self.make_lexicon(sents_token, self.words_min_freq,
                                               self.unknown_word_token)

    def make_lexicon(self, token_seqs, min_freq=1, unknown = u'<UNK>'):
        """Create lexicon from input based on a frequency

            Parameters:
            
            token_seqs
            ----------
               A list of a list of input tokens that will be used to create the lexicon
            
            min_freq
            --------
               Number of times the token needs to be in the corpus to be included in the
               lexicon.  Otherwise, will be replaced with the "unknown" entry
            
            unknown
            -------
               The word in the lexicon that should be used for tokens not existing in lexicon.
               This can be a value that already exists in input list.  For instance, in 
               Named Entity Recognition, a value of "other" or "O" may already be a tag 
               and so having "other" and "unknown" are the same thing!
        """
        # Count how often each word appears in the text.
        token_counts = {}
        for seq in token_seqs:
            for token in seq:
                if token in token_counts:
                    token_counts[token] += 1
                else:
                    token_counts[token] = 1

        # Then, assign each word to a numerical index. 
        # Filter words that occur less than min_freq times.
        lexicon = [token for token, count in token_counts.items() if count >= min_freq]
        
        # Have to delete unknown value from token list so not a gap in lexicon values when
        # turning it into a lexicon (aka, if unknown == OTHER and that is the 7th value, 
        # then 7 won't exist in the lexicon which may cause issues)
        if unknown in lexicon:
            lexicon.remove(unknown)

        # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
        lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
        
        lexicon[unknown] = 1 # Unknown words are those that occur fewer than min_freq times
        lexicon_size = len(lexicon)
        return lexicon
    
    def save_lexicon(self):
        "Save lexicons by pickling them"
        if not os.path.exists(self.savePath):
            os.makedirs(self.savePath)
        with open(os.path.join(self.savePath, self.saveName), 'wb') as f:
            pickle.dump(self.words_lexicon, f)
                        
    def load_lexicon(self):
        with open(os.path.join(self.savePath, self.saveName), 'rb') as f:
            self.words_lexicon = pickle.load(f)
                    
        self.make_lexicon_reverse()
        
    def make_lexicon_reverse(self):
        self.indx_to_words_dict = self.get_lexicon_lookup(self.words_lexicon)
    
    def get_lexicon_lookup(self, lexicon):
        '''Make a dictionary where the string representation of 
           a lexicon item can be retrieved from its numerical index'''
        lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
        return lexicon_lookup
    
    def tokens_to_idxs(self, token_seqs, lexicon):
        """Transform tokens to numeric indexes or <UNK> if doesn't exist"""
        idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for 
                                 token in token_seq] for token_seq in token_seqs]
        return idx_seqs

In [49]:
lexicon = lexiconTransformer(words_min_freq=2)

lexicon.fit(train['final_text'])

In [50]:
train['finalText_indx'] = lexicon.transform(train['final_text'])
test['finalText_indx'] = lexicon.transform(test['final_text'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [51]:
def get_max_seq_len(sents):
    return max([len(idx_seq) for idx_seq in sents])

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

In [52]:
# Get length of longest sequence
max_seq_len = get_max_seq_len(train['finalText_indx'])

#Add one to max length for offsetting sequence by 1
train_padded_words = pad_idx_seqs(train['finalText_indx'], 
                                  max_seq_len + 1) 

test_padded_words = pad_idx_seqs(test['finalText_indx'], 
                                  max_seq_len + 1) 

train_y = to_categorical(train['y'])
test_y = to_categorical(test['y'])

## Model 1: RNN

In [53]:
n_out = 2
nb_epoch = 50

In [54]:
def create_embed_matrix(model, lexicon, embed_size):
    "Create a weight matrix for words"
    vocab_size = len(lexicon)
    embedding_matrix = np.zeros((vocab_size, embed_size))
    n = 0
    word_list = list(lexicon)
    for i in range(vocab_size):
        word = word_list[i]
        if word in model.wv.vocab:
            embedding_vector = model.wv[word]
            if embedding_vector is not None:
                embedding_matrix[n] = embedding_vector[:embed_size]
                n += 1

    return embedding_matrix[:n, :]


In [55]:
w2v = gensim.models.KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [56]:
word_embed_len = 200
word_embed_matrix = create_embed_matrix(w2v, lexicon.words_lexicon, 
                                   word_embed_len)



In [57]:
def create_rnn_model(seq_input_len, embed_matrix, 
                     n_RNN_nodes, n_dense_nodes, 
                     recurrent_dropout=0.2, 
                     drop_out=.2, n_out=2):
    
    word_input = Input(shape=(seq_input_len,), name='word_input_layer')
        
    word_embeddings = Embedding(input_dim=embed_matrix.shape[0],
                                output_dim=embed_matrix.shape[1],
                                weights=[embed_matrix], 
                                mask_zero=True, 
                                name='word_embedding_layer')(word_input) 

    hidden_layer1 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=True, 
                                      recurrent_dropout=recurrent_dropout, 
                                      dropout=drop_out, name='hidden_layer1'))(word_embeddings)
    
    hidden_layer2 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=False, 
                                      recurrent_dropout=recurrent_dropout,
                                      dropout=drop_out, name='hidden_layer2'))(hidden_layer1)

    dense_layer = Dense(units=n_dense_nodes, activation='relu', name='dense_layer')(hidden_layer2)

    drop_out3 = Dropout(drop_out)(dense_layer)

    output_layer = Dense(units=n_out, activation='softmax',
                         name='output_layer')(drop_out3)

    model = Model(inputs=[word_input], outputs=output_layer)
    model.compile(loss='categorical_crossentropy', optimizer="adam", 
                  metrics=['accuracy', recall, precision])

    return model 

In [58]:
rnn_model = create_rnn_model(seq_input_len=train_padded_words.shape[-1],
                             embed_matrix=word_embed_matrix, 
                             recurrent_dropout=.4, drop_out=.5,
                             n_RNN_nodes=500, n_dense_nodes=500, n_out=n_out)

In [59]:
def train_and_test_model(model, x_train, y_train, x_test, y_test, 
                         modelSaveName, modelSavePath='models',
                         batch_size=128, epochs=3, validation_split=.1):
    """Train model, save weights, and predict data"""
    
    print(model.summary())
    
    filepath = os.path.join(modelSavePath, modelSaveName + '.hdf5')
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1)
    callbacks_list = [checkpoint]
    model.fit(x=x_train, y=y_train, batch_size=batch_size, 
              epochs=epochs, validation_split=validation_split, 
              callbacks=callbacks_list)
    
    score, acc, rec, prec = model.evaluate(x_test, y_test, batch_size=batch_size)
    return (model, acc, rec, prec)    

In [60]:
rnn_res = train_and_test_model(rnn_model, train_padded_words, train_y, 
                               test_padded_words, test_y, 'rnn_model',
                               epochs=nb_epoch)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
word_input_layer (InputLayer (None, 102)               0         
_________________________________________________________________
word_embedding_layer (Embedd (None, 102, 200)          2356000   
_________________________________________________________________
bidirectional_5 (Bidirection (None, 102, 1000)         2804000   
_________________________________________________________________
bidirectional_6 (Bidirection (None, 1000)              6004000   
_________________________________________________________________
dense_layer (Dense)          (None, 500)               500500    
_________________________________________________________________
dropout_5 (Dropout)          (None, 500)               0         
_________________________________________________________________
output_layer (Dense)         (None, 2)                 1002      
Total para


Epoch 00026: saving model to models/rnn_model.hdf5
Epoch 27/50

Epoch 00027: saving model to models/rnn_model.hdf5
Epoch 28/50

Epoch 00028: saving model to models/rnn_model.hdf5
Epoch 29/50

Epoch 00029: saving model to models/rnn_model.hdf5
Epoch 30/50

Epoch 00030: saving model to models/rnn_model.hdf5
Epoch 31/50

Epoch 00031: saving model to models/rnn_model.hdf5
Epoch 32/50

Epoch 00032: saving model to models/rnn_model.hdf5
Epoch 33/50

Epoch 00033: saving model to models/rnn_model.hdf5
Epoch 34/50

Epoch 00034: saving model to models/rnn_model.hdf5
Epoch 35/50

Epoch 00035: saving model to models/rnn_model.hdf5
Epoch 36/50

Epoch 00036: saving model to models/rnn_model.hdf5
Epoch 37/50

Epoch 00037: saving model to models/rnn_model.hdf5
Epoch 38/50

Epoch 00038: saving model to models/rnn_model.hdf5
Epoch 39/50

Epoch 00039: saving model to models/rnn_model.hdf5
Epoch 40/50

Epoch 00040: saving model to models/rnn_model.hdf5
Epoch 41/50

Epoch 00041: saving model to models/rnn

In [61]:
rnn_res

(<keras.engine.training.Model at 0x7f2777c9c5f8>,
 0.5267778764805111,
 0.5267778764805111,
 0.5267778764805111)

## Model 2: CNN

In [62]:
def vectorize_sentences(data, lexicon, maxlen=200):
    X = []
    for sentences in data:
        x = [lexicon[token] if token in lexicon else lexicon['<UNK>'] for 
                                 token in sentences]
        x2 = np.eye(len(char_indices) + 1)[x]
        X.append(x2)
    return (pad_sequences(X, maxlen=maxlen))

def create_cnn_model(char_maxlen, vocab_size,
                     nb_filter=100, filter_kernels = [4] * 4,
                     pool_size=3, n_dense_nodes=100,
                     drop_out=.2, n_out=2):

    inputs = Input(shape=(char_maxlen, vocab_size), name='char_input_layer')

    conv1 = Conv1D(nb_filter, kernel_size=filter_kernels[0],
                  padding='valid', activation='relu',
                  input_shape=(char_maxlen, vocab_size))(inputs)
    
    maxpool1 = MaxPool1D(pool_size=pool_size)(conv1)

    conv2 = Conv1D(nb_filter, kernel_size=filter_kernels[1],
                          padding='valid', activation='relu')(maxpool1)
    maxpool2 = MaxPool1D(pool_size=pool_size)(conv2)

    conv3 = Conv1D(nb_filter, kernel_size=filter_kernels[2],
                          padding='valid', activation='relu')(maxpool2)

    conv4 = Conv1D(nb_filter, kernel_size=filter_kernels[3],
                          padding='valid', activation='relu')(conv3)

    maxpool3 = MaxPool1D(pool_size=pool_size)(conv4)
    flatten = Flatten()(maxpool3)

    dense_layer = Dense(n_dense_nodes, activation='relu')(flatten)
    dropout = Dropout(drop_out)(dense_layer)

    output_layer = Dense(n_out, activation='softmax', name='output')(dropout)

    model = Model(inputs=inputs, outputs=output_layer)

    model.compile(loss='categorical_crossentropy', optimizer="adam", 
                  metrics=['accuracy', recall, precision])    
    return model 

In [63]:
char_maxlen = 1024 
nb_filter = 128
dense_outputs = 1024
filter_kernels = [7, 5, 5, 3]
pool_size = 5

In [64]:
# Turn all tokens into one string and then all obs 
# into one overall string
trainTokensAsString = train.final_text.apply(lambda x: ' '.join(x))
testTokensAsString = test.final_text.apply(lambda x: ' '.join(x))
oneTxt = ' '.join(trainTokensAsString)

# Get info about characters
chars = set(oneTxt)
vocab_size = len(chars) + 1
print('total chars:', vocab_size)
char_indices = dict((c, i + 2) for i, c in enumerate(chars))
indices_char = dict((i + 2, c) for i, c in enumerate(chars))

char_indices['<UNK>'] = 1
indices_char[1] = '<UNK>'

total chars: 86


In [65]:
trainCharData = vectorize_sentences(trainTokensAsString, char_indices, char_maxlen)
testCharData = vectorize_sentences(testTokensAsString, char_indices, char_maxlen)

In [66]:
cnn_model = create_cnn_model(char_maxlen=char_maxlen, 
                             vocab_size=vocab_size,
                             nb_filter=nb_filter, 
                             filter_kernels=filter_kernels,
                             pool_size=pool_size, 
                             n_dense_nodes=dense_outputs,
                             drop_out=.5, 
                             n_out=n_out)

In [67]:
cnn_res = train_and_test_model(cnn_model, trainCharData[:, :, 1:],
                               train_y, 
                               testCharData[:, :, 1:], 
                               test_y, 
                               'cnn_model',
                               epochs=nb_epoch)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
char_input_layer (InputLayer (None, 1024, 86)          0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 1018, 128)         77184     
_________________________________________________________________
max_pooling1d_7 (MaxPooling1 (None, 203, 128)          0         
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 199, 128)          82048     
_________________________________________________________________
max_pooling1d_8 (MaxPooling1 (None, 39, 128)           0         
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 35, 128)           82048     
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 33, 128)           49280     
__________


Epoch 00024: saving model to models/cnn_model.hdf5
Epoch 25/50

Epoch 00025: saving model to models/cnn_model.hdf5
Epoch 26/50

Epoch 00026: saving model to models/cnn_model.hdf5
Epoch 27/50

Epoch 00027: saving model to models/cnn_model.hdf5
Epoch 28/50

Epoch 00028: saving model to models/cnn_model.hdf5
Epoch 29/50

Epoch 00029: saving model to models/cnn_model.hdf5
Epoch 30/50

Epoch 00030: saving model to models/cnn_model.hdf5
Epoch 31/50

Epoch 00031: saving model to models/cnn_model.hdf5
Epoch 32/50

Epoch 00032: saving model to models/cnn_model.hdf5
Epoch 33/50

Epoch 00033: saving model to models/cnn_model.hdf5
Epoch 34/50

Epoch 00034: saving model to models/cnn_model.hdf5
Epoch 35/50

Epoch 00035: saving model to models/cnn_model.hdf5
Epoch 36/50

Epoch 00036: saving model to models/cnn_model.hdf5
Epoch 37/50

Epoch 00037: saving model to models/cnn_model.hdf5
Epoch 38/50

Epoch 00038: saving model to models/cnn_model.hdf5
Epoch 39/50

Epoch 00039: saving model to models/cnn

## Model 3: RNN+CNN

In [68]:
def create_cnn_rnn_model(rnn_input_len, char_maxlen, vocab_size,
                         embed_matrix, n_RNN_nodes, 
                         nb_filter=100, filter_kernels = [4] * 4,
                         pool_size=3, n_dense_nodes=100,
                         recurrent_dropout=0.2, 
                         drop_out=.2, n_out=2):
    
    word_input = Input(shape=(rnn_input_len,), name='word_input_layer')
    char_input = Input(shape=(char_maxlen, vocab_size), name='char_input_layer')
    
    word_embeddings = Embedding(input_dim=embed_matrix.shape[0],
                                output_dim=embed_matrix.shape[1],
                                weights=[embed_matrix], 
                                mask_zero=True, 
                                name='word_embedding_layer')(word_input) 

    rnn_output1 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=True, 
                                      recurrent_dropout=recurrent_dropout, 
                                      dropout=drop_out, name='hidden_layer1'))(word_embeddings)
    
    rnn_output2 = Bidirectional(LSTM(units=n_RNN_nodes, return_sequences=False, 
                                      recurrent_dropout=recurrent_dropout,
                                      dropout=drop_out, name='hidden_layer2'))(rnn_output1)
            
    conv1 = Conv1D(nb_filter, kernel_size=filter_kernels[0],
                  padding='valid', activation='relu',
                  input_shape=(char_maxlen, vocab_size))(char_input)

    maxpool1 = MaxPool1D(pool_size=pool_size)(conv1)

    conv2 = Conv1D(nb_filter, kernel_size=filter_kernels[1],
                          padding='valid', activation='relu')(maxpool1)
    maxpool2 = MaxPool1D(pool_size=pool_size)(conv2)

    conv3 = Conv1D(nb_filter, kernel_size=filter_kernels[2],
                          padding='valid', activation='relu')(maxpool2)

    conv4 = Conv1D(nb_filter, kernel_size=filter_kernels[3],
                          padding='valid', activation='relu')(conv3)

    maxpool3 = MaxPool1D(pool_size=pool_size)(conv4)
    cnn_output = Flatten()(maxpool3)

    merged_layer = concatenate([cnn_output, rnn_output2])
    
    dense_layer1 = Dense(n_dense_nodes, activation='relu', name='dense_layer')(merged_layer)
    drop_out1 = Dropout(drop_out)(dense_layer1)
    dense_layer2 = Dense(n_dense_nodes, activation='relu')(drop_out1)
    drop_out2 = Dropout(drop_out)(dense_layer2)
    
    main_output = Dense(n_out, activation='softmax', name='output_layer')(drop_out2)

    model = Model(inputs=[word_input, char_input], outputs=[main_output])

    model.compile(loss='categorical_crossentropy', optimizer="adam", 
                  metrics=['accuracy', recall, precision])    

    return model 

In [69]:
cnn_rnn_model = create_cnn_rnn_model(rnn_input_len=train_padded_words.shape[-1], 
                                     char_maxlen=char_maxlen, 
                                     vocab_size=vocab_size,
                                     embed_matrix=word_embed_matrix, 
                                     n_RNN_nodes=500,
                                     nb_filter=nb_filter, 
                                     filter_kernels=filter_kernels,
                                     pool_size=pool_size, 
                                     n_dense_nodes=400,
                                     recurrent_dropout=0.4, 
                                     drop_out=.5, 
                                     n_out=n_out)
                             

In [70]:
cnn_rnn_res = train_and_test_model(cnn_rnn_model, 
                               [train_padded_words, trainCharData[:, :, 1:]],
                               train_y, 
                               [test_padded_words, testCharData[:, :, 1:]],
                               test_y, 
                               'cnn_rnn_model',
                               epochs=nb_epoch)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input_layer (InputLayer)   (None, 1024, 86)     0                                            
__________________________________________________________________________________________________
conv1d_13 (Conv1D)              (None, 1018, 128)    77184       char_input_layer[0][0]           
__________________________________________________________________________________________________
max_pooling1d_10 (MaxPooling1D) (None, 203, 128)     0           conv1d_13[0][0]                  
__________________________________________________________________________________________________
conv1d_14 (Conv1D)              (None, 199, 128)     82048       max_pooling1d_10[0][0]           
__________________________________________________________________________________________________
max_poolin

Epoch 15/50

Epoch 00015: saving model to models/cnn_rnn_model.hdf5
Epoch 16/50

Epoch 00016: saving model to models/cnn_rnn_model.hdf5
Epoch 17/50

Epoch 00017: saving model to models/cnn_rnn_model.hdf5
Epoch 18/50

Epoch 00018: saving model to models/cnn_rnn_model.hdf5
Epoch 19/50

Epoch 00019: saving model to models/cnn_rnn_model.hdf5
Epoch 20/50

Epoch 00020: saving model to models/cnn_rnn_model.hdf5
Epoch 21/50

Epoch 00021: saving model to models/cnn_rnn_model.hdf5
Epoch 22/50

Epoch 00022: saving model to models/cnn_rnn_model.hdf5
Epoch 23/50

Epoch 00023: saving model to models/cnn_rnn_model.hdf5
Epoch 24/50

Epoch 00024: saving model to models/cnn_rnn_model.hdf5
Epoch 25/50

Epoch 00025: saving model to models/cnn_rnn_model.hdf5
Epoch 26/50

Epoch 00026: saving model to models/cnn_rnn_model.hdf5
Epoch 27/50

Epoch 00027: saving model to models/cnn_rnn_model.hdf5
Epoch 28/50

Epoch 00028: saving model to models/cnn_rnn_model.hdf5
Epoch 29/50

Epoch 00029: saving model to models

Epoch 45/50

Epoch 00045: saving model to models/cnn_rnn_model.hdf5
Epoch 46/50

Epoch 00046: saving model to models/cnn_rnn_model.hdf5
Epoch 47/50

Epoch 00047: saving model to models/cnn_rnn_model.hdf5
Epoch 48/50

Epoch 00048: saving model to models/cnn_rnn_model.hdf5
Epoch 49/50

Epoch 00049: saving model to models/cnn_rnn_model.hdf5
Epoch 50/50

Epoch 00050: saving model to models/cnn_rnn_model.hdf5


In [71]:
cnn_rnn_res

(<keras.engine.training.Model at 0x7f28682d5908>,
 0.5201931517306286,
 0.5201931517306286,
 0.5201931517306286)

1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

#### Put results in a table

In [72]:
pd.DataFrame.from_records([cnn_res[1:4], rnn_res[1:4], cnn_rnn_res[1:4]], 
                          columns=['accuracy', 'recall', 'precision'], 
                         index=['cnn_mod', 'rnn_mod', 'cnn_rnn_mod'])

Unnamed: 0,accuracy,recall,precision
cnn_mod,0.532046,0.532046,0.532046
rnn_mod,0.526778,0.526778,0.526778
cnn_rnn_mod,0.520193,0.520193,0.520193


In [73]:
def print_classifications(classifications, classType, test_y, test_text):
    texts = [' '.join(sent) for sent in test_text[classifications.index]]
    stock_movements = np.where(test_y[classifications.index], 'positive', 'negative')
    
    print('Examples of {} predictions:\n'.format(classType))
    for i in range(len(texts)):
        print('Stock movement was {}'.format(stock_movements[i]))
        print('News info:\n{}'.format(texts[i]))
        print('')

In [74]:
def predict_and_print_samples(model, modelName, test_x, test_y=test['y'], test_text = test['final_text']):
    """"Print out predictions of the model"""
    print('Stats for {} model'.format(modelName))
    
    res = model.predict(test_x)
    class_res = np.apply_along_axis(np.argmax, axis=1, arr=res)

    comparisons = class_res == test_y
    good_class = comparisons.loc[comparisons == True].sample(n=3)
    bad_class = comparisons.loc[comparisons == False].sample(n=3)

    print_classifications(good_class, 'correct', test_y, test_text)
    print_classifications(bad_class, 'INcorrect', test_y, test_text)

    
    top3MostProbPosArg = np.argsort(res[:, 1])[-3:]
    top3Y = test_y.iloc[top3MostProbPosArg]
    top3Probs = pd.Series(res[top3MostProbPosArg, 1], index=top3Y.index)
    top3Data = pd.concat([top3Y, top3Probs], axis=1)
    top3Data.columns = ['Actual', 'PositiveProb']
    print('')
    print('Top 3 Most Positive Probability:')
    print(top3Data)


In [75]:
predict_and_print_samples(rnn_res[0], 'RNN', test_padded_words)

predict_and_print_samples(cnn_res[0], 'CNN', testCharData[:, :, 1:])

predict_and_print_samples(cnn_rnn_res[0], 'CNN_RNN', [test_padded_words, testCharData[:, :, 1:]])

Stats for RNN model
Examples of correct predictions:

Stock movement was positive
News info:
Obama corporate giants announce plan to boost suppliers President Barack Obama is enlisting several major United States and multinational companies to draw attention to an initiative aimed at helping small businesses expand and hire workers

Stock movement was positive
News info:
Lloyds RBS CEOs reassure staff on strength after Brexit vote The chief executives of Lloyds Banking Group and Royal Bank of Scotland moved to reassure thousands of workers that their state backed companies would weather the turmoil sparked by Britain's decision to quit the European Union

Stock movement was positive
News info:
UK sells more Lloyds shares has raised over 10 bln pounds Government planning retail sale later this year Adds further details on plans for Lloyds RBS sales

Examples of INcorrect predictions:

Stock movement was negative
News info:
Raytheon board on adopted by laws to implement proxy access On b