# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

## Model 1: RNN

In [221]:
import json
import pandas as pd
import numpy as np
# Read reuters data
news= pd.read_csv("news_reuters.csv", header=0)
news.columns=["ticker", "name_of_company", "date", "headline", "first_sentence", "news_category"]
stock_info= pd.read_json('stockReturns.json')
stock_info.head()
#stock_info_long= stock_info[0]

Unnamed: 0,long,mid,short
AAPL,"{'20040106': -0.0023, '20040107': -0.0016, '20...","{'20040106': 0.06760000000000001, '20040107': ...","{'20040106': -0.0013000000000000002, '20040107..."
ABB,"{'20040106': 0.09630000000000001, '20040107': ...","{'20040106': 0.09340000000000001, '20040107': ...","{'20040106': 0.0015, '20040107': -0.0107000000..."
ABMD,"{'20040106': 0.08360000000000001, '20040107': ...","{'20040106': 0.039400000000000004, '20040107':...","{'20040106': 0.0102, '20040107': 0.0217, '2004..."
ABR,"{'20040413': 0.0367, '20040414': 0.0053, '2004...","{'20040413': 0.0082, '20040414': 0.01970000000...","{'20040413': 0.013900000000000001, '20040414':..."
ACAD,"{'20040602': -0.049300000000000004, '20040603'...","{'20040602': -0.0821, '20040603': -0.0611, '20...","{'20040602': -0.0346, '20040603': -0.0005, '20..."


In [222]:
news.head()

Unnamed: 0,ticker,name_of_company,date,headline,first_sentence,news_category
0,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
1,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory
2,AA,Alcoa Corporation,20110708,REFILE-TABLE-Australia's top carbon polluters,CANBERRA July 8 Following is a list of Austr...,normal
3,AA,Alcoa Corporation,20110708,US STOCKS-Jobs data hits stocks but earnings ...,* Google slumps on downgrade one of Nasdaq's...,normal
4,AA,Alcoa Corporation,20110708,US STOCKS-Jobs halt Wall St rally investors e...,* Dow off 0.5 pct S&P down 0.7 pct Nasdaq o...,normal


In [228]:
y = stock_info['long']
y = pd.DataFrame.from_dict(y)
y.reset_index(inplace = True)
y.head()

Unnamed: 0,index,long
0,AAPL,"{'20040106': -0.0023, '20040107': -0.0016, '20..."
1,ABB,"{'20040106': 0.09630000000000001, '20040107': ..."
2,ABMD,"{'20040106': 0.08360000000000001, '20040107': ..."
3,ABR,"{'20040413': 0.0367, '20040414': 0.0053, '2004..."
4,ACAD,"{'20040602': -0.049300000000000004, '20040603'..."


In [229]:
yy = y['long'].apply(pd.Series)
yy.head()

Unnamed: 0,20040106,20040107,20040108,20040109,20040113,20040114,20040115,20040116,20040121,20040122,...,20180308,20180309,20180313,20180314,20180315,20180316,20180320,20180321,20180322,20180323
0,-0.0023,-0.0016,-0.0376,-0.0423,-0.0556,-0.0741,-0.0411,0.0269,0.0021,0.0235,...,0.0108,0.0045,-0.0034,0.0019,0.0056,0.0052,0.016,0.0209,0.0389,0.0046
1,0.0963,0.0916,0.1032,-0.0069,0.0404,0.0003,-0.0378,-0.0684,-0.0017,-0.0395,...,-0.0155,-0.0069,0.0214,0.0227,0.0138,0.0122,0.0087,0.0134,0.0206,0.0639
2,0.0836,0.0283,-0.0199,-0.0829,-0.0392,0.0117,-0.0115,0.0258,-0.0825,-0.1271,...,0.0416,0.0403,0.0403,0.064,0.0417,0.0388,0.0416,0.0387,0.0523,0.0492
3,,,,,,,,,,,...,0.0641,0.0431,0.0493,0.032,0.0204,0.0171,-0.0076,-0.0121,-0.0239,-0.0415
4,,,,,,,,,,,...,-0.0978,-0.1037,-0.3875,-0.3022,-0.2314,-0.2368,-0.2455,-0.2055,-0.2092,-0.2117


In [230]:
yy['ticker'] = y['index']
yy.set_index('ticker', inplace = True)
yy = yy.stack()
yy = yy.to_frame(name=None)
yy.reset_index(inplace = True)
yy.columns = ['ticker','date','price value']
yy.head()

Unnamed: 0,ticker,date,price value
0,AAPL,20040106,-0.0023
1,AAPL,20040107,-0.0016
2,AAPL,20040108,-0.0376
3,AAPL,20040109,-0.0423
4,AAPL,20040113,-0.0556


In [231]:
yy["price value"]= (yy['price value'] > 0).astype(int)
yy.head()

Unnamed: 0,ticker,date,price value
0,AAPL,20040106,0
1,AAPL,20040107,0
2,AAPL,20040108,0
3,AAPL,20040109,0
4,AAPL,20040113,0


In [232]:
print(len(set(yy["ticker"])))

439


In [233]:
print(len(set(news["ticker"])))
news_ticker= set(news["ticker"])
yy_ticker= set(yy["ticker"])

2224


In [235]:
merged_data= news.merge(yy, left_on='ticker', right_on='ticker', how='inner')
df=merged_data.drop_duplicates(subset='first_sentence', keep='first')

In [239]:
print(df.shape)
df.head()

(37040, 8)


Unnamed: 0,ticker,name_of_company,date_x,headline,first_sentence,news_category,date_y,price value
0,AAPL,1-800 FLOWERSCOM Inc,20140414,Apple antitrust compliance off to a promising ...,"NEW YORK Apple Inc has made a ""promising start...",topStory,20040106,0
2718,AAPL,1-800 FLOWERSCOM Inc,20140414,Apple antitrust compliance off to a promising ...,"NEW YORK April 14 Apple Inc has made a ""promi...",normal,20040106,0
5436,AAPL,1-800 FLOWERSCOM Inc,20140414,COLUMN-How to avoid the trouble coming to the ...,(The opinions expressed here are those of the ...,normal,20040106,0
8154,AAPL,1-800 FLOWERSCOM Inc,20140414,How to avoid the trouble coming to the tech se...,CHICAGO A resounding shot across the bow has b...,normal,20040106,0
10872,AAPL,1-800 FLOWERSCOM Inc,20140415,Apple cannot escape U.S. states' e-book antitr...,NEW YORK Apple Inc on Tuesday lost an attempt ...,normal,20040106,0


In [None]:
X= df["first_sentence"]
Y= df["price value"]

In [237]:
from nltk.tokenize import WordPunctTokenizer,sent_tokenize, word_tokenize
from collections import Counter
from string import punctuation, ascii_lowercase
import regex as re
from tqdm import tqdm
from geotext import GeoText
stop_words = set(stopwords.words('english'))
spec_char = ['~','!','@','#','$','%','^','&','*','(',')','_','+','|','}','{',
                                               ':','"',"'",'?','>','<','`','-','=',';','/','.',',','.)']
loc= ['NEW','YORK','JERSEY']
city=[]
for sen in list_sentences:
    places= GeoText(sen)
    city.append(places.cities)
city_flat_list = [item for sublist in city for item in sublist]
city_list= list(set(city_flat_list))
# setup tokenizer
tokenizer = WordPunctTokenizer()

vocab = Counter()

def text_to_wordlist(text, lower=False):
    
    # Tokenize
    text = tokenizer.tokenize(text)
    text= [t for t in text if t not in stop_words]
    text= [c for c in text if c not in spec_char]
    text= [l for l in text if l not in city_list]
    text= [n for n in text if n not in loc]
    
    # Return a list of words
    vocab.update(text)
    
    return text

def process_comments(list_sentences, lower=False):
    comments = []
    for text in tqdm(list_sentences):
        txt = text_to_wordlist(text, lower=lower)
        comments.append(txt)
    return comments


list_sentences = list(df["first_sentence"].fillna("NAN_WORD").values)
comments = process_comments(list_sentences, lower=True)
print(comments[0:5])

100%|██████████| 37040/37040 [00:08<00:00, 4289.84it/s]

[['Apple', 'Inc', 'made', 'promising', 'start', 'enhancing', 'antitrust', 'compliance', 'program', 'found', 'liable', 'last', 'year', 'conspiring', 'raise', 'e', 'book', 'prices', 'work', 'required', 'court', 'appointed', 'monitor', 'said', 'Monday'], ['April', '14', 'Apple', 'Inc', 'made', 'promising', 'start', 'enhancing', 'antitrust', 'compliance', 'program', 'found', 'liable', 'last', 'year', 'conspiring', 'raise', 'e', 'book', 'prices', 'work', 'required', 'court', 'appointed', 'monitor', 'said', 'Monday'], ['The', 'opinions', 'expressed', 'author', 'columnist', 'Reuters'], ['A', 'resounding', 'shot', 'across', 'bow', 'fired', 'tech', 'sector', 'recent', 'weeks', 'The', 'tech', 'heavy', 'Nasdaq', 'Composite', 'Index', 'nearly', '5', 'percent', 'April', 'Friday', 'close', 'Nasdaq', 'Biotechnology', 'Index', '21', 'percent', 'record', 'closing', 'high', 'February', '25', 'Many', 'sector', 'flagships', 'newcomers', 'crosshairs'], ['Apple', 'Inc', 'Tuesday', 'lost', 'attempt', 'dismis




In [240]:
print(len(set([item for sublist in comments for item in sublist])))

30046


In [241]:
import pickle

def make_lexicon(token_seq, min_freq=1):
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seq:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

print("WORDS:")
words_lexicon = make_lexicon(comments)

WORDS:
LEXICON SAMPLE (30047 total items):
{'Apple': 2, 'Inc': 3, 'made': 4, 'promising': 5, 'start': 6, 'enhancing': 7, 'antitrust': 8, 'compliance': 9, 'program': 10, 'found': 11, 'liable': 12, 'last': 13, 'year': 14, 'conspiring': 15, 'raise': 16, 'e': 17, 'book': 18, 'prices': 19, 'work': 20, 'required': 21}


In [242]:
def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

token_idx = tokens_to_idxs(comments, words_lexicon)

from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

max_seq_len = max([len(idx_seq) for idx_seq in token_idx]) # Get length of longest sequence
train_padded_words = pad_idx_seqs(token_idx, 
                                  max_seq_len + 1)

print("WORDS:\n", train_padded_words)
print("SHAPE:", train_padded_words.shape, "\n")


WORDS:
 [[   0    0    0 ...   24   25   26]
 [   0    0    0 ...   24   25   26]
 [   0    0    0 ...   32   33   34]
 ...
 [   0    0    0 ...  227  228 3864]
 [   0    0    0 ...  227  228 3864]
 [   0    0    0 ... 3645  903  191]]
SHAPE: (37040, 97) 



In [292]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(train_padded_words,Y, test_size=0.3)


In [294]:
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM

max_features = 30048
maxlen = 97  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

# Building RNN model
model_rnn = Sequential()
model_rnn.add(Embedding(max_features, 128))
model_rnn.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model_rnn.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
adam = optimizers.Adam(lr = 0.001)
model_rnn.compile(loss='binary_crossentropy',
              optimizer= adam,
              metrics=['accuracy'])

model_rnn.fit(X_train, y_train,batch_size=batch_size,epochs=100,validation_data=(X_test, y_test))
score,acc = model.evaluate(X_test, y_test)


Train on 25928 samples, validate on 11112 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100


Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


IndexError: invalid index to scalar variable.

In [295]:
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.1574888670357134
Test accuracy: 0.9729121670266379


## Model 2: CNN

In [279]:
from keras.models import Sequential
from keras.layers import Dropout, Activation, Conv1D, MaxPooling1D, Embedding, Flatten
from keras import optimizers

num_features = 30048
sequence_length = 97
embedding_dimension = 100

def model_cnn():
    model = Sequential()
    
    # use Embedding layer to create vector representation of each word => it is fine-tuned every iteration
    model.add(Embedding(input_dim = 900000, output_dim = embedding_dimension, input_length = sequence_length))
    model.add(Conv1D(filters = 50, kernel_size = 5, strides = 1, padding = 'valid'))
    model.add(MaxPooling1D(2, padding = 'valid'))
    
    model.add(Flatten())
    
    model.add(Dense(10))
    model.add(Activation('relu'))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    
    adam = optimizers.Adam(lr = 0.001)
    
    model.compile(loss='binary_crossentropy', optimizer=adam , metrics=['accuracy'])
    
    return model

model_cnn = model_cnn()

history = model_cnn.fit(X_train, y_train, batch_size = 50, epochs = 100, validation_split = 0.2, verbose = 0)

results = model_cnn.evaluate(X_test, y_test)
print('Test accuracy: ', results[1])

Test accuracy:  0.9036177105831533


In [290]:
print('Test score: ', results[0])

Test score:  0.7152643856751553


## Model 3: RNN+CNN

In [296]:
# Embedding
maxfeatures = 30048
max_len = 97
embedding_size = 128

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 50
epochs = 100

model = Sequential()
model.add(Embedding(maxfeatures, embedding_size, input_length=max_len))
model.add(Dropout(0.25))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size)

Train...
Train on 25928 samples, validate on 11112 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100


Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Test score: 0.6342775312188352
Test accuracy: 0.9146868214708511


In [297]:
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.6342775312188352
Test accuracy: 0.9146868214708511
