# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

## Load and Clean Raw Data

In [191]:
# Utility libraries
import os
import pickle
import numpy as np
import pandas as pd
import re
import calendar

# Prepocessing libraries
from sklearn.model_selection import train_test_split
import gensim

from keras.models import Model
from keras.layers import Input, concatenate, Concatenate, TimeDistributed
from keras.layers import Dense, Bidirectional, Dropout, Conv1D, Conv2D
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import Tokenizer, text_to_word_sequence

In [2]:
dataPath = '../data'
reutersFile = 'news_reuters.csv'
stockFile = 'stockReturns.json'

rawX = pd.read_csv(os.path.join(dataPath, reutersFile), header=None, 
                   names=['ticker', 'company', 'pub_date', 'headline', 'first_sent', 'category'])
rawY = pd.read_json(os.path.join(dataPath, stockFile))
# rawY = json.load(os.path.join(dataPath, stockFile))

In [3]:
rawX.loc[rawX.pub_date == 20110708]

Unnamed: 0,ticker,company,pub_date,headline,first_sent,category
1,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
2,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory
3,AA,Alcoa Corporation,20110708,REFILE-TABLE-Australia's top carbon polluters,CANBERRA July 8 Following is a list of Austr...,normal
4,AA,Alcoa Corporation,20110708,US STOCKS-Jobs data hits stocks but earnings ...,* Google slumps on downgrade one of Nasdaq's...,normal
5,AA,Alcoa Corporation,20110708,US STOCKS-Jobs halt Wall St rally investors e...,* Dow off 0.5 pct S&P down 0.7 pct Nasdaq o...,normal
6,AA,Alcoa Corporation,20110708,Wall St Week Ahead: Recipe for a rally? Beat l...,NEW YORK July 8 Wall Street heads into earni...,normal
7,AA,Alcoa Corporation,20110708,Wall St Week Ahead: Recipe for a rally? Beat l...,NEW YORK Wall Street heads into earnings seaso...,normal
1106,AAPL,Apple Inc,20110708,Analysis: Young startups demand steeper prices...,SAN FRANCISCO A year ago Mike Maples's invest...,topStory
1107,AAPL,Apple Inc,20110708,CORRECTED-UPDATE 1-TSMC UMC post lower June s...,(Corrects headline lead paragraph to show TS...,normal
1108,AAPL,Apple Inc,20110708,PRESS DIGEST - China,BEIJING/SHANGHAI July 8 Chinese newspapers a...,normal


In [4]:
def reformat_y_data(data, tickerType='mid'):
    tmp = data[tickerType].apply(pd.Series)
    tmp = tmp.stack().rename('price', inplace=True).reset_index()
    tmp['y'] = np.where(tmp['price'] >= 0, 1, 0)
    tmp.rename(columns={'level_0': 'ticker', 'level_1': 'pub_date'}, inplace=True)
    return tmp

In [5]:
cleanY = reformat_y_data(rawY, 'mid')

In [6]:
def clean_and_merge_data(X, Y):
    """Filter X to only those tickers with stock data"""
    y_tickers = set(Y['ticker'])
    X = X.loc[X['ticker'].isin(y_tickers)]
    return X.merge(Y, on=['ticker', 'pub_date'], how='left')


In [110]:
merged = clean_and_merge_data(rawX, cleanY)

#### Clean up text columns

In [125]:
def clean_text(sent):
    """Clean up text data by:
    
    1. Replacing double spaces into a single space
    2. Replace U.S. to United States so U won't get deleted with next 
       replacement
    3. Remove all capitalized words at the beginning of the 
       sentence, since those are mostly places (aka NEW YORK)
    4. Remove unnecessary punctuation (hyphens and asterisks)
    5. Remove dates
    """
    monthStrings = list(calendar.month_name)[1:] + list(calendar.month_abbr)[1:]
    monthPattern = '|'.join(monthStrings)
    
    sent = re.sub(r' +', ' ', sent)
    sent = re.sub(r'U.S.', 'United States', sent)
    sent = re.sub(r'^(\W?[A-Z\s\d]+\b-?)', '', sent)
    sent = re.sub(r'^ ?\W ', '', sent)
    sent = re.sub(r'({}) \d+'.format(monthPattern), '', sent)
    
    # replace double spaces one more time after previous cleaning 
    sent = re.sub(r' +', ' ', sent)
    return sent 

In [186]:
class lexiconTransformer():
    """Create a lexicon and transform sentences and
       to indexes for use in the model."""
    
    def __init__(self, words_min_freq = 1, unknown_word_token = u'<UNK>',
                 savePath='models', saveName='stock_word_lexicon'):
        self.words_min_freq = words_min_freq
        self.words_lexicon = None
        self.unknown_word_token = unknown_word_token
        self.indx_to_words_dict = None
        self.savePath = savePath
        self.saveName = saveName + '.pkl'
    
    def fit(self, sents):
        """Create lexicon based on sentences"""
        self.make_words_lexicon(sents)        
        self.make_lexicon_reverse()
        self.save_lexicon()
                
    def transform(self, sents):
        sents_indxs = self.tokens_to_idxs(sents, self.words_lexicon)
        return sents_indxs

    def fit_transform(self, sents):
        self.fit(sents)
        return self.transform(sents)
        
    def make_words_lexicon(self, sents_token):
        """Wrapper for words lexicon"""
        self.words_lexicon = self.make_lexicon(sents_token, self.words_min_freq,
                                               self.unknown_word_token)

    def make_lexicon(self, token_seqs, min_freq=1, unknown = u'<UNK>'):
        """Create lexicon from input based on a frequency

            Parameters:
            
            token_seqs
            ----------
               A list of a list of input tokens that will be used to create the lexicon
            
            min_freq
            --------
               Number of times the token needs to be in the corpus to be included in the
               lexicon.  Otherwise, will be replaced with the "unknown" entry
            
            unknown
            -------
               The word in the lexicon that should be used for tokens not existing in lexicon.
               This can be a value that already exists in input list.  For instance, in 
               Named Entity Recognition, a value of "other" or "O" may already be a tag 
               and so having "other" and "unknown" are the same thing!
        """
        # Count how often each word appears in the text.
        token_counts = {}
        for seq in token_seqs:
            for token in seq:
                if token in token_counts:
                    token_counts[token] += 1
                else:
                    token_counts[token] = 1

        # Then, assign each word to a numerical index. 
        # Filter words that occur less than min_freq times.
        lexicon = [token for token, count in token_counts.items() if count >= min_freq]
        
        # Have to delete unknown value from token list so not a gap in lexicon values when
        # turning it into a lexicon (aka, if unknown == OTHER and that is the 7th value, 
        # then 7 won't exist in the lexicon which may cause issues)
        if unknown in lexicon:
            lexicon.remove(unknown)

        # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
        lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
        
        lexicon[unknown] = 1 # Unknown words are those that occur fewer than min_freq times
        lexicon_size = len(lexicon)
        return lexicon
    
    def save_lexicon(self):
        "Save lexicons by pickling them"
        if not os.path.exists(self.savePath):
            os.makedirs(self.savePath)
        with open(os.path.join(self.savePath, self.saveName), 'wb') as f:
            pickle.dump(self.words_lexicon, f)
                        
    def load_lexicon(self):
        with open(os.path.join(self.savePath, self.saveName), 'rb') as f:
            self.words_lexicon = pickle.load(f)
                    
        self.make_lexicon_reverse()
        
    def make_lexicon_reverse(self):
        self.indx_to_words_dict = self.get_lexicon_lookup(self.words_lexicon)
    
    def get_lexicon_lookup(self, lexicon):
        '''Make a dictionary where the string representation of 
           a lexicon item can be retrieved from its numerical index'''
        lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
        return lexicon_lookup
    
    def tokens_to_idxs(self, token_seqs, lexicon):
        """Transform tokens to numeric indexes or <UNK> if doesn't exist"""
        idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for 
                                 token in token_seq] for token_seq in token_seqs]
        return idx_seqs

In [179]:
def tokenize_sent(col):
    """Tokenize string into a sequence of words"""
    return [text_to_word_sequence(text, lower=False) for text in col]


def create_lexicon(df):
    """Create a lexicon using both headline and first_sentence"""
    pass

In [130]:
merged['headline'] = merged.headline.apply(clean_text)
merged['first_sent'] = merged.first_sent.apply(clean_text)

In [181]:
merged['headline_token'] = tokenize_sent(merged.headline)
merged['first_sent_token'] = tokenize_sent(merged.first_sent)

In [192]:
lexicon = lexiconTransformer(words_min_freq=5)

In [193]:
lexicon.fit(merged['headline_token'].append(merged['first_sent_token']))

## Model 1: RNN

In [131]:
w2v = gensim.models.KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [159]:
Tokenizer?

In [161]:
from collections import Counter

In [162]:
Counter?

In [153]:
tmp = Tokenizer(lower=False, oov_token='<UNK>', )

In [178]:
text_to_word_sequence

In [165]:
tst = 

In [175]:
zz = Counter([item for sublist in tst for item in sublist])

In [174]:
len(zz)

30818

In [169]:
zz

Counter({'apple': 9298,
         'antitrust': 877,
         'compliance': 129,
         'off': 1565,
         'to': 50560,
         'a': 32759,
         'promising': 72,
         'start': 535,
         'monitor': 131,
         'how': 410,
         'avoid': 208,
         'the': 49910,
         'trouble': 66,
         'coming': 261,
         'tech': 874,
         'sector': 563,
         'cannot': 92,
         'escape': 29,
         'united': 4591,
         'states': 4462,
         "states'": 23,
         'e': 871,
         'book': 342,
         'cases': 145,
         'judge': 1398,
         'keep': 332,
         'steve': 285,
         "jobs'": 47,
         'personality': 6,
         'out': 1580,
         'of': 35804,
         'trial': 950,
         'companies': 1656,
         'smartphone': 848,
         'makers': 192,
         'carriers': 103,
         'embrace': 23,
         'anti': 253,
         'theft': 52,
         'initiative': 28,
         'wall': 2176,
         'st': 849,
        

In [154]:
tmp.fit_on_te
xs(merged.headline.append(merged.first_sent).tolist())

In [157]:
tmp.word_index

{'to': 1,
 'the': 2,
 'of': 3,
 'in': 4,
 'a': 5,
 'on': 6,
 'and': 7,
 'for': 8,
 'its': 9,
 'said': 10,
 'with': 11,
 'as': 12,
 'new': 13,
 'inc': 14,
 's': 15,
 'by': 16,
 'apple': 17,
 'it': 18,
 'from': 19,
 'at': 20,
 'u': 21,
 'that': 22,
 'barclays': 23,
 'has': 24,
 'london': 25,
 'bank': 26,
 'billion': 27,
 'is': 28,
 '1': 29,
 'company': 30,
 'after': 31,
 'million': 32,
 'up': 33,
 'will': 34,
 'an': 35,
 'york': 36,
 'united': 37,
 'co': 38,
 'shares': 39,
 'states': 40,
 'says': 41,
 'over': 42,
 '2': 43,
 'deal': 44,
 'percent': 45,
 'year': 46,
 'corp': 47,
 'oil': 48,
 'thursday': 49,
 't': 50,
 'business': 51,
 'tuesday': 52,
 'bp': 53,
 'wednesday': 54,
 'group': 55,
 'may': 56,
 'more': 57,
 'sales': 58,
 '5': 59,
 '3': 60,
 'monday': 61,
 'was': 62,
 'market': 63,
 'be': 64,
 'chief': 65,
 'friday': 66,
 'plc': 67,
 'pct': 68,
 'executive': 69,
 'financial': 70,
 'reported': 71,
 'than': 72,
 'not': 73,
 'google': 74,
 'profit': 75,
 'buy': 76,
 'sources': 77,
 '

In [None]:
tmp.fit_on_sequences(merged.headline)

## Model 2: CNN

## Model 3: RNN+CNN