# Introduction to Recurrent Neural Networks (RNNs)

## Learning stock embeddings for portfolio optimization using bidirectional RNNs

In [124]:
#Import dependencies
!pip install gensim
import numpy as np
import pandas as pd
import re

You should consider upgrading via the '/Users/CHIRAG/opt/miniconda3/bin/python3 -m pip install --upgrade pip' command.[0m


For the purposes of this assignment, we will focus on training a classifier for 15 stocks from the S&P 500. The goal of our classifier is as follows:
We are interested in training a bidirectional RNN model that learns a relationship between news taglines related to the 15 stocks $\{m_1, \ldots, m_{15}\}$ that we have selected and the prices of those stocks. Define $p_i^{(t)}$ to be the price of stock $m_i$ on day $t$. Then, we can formally define our objective as follows:

Let $y_i^{(t)} = \begin{cases} 1 & p_i^{(t)} \geq p_i^{(t - 1)} \\ 0 & p_i^{(t)} < p_i^{(t - 1)} \end{cases}$. Suppose our dataset $D = \{N^{(t)}\}_{t_{in} \leq t \leq t_f}$, where $N^{(t)}$ is a collection of all the articles from day $t$ and $t_{in}$ and $t_f$ represent the dates of the earliest and latest articles in our dataset resepctively. Then, we want to learn a mapping $\hat y_i^{(t)} = f(N^{(t - \mu)} \cup \ldots \cup N^{(t)})$ such that $\hat y_i^{(t)}$ accurately predicts $y_i^{(t)}$. More specifically, as is often the case with classification problems, we want to minimize the loss function given by the mean cross-entropy loss for all $15$ stocks:
$$\mathcal{L} = \frac{1}{15} \sum_{i = 1}^{15} \mathcal{L}_i = \frac{1}{15} \sum_{i = 1}^{15} \left( \frac{-1}{t_f - t_{in}} \sum_{t = t_{in}}^{t_f} \big(y_i^{(t)} \log \hat y_i^{(t)} + (1 - y_i^{(t)}) \log (1 - \hat y_i^{(t)}) \right)$$
Here, we choose to use $\mu = 4$, so we aim to classify the price movement of stock $m_i$ on day $t$, given by $p_i^{(t)}$, using news information from days $[t-4, t]$, i.e., articles $\{N^{(t - 4)}, N^{(t - 3)}, N^{(t - 2)}, N^{(t - 1)}, N^{(t)}\}$. Notice that we are including information from day $t$, so we are not *predicting* the price movement but rather identifying a relationship between the stock price movement and the information contained in the news taglines from day $t$ and the previous 4 days.

## Generating word embeddings

The code below uses news tagline data from Reuters (data sourced from https://github.com/vedic-partap/Event-Driven-Stock-Prediction-using-Deep-Learning/blob/master/input/news_reuters.csv) to create word embeddings for all of the articles in our dataset using a pretrained BERT encoder and a Word2Vec model that we are training on our data (don't worry if you don't know what this means yet). Our dataset contains news articles from 2011 to 2017 so we should have enough data to build a fairly accurate classifier. You will explore algorithms for generating word embeddings in more detail later in the course but for this assignment, we have done the work for you so that you can focus on building RNN models for your stock movement classifier.

<br>

The main idea is to convert all of the qualitative textual information that we have in each article tagline into a quantitative feature that we can use when training our classifier. Let $s_i \in \mathbb{R}^{64}$ represent the stock embedding that we are trying to learn for stock $m_i$. We then define the following quantities:

Let $n_i^{(t)}$ be a news article from day $t$, for some $1 \leq i \leq |N^{(t)}|$. We associate 2 embedding vectors $K_i^{(t)} \in \mathbb{R}^{64}$ and $V_i^{(t)} \in \mathbb{R}^{256}$ with the article $n_i^{(t)}$, which we have computed for you below. We define $score(n_i^{(t)}, s_j) = K_i^{(t)} \cdot s_j$ and the softmax variable $$\alpha_i^{(t)} = \frac{\exp(score(n_i^{(t)}, s_j)}{\sum_{n_k^{(t)} \in N^{(t)}}exp(score(n_k^{(t)}, s_j))}$$

Finally, we define the market status of stock $m_j$ on day $t$, given by $m_j^{(t)} = \sum_{n_i^{(t)} \in N^{(t)}} \alpha_i^{(t)} V_i^{(t)}$. This is the input to the classifier that you will build and train on the dataset to learn the stock embeddings $\{s_j\}_{1 \leq j \leq 15}$.

In [198]:
reuters_data = pd.read_csv('news_reuters.csv').dropna() # load Reuters stock news data as Pandas dataframe

# we are only interested in the stocks that have the most news data so that our classifier can have a good corpus of
# training data to learn from
top15_tickers = list(reuters_data["Ticker"].value_counts()[:15].index)
filtered_data = reuters_data[reuters_data["Ticker"].isin(top15_tickers)].copy()
filtered_data.head()

Unnamed: 0,Ticker,Name,Date,Headline,Tagline,Rating
1074,AAPL,1-800 FLOWERSCOM Inc,20140414,Apple antitrust compliance off to a promising ...,"NEW YORK Apple Inc has made a ""promising start...",topStory
1075,AAPL,1-800 FLOWERSCOM Inc,20140414,Apple antitrust compliance off to a promising ...,"NEW YORK April 14 Apple Inc has made a ""promi...",normal
1076,AAPL,1-800 FLOWERSCOM Inc,20140414,COLUMN-How to avoid the trouble coming to the ...,(The opinions expressed here are those of the ...,normal
1077,AAPL,1-800 FLOWERSCOM Inc,20140414,How to avoid the trouble coming to the tech se...,CHICAGO A resounding shot across the bow has b...,normal
1078,AAPL,1-800 FLOWERSCOM Inc,20140415,Apple cannot escape U.S. states' e-book antitr...,NEW YORK Apple Inc on Tuesday lost an attempt ...,normal


In [199]:
# get unique words from all taglines
corpus = list(reuters_data["Tagline"])
split_corpus = [re.split("\W+", c) for c in corpus]
words = set()
occurences = {}
for c in split_corpus:
    for k in c:
        w = k.lower()
        words.add(w)
        if occurences.get(w, None):
            occurences[w] += 1
        else:
            occurences[w] = 1
words = pd.Series(list(words))

# compute inverse document frequency for each word
idfs = {}
for word in words:
    idfs[word] = np.log(len(corpus) / occurences[word])
    
# train Word2Vec model on our corpus
import gensim.models

class iter_corpus:
    """An iterator that yields sentences from the corpus. """
    def __init__(self, corpus):
        self.corpus = []
        for tag in corpus:
            sentences = re.split("\.", tag)
            for s in sentences:
                tokens = re.split("\W+", s)
                self.corpus.append(tokens)
    def __iter__(self):
        for sentence in self.corpus:
            yield sentence

sentences = iter_corpus(corpus)
model = gensim.models.Word2Vec(sentences=sentences, size=64, min_count=1)

In [None]:
for r in range(64):
    filtered_data["K{}".format(r)] = 0
for idx, row in filtered_data.iterrows():
    tag = row["Tagline"]
    k = np.zeros(64)
    norm_factor = 1
    words = re.split("\W+", tag)
    freq = {} # dictionary for frequency of each word in the tagline
    for word in words:
        if freq.get(word, None):
            freq[word] += 1
        else:
            freq[word] = 1
    for word in words:
        if (word in model.wv.vocab) and (idfs.get(word, None) != None):
            tf = np.log(1 + freq[word]) # term frequency
            idf = idfs[word] # inverse document frequency
            gamma = tf * idf # gamma = TF-IDF score
            k += gamma * model.wv[word]
            norm_factor += gamma
    k /= norm_factor
    for r in range(64):
        filtered_data.iloc[idx, filtered_data.columns.get_loc("K{}".format(r))] = k[r]

In [224]:
filtered_data

Unnamed: 0,Ticker,Name,Date,Headline,Tagline,Rating,K,K0,K1,K2,...,K54,K55,K56,K57,K58,K59,K60,K61,K62,K63
1074,AAPL,1-800 FLOWERSCOM Inc,20140414,Apple antitrust compliance off to a promising ...,"NEW YORK Apple Inc has made a ""promising start...",topStory,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1075,AAPL,1-800 FLOWERSCOM Inc,20140414,Apple antitrust compliance off to a promising ...,"NEW YORK April 14 Apple Inc has made a ""promi...",normal,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1076,AAPL,1-800 FLOWERSCOM Inc,20140414,COLUMN-How to avoid the trouble coming to the ...,(The opinions expressed here are those of the ...,normal,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1077,AAPL,1-800 FLOWERSCOM Inc,20140414,How to avoid the trouble coming to the tech se...,CHICAGO A resounding shot across the bow has b...,normal,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1078,AAPL,1-800 FLOWERSCOM Inc,20140415,Apple cannot escape U.S. states' e-book antitr...,NEW YORK Apple Inc on Tuesday lost an attempt ...,normal,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184859,TAPR,Barclays Inverse US Treasury Composite ETN,20170209,BRIEF-Ultra Petroleum says Barclays agreed to ...,* Ultra Petroleum- on Feb 8 in connection wit...,normal,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
184860,TAPR,Barclays Inverse US Treasury Composite ETN,20170209,MOVES-Barclays Nasdaq RenCap AXA BC Partners,Feb 9 The following financial services industr...,topStory,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
184861,TAPR,Barclays Inverse US Treasury Composite ETN,20170217,Barclays Citi gave South Africa watchdog info...,JOHANNESBURG Feb 17 Barclays Plc and Citigrou...,normal,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
184862,TAPR,Barclays Inverse US Treasury Composite ETN,20170217,Barclays Citi helped South Africa with forex ...,JOHANNESBURG Barclays Plc and Citigroup appr...,topStory,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
