# Recurrent Neural Network

Some applications of deep-learning involve temporal-dependencies i.e. dependencies over time i.e. not just on current input but also on past inputs. RNNs are similar to feed-forward networks but in addition to *memory*.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNNs%20-%20Temporal%20Dependencies.png?raw=1" width="300" height="40%"></img>

In RNNs, the current output *y* depends not only on current input *x*, but also on memory element *s*, that takes into account past inputs. 

RNNs also attempt to address the need of capturing information in previous inputs by maintaining internal memory elements called *States.*<br><br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNNs-%20States.png?raw=1" width="300"></img>

## Applications of RNNs

1. Some of the applications of RNN requires predicting the next word in the sentence which requires looking at *last few words instead of the current one.*

2. Sentiment Analysis
3. Speech Recognition
4. Time Series Prediction
5. NLP
6. Gesture Recognition

## Structure of RNNs
Below are the folded and unfolded sructure of RNNs - <br>

| Folded RNN                                                    | Un-folded RNN                                               |
|---------------------------------------------------------------|-------------------------------------------------------------|
| <img  src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNN-%20Folded%20Model.png?raw=1" width="500"></img> | <img  src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNNs%20-%20Unfolded.png?raw=1" width="500"></img> |



# Back Propogation Through Time (BPTT)

Lets look at the timestep t=3, the error associated w.r.t Wx depends on : vector S3 and its predecessor S2 and S1.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/BPTT.png?raw=1" width="600"></img><br>

Looking at the pattern above while calculating the *accumulative gradient*, we can generalize the formula for Back Propogation Through Time (BPTT)as follows - <br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/General%20formula%20for%20BPTT.png?raw=1" width="300"></img><br>



# Drawbacks of RNNs

## Vanishing Gradient Problem

In RNNs, if we continue to back-propogate further after 8-9 time steps, the contributions of information (graident) keeps on decreading geometrically over time which is known as the *vanishing gradient problem.* Here is where the **LSTM** comes into picture.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/LSTM%20Intro.png?raw=1" width="600"></img>

## Exploding Gradient Problem

In RNNs we can also have the opposite problem, called the *exploding gradient* problem, in which the value of the gradient grows uncontrollably. A simple solution for the exploding gradient problem is **Gradient Clipping.**

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Gradient%20Clipping.png?raw=1" width="500"></img>




# Long Short Term Memory Cells (LSTM Cells)

## Basics of LSTM

Basic RNN was unable to retain long term memory to make prediction regarding the current picture is that od a wolf or dog. This is where LSTM comes into picture. The LSTM cell allows a recurrent system to learn over many time steps without the fear of losing information due to the vanishing gradient problem. It is fully differentiable, therefore gives us the option of easily using backpropagation when updating the weights. Below is the a sample mathematical model of an LSTM cell - <br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/01.lstm_cell.png?raw=1"></img><br>


In an LSTM, we would expect the following behaviour -


| Expected Behaviour of LSTM                                                                   | Reference Diagram                                                       |
|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| 1. Long Term Memory (LTM) and Short Term Memory (STM) to combine and produce correct output. | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/05.%20lstm_basics_1.png?raw=1" width="300px" height="230px"> |
| 2. LTM and STM and event should update the new LTM.                                          | </img>  <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/06.%20lstm_basics_2.png?raw=1" width="530px" height="250px"></img>  |
| 3. LTM and STM and event should update the new STM.                                          | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/07.%20lstm_basics_3.png?raw=1" width="530px" height="250px"></img>          |



## How LSTMs work?

| LSTM consists of 4 types of gates -  <br>1. Forget Gate<br>  2. Learn Gate<br> 3. Remember Gate<br> 4. Use Gate<br> | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/10.%20lstm_architecture_02.png?raw=1" width="530px" height="250px"></img> |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|

### LSTM Explained
Assume the following - 
1. LTM = Elephant
2. STM = Fish
3. Event = Wolf/Dog

| LSTM Operations                                                                                                                                                                                            | Reference Video                                      |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
| **LSTM places LTM, STM and Event as follows -**<br> 1. Forget Gate = LTM<br>  2. Learn Gate = STM + Event<br> 3. Remember Gate = LTM + STM + Event<br> 4. Use Gate = LTM + STM + Event<br> 5. In the end, LTM and STM are updated.<br> | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Animated%20GIF-downsized_large.gif?raw=1"></img> |


## General Architecture of LSTM 

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/LSTM%20Architecture.png?raw=1" width="600"><img>




## Learn Gate
Learn gate takes into account **short-term memory and event** and then ignores a part of it and retains only a part of information.<br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/11.%20learn_gate.png?raw=1" height="200px" width="500px"></img>

### Mathematically Explained
STM and Event are combined together through **activation function** (tanh), which we further multiply it by a **ignore factor** as follows -<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/12.lean_gate_equation.png?raw=1" height="200px" width="500px"></img>

## Forget Gate
Forget gate takes into account the LTM and decides which part of it to keep and which part of LTM is useless and forgets it. LTM gets multiplied by a **forget factor** inroder to forget useless parts of LTM. <br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/13.%20forget_gate.png?raw=1" height="200px" width="500px"></img>

## Remember Gate
Remember gate takes LTM coming from Forget gate and STM coming from Learn gate and combines them together. Mathematically, remember gate adds LTM and STM.<br><br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/14.%20remember_gate.png?raw=1" height="200px" width="400px"></img> <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/15.%20remember_gate_equation.png?raw=1" height="200px" width="450px"></img>

## Use Gate
Use gate takes what is useful from LTM and what's useful from STM and generates a new LTM.<br><br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/16.%20use_gate.png?raw=1" height="200px" width="400px"></img> <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/17.%20use_gate_equation.png?raw=1" height="200px" width="450px"></img>







# RNNs and LSTM for Text Generation


## Drawbacks of one-hot encoding 

Considering an example of an excert from a book containing large collection of dataset and when you use these words as an input to RNN, we can one-hot encode them, but this would mean that we will end up having giant vector with mostly zeros except that one entry as shown below:<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/One%20hot%20encoded%20vectors.png?raw=1" width="500"></img>

Then we pass this one-hot encoded vector into hidden-layer of RNN and the result is a huge matrix of values most of which are zeros because of the initial one-hot encoding and this is really *computaionally inefficient*.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Computationally%20in-efficient.png?raw=1" width="500"></img>

This is where *Embeddings* come into picture.


## Word Embeddings

Word embeddings is a general technique of reducing the dimensionality of text data, but the embedding models can also learn some interesting traits about words in a vocabulary.<br>

Embeddings can improve the ability of neural networks to learn from text data by representing them as *lower dimensional vectors.*

The idea here is when we multiply one-hot encoded vector with weight-matrix, returns only the row of the matrix that corresponds to the 1 or the on input unit.<br><br>

Hence, instead of doing matrix multiplication, we use weight-matrix as a look-up table and instead of representing words as one-hot vectors, we encode each word with a unique integer.


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Embedding%20Lookup.png?raw=1" width="500"></img>




# Look-up Tables

Considering the example of "heart" mentioned above, we see that "heart" is encoded as the integer "958", we can look-up the embedding vector for this word in the 958th row of the embedding weight matrix. This is called a *look-up table*

## Dimensions of Look-up table

If we have a vocabulary of 10k words, then we will have a 10k row embedded weight matrix. The width of the table is called *embedding dimensions*.


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/embedding_lookup_table.png?raw=1" width="500"></img>




# Word2Vec Models

Word2Vec model provides much efficient representations by finding vectors that represents words.<br>

There are 2 architectures for implementing Word2Vec -
1. CBOW (Continous Bag Of Words)
2. Skip-gram



<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/word2vec_architectures.png?raw=1" width="500"></img>


We have implemened *Talking Points* using the *Skip-gram* model.

# Load Data

## Description:

We have gathered the data for training our model from Kaggle's dataset [US Financial News Articles](https://www.kaggle.com/jeet2016/us-financial-news-articles?)


### Context

The data set is rich with metadata, containing the source of the article, the time it was published to the author details and Images related to every article. 
Excellent for text analysis and combined with any other related entity dataset, it could give some astounding results.

### Content

The main Zip file contains 5 other folders , each for every month in 2018 from January to May.

JSON files have articles based on the following criteria:

News publishers: Bloomberg.com, CNBC.com, reuters.com, wsj.com, fortune.com
Language: Only English
Country: United States
News Category: Financial news only

The source for the articles (Archive source: httsps://Webhose.io/archive )


In [None]:
# TODO :: Add !wget commandfile

'https://drive.google.com/file/d/1CJwas1hzNf9cY6_kZ4WnG2BhoOFxQsAc/view?usp=sharing'
#os.chdir('49948-90823-bundle-archive')

In [None]:
!pip install PyDrive

In [None]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
import os
print(os.chdir('2018_01_112b52537b67659ad3609a234388c50a'))
print(os.getcwd())

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
download = drive.CreateFile({'id': '1CJwas1hzNf9cY6_kZ4WnG2BhoOFxQsAc'})
download.GetContentFile('stock_news.zip')

In [None]:
!unzip 'stock_news.zip'

In [None]:

os.chdir('./49948-90823-bundle-archive/2018_04_112b52537b67659ad3609a234388c50a')
os.getcwd()

'/content/49948-90823-bundle-archive/2018_04_112b52537b67659ad3609a234388c50a'

## Combining Stock News

Here, we append the stock news from our dataset into one common txt file.

In [None]:
import json
import os
import glob
import pprint
keywordList = []
news_feed = []
print(os.getcwd())
path = os.getcwd()
for filename in glob.glob(os.path.join(path, '*.json')): #only process .JSON files in folder. 
    print(filename)     
    with open(filename) as currentFile:
        data=currentFile.read().replace('\n', '')
        json_data = json.loads(data)
        title =json_data["title"]
        news = json_data["text"]
        news_feed.append(title)
        news_feed.append(news)

# with open('Untitled document.txt', 'r') as f:
#   data = f.readlines()
#   news_feed.append(data)

In [None]:
print(len(news_feed))

126490


In [None]:
print(news_feed[:10])



In [None]:
f = open('stock_twitter_news.txt', 'w')
f.write('.'.join(news_feed))
f.close()

# Pre-processing Stock News 

The following section pre-processes our text file so that -
1. Any punctuation are converted into tokens, so a period is changed to a bracketed period.
2. In this data set, there aren't any periods, but it will help in other NLP problems.
3. It removes all words that show up five or fewer times in the dataset.This will greatly reduce issues due to noise in the data and improve the quality of the vector representations.
4. It returns a list of words in the text.

In [None]:
import os
import pickle
import torch


SPECIAL_WORDS = {'PADDING': '<PAD>'}


def load_data(path):
    """
    Load Dataset from File
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data


def preprocess_and_save_data(dataset_path, token_lookup, create_lookup_tables):
    """
    Preprocess Text Data
    """
    text = load_data(dataset_path)
    
    # Ignore notice, since we don't use it for analysing the data
    text = text[81:]

    token_dict = token_lookup()
    for key, token in token_dict.items():
        text = text.replace(key, ' {} '.format(token))

    text = text.lower()
    text = text.split()

    vocab_to_int, int_to_vocab = create_lookup_tables(text + list(SPECIAL_WORDS.values()))
    int_text = [vocab_to_int[word] for word in text]
    pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))


def load_preprocess():
    """
    Load the Preprocessed Training data and return them in batches of <batch_size> or less
    """
    return pickle.load(open('preprocess.p', mode='rb'))


def save_model(filename, decoder):
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    torch.save(decoder, save_filename)


def load_model(filename):
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    return torch.load(save_filename)


In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# load in data

data_dir = 'stock_twitter_news.txt'
text = load_data(data_dir)

# Statistics of our dataset

Here we are printing some statistics of our dataset such as number of unique words, number of lines and average number of words in each line.

In [None]:
view_line_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 785718
Number of lines: 900003
Average number of words in each line: 26.306250090277477

The lines 0 to 10:
At war with Alibaba: Top brands fight China e-commerce giant.It was looking like a banner year for business in China . The U.S. clothing company was expecting a 20 percent jump in online sales on Alibaba's Tmall, thanks to the e-commerce giant's massive reach.
But executives soon learned that what Alibaba gives, it can also take away.
The company refused to sign an exclusive contract with Alibaba, and instead participated in a big sale promotion with its archrival, JD.com . Tmall punished them by taking steps to cut traffic to their storefront, two executives told The Associated Press.
They said advertising banners vanished from prominent spots in Tmall sales showrooms, the company was blocked from special sales and products stopped appearing in top search results.
The well-known American brand saw its Tmall sales plummet 10 to 20

# Vocab2int & Int2vocab

Here we are creating 2 dictionaries to convert words to integers (`vocab_to_int`) and integers to vocab (`int_to_vocab`). The integers are assigned in descending order of the frequency, so the most frequent word, "the",  is given the integer "0" and the next most frequent word is given "1" and so on.

In [None]:
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    word_count = Counter(text)
    sorted_vocab = sorted(word_count, key = word_count.get, reverse=True)
    int_to_vocab = {ii:word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word:ii for ii, word in int_to_vocab.items()}
    
    # return tuple
    return (vocab_to_int, int_to_vocab)



In [None]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
    token = dict()
    token['.'] = '<PERIOD>'
    token[','] = '<COMMA>'
    token['"'] = 'QUOTATION_MARK'
    token[';'] = 'SEMICOLON'
    token['!'] = 'EXCLAIMATION_MARK'
    token['?'] = 'QUESTION_MARK'
    token['('] = 'LEFT_PAREN'
    token[')'] = 'RIGHT_PAREN'
    token['-'] = 'QUESTION_MARK'
    token['\n'] = 'NEW_LINE'
    return token


In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# pre-process training data
preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper

int_text, vocab_to_int, int_to_vocab, token_dict = load_preprocess()

In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

No GPU found. Please use a GPU to train your neural network.


# Batching Data

 We'll use `TensorDataset` to provide a known format to our dataset; in combination with DataLoader, it will handle batching, shuffling, and other dataset iteration functions.<br>
We can create data with TensorDataset by passing in feature and target tensors. Then create a DataLoader as usual.

```python
data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data, batch_size=batch_size)
```

For example, say we have these as input:<br>
```
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
```
Our first feature_tensor should contain the values:<br>
```
[1, 2, 3, 4]
```
And the corresponding target_tensor should just be the next "word"/tokenized word value:<br>
```
5
```
This should continue with the second feature_tensor, target_tensor being:<br>
```
[2, 3, 4, 5]  # features
6             # target
```

In [None]:
from torch.utils.data import TensorDataset, DataLoader
import torch
import numpy as np


def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    # TODO: Implement function
    n_batches = len(words)//batch_size
    x, y = [], []
    words = words[:n_batches*batch_size]
    
    for ii in range(0, len(words)-sequence_length):
        i_end = ii+sequence_length        
        batch_x = words[ii:ii+sequence_length]
        x.append(batch_x)
        batch_y = words[i_end]
        y.append(batch_y)
    
    data = TensorDataset(torch.from_numpy(np.asarray(x)), torch.from_numpy(np.asarray(y)))
    data_loader = DataLoader(data, shuffle=True, batch_size=batch_size)
        
    
    # return a dataloader
    return data_loader

# there is no test for this function, but you are encouraged to create
# print statements and tests of your own


In [None]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[39, 40, 41, 42, 43],
        [42, 43, 44, 45, 46],
        [36, 37, 38, 39, 40],
        [ 4,  5,  6,  7,  8],
        [13, 14, 15, 16, 17],
        [44, 45, 46, 47, 48],
        [ 1,  2,  3,  4,  5],
        [ 5,  6,  7,  8,  9],
        [11, 12, 13, 14, 15],
        [14, 15, 16, 17, 18]])

torch.Size([10])
tensor([44, 47, 41,  9, 18, 49,  6, 10, 16, 19])


# Talking Points Model

## Genral Architecture

### Embedding Layer

The model should take our word tokens and firstly pass it through our embedding layer. This layer will be responsible for converting out word tokens or integers into embeddings of specific size. These word embeddings are then fed to the next layer of LSTM cells. <br>

The main purpose of using embedding layer is dimensionality reduction.

### Contiguous LSTM Layer

Our LSTM layer is defined by *hidden state size and number of layers*. At each step, an LSTM cell will produce an output and a new hidden state. The hidden state will be passed to next cell as input (memory representation.)

### Final Fully Connected Linear Layer

The output generated by LSTM cell will be then fed into a *Sigmoid activated fully-connected linear layer.* This layer is responsible for mapping LSTM output to desired output size.

The output of the sigmoid function will be the probability distribution of most likely next word.<br><br>


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Talking Points Model.png?raw=1" height="500"></img>






In [None]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()
        # TODO: Implement function
        
        # define embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # define lstm layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        
        
        # set class variables
        self.vocab_size = vocab_size
        self.output_size = output_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        
        # define model layers
        self.fc = nn.Linear(hidden_dim, output_size)
    
    
    def forward(self, x, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        # TODO: Implement function   
        batch_size = x.size(0)
        x=x.long()
        
        # embedding and lstm_out 
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        # stack up lstm layers
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout, fc layer and final sigmoid layer
        out = self.fc(lstm_out)
        
        # reshaping out layer to batch_size * seq_length * output_size
        out = out.view(batch_size, -1, self.output_size)
        
        # return last batch
        out = out[:, -1]

        # return one batch of output word scores and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        # create 2 new zero tensors of size n_layers * batch_size * hidden_dim
        weights = next(self.parameters()).data
        if(train_on_gpu):
            hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(), 
                     weights.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                     weights.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        # initialize hidden state with zero weights, and move to GPU if available
        
        return hidden


In [None]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    # TODO: Implement Function
    
    # move data to GPU, if available
    if(train_on_gpu):
        rnn.cuda()
    
    # creating variables for hidden state to prevent back-propagation
    # of historical states 
    h = tuple([each.data for each in hidden])
    
    rnn.zero_grad()
    # move inputs, targets to GPU 
    if(train_on_gpu):
        inputs, targets = inp.cuda(), target.cuda()
    
    output, h = rnn(inputs, h)
    
    loss = criterion(output, targets)
    
    # perform backpropagation and optimization
    loss.backward()
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    optimizer.step()

    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), h


In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

In [None]:
# Data params
# Sequence Length
sequence_length = 10  # of words in a sequence
# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [None]:
# Training parameters
# Number of Epochs
num_epochs = 10
# Learning Rate
learning_rate = 0.001

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 200
# Hidden Dimension
hidden_dim = 250
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, 5, show_every_n_batches)

# saving the trained model
save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq.cpu(), -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

In [None]:
# run the cell multiple times to get different results!
gen_length = 100 # modify the length to your preference
prime_word = 'corona' # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
pad_word = SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

corona. m. est
** the dollar slipped 0. 2 percent to 3. 5 percent.
in the past decade, the dow jones industrial average was up 0. 2 percent at $1, 326. 00 a barrel.
a weaker u. s. dollar slipped from a basket of major currencies in 2018.
the dow jones industrial average index was down 0. 3 percent to the lowest level in the past five months, ” the official said.
the ministry is not immediately available to comment


# Sentiment Analysis on Stock Data

Sentiment analysis on stock data can be added to one's advantage. If you look for an extreme example of how social media influences stock market, take a look at Elon Musk's tweet about Tesla's stock. <br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Elon_musk_tesla.png?raw=1" width="500"></img>

Shortly after the message was posted online, within hours, Tesla's market value **crashed by 14 billion dollars** and Musk's own stake in the company reportedly **fell by $3 billion.**

In [None]:
# https://drive.google.com/file/d/14hxhKgpoZTODy1B2BHTam23WYXiX4YL5/view?usp=sharing

!pip install PyDrive

In [4]:
import os
os.chdir('/content')

In [5]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [7]:

download = drive.CreateFile({'id': '14hxhKgpoZTODy1B2BHTam23WYXiX4YL5'})
download.GetContentFile('dataset.zip')

In [8]:
!unzip dataset

Archive:  dataset.zip
   creating: data/
  inflating: data/labels.txt         
  inflating: __MACOSX/data/._labels.txt  
  inflating: data/reviews.txt        
  inflating: __MACOSX/data/._reviews.txt  


In [10]:
import numpy as np

reviews = []
labels = []


#read data from text files
with open('./data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('./data/labels.txt', 'r') as f:
    labels = f.read()

In [12]:
print(reviews[:2000])
print()
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, we'll want to take:
>* We'll want to get rid of periods and extraneous punctuation.
* Also, you might notice that the reviews are delimited with newline characters `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. 
* Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [13]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [14]:
# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

In [None]:
words[:30]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> We will now encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [None]:
from collections import Counter

counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word:ii for ii, word in enumerate(vocab, 1)}

In [None]:
reviews_int = []
'''
reviews_split contains multiple reviews 
reviews_int will be 2-D array
'''
for review in reviews_split:
  reviews_int.append([vocab_to_int[word] for word in review.split()])
print(len(vocab_to_int))
print(reviews_int[:10])

74072
[[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23], [63, 4, 3, 125, 36, 47, 7472, 1395, 16, 3, 4181, 505, 45, 17, 3, 622, 134, 12, 6, 3, 1279, 457, 4, 1721, 207, 3, 10624, 7373, 300, 6, 667, 83, 35, 2116, 1086, 2989, 34, 1, 898, 46417, 4, 8, 13, 5096, 464, 8, 2656, 1721, 1, 221, 57, 17, 58, 794, 1297, 832, 228, 8, 43, 98, 123, 1469, 59, 147, 38, 1, 963, 142, 29, 667, 123, 1, 13584, 410, 61, 9

In [None]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_int[:1])

Unique words:  74072

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


In [None]:
labels_split = labels.split('\n')
labels_to_int = np.array([1 if label=='positive' else 0 for label in labels_split])

zero_length_reviews = Counter([len(x) for x in reviews_int])
print(max(zero_length_reviews))

2514


In [None]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_int])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


### Removing Outliers

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length.

Before we pad our review text, we should check for reviews of extremely short or long lengths; outliers that may mess with our training.

In [None]:
print('Number of reviews before removing outliers: ', len(reviews_int))

## remove any reviews/labels with zero length from the reviews_ints list.

non_zero_idx = [ii for ii, review in enumerate(reviews_int) if len(review)!=0]
reviews_int = [reviews_int[ii] for ii in non_zero_idx]
encoded_labels = np.array([labels_to_int[ii] for ii in non_zero_idx])


print('Number of reviews after removing outliers: ', len(reviews_int))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000


---
## Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `seq_length`, we'll pad with 0s. For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

> Here, we have defined a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `review_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```


In [None]:
def pad_features(reviews_int, seq_length):
  features = np.zeros((len(reviews_int), seq_length), dtype=int)
  for i, row in enumerate(reviews_int):
    features[i, -len(row):] = np.array(row)[:seq_length]
  
  return features

In [None]:
# Test your implementation!

seq_length = 200
features = pad_features(reviews_int, seq_length)
print(features[:30, :10])

## test statements - do not change - ##
assert len(features)==len(reviews_int), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

In [None]:
split_frac = 0.8

split_idx = int(len(features)*split_frac)

train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[   10,   254,   131,  ...,    17,    88,     2],
        [    0,     0,     0,  ...,     2,   425,  1470],
        [   10,    43,  2368,  ...,    12,    40,     6],
        ...,
        [    0,     0,     0,  ...,   115,    17,   273],
        [   10, 17710,    11,  ...,   277,     1,  1020],
        [    0,     0,     0,  ...,    33,    70,  1553]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
        0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
        0, 1])


In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


---
# Sentiment Network with PyTorch

Here we define the network.


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Sentiment%20Analysis%20network.png?raw=1" width="500"></img>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.


### The LSTM Layer(s)

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, you're network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships. 

In [None]:
import torch.nn as nn

class SentimentRNN(nn.Module):
  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
    super(SentimentRNN, self).__init__()

    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim

    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)

    self.dropout = nn.Dropout(0.3)
    self.fc = nn.Linear(hidden_dim, output_size)
    self.sig = nn.Sigmoid()

  def forward(self, x, hidden):
    batch_size = x.size(0)
    x = x.long()
    embeds = self.embedding(x)
    lstm_out, hidden = self.lstm(embeds, hidden)
    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
    out = self.dropout(lstm_out)
    out = self.fc(out)
    sig_out = self.sig(out)

    sig_out = sig_out.view(batch_size, -1)
    sig_out = sig_out[:, -1]

    return sig_out, hidden
  
  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data
    if(train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(), 
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim), 
                weight,new(self.n_layers, batch_size, self.hidden_dim))
      
    return hidden

In [None]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


In [None]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [None]:
epochs = 4
counter = 0 
print_every = 100
clip = 5
if(train_on_gpu):
  net.cuda()

net.train()
for e in range(epochs):
  h = net.init_hidden(batch_size)
  for inputs, labels in train_loader:
    counter += 1
    if(train_on_gpu):
      inputs, labels = inputs.cuda(), labels.cuda()
    h = tuple([each.data for each in h])
    net.zero_grad()
    output, h = net(inputs, h)
    loss = criterion(output.squeeze(), labels.float())
    loss.backward()
    nn.utils.clip_grad_norm(net.parameters(), clip)
    optimizer.step()

    if counter % print_every == 0:
      val_h = net.init_hidden(batch_size)
      val_losses = []
      net.eval()
      for inputs, labels in valid_loader:
        val_h = tuple([each.data for each in val_h])
        if(train_on_gpu):
          inputs, labels = inputs.cuda(), labels.cuda()
        output, val_h = net(inputs, val_h)
        val_loss = criterion(output.squeeze(), labels.float())
        val_losses.append(val_loss.item())
      net.train()
      print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))



Epoch: 1/4... Step: 100... Loss: 0.693711... Val Loss: 0.653396
Epoch: 1/4... Step: 200... Loss: 0.544701... Val Loss: 0.571027
Epoch: 1/4... Step: 300... Loss: 0.562405... Val Loss: 0.557840
Epoch: 1/4... Step: 400... Loss: 0.646383... Val Loss: 0.633329
Epoch: 2/4... Step: 500... Loss: 0.678114... Val Loss: 0.709534
Epoch: 2/4... Step: 600... Loss: 0.566871... Val Loss: 0.530177
Epoch: 2/4... Step: 700... Loss: 0.494823... Val Loss: 0.530239
Epoch: 2/4... Step: 800... Loss: 0.345891... Val Loss: 0.454900
Epoch: 3/4... Step: 900... Loss: 0.307721... Val Loss: 0.481526
Epoch: 3/4... Step: 1000... Loss: 0.329811... Val Loss: 0.469423
Epoch: 3/4... Step: 1100... Loss: 0.299512... Val Loss: 0.490447
Epoch: 3/4... Step: 1200... Loss: 0.238392... Val Loss: 0.461704
Epoch: 4/4... Step: 1300... Loss: 0.311766... Val Loss: 0.491983
Epoch: 4/4... Step: 1400... Loss: 0.336497... Val Loss: 0.480622
Epoch: 4/4... Step: 1500... Loss: 0.345086... Val Loss: 0.544988
Epoch: 4/4... Step: 1600... Loss: 

In [None]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.488
Test accuracy: 0.812


In [None]:
from string import punctuation

def tokenize_movie_review(test_review):
  test_review = test_review.lower()
  test_text = ''.join([c for c in test_review if c not in punctuation])
  test_words = test_text.split()
  test_ints = []
  test_ints.append([vocab_to_int[word] for word in test_words])
  return test_ints

In [None]:
test_review_neg = "It was a very bad movie. Terrible acting."
tokenized_review = tokenize_movie_review(test_review_neg)
print(tokenize_movie_review(test_review_neg))

[[8, 14, 3, 55, 76, 18, 388, 113]]


In [None]:
seq_length = 200
features = pad_features(tokenized_review, seq_length)
print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   8  14   3  55  76  18
  388 113]]


In [None]:
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size(0))

1


In [None]:
def predict(net, test_review, seq_length=200):
  net.eval()
  test_ints = tokenize_movie_review(test_review)
  seq_length=seq_length
  features = pad_features(test_ints, seq_length)
  feature_tensor = torch.from_numpy(features)
  batch_size = feature_tensor.size(0)
  h = net.init_hidden(batch_size)
  if(train_on_gpu):
    feature_tensor=feature_tensor.cuda()
  output, h = net(feature_tensor, h)
  pred = torch.round(output.squeeze())
  print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
  if(pred.item()==1):
    print("Positive")
  else:
    print("Negative")

In [None]:
# negative test review
test_review_neg = 'Stocks are going down as new corona virus cases surge'
seq_length=200 # good to use the length that was trained on

predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.359334
Negative


In [None]:
# negative test review
test_review_neg = 'Stock market is booming and growing post corona pandemic.'
seq_length=200 # good to use the length that was trained on

predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.717415
Positive


# Converting Result into CSV File

Finally, we are converting the generated talking points and the corresponding sentiment analysis into a csv file which can be further called from an API.

In [None]:
import csv

In [None]:
generated_talking_points = x.split('.')

In [None]:
import csv
with open('talking_agenda.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["SN", "Talking Point", "Sentiment Prediction"])
    for i, points in enumerate(generated_talking_points):
      sentiment_predicted = predict(net, points, seq_length)
      writer.writerow([i, points, sentiment_predicted])