# Sentiment Analysis with Recurrent Neural Networks


![top-twitter-emojis.jpg](https://blog.emojipedia.org/content/images/size/w2000/2018/01/top-twitter-emojis.jpg)

We'll be building a recurrent neural network and train it to do sentiment classification!

The full data loading and processing code is provided for you, as is the code for a vanilla Elman-RNN. You'll be completing the implementaion of an LSTM-RNN and then comparing results.

## Data Processing


### Load data
Make sure you've downloaded the Stanford Sentiment Treebank that was used in lab last week. You can find it [here](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip).

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import re
import random

random.seed(1)
sst_home = '../data/trees'

# Let's do 2-way positive/negative classification instead of 5-way
easy_label_map = {0:0, 1:0, 2:None, 3:1, 4:1}
    # so labels of 0 and 1 in te 5-wayclassificaiton are 0 in the 2-way. 3 and 4 are 1, and 2 is none
    # because we don't have a neautral class. 

PADDING = "<PAD>"
UNKNOWN = "<UNK>"
max_seq_length = 20

def load_sst_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = easy_label_map[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    random.seed(1)
    random.shuffle(data)
    return data
     
training_set = load_sst_data(sst_home + '/train.txt')
dev_set = load_sst_data(sst_home + '/dev.txt')
test_set = load_sst_data(sst_home + '/test.txt')

<br>

Next, we'll extract the vocabulary from the data, index each token, and finally convert the sentences into lists of indexed tokens. We are also padding and truncating all sentences to be of length=20. (Why? Think about how to handle batching. This is not the only way to do it! This is just simple.)

In [2]:
import collections
import numpy as np

def tokenize(string):
    return string.split()

def build_dictionary(training_datasets):
    """
    Extract vocabulary and build dictionary.
    """  
    word_counter = collections.Counter()
    for i, dataset in enumerate(training_datasets):
        for example in dataset:
            word_counter.update(tokenize(example['text']))
        
    vocabulary = set([word for word in word_counter])
    vocabulary = list(vocabulary)
    vocabulary = [PADDING, UNKNOWN] + vocabulary
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))

    return word_indices, len(vocabulary)

def sentences_to_padded_index_sequences(word_indices, datasets):
    """
    Annotate datasets with feature vectors. Adding right-sided padding. 
    """
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['text_index_sequence'] = torch.zeros(max_seq_length)

            token_sequence = tokenize(example['text'])
            padding = max_seq_length - len(token_sequence)

            for i in range(max_seq_length):
                if i >= len(token_sequence):
                    index = word_indices[PADDING]
                    pass
                else:
                    if token_sequence[i] in word_indices:
                        index = word_indices[token_sequence[i]]
                    else:
                        index = word_indices[UNKNOWN]
                example['text_index_sequence'][i] = index

            example['text_index_sequence'] = example['text_index_sequence'].long().view(1,-1)
            example['label'] = torch.LongTensor([example['label']])


word_to_ix, vocab_size = build_dictionary([training_set])
sentences_to_padded_index_sequences(word_to_ix, [training_set, dev_set, test_set])

In [3]:
print("Size of training dataset:", len(training_set))
print("\nFirst padded and indexified example in training data:\n", training_set[0])

Size of training dataset: 6920

First padded and indexified example in training data:
 {'label': tensor([0]), 'text': 'Yet another entry in the sentimental oh-those-wacky-Brits genre that was ushered in by The Full Monty and is still straining to produce another smash hit .', 'text_index_sequence': tensor([[ 5957,  6674, 10700,  2187,  2957,  7959, 12589, 10807, 14335,  5618,
         13260,  2187, 11685, 10763,  2265,  6099, 15783, 10090,  5653,  4847]])}


<br>
### Batchify data
We're going to be doign mini-batch training. The following code makes data iterators and a batchifying function.

In [4]:
# This is the iterator we'll use during training. 
# It's a generator that gives you one batch at a time.
def data_iter(source, batch_size):
    dataset_size = len(source)
    start = -1 * batch_size
    order = list(range(dataset_size))
    random.shuffle(order)

    while True:
        start += batch_size
        if start > dataset_size - batch_size:
            # Start another epoch.
            start = 0
            random.shuffle(order)   
        batch_indices = order[start:start + batch_size]
        batch = [source[index] for index in batch_indices]
        yield [source[index] for index in batch_indices]

# This is the iterator we use when we're evaluating our model. 
# It gives a list of batches that you can then iterate through.
def eval_iter(source, batch_size):
    batches = []
    dataset_size = len(source)
    start = -1 * batch_size
    order = list(range(dataset_size))
    random.shuffle(order)

    while start < dataset_size - batch_size:
        start += batch_size
        batch_indices = order[start:start + batch_size]
        batch = [source[index] for index in batch_indices]
        if len(batch) == batch_size:
            batches.append(batch)
        else:
            continue
        
    return batches

# The following function gives batches of vectors and labels, 
# these are the inputs to your model and loss function
def get_batch(batch):
    vectors = []
    labels = []
    for dict in batch:
        vectors.append(dict["text_index_sequence"])
        labels.append(dict["label"])
    return vectors, labels


## Evaluation

We'll be looking at accuracy as our evlauation metric.

In [5]:
# This function outputs the accuracy on the dataset, we will use it during training.
def evaluate(model, data_iter, is_lstm):
    model.eval()
    correct = 0
    total = 0
    for i in range(len(data_iter)):
        vectors, labels = get_batch(data_iter[i])
        vectors = torch.stack(vectors).squeeze()
        labels = torch.stack(labels).squeeze()
        
        if is_lstm:
            hidden, c_t = model.init_hidden()
            output, hidden = model(vectors, hidden, c_t)
        else:
            hidden = model.init_hidden()
            output, hidden = model(vectors, hidden)
        
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
      
    return correct / float(total)

## Elman-RNN

Note that when you're actually building and using these models for research or application, you will never want to build from scratch like we are today. This is simply for demonstration! And because it's a very useul exercise to do these things from scratch at least once.

In [6]:
class ElmanRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size, batch_size):
        super(ElmanRNN, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embedding_size = embedding_dim
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.batch_size = batch_size
        
        self.inlinear = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.hid2logits = nn.Linear(hidden_size, output_size)
        self.init_weights()
    
    def forward(self, x, hidden):
        x_emb = self.embed(x)                
        embs = torch.chunk(x_emb, x_emb.size()[1], 1)
        
        def step(emb, hid):
            combined = torch.cat((hid, emb), 1)
            hid = torch.tanh(self.inlinear(combined))
            return hid

        for i in range(len(embs)):
            hidden = step(embs[i].squeeze(), hidden)
        
        output = self.hid2logits(hidden)
        return output, hidden

    def init_hidden(self):
        h0 = torch.zeros(self.batch_size, self.hidden_size)
        return h0
    
    def init_weights(self):
        initrange = 0.1
        lin_layers = [self.inlinear, self.hid2logits]
        em_layer = [self.embed]
     
        for layer in lin_layers+em_layer:
            layer.weight.data.uniform_(-initrange, initrange)
            if layer in lin_layers:
                layer.bias.data.fill_(0)

### Training loop

In [7]:
def training_loop(batch_size, num_epochs, model, loss_, optim, training_iter, dev_iter, train_eval_iter, is_lstm=False):
    step = 0
    epoch = 0
    total_batches = int(len(training_set) / batch_size)
    while epoch <= num_epochs:
        model.train()
        vectors, labels = get_batch(next(training_iter)) 
        vectors = torch.stack(vectors).squeeze() # batch_size, seq_len
        labels = torch.stack(labels).squeeze()
    
        model.zero_grad()
        
        if is_lstm:
            #assert "Not yet implemented."
            hidden, cell_state = model.init_hidden()
            output, hidden = model(vectors, hidden, cell_state)
        else:
            hidden = model.init_hidden()
            output, hidden = model(vectors, hidden)

        lossy = loss_(output, labels)
        lossy.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
        optim.step()
        
        if step % total_batches == 0:
            model.eval()
            if epoch % 20 == 0:
                print("Epoch %i; Step %i; Loss %f; Train acc: %f; Dev acc %f" 
                      %(epoch, step, lossy.data[0],\
                        evaluate(model, train_eval_iter, is_lstm),\
                        evaluate(model, dev_iter, is_lstm)))
            epoch += 1
        step += 1

### Train model!

We've provided the hyperparmaters you should use. We're also only evaluating on a part of the dev set to speed things along.

In [8]:
# Hyperparameters 
input_size = vocab_size
num_labels = 2 
hidden_dim = 24
embedding_dim = 8
batch_size = 256
learning_rate = 0.2
#learning_rate = 0.0004
num_epochs = 500



# Build, initialize, and train model
rnn = ElmanRNN(vocab_size, embedding_dim, hidden_dim, num_labels, batch_size)
rnn.init_weights()

# Loss and Optimizer
loss = nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.001)

# Train the model
training_iter = data_iter(training_set, batch_size)
train_eval_iter = eval_iter(training_set[:500], batch_size)
dev_iter = eval_iter(dev_set[:500], batch_size)

training_loop(batch_size, num_epochs, rnn, loss, optimizer, training_iter, dev_iter, train_eval_iter, is_lstm=False)



Epoch 0; Step 0; Loss 0.693210; Train acc: 0.519531; Dev acc 0.511719
Epoch 20; Step 540; Loss 0.001854; Train acc: 1.000000; Dev acc 0.742188
Epoch 40; Step 1080; Loss 0.000520; Train acc: 1.000000; Dev acc 0.738281
Epoch 60; Step 1620; Loss 0.000450; Train acc: 1.000000; Dev acc 0.726562
Epoch 80; Step 2160; Loss 0.000089; Train acc: 1.000000; Dev acc 0.738281
Epoch 100; Step 2700; Loss 0.000032; Train acc: 1.000000; Dev acc 0.726562
Epoch 120; Step 3240; Loss 0.000021; Train acc: 1.000000; Dev acc 0.726562
Epoch 140; Step 3780; Loss 0.000015; Train acc: 1.000000; Dev acc 0.726562
Epoch 160; Step 4320; Loss 0.000009; Train acc: 1.000000; Dev acc 0.726562
Epoch 180; Step 4860; Loss 0.000007; Train acc: 1.000000; Dev acc 0.726562
Epoch 200; Step 5400; Loss 0.000005; Train acc: 1.000000; Dev acc 0.722656
Epoch 220; Step 5940; Loss 0.000003; Train acc: 1.000000; Dev acc 0.722656
Epoch 240; Step 6480; Loss 0.000003; Train acc: 1.000000; Dev acc 0.722656
Epoch 260; Step 7020; Loss 0.000002

<br>

## ☆ Implement LSTM! ☆

Now we'll implement and LSTM-RNN! For a quick refresher on LSTMs, have a look at [Olah's blog-post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).

_This bears repeating: you should never actually implement modules like LSTM from scratch if the library has a clean, optimized implementation of it (as pytorch does)._

In [11]:
import numpy as np

def sigmoid(x): 
    return 1. / (1 + np.exp(-x))

class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size, batch_size):
        super(LSTM, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embedding_size = embedding_dim
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.batch_size = batch_size
               
        """
        YOUR CODE GOES HERE
        
        """ 
        self.sig1 = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.sig2 = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.sig3 = nn.Linear(embedding_dim + hidden_size, hidden_size)
        self.tan1 = nn.Linear(embedding_dim + hidden_size, hidden_size)
        
        
        self.hid2logits = nn.Linear(hidden_size, output_size)
        self.init_weights()
        
    
    
    def rand_arr(self, a, b, *args): 
        np.random.seed(0)
        return np.random.rand(*args) * (b - a) + a
    
    def forward(self, x, hidden, c):
        x_emb = self.embed(x)
        embs = torch.chunk(x_emb, x_emb.size()[1], 1)       
        
        """
        YOUR CODE GOES HERE
        """    
        def step(emb, hid, c):
            combined = torch.cat((hid, emb), 1)
            
            sig1 = torch.sigmoid(self.sig1(combined))
            sig2 = torch.sigmoid(self.sig2(combined))
            sig3 = torch.sigmoid(self.sig3(combined))
            tan1 = torch.tanh(self.tan1(combined))
            
            x1 = torch.mul(c, sig1)
            x2 = torch.mul(sig2, tan1)
            c = x1 + x2
            hid = torch.mul(sig3, torch.tanh(c))
            
            return hid, c

        for i in range(len(embs)):
            hidden, c = step(embs[i].squeeze(), hidden, c)
        
        

        output = self.hid2logits(hidden)
        return output, hidden

    def init_hidden(self):
        h0 = torch.zeros(self.batch_size, self.hidden_size)
        c0 = torch.zeros(self.batch_size, self.hidden_size)
        return h0, c0
    
    def init_weights(self):
        initrange = 0.1
        # The following is a placeholder, you need to add some code.
        lin_layers = [self.hid2logits]
        em_layer = [self.embed]
     
        for layer in lin_layers+em_layer:
            layer.weight.data.uniform_(-initrange, initrange)
            if layer in lin_layers:
                layer.bias.data.fill_(0)

### Train LSTM,

Let's train the LSTM-RNN and see how performance compares with the Elman-RNN,

In [12]:
# Hyperparameters 
input_size = vocab_size
num_labels = 2
hidden_dim = 24
embedding_dim = 8
batch_size = 256
learning_rate = 1.5
num_epochs = 300


# Build, initialize, and train model
lstm = LSTM(vocab_size, embedding_dim, hidden_dim, num_labels, batch_size)
lstm.init_weights()

# Loss and Optimizer
loss = nn.CrossEntropyLoss()  
optimizer = torch.optim.SGD(lstm.parameters(), lr=learning_rate)

# Train the model
training_iter = data_iter(training_set, batch_size)
train_eval_iter = eval_iter(training_set[0:500], batch_size)
dev_iter = eval_iter(dev_set[:500], batch_size)

training_loop(batch_size, num_epochs, lstm, loss, optimizer, training_iter, dev_iter, train_eval_iter, is_lstm=True)



Epoch 0; Step 0; Loss 0.694011; Train acc: 0.531250; Dev acc 0.531250
Epoch 20; Step 540; Loss 0.692793; Train acc: 0.531250; Dev acc 0.531250
Epoch 40; Step 1080; Loss 0.692907; Train acc: 0.531250; Dev acc 0.527344
Epoch 60; Step 1620; Loss 0.700164; Train acc: 0.515625; Dev acc 0.496094
Epoch 80; Step 2160; Loss 0.615319; Train acc: 0.675781; Dev acc 0.550781
Epoch 100; Step 2700; Loss 0.446279; Train acc: 0.800781; Dev acc 0.605469
Epoch 120; Step 3240; Loss 0.035270; Train acc: 0.980469; Dev acc 0.703125
Epoch 140; Step 3780; Loss 0.049518; Train acc: 0.996094; Dev acc 0.742188
Epoch 160; Step 4320; Loss 0.017907; Train acc: 0.996094; Dev acc 0.746094
Epoch 180; Step 4860; Loss 0.001254; Train acc: 0.996094; Dev acc 0.730469
Epoch 200; Step 5400; Loss 0.022535; Train acc: 0.996094; Dev acc 0.726562
Epoch 220; Step 5940; Loss 0.001470; Train acc: 0.996094; Dev acc 0.714844
Epoch 240; Step 6480; Loss 0.000775; Train acc: 1.000000; Dev acc 0.707031
Epoch 260; Step 7020; Loss 0.000370