# Recurent Neural Network (Vanilla)

## Sentiment Analysis

We'll create a sentiment analysis model with PyTorch and TorchText on movie reviews using the **IMDb** dataset.

We'll be using a recurrent neural network (RNN) which reads a sequence of words, and for each word (sometimes called a step) will output a hidden state. We then use the hidden state for subsequent word in the sentence, until the final word has been fed into the RNN. This final hidden state will then be used to predict the sentiment of the sentence.

## Table of contents

1. **[Preparing the dataset](#import_corpus)**

    - [Vocabulary word-ID mappings](#word2id)    
    
    - [Getting train, test and validation batches](#ttv)
    
    
2. **[Loss Function](#loss)**

3. **[RNN Model](#rnn)**

4. **[Train the model](#train)**

In [1]:
import random
import torch
from torchtext import data
from torchtext import datasets
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

## Preparing the dataset

In [2]:
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy', batch_first=True)
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
train, test = datasets.IMDB.splits(TEXT, LABEL)

In [4]:
print('len(train):', len(train))
print('len(test):', len(test))

len(train): 25000
len(test): 25000


In [5]:
print(vars(train[0]))

{'text': ['I', "'m", 'grateful', 'to', 'Cesar', 'Montano', 'and', 'his', 'crew', 'in', 'reviving', 'the', 'once', '-', 'moribund', 'Visayan', 'film', 'understorey', '.', '"', 'Panaghoy', '"', 'is', 'hopefully', 'the', 'forerunner', 'of', 'a', 'resurgence', 'in', 'this', 'vernacular', '(', 'that', 'claims', 'more', 'speakers', 'than', 'Tagalog', ')', '.', 'The', 'dialect', 'and', 'lifestyle', 'details', 'are', 'accurately', 'reminiscent', 'of', 'this', 'region', 'of', 'the', 'Philippines', '.', 'Downside', ':', 'the', 'corny', 'and', 'stilted', 'acting', 'of', 'the', 'American', 'antagonist', '.', 'The', 'other', 'item', 'that', 'I', 'did', "n't", 'appreciate', 'was', 'the', 'lack', 'of', 'authenticity', 'in', 'the', '"', 'period', '"', 'costume', 'of', 'the', 'same', 'character', ',', 'and', 'above', 'all', ',', 'his', 'bright', 'red', 'kit', '-', 'car', 'that', 'I', 'suppose', 'was', 'meant', 'to', 'pass', 'for', 'a', '1930s', 'roadster', '.', 'Without', 'those', 'small', 'yet', 'glar

By default this splits 70/30

In [6]:
train, valid = train.split(random_state=random.seed(SEED))

In [7]:
print('train size:', len(train))
print('validation size:', len(valid))
print('test size:', len(test))

train size: 17500
validation size: 7500
test size: 25000


### Vocabulary word-ID mappings

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will be 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways of effectively cutting down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $n$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or unk token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I unk it".

In [8]:
TEXT.build_vocab(train, max_size=25000)
LABEL.build_vocab(train)

In [9]:
print('len(TEXT.vocab):', len(TEXT.vocab))
print('len(LABEL.vocab):', len(LABEL.vocab))

len(TEXT.vocab): 25002
len(LABEL.vocab): 2


Why is the vocab size 25002 and not 25000? One of the addition tokens is the **unk** token and the other is a **pad** token.

When we feed sentences into our model, we feed a batch of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the largest within the batch are padded.

We can also view the most common words in the vocabulary.

In [10]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 203291), (',', 192395), ('.', 164688), ('and', 109470), ('a', 108891), ('of', 100849), ('to', 93550), ('is', 76441), ('in', 61513), ('I', 54165), ('it', 53764), ('that', 49578), ('"', 43898), ("'s", 43008), ('this', 42327), ('-', 37097), ('/><br', 35564), ('was', 34970), ('as', 30477), ('movie', 29771)]


We can also see the vocabulary directly using either the stoi (string to int) or itos (int to string) method.

In [11]:
print(TEXT.vocab.itos[0])

<unk>


In [12]:
print(TEXT.vocab.itos[566])
print(TEXT.vocab.itos[16])
print(TEXT.vocab.itos[24])

How
this
film


We can also check the labels, ensuring 0 is for negative and 1 is for positive.

In [13]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7feb9824e730>, {'neg': 0, 'pos': 1})


### Getting train, test and validation batches

The final step of preparing the data is creating the iterators. <br>

**BucketIterator** first sorts the examples using the sort_key, here we use the length of the sentences, and then partitions them into buckets. When the iterator is called it returns a batch of examples from the same bucket. This will return a batch of examples where each example is a similar length, minimizing the amount of padding.

In [14]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Loss Function

The loss function here is **Binary Cross Entropy**

$$ - w_n \left[ t_n \cdot \log x_n
+ (1 - t_n) \cdot \log (1 - x_n) \right] $$

## RNN model

In [92]:
input_dim = len(TEXT.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1
n_epochs = 5

#### Architecture

<img style="float: left;" src="rnn.png">

In [93]:
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.linear_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, inputs):
        embedded = self.embedding(inputs)
        output, hidden = self.rnn(embedded) #output (batch size, sent len, hid dim), 
                                            #hidden (batch size, 1, hid dim)
        
        return self.linear_out(hidden.squeeze(0)) # squeeze() makes the output (batch size, hid dim) 

#### Instantiate model, optimizer and loss

In [94]:
model = RNN(input_dim, embedding_dim, hidden_dim, output_dim)
optimizer = optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

Using .to, we can place the model and the criterion on the GPU.

In [95]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment).

In [96]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

## Train the model

The train function iterates over all examples, a batch at a time.

model.train() is used to put the model in "training mode", which turns on dropout and batch normalization. Although we aren't using them in this model, it's good practice to include it.

In [97]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

model.eval() puts the model in "evaluation mode", this turns off dropout and batch normalization

In [98]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the split.

In [99]:
for epoch in range(n_epochs):

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.696, Train Acc: 50.38%, Val. Loss: 0.694, Val. Acc: 48.83%
Epoch: 02, Train Loss: 0.693, Train Acc: 50.33%, Val. Loss: 0.694, Val. Acc: 48.78%
Epoch: 03, Train Loss: 0.693, Train Acc: 50.06%, Val. Loss: 0.694, Val. Acc: 49.31%
Epoch: 04, Train Loss: 0.693, Train Acc: 50.20%, Val. Loss: 0.694, Val. Acc: 49.23%
Epoch: 05, Train Loss: 0.693, Train Acc: 50.44%, Val. Loss: 0.694, Val. Acc: 51.76%


Finally, the metric you actually care about, the test loss and accuracy.

In [100]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.682, Test Acc: 58.15%


You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.