# BI-LSTM with pre-trained embeddings, Adam and Dropout

## Sentiment Analysis

We'll create a sentiment analysis model with PyTorch and TorchText on movie reviews using the **IMDb** dataset.

We'll be using a recurrent neural network (RNN) which reads a sequence of words, and for each word (sometimes called a step) will output a hidden state. We then use the hidden state for subsequent word in the sentence, until the final word has been fed into the RNN. This final hidden state will then be used to predict the sentiment of the sentence.

## LSTM

We use a different RNN architecture called a Long Short-Term Memory (LSTM). Why is an LSTM better than a standard RNN? The hidden state can be thought of as a "memory" of the words seen by the model. It is difficult to train a standard RNN as the gradient decays exponentially along the sequence, causing the RNN to "forget" what has happened earlier in the sequence. LSTMs have an extra recurrent state called a cell, which can be thought of as the "memory" of the LSTM and can remember information for many time steps. LSTMs also use multiple gates, these control the flow of information into and out of the memory

<img style="float: left;" src="LSTM.png">

## Bidirectional RNN

The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last, we have a second RNN processing the words in the sentence from the last to the first. PyTorch simplifies this by concatenating both the forward and backward RNNs together, and thus the returned final hidden state, hidden, is the concatenation of the hidden state from the last word of the sentence from the forward RNN with the hidden state of the first word of the sentence from the backward RNN, both of which are the final hidden states from their respective RNNs.

<img style="float: left;" src="BI-RNN.png">

## Regularization

Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into to much detail, the more parameters you have in in your model, the higher the probability that you'll overfit (have a low train error but high validation/test error). To combat this, we use regularization. More specifically, we use a method of regularization called dropout. Dropout works by randomly dropping out (setting to 0) neurons during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a "weaker" (less parameters) model, the predictions from all these "weaker" models (one for each forward pass) get averaged together in the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.

<img style="float: left;" src="dropout.png">

## Table of contents

1. **[Preparing the dataset](#import_corpus)**

    - [Vocabulary word-ID mappings](#word2id)    
    
    - [Getting train, test and validation batches](#ttv)
    
    
2. **[Loss Function](#loss)**

3. **[LSTM Model](#lstm)**

4. **[Train the model](#train)**

In [1]:
import random
from sklearn.metrics import roc_auc_score
import torch
from torchtext import data
from torchtext import datasets
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

<a id='import_corpus'></a>
## Preparing the dataset

In [2]:
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy', batch_first=True)
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
train, test = datasets.IMDB.splits(TEXT, LABEL)

In [4]:
print('len(train):', len(train))
print('len(test):', len(test))

len(train): 25000
len(test): 25000


In [5]:
print(vars(train[0]))

{'text': ['I', "'m", 'grateful', 'to', 'Cesar', 'Montano', 'and', 'his', 'crew', 'in', 'reviving', 'the', 'once', '-', 'moribund', 'Visayan', 'film', 'understorey', '.', '"', 'Panaghoy', '"', 'is', 'hopefully', 'the', 'forerunner', 'of', 'a', 'resurgence', 'in', 'this', 'vernacular', '(', 'that', 'claims', 'more', 'speakers', 'than', 'Tagalog', ')', '.', 'The', 'dialect', 'and', 'lifestyle', 'details', 'are', 'accurately', 'reminiscent', 'of', 'this', 'region', 'of', 'the', 'Philippines', '.', 'Downside', ':', 'the', 'corny', 'and', 'stilted', 'acting', 'of', 'the', 'American', 'antagonist', '.', 'The', 'other', 'item', 'that', 'I', 'did', "n't", 'appreciate', 'was', 'the', 'lack', 'of', 'authenticity', 'in', 'the', '"', 'period', '"', 'costume', 'of', 'the', 'same', 'character', ',', 'and', 'above', 'all', ',', 'his', 'bright', 'red', 'kit', '-', 'car', 'that', 'I', 'suppose', 'was', 'meant', 'to', 'pass', 'for', 'a', '1930s', 'roadster', '.', 'Without', 'those', 'small', 'yet', 'glar

By default this splits 70/30

In [6]:
train, valid = train.split(random_state=random.seed(SEED))

In [7]:
print('train size:', len(train))
print('validation size:', len(valid))
print('test size:', len(test))

train size: 17500
validation size: 7500
test size: 25000


<a id='word2id'></a>
### Vocabulary word-ID mappings

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will be 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways of effectively cutting down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $n$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or unk token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I unk it".

In [8]:
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

In [9]:
print('len(TEXT.vocab):', len(TEXT.vocab))
print('len(LABEL.vocab):', len(LABEL.vocab))

len(TEXT.vocab): 25002
len(LABEL.vocab): 2


Why is the vocab size 25002 and not 25000? One of the addition tokens is the **unk** token and the other is a **pad** token.

When we feed sentences into our model, we feed a batch of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the largest within the batch are padded.

We can also view the most common words in the vocabulary.

In [10]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 203291), (',', 192395), ('.', 164688), ('and', 109470), ('a', 108891), ('of', 100849), ('to', 93550), ('is', 76441), ('in', 61513), ('I', 54165), ('it', 53764), ('that', 49578), ('"', 43898), ("'s", 43008), ('this', 42327), ('-', 37097), ('/><br', 35564), ('was', 34970), ('as', 30477), ('movie', 29771)]


We can also see the vocabulary directly using either the stoi (string to int) or itos (int to string) method.

In [11]:
print(TEXT.vocab.itos[0])

<unk>


In [12]:
print(TEXT.vocab.itos[566])
print(TEXT.vocab.itos[16])
print(TEXT.vocab.itos[24])

How
this
film


We can also check the labels, ensuring 0 is for negative and 1 is for positive.

In [13]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7fedc33ec268>, {'neg': 0, 'pos': 1})


<a id='ttv'></a>
### Getting train, test and validation batches

The final step of preparing the data is creating the iterators. <br>

**BucketIterator** first sorts the examples using the sort_key, here we use the length of the sentences, and then partitions them into buckets. When the iterator is called it returns a batch of examples from the same bucket. This will return a batch of examples where each example is a similar length, minimizing the amount of padding.

In [14]:
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

<a id='loss'></a>
## Loss Function

The loss function here is **Binary Cross Entropy**

$$ - w_n \left[ t_n \cdot \log x_n
+ (1 - t_n) \cdot \log (1 - x_n) \right] $$

<a id='lstm'></a>
## LSTM model

In [15]:
input_dim = len(TEXT.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1
n_epochs = 5
dropout = 0.5

In [16]:
class LSTM(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, dropout):
        super(LSTM, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, 
                            bidirectional=True, dropout=dropout,
                            batch_first=True)
        self.linear_out = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, inputs):
        embedded = self.embedding(inputs)
        output, (hidden, cell) = self.lstm(embedded) #output (batch size, sent len, hid dim*num directions), 
                                                    #hidden (num layers * num directions, batch size, hid. dim)
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)) # concat 2 directions
        
        return self.linear_out(hidden.squeeze(0)) # squeeze() makes the output (batch size, hid dim) 

#### Instantiate model, optimizer and loss

In [17]:
model = LSTM(input_dim, embedding_dim, hidden_dim, output_dim, dropout)
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

  "num_layers={}".format(dropout, num_layers))


The change here is copying the pre-trained word embeddings we loaded earlier (GloVe) into the embedding layer of our model.

In [18]:
pretrained_embeddings = TEXT.vocab.vectors

# Check embeddings size
print(pretrained_embeddings.shape) # (vocab size, embedding dim)

# Replace the initial weights of the embedding layer with the pre-trained embeddings
model.embedding.weight.data.copy_(pretrained_embeddings)

torch.Size([25002, 100])


tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.0709, -0.2084,  0.1463,  ...,  0.2397,  0.9118,  0.5148],
        [ 0.1532,  0.0293,  0.2073,  ...,  0.1447,  0.1620,  0.8675],
        [-0.4321, -0.3363,  0.3794,  ...,  0.5315, -0.1863,  0.3735]])

Using .to, we can place the model and the criterion on the GPU.

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment).

In [20]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    
    return acc

In [21]:
def get_auc(preds, y):
    return roc_auc_score(y.cpu().detach().numpy(), preds.cpu().detach().numpy())

<a id='train'></a>
## Train the model

The train function iterates over all examples, a batch at a time.

model.train() is used to put the model in "training mode", which turns on dropout and batch normalization. Although we aren't using them in this model, it's good practice to include it.

In [22]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    epoch_auc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        auc = get_auc(predictions, batch.label)
        loss.backward()
        optimizer.step()
        auc = get_auc(predictions, batch.label)
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        epoch_auc += auc
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator), epoch_auc / len(iterator)

model.eval() puts the model in "evaluation mode", this turns off dropout and batch normalization

In [23]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    epoch_auc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            auc = get_auc(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            epoch_auc += auc
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator), epoch_auc / len(iterator)

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the split.

In [24]:
for epoch in range(n_epochs):

    train_loss, train_acc, train_auc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc, valid_auc = evaluate(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Train AUC: {train_auc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%, Val. AUC: {valid_auc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.659, Train Acc: 60.39%, Train AUC: 67.30%, Val. Loss: 0.636, Val. Acc: 67.53%, Val. AUC: 78.57%
Epoch: 02, Train Loss: 0.414, Train Acc: 81.89%, Train AUC: 89.66%, Val. Loss: 0.322, Val. Acc: 86.91%, Val. AUC: 93.97%
Epoch: 03, Train Loss: 0.209, Train Acc: 92.36%, Train AUC: 97.52%, Val. Loss: 0.293, Val. Acc: 88.42%, Val. AUC: 95.04%
Epoch: 04, Train Loss: 0.111, Train Acc: 96.56%, Train AUC: 99.17%, Val. Loss: 0.369, Val. Acc: 87.99%, Val. AUC: 94.89%
Epoch: 05, Train Loss: 0.056, Train Acc: 98.50%, Train AUC: 99.70%, Val. Loss: 0.467, Val. Acc: 86.97%, Val. AUC: 94.14%


Finally, the metric you actually care about, the test loss, accuracy and AUC.

In [25]:
test_loss, test_acc, test_auc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%, Test AUC: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Adam greatly increases the model performance