# CS6493 - Tutorial 3

## RNNs for Sentiment Analysis on the IMDB dataset

In this tutorial, we will introduce how to use RNN networks to achieve sentiment analysis. Sentiment analysis is a well-stuided topic in NLP community. In general, we divide the sentiment analysis into following sub-topics:

- **Sentiment analysis with binary classification**: detect if a sentence is positive or negtive;
- **Multi-class sentiment analysis**: detect the sentiment of a sentence, such as `happy`, `angery`, `excited`, etc.;
- **Aspect-based sentiment analysis**: a fine-grained sentiment task which focuses on the sentiment polarity of a specific aspect;
- **Multi-modal sentiment analysis**: detect sentiments based on the text, audio and images.

Here, we give a simple demonstration of binary classification sentiment analysis on movie reviews, using the [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

**Reminder:** Please check your experimental environment at frist. We will use `torch==1.10.0` and `datasets` in this demo.

We will use following technologies in this tutorial,

- pre-trained word embeddings, i.e., `Glove`;
- packed padded sequence;
- `bidirectional LSTM`;
- regularization, i.e., `Dropout`;
- the optimizer `Adam`.

In [None]:
!pip install --upgrade datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import sys
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import datasets
from torchtext.vocab import build_vocab_from_iterator
import tqdm
import random

In [None]:
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7fa4bfa82930>

## Preparing data

Firstly, we use the `.load_dataset()` to load the IMDB dataset. Note that since the released dataset does not contain the validation dataset, we need to split the training set for validation.

In [None]:
train_data, test_data = datasets.load_dataset('imdb', split=['train', 'test'])



  0%|          | 0/2 [00:00<?, ?it/s]

We can also manually download it and then upload it to our server, and load it via `datasets.load_from_disk()`

In [None]:
# train_data = datasets.load_from_disk('imdb')['train']
# test_data = datasets.load_from_disk('imdb')['test']

We crop the long sequence with `max_length=256` and filter the rare words with the `min_freq=5`. We will use `pack_padded_sequence()`, which make our RNNs only process the non-padded elements of our sequence. To this end, we have to tell the RNNs how long the actual sequences are. We do this by recording the length of each sequence during the pre-processing stage.

In [None]:
max_length = 256
min_freq = 5
special_tokens = ['<unk>', '<pad>']

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
def tokenize_data(example):
    tokens = tokenizer(example['text'])[:max_length]
    length = len(tokens)
    example['tokens'] = tokens
    example['length'] = length
    return example

train_data = train_data.map(tokenize_data)
test_data = test_data.map(tokenize_data)

def yield_tokens(data_iter):
    for item in data_iter:
        yield item['tokens']

vocab = build_vocab_from_iterator(yield_tokens(train_data), min_freq=min_freq, specials=special_tokens)



In [None]:
unk_index = vocab['<unk>'] # unknown token
pad_index = vocab['<pad>'] # pad token
vocab.set_default_index(unk_index)

Splitting the original training data into training and validation sets.

In [None]:
splited_ = train_data.train_test_split(test_size = 0.1)
train_data, valid_data = splited_['train'], splited_['test']



In [None]:
train_data

Dataset({
    features: ['text', 'label', 'tokens', 'length'],
    num_rows: 22500
})

In [None]:
def vectorize_data(example):
    ids = [vocab[token] for token in example['tokens']]
    example['ids'] = ids
    return example

train_data = train_data.map(vectorize_data)
valid_data = valid_data.map(vectorize_data)
test_data = test_data.map(vectorize_data)



## Build the model

We use the Long Short-Term Memory (LSTM) which use multiple gates to control the flow of information into or out of the memory. There are some key points:

- `bidirectional`: Default **FALSE**. Set **True** to use the bi-LSTM;
- You are encouraged to give an initialization to LSTM, which would benefit to the convegence speed;
- To alleviate the overfitting, we introduce `dropout` into our model. Dropout works by randomly dropping out neurons in a layer during a forward pass;
- In batch training, we need to use `pack_padded_sequence()` to help the model only process the non-padded elements of a sequence. When using this function, the `hidden` and `cell` are both from the last non-padded element in the sequence. Otherwise, they would be from the last element in the sequence, which probably would be a pad token.
- The `lengths` argument of `packed_padded_sequence` must be a CPU tensor.

In [None]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class, n_layers, bidirectional,
                 dropout_rate, pad_index):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, bidirectional=bidirectional,
                            dropout=dropout_rate, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, num_class)
        self.dropout = nn.Dropout(dropout_rate)
        self.initialize_weights()
    
    def initialize_weights(self):
        nn.init.xavier_normal_(self.embedding.weight)
        nn.init.xavier_normal_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)
        for name, param in self.lstm.named_parameters():
            if "bias" in name:
                nn.init.zeros_(param)
            elif "weight" in name:
                nn.init.orthogonal_(param)
        
    def forward(self, ids, length):
        # ids = [batch size, seq len]
        # length = [batch size]
        embedded = self.dropout(self.embedding(ids))
        # embedded = [batch size, seq len, embedding dim]
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, length, batch_first=True, 
                                                            enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        # hidden = [n layers * n directions, batch size, hidden dim]
        # cell = [n layers * n directions, batch size, hidden dim]
        output, output_length = nn.utils.rnn.pad_packed_sequence(packed_output)
        # output = [batch size, seq len, hidden dim * n directions]
        if self.lstm.bidirectional:
            hidden = self.dropout(torch.cat([hidden[-1], hidden[-2]], dim=-1))
            # hidden = [batch size, hidden dim * 2]
        else:
            hidden = self.dropout(hidden[-1])
            # hidden = [batch size, hidden dim]
        prediction = self.fc(hidden)
        # prediction = [batch size, output dim]
        return prediction

Now, we need to create a model instance with the specifc arguments.

In [None]:
# model parameters
vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 300
num_class = 2 # negtive/positive
n_layers = 2
bidirectional = True # use bi-lstm
dropout_rate = 0.5
device = "cuda" if torch.cuda.is_available else "cpu"

model = LSTM(vocab_size, embedding_dim, hidden_dim, num_class, n_layers, bidirectional, dropout_rate, 
             pad_index)

Let us take a look at the total parameters of the model.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 11,079,902 trainable parameters


Instead of randomly initialize the word embeddings, we here use the pretrained word embeddings `GLOVE.6B.100D`. Alternatively, you can try some other word embeddings following [this page](https://pytorch.org/text/stable/vocab.html#pretrained-word-embeddings).

In [None]:
glove_embed = torchtext.vocab.GloVe(name="6B", dim=300) # vocab_size: 6B; word dimension: 300.
pretrained_word_embeddings = glove_embed.get_vecs_by_tokens(vocab.get_itos())
model.embedding.weight.data = pretrained_word_embeddings

## Train the model

We use the `Adam` optimizer to update the model parameters. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about Adam (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).

In [None]:
def collate_fn(batch):
    batch_ids = [torch.tensor(i['ids']) for i in batch]
    batch_ids = nn.utils.rnn.pad_sequence(batch_ids, padding_value=pad_index, batch_first=True)
    batch_length = [torch.tensor(i['length']) for i in batch]
    batch_length = torch.stack(batch_length)
    batch_label = [torch.tensor(i['label']) for i in batch]
    batch_label = torch.stack(batch_label)
    batch = {'ids': batch_ids,
             'length': batch_length,
             'label': batch_label}
    return batch

In [None]:
def train(dataloader, model, criterion, optimizer, device):

    model.train()
    epoch_losses = []
    epoch_accs = []

    for batch in tqdm.tqdm(dataloader, desc='training...'):
        ids = batch['ids'].to(device)
        length = batch['length']
        label = batch['label'].to(device)
        prediction = model(ids, length)
        loss = criterion(prediction, label)
        accuracy = get_accuracy(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())

    return epoch_losses, epoch_accs

def evaluate(dataloader, model, criterion, device):
    
    model.eval()
    epoch_losses = []
    epoch_accs = []

    with torch.no_grad():
        for batch in tqdm.tqdm(dataloader, desc='evaluating...'):
            ids = batch['ids'].to(device)
            length = batch['length']
            label = batch['label'].to(device)
            prediction = model(ids, length)
            loss = criterion(prediction, label)
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())

    return epoch_losses, epoch_accs

def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

During the training stage, we save the model that performs best on the validation set.

In [None]:
# training hyper-parameters
n_epochs = 10
lr = 5e-4
batch_size = 256
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
# train_data = to_map_style_dataset(train_data)
# valid_data = to_map_style_dataset(valid_data)
# test_data = to_map_style_dataset(test_data)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=batch_size, collate_fn=collate_fn, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, collate_fn=collate_fn, shuffle=False)

best_valid_loss = float('inf')

train_losses = []
train_accs = []
valid_losses = []
valid_accs = []

model = model.to(device)

for epoch in range(n_epochs):

    train_loss, train_acc = train(train_loader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_loader, model, criterion, device)

    train_losses.extend(train_loss)
    train_accs.extend(train_acc)
    valid_losses.extend(valid_loss)
    valid_accs.extend(valid_acc)
    
    epoch_train_loss = np.mean(train_loss)
    epoch_train_acc = np.mean(train_acc)
    epoch_valid_loss = np.mean(valid_loss)
    epoch_valid_acc = np.mean(valid_acc)
    
    if epoch_valid_loss < best_valid_loss:
        best_valid_loss = epoch_valid_loss
        torch.save(model.state_dict(), 'lstm.pt')
    
    print(f'epoch: {epoch+1}')
    print(f'train_loss: {epoch_train_loss:.3f}, train_acc: {epoch_train_acc:.3f}')
    print(f'valid_loss: {epoch_valid_loss:.3f}, valid_acc: {epoch_valid_acc:.3f}')

training...: 100%|██████████| 88/88 [00:44<00:00,  1.97it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.75it/s]


epoch: 1
train_loss: 0.593, train_acc: 0.673
valid_loss: 0.426, valid_acc: 0.798


training...: 100%|██████████| 88/88 [00:44<00:00,  1.99it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.36it/s]


epoch: 2
train_loss: 0.456, train_acc: 0.789
valid_loss: 0.461, valid_acc: 0.785


training...: 100%|██████████| 88/88 [00:45<00:00,  1.93it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.67it/s]


epoch: 3
train_loss: 0.449, train_acc: 0.800
valid_loss: 0.355, valid_acc: 0.848


training...: 100%|██████████| 88/88 [00:45<00:00,  1.92it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.65it/s]


epoch: 4
train_loss: 0.351, train_acc: 0.851
valid_loss: 0.361, valid_acc: 0.854


training...: 100%|██████████| 88/88 [00:46<00:00,  1.89it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.71it/s]


epoch: 5
train_loss: 0.415, train_acc: 0.820
valid_loss: 0.589, valid_acc: 0.617


training...: 100%|██████████| 88/88 [00:46<00:00,  1.90it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.63it/s]


epoch: 6
train_loss: 0.522, train_acc: 0.722
valid_loss: 0.357, valid_acc: 0.857


training...: 100%|██████████| 88/88 [00:46<00:00,  1.90it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.71it/s]


epoch: 7
train_loss: 0.346, train_acc: 0.853
valid_loss: 0.331, valid_acc: 0.865


training...: 100%|██████████| 88/88 [00:46<00:00,  1.89it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.67it/s]


epoch: 8
train_loss: 0.315, train_acc: 0.869
valid_loss: 0.362, valid_acc: 0.847


training...: 100%|██████████| 88/88 [00:46<00:00,  1.88it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.72it/s]


epoch: 9
train_loss: 0.277, train_acc: 0.887
valid_loss: 0.293, valid_acc: 0.882


training...: 100%|██████████| 88/88 [00:46<00:00,  1.88it/s]
evaluating...: 100%|██████████| 10/10 [00:02<00:00,  4.71it/s]


epoch: 10
train_loss: 0.235, train_acc: 0.909
valid_loss: 0.290, valid_acc: 0.881


## Evaluate on the test set

Now, we load the well-trained model and evaluate on the test set.

In [None]:
model.load_state_dict(torch.load('lstm.pt'))

test_loss, test_acc = evaluate(test_loader, model, criterion, device)

epoch_test_loss = np.mean(test_loss)
epoch_test_acc = np.mean(test_acc)

print(f'test_loss: {epoch_test_loss:.3f}, test_acc: {epoch_test_acc:.3f}')

evaluating...: 100%|██████████| 98/98 [00:21<00:00,  4.62it/s]

test_loss: 0.331, test_acc: 0.861





## Test on a random example

In [None]:
def predict_sentiment(text, model, tokenizer, vocab, device):
    tokens = tokenizer(text)
    ids = [vocab[t] for t in tokens]
    length = torch.LongTensor([len(ids)])
    tensor = torch.LongTensor(ids).unsqueeze(dim=0).to(device)
    prediction = model(tensor, length).squeeze(dim=0)
    probability = torch.softmax(prediction, dim=-1)
    predicted_class = prediction.argmax(dim=-1).item()
    predicted_probability = probability[predicted_class].item()
    return predicted_class, predicted_probability

In [None]:
text = "This film is terrible!"

predicted_label, prob = predict_sentiment(text, model, tokenizer, vocab, device)
print(f"'{text}' is {'positive' if predicted_label else 'negtive'} with probability {prob}.")

'This film is terrible!' is negtive with probability 0.9612252712249756.


## Practice

Please try more experimental settings and hype-parameters to obtain better performance. You can consider from following aspects:

- Based on the comparisons of traning and validation accuracy, we can find that the model is still overfitting to the training set. So, try to alleviate the overfitting.
- In our example, we use `LSTM` as the backbone. Please try some other RNNs, like `GRU`.
- Please try more pre-trained word embeddings.

In [None]:
!nvidia-smi

Mon Jan 30 07:00:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    26W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces