# COMP4167 Natural Language Processing
# Practical III
# Transformers with Sentiment Analysis

In this notebook we will be using the transformer model, BERT (Bidirectional Encoder Representations from Transformers) model to perform sentiment analysis of movie review dataset IMDB.

In this notebook we will:
- Use the [transformers library](https://github.com/huggingface/transformers) to get pre-trained BERT model and use it as our embedding layer. 
- We will freeze (not train) the transformer and only train the model which learns from the representations produced by the transformer. 
- We will use a multi-layer bi-directional GRU, however any model can learn from these representations.

## Preparing Data

The transformer has been pre-trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

The transformers library has tokenizers for each of the transformer models provided. We are using the BERT model which ignores casing. We get this by loading the pre-trained `bert-base-uncased` tokenizer. This is one of the smaller BERT models but still contains 110 million parameters.

In [2]:
!pip install torchtext==0.17.1
!pip install torchdata==0.7.1

Collecting torchtext==0.17.1
  Obtaining dependency information for torchtext==0.17.1 from https://files.pythonhosted.org/packages/b2/cd/add336798f3ebfecaadf1f5aed211216ae9ecb6af3d82e62ecc0af9bbdd8/torchtext-0.17.1-cp311-cp311-win_amd64.whl.metadata
  Downloading torchtext-0.17.1-cp311-cp311-win_amd64.whl.metadata (7.6 kB)
Collecting torch==2.2.1 (from torchtext==0.17.1)
  Obtaining dependency information for torch==2.2.1 from https://files.pythonhosted.org/packages/59/1f/4975d1ab3ed2244053876321ef65bc02935daed67da76c6e7d65900772a3/torch-2.2.1-cp311-cp311-win_amd64.whl.metadata
  Downloading torch-2.2.1-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting typing-extensions>=4.8.0 (from torch==2.2.1->torchtext==0.17.1)
  Obtaining dependency information for typing-extensions>=4.8.0 from https://files.pythonhosted.org/packages/f9/de/dc04a3ea60b22624b51c703a84bbe0184abcd1d0b9bc8074b5d6b7ab90bb/typing_extensions-4.10.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.10.0-py3-n

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.0 requires torch==2.2.0, but you have torch 2.2.1 which is incompatible.
torchvision 0.17.0 requires torch==2.2.0, but you have torch 2.2.1 which is incompatible.




In [1]:
import warnings
warnings.filterwarnings('ignore')

import torch
from tqdm import tqdm
import random
import numpy as np

#Set the random seeds for deterministic results.
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The `tokenizer` has a `vocab` attribute which contains the actual vocabulary we will be using. We can check how many tokens are in it by checking its length.

In [3]:
len(tokenizer.vocab)

30522

Using the tokenizer is as simple as calling `tokenizer.tokenize` on a string. This will tokenize and lower case the data in a way that is consistent with the pre-trained transformer model.

In [4]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)

['hello', 'world', 'how', 'are', 'you', '?']


We can numericalize tokens using our vocabulary using `tokenizer.convert_tokens_to_ids`.

In [5]:
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

[7592, 2088, 2129, 2024, 2017, 1029]


The transformer was also pre-trained with special tokens to mark the beginning and end of the sentence, detailed [here](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel). As well as a standard padding and unknown token. We can also get these from the tokenizer.

- CLS = Classification token
- SEP = Seperating token
- Pad = Padding token
- UNK = Unknown token


In [6]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


We can get the indexes of the special tokens by converting them using the vocabulary...

In [7]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


...or by explicitly getting them from the tokenizer.

In [8]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


Another thing we need to handle is that the model was pre-trained on sequences with a defined maximum length - it does not know how to handle sequences longer than it has been trained on. We can get the maximum length of these input sizes by checking the `max_model_input_sizes` for the version of the transformer we want to use. In this case, it is 512 tokens.

In [9]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)

512


We will define a utility function to tokenize sentences based on the tokenizer provided by the transformers model.

In [10]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

## Setup data pre-processing pipeline 

Now we define how the data will be loaded and converted into tokens. This will make use of the `tokenize_and_cut` functionality we defined above and will also add the special tokens - making note that we are defining them to be their index value and not their string value, i.e. `100` instead of `[UNK]`.

## Data split
Now we load the data and create the validation splits.
IMDB is available within torchtext.

In [11]:
!pip install portalocker
from torchtext import datasets
from torchtext.data.functional import to_map_style_dataset

train_dataset  = to_map_style_dataset(datasets.IMDB(split='train'))
test_dataset   = to_map_style_dataset(datasets.IMDB(split='test'))

# Split train into training set and validation set
train_size = int(0.8 * len(train_dataset))
test_size = len(train_dataset) - train_size
train_dataset, valid_dataset = torch.utils.data.random_split(train_dataset, [train_size, test_size])




## Prepare iterators
We create training and testing iterators. To avoid memory problems during training we will use a small batch size.

There are 2 key arguments we pass to the `DataLoader`:
- `batch_sampler` - this function defines how we split the dataset into batches. In our example we're using bucket sampling which groups texts of similar lengths together, minimising the amount of padding we need to do and speeding up training time.
- `collate_fn` - this is a function which takes a batch of data (label,text) and converts this into a input for the model. This involves tokenizing and vectorizing the data using the tokenizer we defined previously. We also add the special tokens.

In [12]:
from torch.utils.data import DataLoader,Sampler
from torch.nn.utils.rnn import pad_sequence

BATCH_SIZE = 32

train_list = list(train_dataset)

def vectorize_batch(batch):
    '''Take a batch of (text,label) pairs and return tensors ready for input to the model.'''
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(int(_label)-1)
        tokens = tokenizer.convert_tokens_to_ids(tokenize_and_cut(_text))
        text_list.append(torch.tensor([init_token_idx] + tokens + [eos_token_idx]))
    return pad_sequence(text_list,
                        padding_value=pad_token_idx,
                        batch_first=True), torch.tensor(label_list)
                                      
class BucketSampler(Sampler):
    def __init__(self, dataset, batch_size):
        train_list = list(dataset)
        indices = [(i, len(tokenizer(s[1]))) for i, s in enumerate(train_list)]
        random.shuffle(indices)
        self.batch_size = batch_size
        
        # create pool of indices with similar lengths 
        self.pooled_indices = []
        for i in range(0, len(indices), self.batch_size * 100):
            self.pooled_indices.extend(sorted(indices[i:i + self.batch_size * 100], key=lambda x: x[1]))
        self.pooled_indices = [x[0] for x in self.pooled_indices]
        
    def __iter__(self):
        self.count = 0
        return self
    
    def __next__(self):
        # yield indices for current batch
        if self.count >= len(self.pooled_indices)-self.batch_size:
            raise StopIteration
        self.count += 1
        
        for i in range(0, len(self.pooled_indices), self.batch_size):
            return self.pooled_indices[self.count:self.count + self.batch_size]
        
    def __len__(self):
        return len(self.pooled_indices)


train_iterator  = DataLoader(train_dataset, collate_fn=vectorize_batch, batch_sampler=BucketSampler(train_dataset, BATCH_SIZE))
valid_iterator  = DataLoader(valid_dataset, collate_fn=vectorize_batch, batch_size=BATCH_SIZE)
test_iterator   = DataLoader(test_dataset,  collate_fn=vectorize_batch, batch_size=BATCH_SIZE)

Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors


We can check a batch of examples. We can see that the words/tokens have been now replaced by their indexes.

In [13]:
for X, Y in train_iterator:
    print("input_ids:",X)
    print("labels:", Y)
    break

input_ids: tensor([[  101,  2026,  2035,  ...,     0,     0,     0],
        [  101,  1045,  2031,  ...,     0,     0,     0],
        [  101,  2096,  2023,  ...,     0,     0,     0],
        ...,
        [  101,  5525,  1037,  ...,     0,     0,     0],
        [  101,  2009,  1005,  ...,     0,     0,     0],
        [  101, 14833,  9923,  ...,     0,     0,     0]])
labels: tensor([1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
        1, 0, 0, 1, 1, 1, 1, 1])


We can use the `convert_ids_to_tokens` method to transform these indexes back into readable tokens.

In [14]:
print(tokenizer.convert_ids_to_tokens(X[0]))

['[CLS]', 'my', 'all', '-', 'time', 'favorite', 'movie', '!', 'i', 'have', 'seen', 'many', 'movies', ',', 'but', 'this', 'one', 'beats', 'them', 'all', '!', 'excel', '##ent', 'acting', ',', 'wonderful', 'story', '.', 'you', 'will', ',', 'as', 'a', '"', 'normal', '"', 'caring', 'person', 'start', 'to', 'love', 'george', '.', 'alto', '##ugh', 'he', 'is', 'an', 'actor', ',', 'he', 'is', 'also', 'himself', 'and', 'a', 'very', 'lo', '##vable', 'person', '.', 'and', 'mab', '##y', 'most', 'important', 'thing', ':', 'you', 'will', 'learn', 'to', 'respect', '&', 'look', 'different', 'to', 'people', 'with', 'down', 'syndrome', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '

## Build the Model

Next, we'll load the pre-trained model, making sure to load the same model as we did for the tokenizer.

In [15]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Now we can define our classification model. 

Instead of using an embedding layer to get embeddings for our text, we'll be using the pre-trained transformer model. These embeddings will then be fed into a GRU to produce a prediction for the sentiment of the input sentence. We get the embedding dimension size (called the `hidden_size`) from the transformer via its config attribute. The rest of the initialisation is standard.

Within the forward pass, we freeze the transformer by using `no_grad` to ensure no gradients are calculated over this part of the model.



In [16]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        self.bert = bert
        embedding_dim = bert.config.to_dict()['hidden_size']
        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        with torch.no_grad():
            embedded = self.bert(text, return_dict=False)[0]
        _, hidden = self.rnn(embedded)
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
                
        output = self.out(hidden)
        return output

Next, we create an instance of our model using standard hyperparameters.

In [17]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In order to freeze paramers (not train them) we need to set their `requires_grad` attribute to `False`. To do this, we simply loop through all of the `named_parameters` in our model and if they're a part of the `bert` transformer model, we set `requires_grad = False`. 

In [18]:
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

We can double check the names of the trainable parameters, ensuring none of them belong to BERT. We can see, they are all the parameters of the GRU (`rnn`) and the linear layer (`out`).

In [19]:
for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)

rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l0_reverse
rnn.weight_hh_l0_reverse
rnn.bias_ih_l0_reverse
rnn.bias_hh_l0_reverse
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
rnn.weight_ih_l1_reverse
rnn.weight_hh_l1_reverse
rnn.bias_ih_l1_reverse
rnn.bias_hh_l1_reverse
out.weight
out.bias


## Train the Model

As is standard, we define our optimizer and criterion (loss function).

In [20]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [21]:
criterion = nn.BCEWithLogitsLoss()

Place the model and criterion onto the GPU (if available)

In [22]:
model = model.to(device)
criterion = criterion.to(device)

Next, we'll define functions for: calculating accuracy, performing a training epoch, performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

In [23]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    # round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() # convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [24]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for text, label in tqdm(iterator):
        text   = text.to(device)
        label = label.to(device)
        optimizer.zero_grad()
        predictions = model(text).squeeze(1)
        loss = criterion(predictions.to(torch.float32), label.to(torch.float32))
        acc = binary_accuracy(predictions, label)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [25]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
        for text, label in tqdm(iterator):
            # Move everything to the right device
            text  = text.to(device)
            label = label.to(device)
            predictions = model(text).squeeze(1)
            loss = criterion(predictions.to(torch.float32), label.to(torch.float32))
            acc = binary_accuracy(predictions, label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [26]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

## Training Loop
Finally, we'll train our model. This takes considerably longer than any of the previous models due to the size of the transformer. Even though we are not training any of the transformer's parameters we still need to pass the data through the model which takes a considerable amount of time on a standard GPU.

- To save training time, I uploaded an already trained model for you to use. Please download it from here (https://durhamuniversity-my.sharepoint.com/:u:/g/personal/lkjs74_durham_ac_uk/EecLvOrAhdlDqmkIcFBQmEsBE5lIVwbMkz7LB9lNCmjdxA?e=fIXt1D)


In [1]:
import torch
print(torch.cuda.is_available())


False


In [2]:
import torch
print(torch.version.cuda)


None


In [34]:
load_saved_model = True # set to True in order to load a previously trained model 

if load_saved_model:
    p = torch.load('bert-sentiment-model.pt', map_location='cpu')
    
    model.rnn.load_state_dict({k[4:]: v for k, v in p.items() if k.startswith('rnn')})
    model.out.load_state_dict({k[4:]: v for k, v in p.items() if k.startswith('out')})
#     model.load_state_dict(torch.load('bert-sentiment-model.pt', map_location='cpu'))

else:
    # Train the model
    N_EPOCHS = 4

    best_valid_loss = float('inf')

    for epoch in range(N_EPOCHS):
        start_time = time.time()

        train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
        valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), 'bert-sentiment-model.pt')

        print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
        print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
        print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

## Test the model 

In [33]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

 25%|██▍       | 195/782 [41:56<2:06:16, 12.91s/it]


KeyboardInterrupt: 

## Inference

We'll use the model to test the sentiment of some sequences. We tokenize the input sequence, trim it down to the maximum length, add the special tokens to either side, convert it to a tensor, add a batch dimension and then pass it through our model.

In [30]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

In [31]:
predict_sentiment(model, tokenizer, "This film is terrible")

0.011647679843008518

In [32]:
predict_sentiment(model, tokenizer, "This film is great")

0.9848048686981201

Exercises
=========
-  Try to test the model with different sentences of your choice.   
-  Try with more layers, more hidden units, and more sentences.
-  Replace the GRU with LSTM and analyse the difference in performance.