# Fine-tuning BERT

In this notebook, you will download a pre-trained BERT network, which was already fine-tuned on a task of predicting whether two sequences are semantically similar or not (a two-class classification problem). As Colab GPUs are not always available, you will test this on the CPU - which is of course slower than a GPU.

Note that the Huggingface transformer interface (that you used in the tokenization notebook) makes these tasks far easier, by providing functions that download trained models; we here spell out the code, to give you an impression of what happens behind the scenes.

First, we install the D2L module - **if you get an error, do not restart the session!** Just continue, things should work fine.

In [None]:
!pip install d2l==1.0.3

In [None]:
import os
import re
import torch
from torch import nn
import torch.nn.functional as F
from d2l import torch as d2l
import json
import multiprocessing
import tensorflow_datasets as tfds   # if Colab complains, first install (similar to installation of d2l above)
import numpy as np

Here, we make use of a dataset distributed with tensorflow. It's part of GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/), a collection of resources for training, evaluating, and analyzing natural language understanding systems. In particular, we focus on the Microsoft Research Paraphrase Corpus (MRPC), a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

In [None]:
tfds.disable_progress_bar()
glue, info = tfds.load('glue/mrpc', with_info=True,
                       # It's small, load the whole dataset
                       batch_size=-1)

The MRPC dataset contains training data, test data and validation data. Items in the datasets contain pairs of sentences (sentence1 and sentence2) with the associated label indicating that the two sentences are equivalent (1) or not (0). In the test data, the label is set to -1 (unlabelled).

In [None]:
def extract_text(s):
    # Remove information that will not be used by us
    s = re.sub('\\(', '', s)
    s = re.sub('\\)', '', s)
    # Substitute two or more consecutive whitespace with space
    s = re.sub('\\s{2,}', ' ', s)
    return s.strip()

def read_preprocess(data):
    """Read the dataset into sentence1, sentence2, and labels."""
    label_set = {'not_equivalent': 0, 'equivalent': 1}
    sentences1 = [extract_text(s.numpy().decode()) for s in data['sentence1']]
    sentences2 = [extract_text(s.numpy().decode()) for s in data['sentence2']]
    labels = [s.numpy() for s in data['label']]
    return sentences1, sentences2, labels

print('Labels: ', info.features['label'].names)
train_data      = read_preprocess(glue['train'])
test_data       = read_preprocess(glue['test'])
validation_data = read_preprocess(glue['validation'])

for data in [train_data, test_data, validation_data]:
    print([[row for row in data[2]].count(i) for i in [-1,0,1]])
    # For train, test, validation: print number of cases with label -1,0,1

Some examples of sentence pairs:

In [None]:
sents1, sents2, labels = train_data
print(sents1[0], '<->', sents2[0], ': ', labels[0])
sents1, sents2, labels = validation_data
print(sents1[0], '<->', sents2[0], ': ', labels[0])

We now will load a pre-trained BERT, specifically a 12-layer, 768-hidden units, 12-head, 110M parameter base model (there are also small and large versions). We also load the vocabulary which was obtained when pre-training the model.

First, we will load the data - this may take a few minutes depending on your Internet connection.

In [None]:
!git clone https://git.wur.nl/bioinformatics/grs34806-deep-learning-course-data.git data

In [None]:
# Change to GPU for speedup
device = torch.device('cpu')

# Define an empty vocabulary to load the predefined vocabulary
vocab = d2l.Vocab()
vocab.idx_to_token = json.load(open(os.path.join('data/vocab.json')))
vocab.token_to_idx = {token: idx for idx, token in enumerate(vocab.idx_to_token)}

# Instantiate the network architecture
bert = d2l.BERTModel(len(vocab), num_hiddens=768, ffn_num_hiddens=3072,
                     num_heads=12, num_blks=12, dropout=0.1, max_len=512)

Let's have a quick look at what `bert` looks like; we will use most (but not all) of this network below for our dedicated classifier.

In [None]:
bert

## Exercise 1

In the `bert` architecture above, check that you recognize the encoder blocks (how many?), each with four submodules (which ones?).

In [None]:
class GLUEBERTDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, max_len, vocab=None):
        # Unpack all tokens
        all_tokens = [[s1_tokens, s2_tokens]
            for s1_tokens, s2_tokens in zip(*[
                d2l.tokenize([s.lower() for s in sentences])
                    for sentences in dataset[:2]])]

        self.labels = torch.tensor(dataset[2])
        self.vocab = vocab
        self.max_len = max_len
        (self.all_token_ids, self.all_segments,
         self.valid_lens) = self._preprocess(all_tokens)
        print('Read ' + str(len(self.all_token_ids)) + ' examples')

    def _preprocess(self, all_tokens):
        pool = multiprocessing.Pool(4)  # Use 4 worker processes
        out = pool.map(self._mp_worker, all_tokens)
        all_token_ids = [token_ids for token_ids, segments, valid_len in out]
        all_segments = [segments for token_ids, segments, valid_len in out]
        valid_lens = [valid_len for token_ids, segments, valid_len in out]
        return (torch.tensor(all_token_ids, dtype=torch.long),
                torch.tensor(all_segments, dtype=torch.long), torch.tensor(valid_lens))

    def _mp_worker(self, all_tokens):
        s1_tokens, s2_tokens = all_tokens
        self._truncate_pair_of_tokens(s1_tokens, s2_tokens)
        tokens, segments = d2l.get_tokens_and_segments(s1_tokens, s2_tokens)
        token_ids = self.vocab[tokens] + [self.vocab['<pad>']] \
                             * (self.max_len - len(tokens))
        segments = segments + [0] * (self.max_len - len(segments))
        valid_len = len(tokens)
        return token_ids, segments, valid_len

    def _truncate_pair_of_tokens(self, s1_tokens, s2_tokens):
        # Reserve slots for '<CLS>', '<SEP>', and '<SEP>' tokens for the BERT input
        while len(s1_tokens) + len(s2_tokens) > self.max_len - 3:
            if len(s1_tokens) > len(s2_tokens):
                s1_tokens.pop()
            else:
                s2_tokens.pop()

    def __getitem__(self, idx):
        return (self.all_token_ids[idx].to(device),
                self.all_segments[idx].to(device),
                self.valid_lens[idx].to(device)), self.labels[idx].to(device)

    def __len__(self):
        return len(self.all_token_ids)

## The dataset for fine-tuning BERT

For the task on the GLUE dataset, we define a customized dataset class `GLUEBERTDataset`.
In each example, the two sentences form a pair of text sequences packed into one BERT input sequence. Segment IDs are used to distinguish the two text sequences.
With the predefined maximum length of a BERT input sequence (`max_len`, here 128), the last token of the longer of the input text pair keeps getting removed until the maximum length is met.

In [None]:
max_len = 128
batch_size = 8 # We use a small batch size here for demonstration purposes

train_set      = GLUEBERTDataset(train_data, max_len, vocab)
validation_set = GLUEBERTDataset(validation_data, max_len, vocab)
test_set       = GLUEBERTDataset(test_data, max_len, vocab)

# We use glue_validate for testing (test set is unlabelled)
train_iter      = torch.utils.data.DataLoader(train_set, batch_size, shuffle=True)
validation_iter = torch.utils.data.DataLoader(validation_set, batch_size, shuffle=False)
test_iter       = torch.utils.data.DataLoader(test_set, batch_size, shuffle=False)

Next, we create a network out of parts of BERT model - its encoder and hidden layer - followed by a simple linear output layer with 2 units coding for our two classes:

In [None]:
class BERTClassifier(nn.Module):
    def __init__(self, bert):
        super(BERTClassifier, self).__init__()
        # Note how we exctract here various pieces of the BERT model defined above
        self.encoder = bert.encoder
        self.hidden = bert.hidden
        # 768 is the dimension of the hidden state of bert.hidden
        self.output = nn.Sequential(nn.Linear(768, 2))

    def forward(self, inputs):
        tokens_X, segments_X, valid_lens_x = inputs
        encoded_X = self.encoder(tokens_X, segments_X, valid_lens_x)
        return self.output(self.hidden(encoded_X[:, 0, :]))

net = BERTClassifier(bert)
net.eval()

Let's now load the parameters of the model, obtained by pre-training.

In [None]:
net.load_state_dict(torch.load("data/GLUEBERT.net"))

## Exercise 2

You can use the code below to investigate the context dependency of the BERT embeddings of the same word - in this case "bank", which gets a different 768D embedding depending on the sentence in which it is used. Do the similarities make sense? See if you can come up with some different sentences where the same words are used in different contexts.

In [None]:
def get_bert_encoding(net, tokens):
    toks, segments = d2l.get_tokens_and_segments(tokens)
    token_ids = torch.tensor(vocab[toks], device=device).unsqueeze(0)
    segments = torch.tensor(segments, device=device).unsqueeze(0)
    valid_len = torch.tensor(len(toks), device=device).unsqueeze(0)
    encoded_X = net(token_ids, segments, valid_len)
    return encoded_X

bert.to(device)
bert.eval()

tokens_a = 'i walked along the road to get cash from my bank'.split()
tokens_b = 'we managed to open a savings account at the local bank'.split()
tokens_c = 'i swam across the river to get to its other bank'.split()

# First token is <cls>, so 'bank' is the 11th token.
enc_a = get_bert_encoding(bert.encoder, tokens_a)[:,11,:]
enc_b = get_bert_encoding(bert.encoder, tokens_b)[:,11,:]
enc_c = get_bert_encoding(bert.encoder, tokens_c)[:,11,:]

print(F.cosine_similarity(enc_a,enc_b))
print(F.cosine_similarity(enc_a,enc_c))
print(F.cosine_similarity(enc_b,enc_c))

Now we can try out the model:

In [None]:
# Test the network on a batch of sentence pairs
net.to(device)
net.eval()

with torch.no_grad():
  X,y = next(iter(train_iter))
  yhat = np.argmax(net(X).detach().cpu().numpy(),axis=1)

for i in range(len(yhat)):
  print('Input: ', ' '.join([vocab.idx_to_token[j] for j in X[0][i]]))
  print('Prediction: ', yhat[i], y[i])

In [None]:
# To test our own sentences, first create a GLUEBERTDataset with them
s1 = "We use this as a test sentence to see whether the network works."
s2 = "To test if the network works we use this sentence."

#s1 = "Google is a very large company indeed."
#s2 = "And now for something completely different."

data = ([extract_text(s1)], [extract_text(s2)], [0])
print(data)

my_set = GLUEBERTDataset(data, max_len, vocab)
my_iter = torch.utils.data.DataLoader(my_set, 1)

with torch.no_grad():
  X, y = next(iter(my_iter))
  yhat = np.argmax(net(X).detach().cpu().numpy(),axis=1)
print('Prediction: ', yhat)

Instead of only looking at the actual prediction of the model (as above) we can also look at the probabilities for the two possible labels ("similar or not similar"):

In [None]:
torch.nn.functional.softmax(net(X).detach(),dim=1)

## Exercise 3
Try changing specific key words in the sentence pairs above, to get some idea on how much understanding the model has about language. For example, replace "network" by "model" or by "car" in one of the two sentences of the first pair. Look at the resulting prediction (equivalent or not) and also at the underlying probabilities which the model gives as output.

# Answers

## Exercise 1

You can clearly identify 12 encoder blocks, as follows:

```
TransformerEncoderBlock(
        (attention): MultiHeadAttention(
          (attention): DotProductAttention(
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (W_q): Linear(in_features=0, out_features=768, bias=True)
          (W_k): Linear(in_features=0, out_features=768, bias=True)
          (W_v): Linear(in_features=0, out_features=768, bias=True)
          (W_o): Linear(in_features=0, out_features=768, bias=True)
        )
        (addnorm1): AddNorm(
          (dropout): Dropout(p=0.1, inplace=False)
          (ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (ffn): PositionWiseFFN(
          (dense1): Linear(in_features=0, out_features=3072, bias=True)
          (relu): ReLU()
          (dense2): Linear(in_features=0, out_features=768, bias=True)
        )
        (addnorm2): AddNorm(
          (dropout): Dropout(p=0.1, inplace=False)
          (ln): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
```

At the end, you will see an MLM layer (maxed language modelling) and an NSP layer (next sentence prediction); these were both used in pretraining. When fine-tuning BERT, we take the output of the encoder block.

## Exercise 2

If all goes well, you should get a similarity of ~0.96 for the two sentences that use "bank" to mean "financial institution", and lower (but still reasonably high) for the pairs with one sentence using "bank" as "riverside".

There are many more homonyms that you can play with, see e.g. https://www.yourdictionary.com/articles/examples-homonyms.

## Exercise 3

Changing "network" in the first sentence to "model" or "car" indeed changes the classifcation.