In [None]:
%matplotlib inline

import collections
import random
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import sklearn.metrics
import torch
import transformers

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# BERT

For most machine learning tasks, you're better off **fine-tuning** a **pre-trained** neural network than training from scratch.
What this means is you can take a model that was trained using a large training set that is not exactly what you need (usually a self-supervised task) and then continue training it (fine-tuning it) on a small data set that is precisely what you need.
Doing this will allow the model to start off with being able to extract useful information from the small training set and so will not need to learn the basics of how language works.
This technique, called **transfer learning**, let's you get much better results using a small data set.
So useful was this concept that people started training models with the intention of publishing them as a **source model** for others to perform transfer learning.

One of the most popular pre-trained models is [BERT](https://aclanthology.org/N19-1423/) (Bidirectional Encoder Representations from Transformers) which was trained by self-supervised learning on a large corpus of text from Wikipedia and the Book Corpus.
The self-supervised tasks it was trained on are two: **masked language modelling** (MLM) and **next sentence prediction** (NSP).

MLM is a task for predicting what was the missing token from a sentence.
This is done by randomly replacing some tokens in a sentence with the special token `[MASK]` and training the model to predict every token in the sentence, including the masked ones.
Masked language modelling is contrasted with **causal language modelling**, which is what the language models we've used up to now did.

NSP is a task for predicting if two sentences were found next to each other or not.
This is done by using two special tokens: `[SEP]` (separator) and `[CLS]` (class).
The separator token is placed between the two sentences which are concatenated together into a single sequence.
The class token is used such that the word-in-context vector produced by it is passed into a binary classifier to determine if the two sentences follow each other or if they were picked randomly.

## HuggingFace

The easiest way to use BERT is to use [HuggingFace](https://huggingface.co/models), a library of pre-trained transformers that are readily usable and downloadable.
To use HuggingFace you just need to install the `transformers` Python library using pip.

### Tokeniser

Since it's pre-trained, BERT has its own vocabulary that you need to use.
It also comes with its own tokeniser that you are expected to use.

The vocabulary consists of tokens called **word pieces**, which makes it possible to avoid using unknown tokens (although not completely).
At its extreme, the tokeniser can break down a word into its individual characters and treat these characters as separate tokens.
This would make the vocabulary be all the characters in the alphabet, digits, punctuation, and so on.
While avoiding most unknown tokens, such a tokeniser would turn sentences into token sequences that are too long and which require a lot of memory to work with.
So in addition to the individual characters, the vocabulary also includes commonly occuring substrings that are found in words.
BERT's tokeniser automatically identifies these substrings and makes them a single token.
Any unknown substrings are broken into smaller substrings and treated as separate tokens.

The algorithm for extracting the vocabulary is called the **byte pair encoding** algorithm (BPE) and it works as follows:

1. Start with a corpus of text from which to extract a vocabulary and tokenise it into whole words.
1. Collect all the individual characters from all the words, treat these as tokens, and add them to the vocabulary.
1. Look for the most frequent pair of adjacent vocabulary tokens in the corpus.
1. Concatenate this pair, replace all occurrences of it in the corpus with a single token, and add it to the vocabulary.
1. Repeat the previous two steps until the vocabulary is a certain size (something like 50 000 words).

Let's take the word 'banana' as our example corpus.
We treat the characters of the word as individual tokens and add them to our vocabulary:

    Corpus: b a n a n a
    Vocabulary: {a, b, n}

The frequencies of the pairs of adjacent tokens in the corpus are as follows:

    ba: 1
    an: 2
    na: 2

Let's say that the most frequent pair is 'an'.
We replace all instances of this in the corpus with a single token and add it to the vocabulary:

    Corpus: b an an a
    Vocabulary: {a, b, n, an}

The frequencies of the pairs of new adjacent tokens is now:

    ban: 1
    anan: 1
    ana: 1

Keep replacing the most frequent pair of tokens with a single token until you reach a desired vocabulary size.

Unlike other tokenisers, such as [the one used by GPT](https://gpt-tokenizer.dev/), the BERT tokeniser does not include the space between words as a token.
Instead, each token has two versions: tokens that can only be used at the beginning of a word and tokens that can only appear anywhere else in the word.
This allows BERT to know where multi-token words begin and end in a sentence.
In fact, BERT repesents tokens that can be used after the first token by putting a '##' in front of the token.
For example, the vocabulary shown above would actually be:

    Vocabulary: {b, ##a, ##n, ##an}

And the word 'banana' would be represented as:

    b ##an ##an ##a

In Huggingface, the BERT tokenizer can be loaded as follows:

In [None]:
tokeniser = transformers.BertTokenizer.from_pretrained('bert-base-cased')

And this is how you tokenise your text:

In [None]:
texts = [
    'The dog bit the cat.',
    'It was unbelievable.',
]

# No need to tokenise the text into a list of words first!

token_indexes = tokeniser(texts, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device) # BERT cannot process texts longer than 512 tokens (because of the position embeddings).

# Token indexes.
print('indexes:')
print(token_indexes['input_ids'])
print()

# Pad mask.
print('mask:')
print(token_indexes['attention_mask'])
print()

# Readable tokens.
print('tokens:')
print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))
print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][1]))

### Masked language modelling

This is how you load the BERT model for masked language modelling (it will spend time downloading the model from HuggingFace the first time you run this but then will use the cached model for future calls):

In [None]:
bert = transformers.BertForMaskedLM.from_pretrained('bert-base-cased').to(device)

Now we can replace a token in the tokenised sentences with the mask token:

In [None]:
token_indexes['input_ids'][0, 3] = tokeniser.mask_token_id
token_indexes['input_ids'][1, 4] = tokeniser.mask_token_id

print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))
print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][1]))

Finally we can get the predictions made by BERT:

In [None]:
logits = bert(token_indexes['input_ids'], attention_mask=token_indexes['attention_mask']).logits
probs = torch.softmax(logits, dim=2)
vocab = tokeniser.get_vocab()

print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))
for (prob, token) in sorted(zip(probs[0, 3, :].tolist(), vocab), reverse=True)[:5]:
    print(token, prob)
print()

print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][1]))
for (prob, token) in sorted(zip(probs[1, 4, :].tolist(), vocab), reverse=True)[:5]:
    print(token, prob)

Note that trying to get predictions of a non-mask token will just predict the token itself with almost 100% probability:

In [None]:
print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))
print(tokeniser.convert_ids_to_tokens([token_indexes['input_ids'][0][2]]))
for (prob, token) in sorted(zip(probs[0, 2, :].tolist(), vocab), reverse=True)[:5]:
    print(token, prob)

You don't need to limit yourself to only one mask.
You can use multiple masks to perform **text in-filling**, that is, filling a span of text rather than a single token.
Keep in mind that you can't just predict tokens for all the masks at once because the predictions made for each mask do not take into account what the other masks will be replaced with.
The way you fill in multiple masks is exactly the same as when we performed text generation: you predict a token for a single mask, put it in place of said mask, and then perform a completely new prediction for the next mask with the previous mask filled in.

In [None]:
texts = [
    'The [MASK] [MASK] bit the cat.' # The tokeniser will turn '[MASK]' into the special token.
]

token_indexes = tokeniser(texts, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))

We can get the indexes of all the mask tokens like this:

In [None]:
mask_token_mask = token_indexes['input_ids'][0] == tokeniser.mask_token_id
print('mask_token_mask:')
print(mask_token_mask)
print()

token_positions = torch.arange(token_indexes['input_ids'].shape[1], dtype=torch.int64, device=device)
print('token_positions:')
print(token_positions)
print()

mask_token_positions = token_positions[mask_token_mask]
print('mask_token_positions:')
print(mask_token_positions)

Next, we can predict the token that can replace one mask:

In [None]:
logits = bert(token_indexes['input_ids'], attention_mask=token_indexes['attention_mask']).logits
probs = torch.softmax(logits, dim=2)
print('mask 1:')
for (prob, token) in sorted(zip(probs[0, mask_token_positions[0], :].tolist(), vocab), reverse=True)[:5]:
    print(token, prob)
print()

logits = bert(token_indexes['input_ids'], attention_mask=token_indexes['attention_mask']).logits
probs = torch.softmax(logits, dim=2)
print('mask 2:')
for (prob, token) in sorted(zip(probs[0, mask_token_positions[1], :].tolist(), vocab), reverse=True)[:5]:
    print(token, prob)
print()

print('replacing mask 1 with most probable token:')
token_indexes['input_ids'][0, 2] = probs[0, 2, :].argmax()
print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))
print()

logits = bert(token_indexes['input_ids'], attention_mask=token_indexes['attention_mask']).logits
probs = torch.softmax(logits, dim=2)
print('mask 2 now:')
for (prob, token) in sorted(zip(probs[0, 3, :].tolist(), vocab), reverse=True)[:5]:
    print(token, prob)

How do you enforce the two masks to form a single word (that is made of two tokens)?

#### Pseudo-log-likelihood score

It's not possible to get the probability of a sentence using a masked language model.
Remember how we decomposed the probability of a sentence into the product of multiple probabilities made from prefixes of the sentence?
Masked language models do not use prefixes so the token probabilities do not combine into the mathematical definition of the probability of a sentence.
But this did not stop people from doing it anyway, with good results.
It's called a [**pseudo-log-likelihood score**](https://aclanthology.org/2020.acl-main.240/) (that is, not the true log-likelihood) and it's done as follows:

1. Replace the first token in the sentence with a mask.
1. Get the log probability of the original token in the mask's place.
1. Put the original token back.
1. Repeat the previous three steps for every token.
1. Add up all the log probabilities into a single score.

In [None]:
texts = [
    'The dog bit the cat.'
]
token_indexes = tokeniser(texts, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]))
print()

score = 0.0
for (i, token_index) in enumerate(token_indexes['input_ids'][0, 1:-1].tolist(), start=1):
    token_indexes['input_ids'][0, i] = tokeniser.mask_token_id
    
    logits = bert(token_indexes['input_ids'], attention_mask=token_indexes['attention_mask']).logits
    pseudo_logprobs = torch.log_softmax(logits, dim=2)
    token_pseudo_logprob = pseudo_logprobs[0, i, token_index].tolist()
    
    print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][0]), token_pseudo_logprob)
    
    score += token_pseudo_logprob
    
    token_indexes['input_ids'][0, i] = token_index
print()

print('score:', score)

Remember that this is not a real probability, so there is not point in converting the log probability to a probability.
It should only be treated as a score for ranking.

### Fine-tuning BERT

Masked language modelling is pretty cool, but not as cool as incorporating BERT into your model and fine-tuning it to do your bidding.
What we'll do is to crack open BERT and use the output of a hidden layer inside BERT as a part of our model.
We can then train our model together with BERT.
We'll use it to perform sentiment analysis and part of speech tagging.

Here is how you access a hidden layer:

In [None]:
# BERT has 13 hidden layers, the first being the embedding layer.
hidden_layers = bert(
    token_indexes['input_ids'],
    attention_mask=token_indexes['attention_mask'],
    output_hidden_states=True
).hidden_states

hidden_layer = hidden_layers[7] # The middle layer is usually the most transferrable.
print(hidden_layer.shape)

The important thing about these pre-trained transformers is that they are normal PyTorch modules, and so you can treat them as a layer in your own module.

There are certain things you need to keep in mind when fine-tuning:

* BERT makes use of dropout internally so you have to use `model.train()` and `model.eval()` to say whether the calls you're making on the model are for optimisation or to get predictions.
* You should use a smaller learning rate for BERT than for your own parameters.
    This is to avoid **catastrophic forgetting**, which is when the pre-trained model overfits on your data and forgets what it was pre-trained on.
    We usually use a learning rate of `2E-5` on BERT.
    You can do this easily in PyTorch, as shown below.
* Due to the dropout inside BERT, it will have slightly unstable learning progress, which is normal, provided that the error tends to go down.

#### Text classification

In [None]:
train_x = [
    'I like it .',
    'I hate it .',
    'I don\'t hate it .',
    'I don\'t like it .',
]
train_y = torch.tensor([
    [1],
    [0],
    [1],
    [0],
], dtype=torch.float32, device=device)

token_indexes = tokeniser(train_x, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
train_x_indexed = token_indexes['input_ids']
train_x_mask = token_indexes['attention_mask']

In [None]:
class Model(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.bert = transformers.BertForMaskedLM.from_pretrained('bert-base-cased') # Make sure you're using a fresh copy of the original BERT and not a fine-tuned one.
        self.output_layer = torch.nn.Linear(768, 1)
    
    def forward(self, x_indexed, pad_mask):
        word_in_context_vecs = self.bert(x_indexed, attention_mask=pad_mask, output_hidden_states=True).hidden_states[7]
        cls_vec = word_in_context_vecs[:, 0, :] # Use the CLS token to represent the entire text.
        return self.output_layer(cls_vec)

model = Model()
model.to(device)

# Use a normal learning rate for the parameters we created and a tiny learning rate for BERT.
optimiser = torch.optim.Adam([
    {'params': model.output_layer.parameters(), 'lr': 0.1},
    {'params': model.bert.parameters(), 'lr': 2E-5}
])

print('epoch', 'error')
train_errors = []
for epoch in range(1, 40+1):
    optimiser.zero_grad()
    model.train()
    logits = model(train_x_indexed, train_x_mask)
    train_error = torch.nn.functional.binary_cross_entropy_with_logits(logits, train_y)
    train_errors.append(train_error.cpu().detach().tolist())
    train_error.backward()
    optimiser.step()
    model.eval()

    if epoch%1 == 0:
        print(epoch, train_errors[-1])
print()

with torch.no_grad():
    print('text', 'output')
    outputs = torch.sigmoid(model(train_x_indexed, train_x_mask))
    for (text, y) in zip(train_x, outputs):
        print(text, y.round(decimals=1).cpu().tolist())

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('epoch')
ax.set_ylabel('$E$')
ax.plot(range(1, len(train_errors) + 1), train_errors, color='blue', linestyle='-', linewidth=3)
ax.grid()

#### Text tagging

Given that BERT works on subword tokens rather than at the word level, using BERT to tag full words is a bit of a challenge.
For example, "I don't like it." becomes:

In [None]:
print(tokeniser.convert_ids_to_tokens(token_indexes['input_ids'][3]))

A 5 word sentence will result in 9 tokens which will result in 9 predictions.
How can you make predictions for whole words?

What is usually done is that only the first token of every word is considered, with all other subword tokens being masked out.
Here's a function that converts a list of words with tags into a list of tokens, mask, and aligned tags:

In [None]:
def get_aligned_tokens_and_tags(tokeniser, text_tokens, text_tags, no_tag=0):
    '''
    Convert a list of word-tokenised texts with a tag for each word into BERT tokenised texts with aligned tags.
    '''
    text_indexes = []
    text_aligned_tags = []
    text_tag_masks = []
    
    for (tokens, tags) in zip(text_tokens, text_tags):
        # Replicate the process of tokenising a sentence by concatenating the tokenised words.
        
        # Start with the CLS token.
        indexes = [tokeniser.cls_token_id]
        tag_mask = [False]
        aligned_tags = [no_tag]
        
        # Add the tokens of each word.
        for (tag, word) in zip(tags, tokens):
            for (i, index) in enumerate(tokeniser(word)['input_ids'][1:-1]): # For every subword token in the current word:
                indexes.append(index)
                if i == 0: # Only the first token of the word gets a true tag.
                    tag_mask.append(True)
                    aligned_tags.append(tag)
                else:
                    tag_mask.append(False)
                    aligned_tags.append(no_tag)
        
        # End with the SEP token.
        indexes.append(tokeniser.sep_token_id)
        tag_mask.append(False)
        aligned_tags.append(no_tag)
        
        # Add the new tokenised sentence to the list.
        text_indexes.append(indexes)
        text_aligned_tags.append(aligned_tags)
        text_tag_masks.append(tag_mask)
    
    # Pad the tokens.
    max_len = max(len(indexes) for indexes in text_indexes)
    text_token_masks = []
    for (i, (indexes, tag_mask, aligned_tags)) in enumerate(zip(text_indexes, text_tag_masks, text_aligned_tags)):
        num_tokens = len(indexes)
        text_token_masks.append([1]*num_tokens + [0]*(max_len - num_tokens))
        text_indexes[i].extend([tokeniser.pad_token_id]*(max_len - num_tokens))
        text_tag_masks[i].extend([False]*(max_len - num_tokens))
        text_aligned_tags[i].extend([no_tag]*(max_len - num_tokens))
    
    return (
        torch.tensor(text_indexes, dtype=torch.int64, device=device),
        torch.tensor(text_token_masks, dtype=torch.bool, device=device),
        torch.tensor(text_aligned_tags, dtype=torch.int64, device=device),
        torch.tensor(text_tag_masks, dtype=torch.bool, device=device),
    )

In [None]:
texts = [
    'I like it .'.split(' '),
    'I don\'t like it .'.split(' '),
]
tags = [
    'PRON VERB PROP .'.split(' '),
    'PRON VERB ADP PROP .'.split(' '),
]

tag_set = ['PAD', 'PRON', 'VERB', 'ADP', 'PROP', '.']
tag_indexes = [[tag_set.index(tag) for tag in text] for text in tags]

(token_indexes, token_mask, aligned_tags, tag_mask) = get_aligned_tokens_and_tags(tokeniser, texts, tag_indexes)
token_indexes.to(device)
token_mask.to(device)

print('token_indexes:')
print(token_indexes)
print()
print('readable token_indexes:')
for text in token_indexes:
    print(tokeniser.convert_ids_to_tokens(text))
print()
print('token_mask:')
print(token_mask)
print()
print('aligned_tags:')
print(aligned_tags)
print()
print('readable aligned_tags:')
for text in aligned_tags:
    print([tag_set[index] for index in text])
print()
print('tag_mask:')
print(tag_mask)

And now we can preprocess our data.

In [None]:
train_x = [
    'I like it .'.split(' '),
    'I hate it .'.split(' '),
    'I don\'t hate it .'.split(' '),
    'I don\'t like it .'.split(' '),
]
train_y = [
    ['PRON', 'VERB', 'PROP', '.'],
    ['PRON', 'VERB', 'PROP', '.'],
    ['PRON', 'VERB', 'ADP', 'PROP', '.'],
    ['PRON', 'VERB', 'ADP', 'PROP', '.'],
]

tag_set = ['PAD'] + sorted({tag for text in train_y for tag in text})
train_y_indexed = [[tag_set.index(tag) for tag in text] for text in train_y]

(token_indexes, token_mask, aligned_tags, tag_mask) = get_aligned_tokens_and_tags(tokeniser, train_x, train_y_indexed)
token_indexes.to(device)
token_mask.to(device)

In [None]:
class Model(torch.nn.Module):

    def __init__(self, num_tags):
        super().__init__()
        self.bert = transformers.BertForMaskedLM.from_pretrained('bert-base-cased')
        self.output_layer = torch.nn.Linear(768, num_tags)
    
    def forward(self, x, mask):
        vecs = self.bert(x, attention_mask=mask, output_hidden_states=True).hidden_states[7]
        return self.output_layer(vecs)

model = Model(len(tag_set))
model.to(device)

optimiser = torch.optim.Adam([
    {'params': model.output_layer.parameters(), 'lr': 0.1},
    {'params': model.bert.parameters(), 'lr': 2E-5}
])

print('epoch', 'error')
train_errors = []
for epoch in range(1, 10+1):
    optimiser.zero_grad()
    model.train()
    logits = model(token_indexes, token_mask)
    train_token_errors = torch.nn.functional.cross_entropy(logits.transpose(1, 2), aligned_tags, reduction='none')
    train_token_errors = train_token_errors.masked_fill(~tag_mask, 0.0)
    train_error = train_token_errors.sum()/tag_mask.sum()
    train_errors.append(train_error.cpu().detach().tolist())
    train_error.backward()
    optimiser.step()
    model.eval()

    if epoch%1 == 0:
        print(epoch, train_errors[-1])
print()

with torch.no_grad():
    print('text', 'output')
    output = torch.softmax(model(token_indexes, token_mask), dim=2)
    for (text, indexes, mask, y) in zip(train_x, token_indexes, tag_mask, output):
        tokens = tokeniser.convert_ids_to_tokens(indexes)
        tags = [tag_set[probs.argmax()] if m else '-' for (probs, m) in zip(y, mask)]
        print('text  :', ' '.join(text))
        print('tokens:', *[f'{token: <6s}' for token in tokens])
        print('tags  :', *[f'{tag: <6s}' for tag in tags])
        print()

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('epoch')
ax.set_ylabel('$E$')
ax.plot(range(1, len(train_errors) + 1), train_errors, color='blue', linestyle='-', linewidth=3)
ax.grid()

## Exercises

### Movie reviews with BERT

Redo the movie review sentiment classification task we were doing in earlier topics but this time use BERT.
Note that we don't need to tokenise, pad, or replace out-of-vocabulary tokens.
Also note that your batch size might need to be tiny to fit BERT into you VRAM.

In [None]:
train_df = pd.read_csv('../data_set/sentiment/train.csv')
test_df = pd.read_csv('../data_set/sentiment/test.csv')

train_x = train_df['text']
train_y = train_df['class']
test_x = test_df['text']
test_y = test_df['class']
categories = ['neg', 'pos']
cat2idx = {cat: i for (i, cat) in enumerate(categories)}

train_y_indexed = torch.tensor(
    train_y.map(cat2idx.get).to_numpy()[:, None],
    dtype=torch.float32, device=device
)
test_y_indexed = test_y.map(cat2idx.get).to_numpy()[:, None]

tokeniser = transformers.BertTokenizer.from_pretrained('bert-base-cased')

token_indexes = tokeniser(train_x.tolist(), return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
train_x_indexed = token_indexes['input_ids']
train_x_mask = token_indexes['attention_mask']

token_indexes = tokeniser(test_x.tolist(), return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
test_x_indexed = token_indexes['input_ids']
test_x_mask = token_indexes['attention_mask']