# Week 9: Using Transformer Models


## Getting started
If working on your own machine, make sure the huggingface transformers package is installed

`conda install -c huggingface transformers`

or

`pip install transformers`

Of course, if working on Google Colab, you won't need to do this.  Whatever environment you are using check whether the following code runs.  It should output a negative label with a high score!


In [1]:
from transformers import pipeline
print(pipeline('sentiment-analysis')('I hate you'))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]


The following is adapted from an older version of the huggingface quickstart to transformers tutorial 
https://huggingface.co/transformers/v2.4.0/quickstart.html
We will be looking at the BERT introduction (but feel free to have a look at GPT2 etc as well!)

First of all we need some key imports.  We are going to be using the pre-trained bert-base-uncased model so this cell instantiates a tokenizer for this model.  Logging is also switched on so we can see more of what's going on. The first time you run it, the model will be downloaded and cached.  The cached version will be used on subsequent runs, if it is available (not on Google CoLab).

In [2]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



Now we are going to tokenize some text.  This will demonstrate the 'wordpiece' vocabulary used by BERT as well as the fact that we need to introduce special `[CLS]` and `[SEP]` tokens in the input.

In [3]:
# Tokenize input
text = "[CLS] Who was elected as British prime minister in 1951? [SEP] Sir Winston Leonard Spencer Churchill was a British politician, statesman, army officer and writer, who was Prime Minister of the United Kingdom from 1940 to 1945 and again from 1951 to 1955. [SEP]"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'who', 'was', 'elected', 'as', 'british', 'prime', 'minister', 'in', '1951', '?', '[SEP]', 'sir', 'winston', 'leonard', 'spencer', 'churchill', 'was', 'a', 'british', 'politician', ',', 'statesman', ',', 'army', 'officer', 'and', 'writer', ',', 'who', 'was', 'prime', 'minister', 'of', 'the', 'united', 'kingdom', 'from', '1940', 'to', '1945', 'and', 'again', 'from', '1951', 'to', '1955', '.', '[SEP]']


In [4]:
# Tokenize input
text = "[CLS] What are igneous rocks? [SEP] Igneous rocks form when hot , molten rock crystallizes and solidifies. [SEP] "
tokenized_text= tokenizer.tokenize(text)
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', 'form', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


Note that the tokenizer is not breaking down all words according to their morphology -- only rare words.  Reasonably frequent words such as `elected` are left as whole words.  Rarer words such as `solidifies` are broken down.

Now we are going to mask out one of the words in the text.  For the purposes of this demonstration, I have chosen token 11 but you could try different tokens.  Remember that during training the tokens to mask are chosen randomly.


In [5]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 11
tokenized_text[masked_index] = '[MASK]'
print(tokenized_text)

['[CLS]', 'what', 'are', 'ign', '##eous', 'rocks', '?', '[SEP]', 'ign', '##eous', 'rocks', '[MASK]', 'when', 'hot', ',', 'molten', 'rock', 'crystal', '##li', '##zes', 'and', 'solid', '##ifies', '.', '[SEP]']


In [6]:
print(len(tokenized_text))

25


We are now going to try to use the masked language model to predict this word.

First we need to convert the input into a list of word index ids.

In [7]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(indexed_tokens)

[101, 2054, 2024, 16270, 14769, 5749, 1029, 102, 16270, 14769, 5749, 103, 2043, 2980, 1010, 23548, 2600, 6121, 3669, 11254, 1998, 5024, 14144, 1012, 102]


We need segment ids to define whether a token is in the first or second sentence.

In [8]:
def make_segment_ids(list_of_tokens):
    #this function assumes that up to and including the first '[SEP]' is the first segment, anything afterwards is the second segment
    current_id=0
    segment_ids=[]
    for token in list_of_tokens:
        segment_ids.append(current_id)
        if token == '[SEP]':
            current_id=1
    return segment_ids

segment_ids=make_segment_ids(tokenized_text)
print(segment_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [9]:
# Convert inputs to PyTorch tensors
#this just wraps things up in multi-dimensional tensors rather than as flat lists.
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segment_ids])
print(tokens_tensor)
print(segments_tensors)

tensor([[  101,  2054,  2024, 16270, 14769,  5749,  1029,   102, 16270, 14769,
          5749,   103,  2043,  2980,  1010, 23548,  2600,  6121,  3669, 11254,
          1998,  5024, 14144,  1012,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1]])


Now we need to encode the input using the bert-base-uncased model


In [10]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda - otherwise comment this out to run on CPU
#tokens_tensor = tokens_tensor.to('cuda')
#segments_tensors = segments_tensors.to('cuda')
#model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element of outputs is the output of the last layer of the Bert model (all tokens)
    # the second element of outputs, outputs[1] is actually just a "pooled_output" representation of the CLS token (rather than all tokens) - however this involves an extra layer which is why it is not the same as the first element in outputs[0]!
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
print(encoded_layers.shape)

torch.Size([1, 25, 768])


In [12]:
encoded_layers

tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
         [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
         [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
         ...,
         [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
         [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
         [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]])

In [13]:
#outputs[1] is a representation of the CLS token of shape (batch size, model hidden dimension)
outputs[1].shape

torch.Size([1, 768])

In [14]:
outputs[1]

tensor([[-0.9920, -0.8534, -0.9981,  0.9898,  0.9694, -0.7778,  0.9918,  0.7435,
         -0.9951, -1.0000, -0.9537,  0.9990,  0.9948,  0.9125,  0.9905, -0.9797,
         -0.9501, -0.9055,  0.7429, -0.9396,  0.9539,  1.0000, -0.7118,  0.8053,
          0.8925,  1.0000, -0.9784,  0.9881,  0.9908,  0.8765, -0.9697,  0.7592,
         -0.9981, -0.6847, -0.9970, -0.9994,  0.9093, -0.9470, -0.6564, -0.6737,
         -0.9759,  0.8515,  1.0000,  0.4217,  0.8609, -0.7738, -1.0000,  0.7504,
         -0.9733,  0.9991,  0.9951,  0.9958,  0.8245,  0.9274,  0.9106, -0.8983,
          0.6171,  0.6578, -0.7522, -0.9329, -0.8695,  0.8373, -0.9920, -0.9770,
          0.9985,  0.9901, -0.7740, -0.8020, -0.7395,  0.4467,  0.9907,  0.7470,
         -0.6924, -0.9460,  0.9862,  0.7678, -0.8415,  1.0000, -0.9650, -0.9968,
          0.9861,  0.9893,  0.8187, -0.9405,  0.9565, -1.0000,  0.9414, -0.6060,
         -0.9974,  0.7915,  0.9118, -0.7666,  0.9441,  0.8653, -0.8785, -0.9049,
         -0.8647, -0.9939, -

We can also predict the masked token as follows.  We make the predictions as before (using the last layer of the BERT model) but then we find the token id which maximises the prediction for the masked token.

In [15]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
#tokens_tensor = tokens_tensor.to('cuda')
#segments_tensors = segments_tensors.to('cuda')
#model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

        
# find the token id which maximises the prediction for the masked token and then convert this back to a word
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


form


Did BERT correctly predict the masked token?

### Exercise 0
Mask each token in turn and see what BERT predicts.   How accurate are its predictions?  As an extension, you could look at masking multiple words in the sequence.

## Representing Sentential Meaning
We are going to be looking at different strategies for representing sentential meaning
* CLS token representation
* centroid/sum of output embeddings

The file `examples.txt` contains some example sentences.

### Exercise 1
Read in the sentences and store them as a list of sentences.  Add `[CLS]` and `[SEP]` tokens to the beginning and end of each and then pass them through the bert-base-uncased tokenizer

When encoding sentences, it is actually more typical to pool the hidden states for each layer (at depth n) rather than the output layer.  We can access the hidden states of the model using `output_hidden_states=True` 

In [16]:
model = BertModel.from_pretrained('bert-base-uncased')


model.eval()

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors,output_hidden_states=True)
   
    
    

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
outputs.to_tuple()

(tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
          [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
          [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
          ...,
          [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
          [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
          [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]]),
 tensor([[-0.9920, -0.8534, -0.9981,  0.9898,  0.9694, -0.7778,  0.9918,  0.7435,
          -0.9951, -1.0000, -0.9537,  0.9990,  0.9948,  0.9125,  0.9905, -0.9797,
          -0.9501, -0.9055,  0.7429, -0.9396,  0.9539,  1.0000, -0.7118,  0.8053,
           0.8925,  1.0000, -0.9784,  0.9881,  0.9908,  0.8765, -0.9697,  0.7592,
          -0.9981, -0.6847, -0.9970, -0.9994,  0.9093, -0.9470, -0.6564, -0.6737,
          -0.9759,  0.8515,  1.0000,  0.4217,  0.8609, -0.7738, -1.0000,  0.7504,
          -0.9733,  0.9991,  0.9951,  0.9958,  0.8245,  0.

In [18]:
print(len(outputs))
for i in range(len(outputs)):
    try:
        print(outputs[i].shape)
    except:
        print(len(outputs[i]))

3
torch.Size([1, 25, 768])
torch.Size([1, 768])
13


Here:
* outputs[0] contains the output representation of each token
* outputs[1] is representation of the first token (after being put through an additional layer)
* outputs[2] is a a tuple.  Each element is the hidden layer at depth n.  If we want the last layer then we need outputs[2][-1]


In [19]:
#outputs[2][-1] is the last hidden layer also output as outputs[0]
outputs[2][-1]

tensor([[[-0.6674,  0.9458, -0.5036,  ..., -0.5302,  0.1874,  0.6336],
         [ 0.3067, -0.0335,  0.0011,  ...,  0.1267, -0.3640, -0.9253],
         [ 0.4349,  0.3408,  0.5464,  ..., -0.4371, -0.1396,  0.6499],
         ...,
         [ 0.2870,  0.5662, -0.0956,  ..., -0.5547, -0.4900, -0.2980],
         [ 0.6253,  0.0615, -0.2790,  ...,  0.0038, -0.5781, -0.4948],
         [ 0.6399,  0.0483, -0.2593,  ...,  0.0093, -0.5864, -0.4718]]])

In [20]:
#so if you want the penultimate hidden layer you need outputs[2][-2]
outputs[2][-2]

tensor([[[-3.4760e-01,  6.1722e-01, -6.2986e-01,  ..., -5.5268e-02,
          -4.7414e-01,  1.1962e+00],
         [ 2.4667e-01, -1.1385e-01, -3.2224e-01,  ...,  3.9890e-01,
          -4.0355e-01, -1.5206e+00],
         [-1.5780e-01,  2.0670e-01, -3.2949e-01,  ..., -7.0685e-01,
          -8.9300e-02,  1.3601e+00],
         ...,
         [ 3.1460e-01,  4.2031e-01,  1.2169e-01,  ..., -4.4048e-01,
          -5.6069e-01, -1.8098e-01],
         [ 4.6541e-02,  1.3745e-02, -3.7530e-02,  ...,  1.9264e-02,
          -1.4171e-02, -4.5845e-03],
         [ 4.5347e-02,  8.7523e-03, -3.7831e-02,  ...,  1.8386e-02,
          -1.5910e-02, -1.1673e-03]]])

### Exercise 2
* Encode each sentence using the output representation for its CLS token - note that you do not need to mask the CLS token.  We are just interested in the output layer embedding for this token.  You can use outputs[0][0] or outputs[1] as a representation of the CLS token - but you will get different results as outputs[1] as gone through an additional layer (trained for next sentence prediction during fine-tuning and classification IF the model has been fine-tuned).
* Use cosine similarity to determine all pairs similarities for the sentences.
* Identify the 10 most similar pairs of sentences using this sentence encoding

In [21]:
## this is a handy way of finding the cosine similarity between two tensors
# see https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

#you use this as:
print(encoded_layers[0,0],encoded_layers[0,1])
output=cos(encoded_layers[0,0],encoded_layers[0,1])
print(output.item())

tensor([-6.6741e-01,  9.4580e-01, -5.0361e-01,  7.6330e-02, -1.2090e+00,
         1.2795e-01,  9.3474e-01,  1.1574e+00, -1.9790e-01,  9.3847e-02,
        -7.1620e-01, -4.7296e-01, -2.1994e-01,  1.0530e+00,  3.7979e-01,
        -1.6461e-01, -2.0667e-01,  1.0136e+00,  3.5675e-01, -2.8986e-02,
         1.4404e-01, -2.0457e-01, -1.0594e-01,  9.7709e-03,  9.4189e-03,
        -5.7983e-01, -1.4214e-01, -4.3860e-01,  6.5073e-01, -4.3919e-01,
        -2.3003e-01,  1.1294e+00, -5.0243e-01, -1.7202e-01,  3.5664e-01,
        -6.0078e-02,  1.5559e-01, -5.6304e-02,  3.3006e-01,  1.5681e-01,
        -1.7278e-01,  1.3906e-01,  5.0966e-01,  3.3143e-01, -1.8095e-01,
        -3.5913e-01, -1.8010e+00, -2.8397e-01, -3.3440e-01, -3.4394e-01,
         1.7959e-01, -6.9789e-02,  8.8394e-01,  4.0845e-01, -3.3890e-01,
         1.4026e+00, -9.4726e-01,  3.8425e-01,  1.7327e-01,  9.4517e-01,
        -6.2579e-02, -7.5702e-02, -6.6680e-01, -4.6924e-01,  4.4900e-01,
         9.2037e-01, -2.3056e-01,  8.5764e-01, -8.6

### Exercise 3
a) Repeat exercise 2 but use the centroid of all of the output embeddings as the representation of a sentence.

b) Experiment with using different pooling layers from the hidden state embeddings.  Typically, using the penultimate layer (-2) is felt to be optimal as it is far enough away from the original uncontextualised word embeddings but also not too close to the output predictions.  

### Extension 1
The MRPC.zip file contains a training, dev and test split for the Microsoft Research paraphrase corpus.  In this corpus the quality '1' indicates that the 2 sentences are considered to be paraphrases and '0' indicates that they are not.

Can you build a classifier on top of the BERT pre-trained model, trained on the training split of MRPC, which predicts whether 2 sentences are paraphrases or not?

Note this does not require you to fine-tune the BERT model.  You can use outputs from BERT as input to your separate classifier.  I would suggest a single neural layer which uses the representation from exercise 2 or 3 as input, built using scikit-learn or torch.   