## Exploring tokenization for classification with BERT and RoBERTa

In this section we explore how BERT and RoBERTa models tokenize texts or pairs of texts, and what special tokens are aggregated.

In [13]:
from transformers import AutoTokenizer

#### BERT

In [14]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Let's start with tokenizing a single text.

In [24]:
example_text = ["Here is some text to encode"]
encoded_input = tokenizer(example_text, return_tensors='pt')
for key, value in encoded_input.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1]]


As you can see, we get three different things for each text:
- `input_ids` - The indices for the corresponding tokens.
- `attention_mask` - This is for the Transformer, to mask out any padding token and prevent it from being involved in the calculations! Since we only have one text, there is no need to use padding to get all texts to be of the same length. Thus, all tokens have the mask set to `True`.
- `token_type_ids` - This was seen during class. As BERT was prepared to receive texts pairs of texts, they have one embedding for each text in the pair. This is indicated by the `token_type_id`. In this case, we have one text only, and thus it is set to `0` in all tokens.

In [25]:
input_ids = encoded_input['input_ids'][0]
token_type_ids = encoded_input['token_type_ids'][0]
attention_mask = encoded_input['attention_mask'][0]

for input_id, token_type_id, attention_mask in zip(input_ids, token_type_ids, attention_mask):
    print(f"Token with ID {input_id}, corresponding to '{tokenizer.decode(input_id)}' - attention_mask={attention_mask.item()}, token_type_id={token_type_id.item()}")

Token with ID 101, corresponding to '[CLS]' - attention_mask=1, token_type_id=0
Token with ID 2182, corresponding to 'here' - attention_mask=1, token_type_id=0
Token with ID 2003, corresponding to 'is' - attention_mask=1, token_type_id=0
Token with ID 2070, corresponding to 'some' - attention_mask=1, token_type_id=0
Token with ID 3793, corresponding to 'text' - attention_mask=1, token_type_id=0
Token with ID 2000, corresponding to 'to' - attention_mask=1, token_type_id=0
Token with ID 4372, corresponding to 'en' - attention_mask=1, token_type_id=0
Token with ID 16044, corresponding to '##code' - attention_mask=1, token_type_id=0
Token with ID 102, corresponding to '[SEP]' - attention_mask=1, token_type_id=0


After investigating each token further, we can see the the `[CLS]` and `[SEP]` tokens seen in class were added. Let's now check how pairs of texts are tokenized:

In [26]:
example_text = [("Here is some text to encode", "Here is some more text to encode")]
encoded_input = tokenizer(example_text, return_tensors='pt')
for key, value in encoded_input.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102, 2182, 2003, 2070, 2062, 3793, 2000, 4372, 16044, 102]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [27]:
input_ids = encoded_input['input_ids'][0]
token_type_ids = encoded_input['token_type_ids'][0]
attention_mask = encoded_input['attention_mask'][0]

for input_id, token_type_id, attention_mask in zip(input_ids, token_type_ids, attention_mask):
    print(f"Token with ID {input_id}, corresponding to '{tokenizer.decode(input_id)}' - attention_mask={attention_mask.item()}, token_type_id={token_type_id.item()}")

Token with ID 101, corresponding to '[CLS]' - attention_mask=1, token_type_id=0
Token with ID 2182, corresponding to 'here' - attention_mask=1, token_type_id=0
Token with ID 2003, corresponding to 'is' - attention_mask=1, token_type_id=0
Token with ID 2070, corresponding to 'some' - attention_mask=1, token_type_id=0
Token with ID 3793, corresponding to 'text' - attention_mask=1, token_type_id=0
Token with ID 2000, corresponding to 'to' - attention_mask=1, token_type_id=0
Token with ID 4372, corresponding to 'en' - attention_mask=1, token_type_id=0
Token with ID 16044, corresponding to '##code' - attention_mask=1, token_type_id=0
Token with ID 102, corresponding to '[SEP]' - attention_mask=1, token_type_id=0
Token with ID 2182, corresponding to 'here' - attention_mask=1, token_type_id=1
Token with ID 2003, corresponding to 'is' - attention_mask=1, token_type_id=1
Token with ID 2070, corresponding to 'some' - attention_mask=1, token_type_id=1
Token with ID 2062, corresponding to 'more' -

As you can see, after each text in the pair the `[SEP]` token was added. Further, each token of the first text has `token_type_id=0`, while all tokens of the second text have `token_type_id=1`.

#### RoBERTa
We do the same with RoBERTa:

In [28]:
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")

In [29]:
example_text = [("Here is some text to encode", "Here is some more text to encode")]
encoded_input = tokenizer(example_text, return_tensors='pt')
for key, value in encoded_input.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[0, 11773, 16, 103, 2788, 7, 46855, 2, 2, 11773, 16, 103, 55, 2788, 7, 46855, 2]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [32]:
input_ids = encoded_input['input_ids'][0]
attention_mask = encoded_input['attention_mask'][0]

for input_id, attention_mask in zip(input_ids, attention_mask):
    print(f"Token with ID {input_id}, corresponding to '{tokenizer.decode(input_id)}' - attention_mask={attention_mask.item()}")

Token with ID 0, corresponding to '<s>' - attention_mask=1
Token with ID 11773, corresponding to 'Here' - attention_mask=1
Token with ID 16, corresponding to ' is' - attention_mask=1
Token with ID 103, corresponding to ' some' - attention_mask=1
Token with ID 2788, corresponding to ' text' - attention_mask=1
Token with ID 7, corresponding to ' to' - attention_mask=1
Token with ID 46855, corresponding to ' encode' - attention_mask=1
Token with ID 2, corresponding to '</s>' - attention_mask=1
Token with ID 2, corresponding to '</s>' - attention_mask=1
Token with ID 11773, corresponding to 'Here' - attention_mask=1
Token with ID 16, corresponding to ' is' - attention_mask=1
Token with ID 103, corresponding to ' some' - attention_mask=1
Token with ID 55, corresponding to ' more' - attention_mask=1
Token with ID 2788, corresponding to ' text' - attention_mask=1
Token with ID 7, corresponding to ' to' - attention_mask=1
Token with ID 46855, corresponding to ' encode' - attention_mask=1
Token

We can see two things:
1. RoBERTa does not use `token_type_ids`, they do not add these new different embeddings for the text pairs.
2. Instead of `[CLS]` and `[SEP]`, they add `<s>` and `</s>`. We can use `<s>` as we would use `[CLS]`, since they both appear as the first token always.

## Freezing parameters of a pre-trained model

In [33]:
from transformers import AutoModel

We now want to be able to freeze any layers of the pre-trained transformer during fine-tuning. To do this, we just need to set `param.requires_grad = False` for the correct parameters. Unfortunately, each model may be structured in a different way. Thus, we need to inspect the model we will be using to learn how to access the correct modules for freezing:

In [47]:
bert_model = AutoModel.from_pretrained("bert-base-uncased")
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

We see the full module list of the loaded model:
1. `embeddings` - This module holds the token embeddings, the positional embeddings and the different embeddings for the first and second texts.
2. `encoder` - The transformer itself, with a list of 12 identical transformer layers.
3. `pooler` - An additional layer provided by HuggingFace that pools the sequence of embeddings to a single text embedding. **We will ignore this, as we want to have full control over how we pool the sequence and get the final logits or probabilities**. If you get a model that has already trained for the task you need, you will want to reuse their `pooler` and final layers, as they are already trained.

Now, we can access to each module separately to freeze a specific number of layers:

In [49]:
num_frozen_layers = 5
# Freeze the first `num_frozen_layers` layers of the model
for layer in bert_model.encoder.layer[:num_frozen_layers]:
    for param in layer.parameters():
        param.requires_grad = False
# Freeze initial embeddings
for param in bert_model.embeddings.parameters():
    param.requires_grad = False

Similarly, we can study the modules of the RoBERTa model, and see how we need to access them in order to freeze the desired parameters:

In [48]:
roberta_model = AutoModel.from_pretrained("FacebookAI/roberta-base")
roberta_model

Some weights of RobertaModel were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dr

It has the same structure as BERT! The same code will freeze the parameters correctly.

## Training for a simple text classification task

In [46]:
example_text = ["Here is some text to encode"]
encoded_input = bert_tokenizer(example_text, return_tensors='pt')
output = bert_model(**encoded_input)
output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

## Zero-shot text classification with LLMs

## Zero-shot text classification with sentence-transformers