Source: https://huggingface.co/learn/nlp-course/chapter2/5?fw=pt

# Handling multiple sequences (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# !pip install datasets evaluate transformers[sentencepiece]

In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already:

- How do we handle multiple sequences? <span style="color:green">Batching (by default)</span>
- How do we handle multiple sequences of <i>different lengths</i>? <span style="color:green">Padding</span>
- Are vocabulary indices the only inputs that allow a model to work well? <span style="color:green">Attention mask</span>
- Is there such a thing as too long a sequence? <span style="color:green">Yes, use max sequence length to truncate</span>

Let’s see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API.

## Models expect a batch of inputs
In the previous exercise you saw how sequences get translated into lists of numbers. Let’s convert this list of numbers to a tensor and send it to the model:

In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids) # We have skipped one step "prepare_for_model(...)"
print (f"input_ids.shape = {input_ids.shape}")
# This line will fail.
model(input_ids)

input_ids.shape = torch.Size([14])


RuntimeError: The size of tensor a (14) must match the size of tensor b (512) at non-singleton dimension 1

Oh no! Why did this fail? “We followed the steps from the pipeline in section 2.

<span style="color:red">The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default</span>. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it:

In [18]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print (f'tokenized_inputs["input_ids"].shape = {tokenized_inputs["input_ids"].shape}')
print(tokenized_inputs["input_ids"])

tokenized_inputs["input_ids"].shape = torch.Size([1, 16])
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


<span style="color:green">Let’s try again and add a new dimension</span>. We print the input IDs as well as the resulting logits — here’s the output.

In [20]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids]) # Add a new dimension
print("Input IDs:", input_ids)
print (f"input_ids.shape = {input_ids.shape}")

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
input_ids.shape = torch.Size([1, 14])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


<span style="color:blue"><i>Batching</i> is the act of sending multiple sentences through the model, all at once.</span> If you only have one sentence, you can just build a batch with a single sequence:

In [23]:
batched_ids = [ids, ids]
input_ids = torch.tensor(batched_ids)
output = model(input_ids)
print("Logits:", output.logits)

Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


This is a batch of two identical sequences!

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. <span style="color:red">There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. If you’ve ever worked with tensors before, you know that they need to be of rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly</span>. <span style="color:green">To work around this problem, we usually <b><i>pad</i></b> the inputs.</span>

## Padding the inputs

The following list of lists cannot be converted to a tensor:

In [25]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

torch.tensor(batched_ids)

ValueError: expected sequence of length 3 at dim 1 (got 2)

<span style="color:green">In order to work around this, we’ll use <i>padding</i> to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the <i>padding token</i> to the sentences with fewer values</span>. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

In [27]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

torch.tensor(batched_ids)

tensor([[200, 200, 200],
        [200, 200, 100]])

<span style="color:green">The padding token ID can be found in `tokenizer.pad_token_id`</span>. Let’s use it and send our two sentences through the model individually and batched together:

In [29]:
print (tokenizer.pad_token_id)

0


In [36]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

print (model.num_labels, model.config.label2id, "\n")

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

2 {'NEGATIVE': 0, 'POSITIVE': 1} 

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


<span style="color:red">There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!</span>

<span style="color:blue">This is because the <b>key feature of Transformer models is attention layers that <i>contextualize</i> each token</b>. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an <b>attention mask.</b></span>

## Attention masks

<span style="color:green"><i>Attention masks</i> are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).</span>

Let’s complete the previous example with an attention mask:

In [39]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


In [37]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


Now we get the same logits for the second sentence in the batch.

Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

<img src="images/attention-mask-1.png" style="width:650px;" title="Padding">
<img src="images/attention-mask-2.png" style="width:700px;" title="Padding and attention mask">

## Longer sequences

<span style="color:red">With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences</span>. <span style="color:green">There are two solutions to this problem:</span>

- <span style="color:green">Use a model with a longer supported sequence length.</span>
- <span style="color:green">Truncate your sequences.</span>

Models have different supported sequence lengths, and some specialize in handling very long sequences. [Longformer](https://huggingface.co/transformers/model_doc/longformer.html) is one example, and another is [LED](https://huggingface.co/transformers/model_doc/led.html). If you’re working on a task that requires very long sequences, we recommend you take a look at those models.

Otherwise, we recommend you truncate your sequences by specifying the `max_sequence_length` parameter:

In [40]:
max_sequence_length = 15
print (sequence)
trunc_sequence = sequence[:max_sequence_length]
print (trunc_sequence)

I've been waiting for a HuggingFace course my whole life.
I've been waiti
