# Handling multiple sequences (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## Models expect a batch of inputs

It will be common to group inside a batch different length sentences. This will result in an error.

In [11]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
print(f"Tokens: {tokens}\n")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}\n")
input_ids = torch.tensor(ids)
print(f"Input IDs: {input_ids}\n")
# The following line will fail.
# model(input_ids)

Tokens: ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']

IDs: [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

Input IDs: tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])



The problem is that we sent a **single sequence** to the model.

Transformers models expect **multiple sentences by default**.

In [5]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


Here, the tokenizer:
1. Converted the list of input IDs into a tensor
2. Added a dimension on top of it.

*Batching* is the act of sending multiple sentences through the model, all at once. 

If you only have one sentence, you can just build a batch with a single sequence:

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
print(f"Tokens: {tokens}\n")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}\n")

# !! We are adding a dimension by encapsulating with [] 
input_ids = torch.tensor([ids]) 
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Tokens: ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']

IDs: [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


### Exercise

Convert this batched_ids list into a tensor and pass it through your model. Check that you obtain the same logits as before (but twice)!

In [26]:
batched_ids = [ids, ids]
batched_input_ids = torch.tensor(batched_ids)
print(f"Batched input IDs: {batched_input_ids}\n")

output = model(batched_input_ids)

print(f"Logits: {output.logits}")

Batched input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])

Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


### Padding the Inputs

The following list of lists cannot be converted to a tensor:

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

We’ll use padding to make our tensors have a rectangular shape.

Padding makes sure all our sentences have the same length by adding a special word called the *padding token* to the **sentences with fewer values**.

In [None]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [30]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
print(f"Tokenizer padding value = {tokenizer.pad_token_id}\n")
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(f"Batched ids with padding: {batched_ids}\n")

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

Tokenizer padding value = 0

Batched ids with padding: [[200, 200, 200], [200, 200, 0]]

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward0>)


### Attention Masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 
- 1s indicate the corresponding tokens should be attended to
- 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).



In [31]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


#### Exercise

Apply the tokenization manually on the two sentences used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model!

In [62]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = ['I’ve been waiting for a HuggingFace course my whole life.', 'I hate this so much!']

# 1: Tokenize the sentences
tokenized_list = [tokenizer.tokenize(sentence) for sentence in sentences]
print(f"Tokenized list: {tokenized_list}\n")

# 2: Convert tokens to IDs
input_ids_list = [tokenizer.convert_tokens_to_ids(token) for token in tokenized_list]
print(f"Input ID list: {input_id_list}\n")

# 3: Find the maximum sequence length
max_length = max(len(ids) for ids in input_ids_list)

# 4: Pad the input sequences
padded_input_ids_list = [ids + [0]*(max_length - len(ids)) for ids in input_ids_list]
print(f"Padded input ID list: {padded_input_ids_list}\n")

# 5: Generate attention masks
attention_mask_list = [
    [0 if id == 0 else 1 for id in ids]
                       for ids in padded_input_ids_list]

print(f"Attention mask list: {attention_mask_list}\n")

output = model(torch.tensor(padded_input_ids_list), 
               attention_mask = torch.tensor(attention_mask_list))

print(f"Logits: {output.logits}\n")

# 6: PostProcess 
predictions = torch.nn.functional.softmax(output.logits, dim = -1)

print(model.config.id2label)
print(predictions)

Tokenized list: [['i', '’', 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.'], ['i', 'hate', 'this', 'so', 'much', '!']]

Input ID list: [[1045, 1521, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 2061, 2172, 999]]

Padded input ID list: [[1045, 1521, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 2061, 2172, 999, 0, 0, 0, 0, 0, 0, 0, 0]]

Attention mask list: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]

Logits: tensor([[-2.5720,  2.6852],
        [ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>)

{0: 'NEGATIVE', 1: 'POSITIVE'}
tensor([[0.0052, 0.9948],
        [0.9972, 0.0028]], grad_fn=<SoftmaxBackward0>)


## Longer Sequences

There is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens.

Solutions: 
- Use a model with a longer supported sequence length.
- Truncate your sequences.

Some models can handle big sequences:
- [Longformer](https://huggingface.co/transformers/model_doc/longformer.html) 
- [LED](https://huggingface.co/transformers/model_doc/led.html)

In [None]:
sequence = sequence[:max_sequence_length]