# Handling multiple sequences

## Models expect a batch of inputs

Hugging Face Transformers models expect multiple sentences by default

The code below will fail because we sent a single sequence to the model whereas Hugging face Transformer models expect multiple sentences by default

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)



IndexError: too many indices for tensor of dimension 1

What the tokenizer actually does behind the scenes: Convert the list of input IDs into a tensor AND added a dimension on top of it

In [2]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


Adding the dimension ourselves. This is called ```batching```.
```Batching``` is the act of sending multiple sentences through the model all at once.
If we only have 1 sentence, we can just build a batch with a single sequence like below

Issue with ```batching```: When you’re trying to batch together two (or more) sentences, they might be of different lengths. To work around this problem, we usually pad the inputs.

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids]) #ADDING THE DIMENSION HERE ids -> [ids]
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)



Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


## Padding

The following list of lists (example 2 sequences) cannot be converted to a tensor 

```
batched_ids = [
    [200, 200, 200],
    [200, 200]
]
```

To work around this, we will use ```padding``` to make sure our tensors have a rectangular shape (tensors need to be of rectangular shape)

Padding makes sure all our sentences have the same length by adding a special word called the ```padding token``` to the sentences with fewer values

For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words

```
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]
```

Applying the padding token:

The padding token ID can be found in tokenizer.pad_token_id. Let’s use it and send our 2 sentences through the model individually, and then batched together

In [5]:
# Applying the padding token 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print("Sequence 1 Sequence Classification Result (Individual)")
print(model(torch.tensor(sequence1_ids)).logits)
print("")
print("Sequence 2 Sequence Classification Result (Individual)")
print(model(torch.tensor(sequence2_ids)).logits)
print("")
print("Sequence 1 and 2 Batched Sequence Classification Result (Batched)")
print(model(torch.tensor(batched_ids)).logits)

Sequence 1 Sequence Classification Result (Individual)
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)

Sequence 2 Sequence Classification Result (Individual)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)

Sequence 1 and 2 Batched Sequence Classification Result (Batched)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


We can observe that there is something wrong with the logits in our batched predictions: The second row of the batched sequence should be the same as the logits for the second sentence individually, but we’ve got completely different values

```This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence```. 

To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. ```This is done by using an attention mask```

## Attention Masks. Keyword MASKS

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 
- 1s indicate the corresponding tokens should be attended to
- 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [6]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0], # 0 -> ignore the padding
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

# now we get the same logits for the second sentence in the batch

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


## Handling Longer Sequences

There is a limit to the lengths of the sequences we can pass to models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. Solutions:
- Use a model with a longer supported sequence length
- Truncate your sequences

Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If you’re working on a task that requires very long sequences, we recommend you take a look at those models.

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:
```sequence = sequence[:max_sequence_length]```