# Handling multiple sequences (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.

In [23]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)  # list
ids = tokenizer.convert_tokens_to_ids(tokens) # list
input_ids = torch.tensor(ids) # tensor list like
# # This line will fail.
model(input_ids)

IndexError: too many indices for tensor of dimension 1

In [4]:
tokens

['i',
 "'",
 've',
 'been',
 'waiting',
 'for',
 'a',
 'hugging',
 '##face',
 'course',
 'my',
 'whole',
 'life',
 '.']

In [5]:
ids

[1045,
 1005,
 2310,
 2042,
 3403,
 2005,
 1037,
 17662,
 12172,
 2607,
 2026,
 2878,
 2166,
 1012]

In [9]:
ids.shape()

AttributeError: 'list' object has no attribute 'shape'

In [6]:
input_ids

tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])

In [7]:
input_ids.shape

torch.Size([14])

In [11]:
'''
# gemini fix

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

# Reshape the input_ids to have a batch dimension of 1
input_ids = torch.tensor([ids])
# The input_ids tensor is now 2-dimensional: [batch_size, sequence_length]

# Now this line should work without error
output = model(input_ids)
print(output.logits)
'''

tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [14]:
[ids]

[[1045,
  1005,
  2310,
  2042,
  3403,
  2005,
  1037,
  17662,
  12172,
  2607,
  2026,
  2878,
  2166,
  1012]]



```
what [] outside the list meaning? [ids]

Okay, let's break down the meaning of [ids] in the context of your code.

In the line input_ids = torch.tensor([ids]), the outer square brackets [] are creating a new list with a single element: the ids list. This effectively adds a batch dimension to the tensor that's created by torch.tensor().

Let me illustrate with an example:

ids: Represents a list of token IDs, let's say [1045, 1005, 2310, ...]. This is a 1-dimensional list.
[ids]: Creates a new list containing ids as its only element. It would look like this: [[1045, 1005, 2310, ...]]. This is now a 2-dimensional list (a list of lists).
torch.tensor([ids]): Takes this 2-dimensional list and creates a 2-dimensional PyTorch tensor. The first dimension represents the batch size (in this case, 1 because there's only one list inside), and the second dimension represents the sequence length (the number of tokens in your sequence).
Why is this necessary?

Most transformer models are designed to process batches of sequences efficiently. Even if you're working with a single sequence, the model expects the input to have a batch dimension. By adding the extra [] around ids, you're creating a batch with a single sequence, satisfying the model's input requirements.

In simpler terms:

Think of it like this – the model expects a tray of items, even if you only have one item. You need to put your single item on the tray (by creating a list containing your item) before handing it to the model. This "tray" is the batch dimension, and [ids] is the way you put your "item" (the ids list) onto the tray.
```



In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])

In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

probabilities = F.softmax(output.logits, dim=-1) # use dim=-1 for applying to the last dim


print("Input IDs:", tokenized_inputs["input_ids"])
print("Logits:", output.logits)
print("Probabilities:", probabilities)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
Input IDs: tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
Probabilities: tensor([[0.0037, 0.9963]], grad_fn=<SoftmaxBackward0>)


In [29]:
# me fix
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokenized_inputs = tokenizer(sequence, return_tensors="pt")
# print("tokenized_inputs", tokenized_inputs["input_ids"])

input_ids = tokenized_inputs["input_ids"]
print("Input IDs:", input_ids)

output = model(tokenized_inputs["input_ids"]) # dont use this
print("Logits:", output.logits)

probabilities = F.softmax(output.logits, dim=-1) # use dim=-1 for applying to the last dim


print("Input IDs:", tokenized_inputs["input_ids"])
print("Logits:", output.logits)
print("Probabilities:", probabilities)

Input IDs: tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
Logits: tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)
Input IDs: tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
Logits: tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)
Probabilities: tensor([[0.0402, 0.9598]], grad_fn=<SoftmaxBackward0>)


In [27]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# Use the tokenizer directly to get input IDs
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
# print("tokenized_inputs", tokenized_inputs["input_ids"])

input_ids = tokenized_inputs["input_ids"]
print("Input IDs:", input_ids)

# Pass the tokenized inputs to the model
output = model(tokenized_inputs["input_ids"]) # dont use this - tokenized_inputs or input_ids
print("Logits:", output.logits)

Input IDs: tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
Logits: tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)




```
# This is formatted as code
when using it, it have completely different result : Input IDs: tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]]) Logits: tensor([[-1.5607, 1.6123]], grad_fn=) , compare to Input IDs: tensor([[ 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]) Logits: tensor([[-2.7276, 2.8789]], grad_fn=) when using [ids] as input

Okay, let's address the difference in results when using tokenizer(sequence, return_tensors="pt") versus manually creating the input tensor with [ids].

The difference in results is due to the special tokens that the tokenizer automatically adds when you use tokenizer(sequence, return_tensors="pt"). Specifically, it adds:

[CLS] token (ID 101): This token is added at the beginning of the sequence and is used for classification tasks.
[SEP] token (ID 102): This token is added at the end of the sequence to mark the separation between different segments (if any).
These special tokens are crucial for the model to understand the structure of the input and perform correctly.

When you manually create the input tensor with [ids]:

You are only including the token IDs of the words in the sequence.
You are missing the special tokens ([CLS] and [SEP]) that the model expects.
When you use tokenizer(sequence, return_tensors="pt"):

The tokenizer automatically adds the necessary special tokens ([CLS] and [SEP]).
The model receives the input in the expected format, leading to more accurate results.
Why the difference in logits:

The logits represent the model's raw predictions before applying the softmax function. Since the input is different in both cases (with and without special tokens), the model's internal calculations and predictions will also differ, resulting in different logits.

In summary:

Always use tokenizer(sequence, return_tensors="pt") to prepare your input for transformer models.
This ensures that the necessary special tokens are added, and the model receives the input in the expected format, leading to better performance and accuracy.
If you need to control the specific tokens added (e.g., in more complex scenarios), you can explore the tokenizer's documentation for options like add_special_tokens and return_special_tokens_mask. But for most common cases, the default behavior of tokenizer(sequence, return_tensors="pt") is sufficient.
```





```
it also have diff proba Logits: tensor([[-2.7276, 2.8789]], grad_fn=) Probabilities: tensor([[0.0037, 0.9963]], grad_fn=)

and Logits: tensor([[-1.5607, 1.6123]], grad_fn=) Probabilities: tensor([[0.0402, 0.9598]], grad_fn=)

You're correct; there's still a difference in the probabilities even after applying the softmax function. This is expected and is a direct consequence of the difference in the logits we discussed earlier.

Here's why the probabilities differ:

Different Logits: As we established, the logits (raw predictions) are different because of the presence or absence of special tokens ([CLS] and [SEP]) in the input. The model processes these tokens, and they influence the internal calculations, leading to different logit values.

Softmax's Role: The softmax function converts the logits into probabilities, but it does so in a way that preserves the relative relationships between the logits.

In the first case (manual input with [ids]): The logits are [-2.7276, 2.8789]. The softmax function essentially exponentiates these values and then normalizes them to sum to 1. This results in probabilities of approximately [0.0037, 0.9963].

In the second case (using tokenizer(sequence, return_tensors="pt")): The logits are [-1.5607, 1.6123]. Again, softmax exponentiates and normalizes these values, resulting in probabilities of approximately [0.0402, 0.9598].

The Key Takeaway:

Even though the probabilities are different, they still reflect the same general trend: the model is more confident in the second class (positive sentiment) in both cases. However, the magnitude of confidence is slightly different.

Why the difference in confidence:

The special tokens ([CLS] and [SEP]) provide additional context and information to the model, which can affect its confidence in the prediction. In this case, the presence of these tokens seems to make the model slightly less confident in its prediction, resulting in a probability closer to 0.9598 compared to 0.9963 in the manual case.

Practical implications:

The difference in probabilities is generally not a major concern as long as the overall trend and prediction remain consistent.
The probabilities obtained using tokenizer(sequence, return_tensors="pt") (with special tokens) are considered more reliable and are preferred for real-world applications.
In research or fine-tuning scenarios, you might want to analyze these differences further to understand the impact of special tokens on the model's behavior.
```



In [31]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In [32]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [33]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


cái này để compare khi add pading vào thì logit đã thay đổi r so vs sequence 2, nên cần add addtention mask thì kqua như sequence 2

In [34]:
batched_ids

[[200, 200, 200], [200, 200, 0]]

In [37]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]


outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


In [40]:
sequence[:6]

"I've b"

In [38]:
sequence = sequence[:1]

NameError: name 'max_sequence_length' is not defined