## Input IDs

In [1]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"

In [2]:
tokenized_sequence = tokenizer.tokenize(sequence)

In [3]:
print(tokenized_sequence)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']


In [4]:
inputs = tokenizer(sequence)

In [5]:
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)

[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]


In [6]:
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)

[CLS] A Titan RTX has 24GB of VRAM [SEP]


## Attention mask

The attention mask is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not.

In [7]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

In [8]:
len(encoded_sequence_a), len(encoded_sequence_b)

(8, 19)

In [9]:
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

In [10]:
padded_sequences["input_ids"]

[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [101,
  1188,
  1110,
  170,
  1897,
  1263,
  4954,
  119,
  1135,
  1110,
  1120,
  1655,
  2039,
  1190,
  1103,
  4954,
  138,
  119,
  102]]

In [11]:
padded_sequences["attention_mask"]

[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

## Token Type IDs

In [12]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

In [13]:
print(decoded)

[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]


In [14]:
encoded_dict['token_type_ids']

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

This is enough for some models to understand where one sequence ends and where another begins. However, other models, such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying the two types of sequence in the model.

## Position IDs

The position IDs (position_ids) are used by the model to identify each token’s position in the list of tokens. They are an optional parameter. If no position_ids is passed to the model, the IDs are automatically created as absolute positional embeddings.

## Labels

These labels are different according to the model head, for example:

For sequence classification models (e.g., BertForSequenceClassification), the model expects a tensor of dimension (batch_size) with each value of the batch corresponding to the expected label of the entire sequence.

For token classification models (e.g., BertForTokenClassification), the model expects a tensor of dimension (batch_size, seq_length) with each value corresponding to the expected label of each individual token.

For masked language modeling (e.g., BertForMaskedLM), the model expects a tensor of dimension (batch_size, seq_length) with each value corresponding to the expected label of each individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).

For sequence to sequence tasks,(e.g., BartForConditionalGeneration, MBartForConditionalGeneration), the model expects a tensor of dimension (batch_size, tgt_seq_length) with each value corresponding to the target sequences associated with each input sequence. During training, both BART and T5 will make the appropriate decoder_input_ids and decoder attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See the documentation of each model for more information on each specific model’s labels.

The base models (e.g., BertModel) do not accept labels, as these are the base transformer models, simply outputting features.