# Inspecting Special Tokens

Visualize special tokens, attention masks, token IDs.

Load the Tokenizer

In [19]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Look at special tokens

In [20]:
print("CLS token:", tokenizer.cls_token)  # Start of sequence
print("SEP token:", tokenizer.sep_token)  # End of sequence
print("PAD token:", tokenizer.pad_token)  # Fills empty spaces

CLS token: [CLS]
SEP token: [SEP]
PAD token: [PAD]


Tokenize a Single Sentence

In [21]:
text = "I love Hugging Face!"
tokens = tokenizer(text)
print(tokens)

{'input_ids': [101, 1045, 2293, 17662, 2227, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}


Tokenize Sentences

In [22]:
text = ["I love Hugging Face! Its very convenient.", "I am learning a lot!"]
tokens = tokenizer(text)
print(tokens)

{'input_ids': [[101, 1045, 2293, 17662, 2227, 999, 2049, 2200, 14057, 1012, 102], [101, 1045, 2572, 4083, 1037, 2843, 999, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


Padding and Truncation

In [23]:
texts = ["I love Hugging Face!", "AI is amazing!"]
batch = tokenizer(texts, padding=True, truncation=True, max_length=9, return_tensors="pt")
print(batch)

{'input_ids': tensor([[  101,  1045,  2293, 17662,  2227,   999,   102],
        [  101,  9932,  2003,  6429,   999,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}


* Saw multiple sentences, with special tokens (CLS, SEP, PAD)
* Observed padding which will make the input sequences of the same length by adding PAD special tokens with a token id of 0 (ignored by the model)
* Observed truncation which removes extra tokens that are greater than max length
* Observed max_length which is related to truncation process as well as the padding process
* Returned the data in a tensor Object, which will be commonly used across models when handling batches of data, stores multi-dimensional array as objects
* Data can be returned in pyTorch objects, TensorFlow objects, numpy arrays, and python lists
