<a href="https://colab.research.google.com/github/ngzhiwei517/Transformers/blob/main/Chapter2/Tokenizers(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Why Use AutoTokenizer?

Flexibility: It automatically detects the correct tokenizer class based on the model name you provide, like "bert-base-cased".

Ease of Use: You don’t have to remember the specific tokenizer class for each model, reducing the chance of making mistakes

In [5]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

# **input_ids 🆔**

These are the token IDs corresponding to the subword tokens.

The special tokens 101 and 102 are [CLS] (start of sentence) and [SEP] (end of sentence), respectively, for BERT.


# **attention_mask 👁️**

This tells the model which tokens to pay attention to and which to ignore (padding).

1 means the token is real (not padding), while 0 means it’s padding.

Since you have no padding here, it’s all 1s.

# **What are token_type_ids Used For?**

Single Sentence: All token_type_ids are 0.

Sentence Pairs: 0 for the first sentence and 1 for the second sentence.

In [7]:
tokenizer("Transformers are amazing!", "They power modern NLP models.")

{'input_ids': [101, 25267, 1132, 6929, 106, 102, 1220, 1540, 2030, 21239, 2101, 3584, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Breaking This Down:

First Sentence: [101, 16297, 1132, 12245, 106, 102]

Token type 0 for each token.

Second Sentence: [1387, 1700, 1292, 1565, 10883, 1647, 119, 102]

Token type 1 for each token.

Why Do We Use This?
It helps the model distinguish between the two sentences.

Crucial for tasks like question answering and next sentence prediction.



# `Example`


Why is token_type_ids Critical Here?
It helps the model distinguish the question from the context, even though they are in the same input sequence.

This is crucial for the BERT model to understand where the question ends and the context begins, ensuring it can locate the correct answer span.



In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
context = "Transformers are neural networks used for NLP."
question = "What are transformers used for?"

tokenizer(question, context)



{'input_ids': [101, 1327, 1132, 11303, 1468, 1215, 1111, 136, 102, 25267, 1132, 18250, 6379, 1215, 1111, 21239, 2101, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 3014]

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

'Using a Transformer network is simple'