# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [12]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [5]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
tokenizer("I’ve been waiting for a HuggingFace course my whole life.")

{'input_ids': [101, 146, 787, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
var = tokenizer("I’ve been waiting for a HuggingFace course my whole life.").input_ids

```
input_ids: A list of numerical IDs representing each token in the input sequence. These IDs map to the tokenizer's vocabulary.
attention_mask: A list indicating which tokens should be attended to by the model (typically 1 for relevant tokens and 0 for padding tokens).
token_type_ids: A list used to distinguish different segments of the input (e.g., in a question-answering task, to separate the question and the context).
```





```
trong bài token_type_ids đều là 0 có nghĩa là gì

Okay, let's discuss the meaning of all token_type_ids being 0.

Significance of token_type_ids being all 0
In the context of tokenization for transformer-based models, token_type_ids help differentiate between different segments or sequences within the input. This is especially important for tasks that involve multiple input sequences, like question answering or sentence pair classification.

When all token_type_ids are 0, it generally means:

Single Sequence Input: The input to the tokenizer consists of only one sequence or segment. There are no distinct segments to be differentiated.

No Special Segment Embeddings: Transformer models use embeddings to represent input tokens. token_type_ids guide the creation of segment embeddings, which are added to the token embeddings. With all token_type_ids being 0, the model essentially applies the same segment embedding to all tokens, treating them as belonging to a single, unified sequence.

Example:

Consider the sentence "Using a Transformer network is simple." When tokenized, if all token_type_ids are 0, it implies the model treats the entire sentence as a single, cohesive unit. It does not need to distinguish between different parts of the sentence.

Contrast with Multiple Segments:

In tasks like question answering, you might have two segments: the question and the context passage. In such cases, token_type_ids would typically be assigned as follows:

0: for tokens belonging to the question.
1: for tokens belonging to the context passage.
This helps the model understand the distinct roles of the question and the context.

In summary: If all token_type_ids are 0 after tokenization, it signifies that the input is considered a single, undifferentiated sequence by the model. This is common when working with single sentences or paragraphs, where there's no need to distinguish between different segments.
```



In [6]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [9]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [10]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


decode là dịch sang word từ id

In [17]:
decoded_string = tokenizer.decode(var)
print(decoded_string)

[CLS] I ’ ve been waiting for a HuggingFace course my whole life. [SEP]
