<a href="https://colab.research.google.com/github/ngzhiwei517/Transformers/blob/main/Chapter2/Tokenizers(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Why Use AutoTokenizer?

Flexibility: It automatically detects the correct tokenizer class based on the model name you provide, like "bert-base-cased".

Ease of Use: You don’t have to remember the specific tokenizer class for each model, reducing the chance of making mistakes

In [None]:
tokenizer("Using a Transformer network is simple")

# **input_ids 🆔**

These are the token IDs corresponding to the subword tokens.

The special tokens 101 and 102 are [CLS] (start of sentence) and [SEP] (end of sentence), respectively, for BERT.


# **attention_mask 👁️**

This tells the model which tokens to pay attention to and which to ignore (padding).

1 means the token is real (not padding), while 0 means it’s padding.

Since you have no padding here, it’s all 1s.

# **What are token_type_ids Used For?**

Single Sentence: All token_type_ids are 0.

Sentence Pairs: 0 for the first sentence and 1 for the second sentence.

In [None]:
tokenizer("Transformers are amazing!", "They power modern NLP models.")

Breaking This Down:

First Sentence: [101, 16297, 1132, 12245, 106, 102]

Token type 0 for each token.

Second Sentence: [1387, 1700, 1292, 1565, 10883, 1647, 119, 102]

Token type 1 for each token.

Why Do We Use This?
It helps the model distinguish between the two sentences.

Crucial for tasks like question answering and next sentence prediction.



# `Example`


Why is token_type_ids Critical Here?
It helps the model distinguish the question from the context, even though they are in the same input sequence.

This is crucial for the BERT model to understand where the question ends and the context begins, ensuring it can locate the correct answer span.



In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
context = "Transformers are neural networks used for NLP."
question = "What are transformers used for?"

tokenizer(question, context)



In [None]:
tokenizer.save_pretrained("directory_on_my_computer")

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary


Why Split "Transformer" into "transform" and "##er"?
> The BERT tokenizer uses WordPiece tokenization, which tries to keep the vocabulary small while still covering most words.

Here’s a simplified view of what might be in vocab.txt:


transform  
er  
transformation  
network  
is  
simple  
...

Notice that "transformer" is not included, but its components "transform" and "##er" are.



---


🧠 Why This is Smart

This approach handles both common and rare words efficiently.

It can represent new words like "transforming" (transform + ##ing) or "transformers" (transform + ##ers) without needing new vocabulary.




In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

# **1. tokenizer.tokenize() (Tokenization Only)**



*   Does not add special tokens like [CLS] or [SEP].


*   Returns a list of strings.


*  No attention mask or token type IDs are generated.



# **2. tokenizer() (Full Encoding)**


Includes tokenization plus:

* [CLS] (start) and [SEP] (end) tokens.

* Converts tokens to IDs using the model’s vocabulary.



*   Adds token type IDs and attention mask.


*   Ready for Model Input.


* tokenizer.tokenize() → Shows the internal tokenization process (subword splitting).

* tokenizer() → Directly prepares the full input for the model,

