<a href="https://colab.research.google.com/github/hollyemblem/raschka-llm-from-scratch/blob/main/chapter_2_byte_pair_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Chapter 2

#### Exercise 2.1 Byte pair encoding of unknown words

"Try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and print the individual token IDs. Then, call the decode function on each of the resulting integers in this list to reproduce the mapping shown in figure 2.11."

In [1]:
!pip install tiktoken



In [2]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.12.0


In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

In [6]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
'''
"First, the <|endoftext|> token is assigned a relatively large token ID, namely, 50256.
In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID."
'''
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [7]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


In [8]:
'''Second, the BPE tokenizer encodes and decodes unknown words, such as someunknownPlace, correctly.
The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?'''

'Second, the BPE tokenizer encodes and decodes unknown words, such as someunknownPlace, correctly. \nThe BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?'

In [9]:
unknown_text = "Akwirw ier"
ids = tokenizer.encode(unknown_text, allowed_special={"<|endoftext|>"})

In [10]:
print(ids)

[33901, 86, 343, 86, 220, 959]


In [12]:
for i in ids:
  print(tokenizer.decode([i]))

Ak
w
ir
w
 
ier


![image]https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781633437166/files/Images/2-11.png