<div class="alert alert-block alert-warning">

Since implementing BPE can be relatively complicated, we will use an existing Python
open-source library called tiktoken (https://github.com/openai/tiktoken).

This library implements
the BPE algorithm very efficiently based on source code in Rust.
</div>

In [1]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [3]:
import importlib
import tiktoken

print("Version of tiktoken : ",importlib.metadata.version('tiktoken'))

Version of tiktoken :  0.9.0


In [6]:
tokenizer=tiktoken.get_encoding("o200k_base")

In [11]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[13225, 11, 621, 481, 1299, 17966, 30, 220, 199999, 730, 290, 7334, 32758, 173297, 1440, 1236, 33936, 18099, 13]


In [9]:
strings=tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


<div class="alert alert-block alert-warning">

We can make two noteworthy observations based on the token IDs and decoded text
above.

First, the <|endoftext|> token is assigned a relatively large token ID, namely,199999.


In fact, the BPE tokenizer, which was used to train models such as GPT-4o, GPT-4O mini,
and the original model used in ChatGPT, has a total vocabulary size of 199999, with
<|endoftext|> being assigned the largest token ID.
    


</div>

In [12]:
#example
integers=tokenizer.encode("Hello, Pankaj Kumar")
print(integers)
strings=tokenizer.decode(integers)
print(strings)

[13225, 11, 398, 1104, 1255, 70737]
Hello, Pankaj Kumar
