# **Comparing various BPE Implementations**

## Using BPE from `tiktoken`

In [1]:
from importlib.metadata import version

print(f"tiktoken version: {version("tiktoken")}")

tiktoken version: 0.12.0


In [2]:
import tiktoken 

tik_tokenizer = tiktoken.get_encoding("gpt2")

text = "Hello, world. Is this-- a test?"

In [3]:
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [4]:
strings = tik_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


In [5]:
print(tik_tokenizer.n_vocab)

50257


## Using the original BPE implementation used in GPT-2

In [6]:
from bpe_openai_gpt2 import get_encoder, download_vocab

In [7]:
download_vocab()

Fetching encoder.json: 1.04Mit [00:02, 394kit/s]                                                    
Fetching vocab.bpe: 457kit [00:02, 165kit/s]                                                        


In [8]:
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

In [9]:
integers = orig_tokenizer.encode(text)

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [10]:
strings = orig_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


## Using BPE via HuggingFace transformers

In [11]:
import transformers

transformers.__version__

'4.57.3'

In [12]:
from transformers import GPT2Tokenizer

hf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [13]:
hf_tokenizer(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

In [14]:
from transformers import GPT2TokenizerFast

hf_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")

In [15]:
hf_tokenizer_fast(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

## Performance benchmark

In [17]:
with open("../01_main-code/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Original OpenAI GPT-2 tokenizer

In [18]:
%timeit orig_tokenizer.encode(raw_text)

6.29 ms ± 399 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Tiktoken OpenAI GPT-2 tokenizer

In [19]:
%timeit tik_tokenizer.encode(raw_text)

1.41 ms ± 86.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Hugging Face OpenAI GPT-2 tokenizer

In [20]:
%timeit hf_tokenizer(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


15.1 ms ± 700 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

15.6 ms ± 538 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [22]:
%timeit hf_tokenizer_fast(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


5.65 ms ± 365 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
%timeit hf_tokenizer_fast(raw_text, max_length=5145, truncation=True)["input_ids"]

5.76 ms ± 447 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Looks like tiktoken is the clear winner. That's why it's the most commonly used tokenizer.

Why is it so much faster?


(Honestly, I don't know. Will look into it someday.)