# DEMO 2: **Byte-Pair Encoding Tokenization**
---

Demo run of Byte-Pair Encoding Tokenization training process over the TinyStories dataset. This run uses a Macbook Pro 2023, M3 Pro. 

The implementation uses a greedy approach for the inference. The greediness provides more efficiency in runtime for the tokenization process; Some studies, such as [Greed is All You Need: An Evaluation of Tokenizer Inference Methods](https://arxiv.org/pdf/2403.01289) (2024), show that greedy inference also yields good results in benchmarks, especially for morphologically-motivated tasks.

## BPETokenizer

Now, let's instantiate a BPE Tokenizer with the `BPETokenizer` class. The only special token we'll be considering is the ending token of the TinyStories dataset, which is `"<|endoftext|>"`. Let's try a vocab size of `10000`:

In [1]:
from bpe_transformer.tokenization import BPETrainer

special_tokens = ["<|endoftext|>"]
bpe = BPETrainer(vocab_size=10000, special_tokens=special_tokens)

## Training
---

Before the training process, the pre-tokenization functionalities are called to pre-process the training data, following the patterns used for GPT-2. We always keep the special tokens intact, and they are never used in the training process.

For more details on it, check `1_pretokenization.ipynb`. 

**Defining input variables:** Training data file path and number of workers for parallel pre-tokenization. For this notebook, we'll use the max. number of CPUs available.

In [2]:
from pathlib import Path
from multiprocessing import cpu_count

input_path = Path("../data/TinyStoriesV2-GPT4-train.txt")
n_cpus = cpu_count()

We can train our `bpe` calling `bpe.train(..)` and track the total time taken. For memory tracking, we use `tracemalloc`.

We'll also serialize the resulting vocabulary and merges to disk for further inspection. 

In [3]:
import tracemalloc

from time import time

if __name__ == "__main__":
    tracemalloc.start()
    start = time()
    bpe.train(input_path=input_path, num_processes=n_cpus)
    end = time()
    total_time = end - start
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

After training, we can access our `vocab` and check some of the tokens with `bpe.vocab`, such as the longest token, as well as the merges executed (`bpe.merges`). 


In [5]:
l_50 = 50 * "="
l_50_2 = 50 * "-"
print(l_50)
print(f"Peak memory usage: {peak / 1024**3:.2f} GB")
print(l_50_2)
print(f"TRAINING TIME (seconds): {total_time}")
print(l_50)
print("\n")


print(l_50)
print("BPE Tokenizer vocab (post-training, last 30 tokens)")
print(l_50)

print(list(bpe.vocab.items())[-30:])

print("\n")
print(l_50)
print("BPE Tokenizer merges list (post-training, last 30 merges)")
print(l_50)
print(bpe.merges[:30])

Peak memory usage: 0.05 GB
--------------------------------------------------
TRAINING TIME (seconds): 1645.2148880958557


BPE Tokenizer vocab (post-training, last 30 tokens)
[(9970, b' Din'), (9971, b'dest'), (9972, b'Maddy'), (9973, b'Everything'), (9974, b'Curious'), (9975, b' racers'), (9976, b' patients'), (9977, b' muster'), (9978, b' god'), (9979, b' deserves'), (9980, b' aloud'), (9981, b' Things'), (9982, b'aps'), (9983, b'Use'), (9984, b' Squee'), (9985, b' Dragon'), (9986, b' tours'), (9987, b' meets'), (9988, b' marvel'), (9989, b' Rusty'), (9990, b' Liza'), (9991, b' Jet'), (9992, b'Froggy'), (9993, b' wrapper'), (9994, b' Reddy'), (9995, b' Hops'), (9996, b' Crusty'), (9997, b' whiskers'), (9998, b' nicest'), (9999, b' improving')]


BPE Tokenizer merges list (post-training, last 30 merges)
[(b' ', b't'), (b'h', b'e'), (b' ', b'a'), (b' ', b's'), (b' ', b'w'), (b'n', b'd'), (b' t', b'he'), (b'e', b'd'), (b' ', b'b'), (b' t', b'o'), (b' a', b'nd'), (b' ', b'h'), (b' ', b'

Let's check the longest token in our vocab:

In [6]:
print(2 * l_50)
longest_token = max(bpe.vocab.items(), key=lambda x: len(x[1]))
print(f"Longest token ID: {longest_token[0]}")
print(f"Longest token: {longest_token[1]}")
print(f"Length: {len(longest_token[1])} bytes")
print(f"Decoded: {longest_token[1].decode('utf-8', errors='replace')}")
print(2 * l_50)

Longest token ID: 7164
Longest token: b' accomplishment'
Length: 15 bytes
Decoded:  accomplishment


The longest token is ` accomplishment`, a complete word preceded by a whitespace.

Let's export our vocab (we'll do `.txt` files in this notebook for easy and casual inspection):

In [7]:
from os import makedirs

output_dir = "."
makedirs(output_dir, exist_ok=True)

# Export vocabulary
with open(f"{output_dir}/bpe_vocab.txt", "w", encoding="utf-8") as f:
    for token, idx in bpe.vocab.items():
        f.write(f"{idx}\t{token}\n")

# Export merges
with open(f"{output_dir}/bpe_merges.txt", "w", encoding="utf-8") as f:
    for pair, merged in bpe.merges:
        f.write(f"{pair} -> {merged} \n")

print(f"Saved to {output_dir}/bpe_vocab.txt and {output_dir}/bpe_merges.txt")

Saved to ./bpe_vocab.txt and ./bpe_merges.txt
