# DEMO: Pre-Tokenization

Demo run of pre-tokenization over text with a Macbook Pro 2023, M3 Pro. 

## Pre-tokenization Pattern

Regex-based pattern (used by GPT-2; Radford et al., 2019) from github.com/openai/tiktoken/pull/234/files 

In [1]:
# Pre-tokenization pattern
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

We will process the TinyStories validation .txt file, downloaded from HuggingFace, for faster demonstration.

In [2]:
from pathlib import Path

# Input file to pretokenize: TinyStories validation .txt file, from Hugging Face.
INPUT_PATH = Path("../data/TinyStoriesV2-GPT4-valid.txt")

## •• Run Pretokenization ••

Both parallel and serial runtimes are displayed for comparison.

Check the number of available CPU cores:

In [3]:
from multiprocessing import cpu_count

n_cpus = cpu_count()
print(n_cpus)

11


We will run pre-tokenization utilising all the cpus available (`n_cpus`) with a Macbook Pro 2023, M3 Pro.

In [4]:
from bpe_transformer.tokenization.preprocessing import pretokenize

# End of Text token, to split the chunks.
split_token = b"<|endoftext|>"

In [5]:
import time

start_time = time.time()
pretokens_counter = pretokenize(file_path=INPUT_PATH, split_token=split_token, n_workers=n_cpus)
end_time = time.time()

print("\n=== Parallelized Version ===")
print(f"Processed {sum(pretokens_counter.values())} tokens in {end_time - start_time:.2f} seconds")
print(f"Found {len(pretokens_counter)} unique tokens")
print(f"Most common tokens: {pretokens_counter.most_common(10)}")


# Serialized run
start_time = time.time()
pretokens_counter = pretokenize(file_path=INPUT_PATH, split_token=split_token, parallel_processing=False)
end_time = time.time()
print("\n=== Serial Version ===")
print(f"Processed {sum(pretokens_counter.values())} tokens in {end_time - start_time:.2f} seconds")
print(f"Found {len(pretokens_counter)} unique tokens")
print(f"Most common tokens: {pretokens_counter.most_common(10)}")


=== Parallelized Version ===
Processed 5501965 tokens in 0.56 seconds
Found 13113 unique tokens
Most common tokens: [((46,), 421616), ((44,), 235432), ((32, 116, 104, 101), 211031), ((32, 97, 110, 100), 196057), ((32, 97), 152161), ((10,), 152067), ((32, 116, 111), 150493), ((32, 119, 97, 115), 108019), ((32, 84, 104, 101, 121), 52425), ((32, 105, 116), 51670)]

=== Serial Version ===
Processed 5501965 tokens in 3.07 seconds
Found 13113 unique tokens
Most common tokens: [((46,), 421616), ((44,), 235432), ((32, 116, 104, 101), 211031), ((32, 97, 110, 100), 196057), ((32, 97), 152161), ((10,), 152067), ((32, 116, 111), 150493), ((32, 119, 97, 115), 108019), ((32, 84, 104, 101, 121), 52425), ((32, 105, 116), 51670)]


The results show that, generally, pretokenizing the chunks in parallel w/ all cpus (total `cpu_count()`) takes aprox. **20\%** of the total time of a serialized run.

It should be worth implementing this processing in a language like C++ or Rust.