GitHub - gxenos/gpt4-tokenizer: GPT-4 tokenizer from scratch

A GPT-4 tokenizer from scratch -- matching tiktoken cl100k_base encoding

Training example:

tokenizer = RegexTokenizer()
tokenizer.train(text, 276, verbose=True)

valtext = "Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard.[4] Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters."
valtext2 = tokenizer.decode(tokenizer.encode(valtext))
print(valtext2 == valtext)
# True

Implements the basic BPE algorithm described here: https://en.wikipedia.org/wiki/Byte-pair_encoding for tokenizer training
Supports regex split of text before applying BPE, as introduced in GPT-2 here: https://github.com/openai/gpt-2/blob/master/src/encoder.py (GPT-4 regex)
gpt4_tokenizer.py matches tiktoken cl100k_base encoding/decoding

Support of tiktoken vocab was implemented here: https://github.com/karpathy/minbpe/blob/master/minbpe/gpt4.py#L29

Does not support special tokens (like '<|endoftext|>') yet.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
gpt4-tokenizer		gpt4-tokenizer
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A GPT-4 tokenizer from scratch -- matching tiktoken cl100k_base encoding

Training example:

About

Uh oh!

Languages

gxenos/gpt4-tokenizer

Folders and files

Latest commit

History

Repository files navigation

A GPT-4 tokenizer from scratch -- matching tiktoken cl100k_base encoding

Training example:

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages