Pure Elixir BPE tokenizer compatible with OpenAI's tiktoken. No NIFs, no Python, no external dependencies.
Supports cl100k_base (GPT-4, GPT-3.5) and o200k_base (GPT-4o) encodings.
# Encode text to token IDs
Tiktokenex.encode("Hello, world!")
#=> [9906, 11, 1917, 0]
# Decode back to text
Tiktokenex.decode([9906, 11, 1917, 0])
#=> "Hello, world!"
# Count tokens
Tiktokenex.count("Hello, world!")
#=> 4
# See the BPE chunks
Tiktokenex.encode_to_chunks("Hello, world!")
#=> ["Hello", ",", " world", "!"]
# Use o200k_base encoding
Tiktokenex.encode("Hello", :o200k_base)Add to your mix.exs as a git or path dependency:
def deps do
[
# git
{:tiktokenex, git: "https://github.com/phiat/tiktokenex.git"},
# …or a sibling working copy for development
{:tiktokenex, path: "../tiktokenex"}
]
endBPE rank files are not tracked in git — fetch them once with the bundled justfile recipe:
git clone https://github.com/phiat/tiktokenex.git
cd tiktokenex
just setup # mix deps.get + downloads cl100k_base + o200k_base into priv/ranks/Or download manually:
mkdir -p priv/ranks
curl -o priv/ranks/cl100k_base.tiktoken \
https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
curl -o priv/ranks/o200k_base.tiktoken \
https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken- Pre-tokenization (
Pretokenizer) — splits text using tiktoken's regex patterns into coarse chunks - BPE encoding (
BPE) — applies byte-pair encoding merges using rank tables - Rank loading (
Ranks) — parses.tiktokenrank files, caches inpersistent_term
The algorithm matches tiktoken's output exactly. See test/ for reference vectors.
| Function | Description |
|---|---|
encode(text, encoding) |
Text to token ID list |
decode(ids, encoding) |
Token IDs back to text |
encode_to_chunks(text, encoding) |
Text to BPE chunk strings |
count(text, encoding) |
Token count |
Default encoding is :cl100k_base. Pass :o200k_base as the second argument for GPT-4o tokenization.
just check # mix test + credo + compile-with-warnings-as-errors
just test # tests onlyMIT — see LICENSE.