-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Summary
In o200k_harmony
, two special token names share the same token id 200018:
<|endofprompt|>
→ 200018<|reserved_200018|>
→ 200018 (conflict)
Token ids must be unique within an encoding.
Reproduction
import tiktoken
from collections import defaultdict
print(tiktoken.__version__) # expect 0.12.0
enc = tiktoken.get_encoding("o200k_harmony")
sp = enc._special_tokens
print(sp) # shows '<|endofprompt|>': 200018 and '<|reserved_200018|>': 200018
# Optional: explicit duplicate-id check
id2names = defaultdict(list)
for name, tid in sp.items():
id2names[tid].append(name)
dups = {tid: names for tid, names in id2names.items() if len(names) > 1}
print(dups)
# -> {200018: ['<|endofprompt|>', '<|reserved_200018|>']}
Actual
<|reserved_200018|>
duplicates <|endofprompt|>
(both id=200018).
Expected
No two special token names share the same token id.
Root Cause
tiktoken_ext/openai_public.py
bulk-generates reserved_*
specials for [200013, 201088)
without excluding ids already used by explicit specials, introducing <|reserved_200018|>
as a duplicate of <|endofprompt|>
.
-
This is the full code segment where the bug is introduced:
tiktoken/tiktoken_ext/openai_public.py
Lines 95 to 145 in 97e49cb
def o200k_base(): mergeable_ranks = load_tiktoken_bpe( "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken", expected_hash="446a9538cb6c348e3516120d7c08b09f57c36495e2acfffe59a5bf8b0cfb1a2d", ) special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018} # This regex could be made more efficient. If I was the one working on this encoding, I would # have done a few other things differently too, e.g. I think you can allocate tokens more # efficiently across languages. pat_str = "|".join( [ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""", r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""", r"""\p{N}{1,3}""", r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""", r"""\s*[\r\n]+""", r"""\s+(?!\S)""", r"""\s+""", ] ) return { "name": "o200k_base", "pat_str": pat_str, "mergeable_ranks": mergeable_ranks, "special_tokens": special_tokens, } def o200k_harmony(): base_enc = o200k_base() name = "o200k_harmony" pat_str = base_enc["pat_str"] mergeable_ranks = base_enc["mergeable_ranks"] special_tokens = { **base_enc["special_tokens"], "<|startoftext|>": 199998, "<|endoftext|>": 199999, "<|reserved_200000|>": 200000, "<|reserved_200001|>": 200001, "<|return|>": 200002, "<|constrain|>": 200003, "<|reserved_200004|>": 200004, "<|channel|>": 200005, "<|start|>": 200006, "<|end|>": 200007, "<|message|>": 200008, "<|reserved_200009|>": 200009, "<|reserved_200010|>": 200010, "<|reserved_200011|>": 200011, "<|call|>": 200012, } | {f"<|reserved_{i}|>": i for i in range(200013, 201088)} -
At line 100,
<|endofprompt|>
is already registered with the id200018
:
tiktoken/tiktoken_ext/openai_public.py
Line 100 in 97e49cb
special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018} -
However, at line 145, the code adds all ids from
200013
to201088
as reserved tokens, which mistakenly includes the already-used id200018
:
tiktoken/tiktoken_ext/openai_public.py
Line 145 in 97e49cb
} | {f"<|reserved_{i}|>": i for i in range(200013, 201088)}
Fix
- Remove
<|reserved_200018|>
which duplicates<|endofprompt|>
(both id=200018). - When generating
reserved_*
, skip ids already defined inspecial_tokens
to prevent future collisions. - Implemented in PR #458.
Tests
Add tests/test_token_ids_unique.py
to enforce token-id uniqueness across all encodings:
- special token ids are unique (no two names share the same id);
- mergeable vocab ids are unique when
_mergeable_ranks
is exposed.
The test fails before this change and passes after.
Compatibility
No behavior change to encoding/decoding. Only removes the duplicate special token entry.
Environment
tiktoken
version:0.12.0
- Python:
3.13.7
- OS: macOS/Linux/Windows (reproducible)