Skip to content

bug: o200k_harmony duplicates special token id 200018 #457

@jinzhuer

Description

@jinzhuer

Summary

In o200k_harmony, two special token names share the same token id 200018:

  • <|endofprompt|> → 200018
  • <|reserved_200018|> → 200018 (conflict)

Token ids must be unique within an encoding.


Reproduction

import tiktoken
from collections import defaultdict

print(tiktoken.__version__)  # expect 0.12.0

enc = tiktoken.get_encoding("o200k_harmony")
sp = enc._special_tokens
print(sp)  # shows '<|endofprompt|>': 200018 and '<|reserved_200018|>': 200018

# Optional: explicit duplicate-id check
id2names = defaultdict(list)
for name, tid in sp.items():
    id2names[tid].append(name)
dups = {tid: names for tid, names in id2names.items() if len(names) > 1}
print(dups)  

# -> {200018: ['<|endofprompt|>', '<|reserved_200018|>']}

Actual

<|reserved_200018|> duplicates <|endofprompt|> (both id=200018).

Expected

No two special token names share the same token id.

Root Cause

tiktoken_ext/openai_public.py bulk-generates reserved_* specials for [200013, 201088) without excluding ids already used by explicit specials, introducing <|reserved_200018|> as a duplicate of <|endofprompt|>.

  • This is the full code segment where the bug is introduced:

    def o200k_base():
    mergeable_ranks = load_tiktoken_bpe(
    "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken",
    expected_hash="446a9538cb6c348e3516120d7c08b09f57c36495e2acfffe59a5bf8b0cfb1a2d",
    )
    special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018}
    # This regex could be made more efficient. If I was the one working on this encoding, I would
    # have done a few other things differently too, e.g. I think you can allocate tokens more
    # efficiently across languages.
    pat_str = "|".join(
    [
    r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
    r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
    r"""\p{N}{1,3}""",
    r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
    r"""\s*[\r\n]+""",
    r"""\s+(?!\S)""",
    r"""\s+""",
    ]
    )
    return {
    "name": "o200k_base",
    "pat_str": pat_str,
    "mergeable_ranks": mergeable_ranks,
    "special_tokens": special_tokens,
    }
    def o200k_harmony():
    base_enc = o200k_base()
    name = "o200k_harmony"
    pat_str = base_enc["pat_str"]
    mergeable_ranks = base_enc["mergeable_ranks"]
    special_tokens = {
    **base_enc["special_tokens"],
    "<|startoftext|>": 199998,
    "<|endoftext|>": 199999,
    "<|reserved_200000|>": 200000,
    "<|reserved_200001|>": 200001,
    "<|return|>": 200002,
    "<|constrain|>": 200003,
    "<|reserved_200004|>": 200004,
    "<|channel|>": 200005,
    "<|start|>": 200006,
    "<|end|>": 200007,
    "<|message|>": 200008,
    "<|reserved_200009|>": 200009,
    "<|reserved_200010|>": 200010,
    "<|reserved_200011|>": 200011,
    "<|call|>": 200012,
    } | {f"<|reserved_{i}|>": i for i in range(200013, 201088)}

  • At line 100, <|endofprompt|> is already registered with the id 200018:

    special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018}

  • However, at line 145, the code adds all ids from 200013 to 201088 as reserved tokens, which mistakenly includes the already-used id 200018:

    } | {f"<|reserved_{i}|>": i for i in range(200013, 201088)}

Fix

  • Remove <|reserved_200018|> which duplicates <|endofprompt|> (both id=200018).
  • When generating reserved_*, skip ids already defined in special_tokens to prevent future collisions.
  • Implemented in PR #458.

Tests

Add tests/test_token_ids_unique.py to enforce token-id uniqueness across all encodings:

  • special token ids are unique (no two names share the same id);
  • mergeable vocab ids are unique when _mergeable_ranks is exposed.

The test fails before this change and passes after.

Compatibility

No behavior change to encoding/decoding. Only removes the duplicate special token entry.

Environment

  • tiktoken version: 0.12.0
  • Python: 3.13.7
  • OS: macOS/Linux/Windows (reproducible)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions