bug: o200k_harmony duplicates special token id 200018



## Summary
In `o200k_harmony`, two special token names share the same token id **200018**:
- `<|endofprompt|>` → 200018
- `<|reserved_200018|>` → 200018 **(conflict)**

Token ids must be unique within an encoding.

---

## Reproduction

```python
import tiktoken
from collections import defaultdict

print(tiktoken.__version__)  # expect 0.12.0

enc = tiktoken.get_encoding("o200k_harmony")
sp = enc._special_tokens
print(sp)  # shows '<|endofprompt|>': 200018 and '<|reserved_200018|>': 200018

# Optional: explicit duplicate-id check
id2names = defaultdict(list)
for name, tid in sp.items():
    id2names[tid].append(name)
dups = {tid: names for tid, names in id2names.items() if len(names) > 1}
print(dups)  

# -> {200018: ['<|endofprompt|>', '<|reserved_200018|>']}
```
## Actual
`<|reserved_200018|>` duplicates `<|endofprompt|>` (both id=200018).

## Expected
No two special token names share the same token id.

## Root Cause
`tiktoken_ext/openai_public.py` bulk-generates `reserved_*` specials for `[200013, 201088)` without excluding ids already used by explicit specials, introducing `<|reserved_200018|>` as a duplicate of `<|endofprompt|>`.

- This is the full code segment where the bug is introduced:  
  https://github.com/openai/tiktoken/blob/97e49cbadd500b5cc9dbb51a486f0b42e6701bee/tiktoken_ext/openai_public.py#L95-L145

- At line 100, `<|endofprompt|>` is already registered with the id `200018`:  
  https://github.com/openai/tiktoken/blob/97e49cbadd500b5cc9dbb51a486f0b42e6701bee/tiktoken_ext/openai_public.py#L100

- However, at line 145, the code adds all ids from `200013` to `201088` as reserved tokens, which mistakenly includes the already-used id `200018`:  
  https://github.com/openai/tiktoken/blob/97e49cbadd500b5cc9dbb51a486f0b42e6701bee/tiktoken_ext/openai_public.py#L145

## Fix
- **Remove `<|reserved_200018|>`** which duplicates `<|endofprompt|>` (both id=200018).  
- When generating `reserved_*`, skip ids already defined in `special_tokens` to prevent future collisions.  
- Implemented in PR [#458](https://github.com/openai/tiktoken/pull/458).



## Tests
Add `tests/test_token_ids_unique.py` to enforce token-id uniqueness across all encodings:
- special token ids are unique (no two names share the same id);
- mergeable vocab ids are unique when `_mergeable_ranks` is exposed.

The test fails before this change and passes after.

## Compatibility
No behavior change to encoding/decoding. Only removes the duplicate special token entry.

## Environment
- `tiktoken` version: `0.12.0`
- Python: `3.13.7`
- OS: macOS/Linux/Windows (reproducible)


	def o200k_base():
	mergeable_ranks = load_tiktoken_bpe(
	"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken",
	expected_hash="446a9538cb6c348e3516120d7c08b09f57c36495e2acfffe59a5bf8b0cfb1a2d",
	)
	special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018}
	# This regex could be made more efficient. If I was the one working on this encoding, I would
	# have done a few other things differently too, e.g. I think you can allocate tokens more
	# efficiently across languages.
	pat_str = "\|".join(
	[
	r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s\|'t\|'re\|'ve\|'m\|'ll\|'d)?""",
	r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s\|'t\|'re\|'ve\|'m\|'ll\|'d)?""",
	r"""\p{N}{1,3}""",
	r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
	r"""\s*[\r\n]+""",
	r"""\s+(?!\S)""",
	r"""\s+""",
	]
	)
	return {
	"name": "o200k_base",
	"pat_str": pat_str,
	"mergeable_ranks": mergeable_ranks,
	"special_tokens": special_tokens,
	}


	def o200k_harmony():
	base_enc = o200k_base()
	name = "o200k_harmony"
	pat_str = base_enc["pat_str"]
	mergeable_ranks = base_enc["mergeable_ranks"]
	special_tokens = {
	**base_enc["special_tokens"],
	"<\|startoftext\|>": 199998,
	"<\|endoftext\|>": 199999,
	"<\|reserved_200000\|>": 200000,
	"<\|reserved_200001\|>": 200001,
	"<\|return\|>": 200002,
	"<\|constrain\|>": 200003,
	"<\|reserved_200004\|>": 200004,
	"<\|channel\|>": 200005,
	"<\|start\|>": 200006,
	"<\|end\|>": 200007,
	"<\|message\|>": 200008,
	"<\|reserved_200009\|>": 200009,
	"<\|reserved_200010\|>": 200010,
	"<\|reserved_200011\|>": 200011,
	"<\|call\|>": 200012,
	} \| {f"<\|reserved_{i}\|>": i for i in range(200013, 201088)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: o200k_harmony duplicates special token id 200018 #457

Summary

Reproduction

Actual

Expected

Root Cause

Fix

Tests

Compatibility

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: o200k_harmony duplicates special token id 200018 #457

Description

Summary

Reproduction

Actual

Expected

Root Cause

Fix

Tests

Compatibility

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions