Skip to content

Fix tokenization of <|constrain|> content type in rendering#47

Merged
dkundel-openai merged 1 commit intoopenai:mainfrom
dzhulgakov:fix-contrain-tokenization
Aug 9, 2025
Merged

Fix tokenization of <|constrain|> content type in rendering#47
dkundel-openai merged 1 commit intoopenai:mainfrom
dzhulgakov:fix-contrain-tokenization

Conversation

@dzhulgakov
Copy link
Contributor

The recommended way to render a constrained JSON clause in Harmony is .with_content_type("<|constrain|>json"). Unfortunately, the current code treats it as user input and doesn't tokenize into special tokens.

Thus, the previous tool calls get rendered wrong into the token space by encoding.render_conversation:

From rendering =======
' <' '|' 'con' 'strain' '|' '>' 'json' '<|message|>' '{}' '<|call|>'

After re-encoding or as produced by model =======
' ' '<|constrain|>' 'json' '<|message|>' '{}' '<|call|>'
image

This confuses the model and leads to invalid syntax for subsequent tool calling, especially on 20B model. See #27 (comment) for more context

If one renders text and encodes it back with special tokens allowed, the output is fixed. However, it opens an opportunity for prompt injection attacks as user input may now contain special tokens.

Instead, this PR handles the <|constrain|> special case explicitly. Alternative would be to add explicit API like .with_content_type("json", contrain=True) but it feels like a bigger change.

Minimal repro

Script
from openai_harmony import load_harmony_encoding, Conversation, Message, Role

encoding = load_harmony_encoding("HarmonyGptOss")

tool_call_message = (
    Message.from_role_and_content(Role.ASSISTANT, "{}")
    .with_recipient("functions.dummy")
    .with_channel("commentary")
    .with_content_type("<|constrain|>json")
)
conversation = Conversation.from_messages([tool_call_message])
tokens = encoding.render_conversation(conversation=conversation)
text = encoding.decode_utf8(tokens)
reencoded_tokens = encoding.encode(text, allowed_special="all")
text_reencoded = encoding.decode_utf8(reencoded_tokens)

print(f"{tokens == reencoded_tokens=}")
print(f"{text == text_reencoded=}")

print("text =======")
print(text)
print()
print("reencoded text =======")
print(text_reencoded)
print()

print("tokens =======")
print(" ".join([repr(encoding.decode([token])) for token in tokens]))
print()
print("reencoded tokens =======")
print(" ".join([repr(encoding.decode([token])) for token in reencoded_tokens]))
print()

output before this PR

tokens == reencoded_tokens=False
text == text_reencoded=True
text =======
<|start|>assistant to=functions.dummy<|channel|>commentary <|constrain|>json<|message|>{}<|call|>

reencoded text =======
<|start|>assistant to=functions.dummy<|channel|>commentary <|constrain|>json<|message|>{}<|call|>

tokens =======
'<|start|>' 'assistant' ' to' '=' 'functions' '.d' 'ummy' '<|channel|>' 'comment' 'ary' ' <' '|' 'con' 'strain' '|' '>' 'json' '<|message|>' '{}' '<|call|>'

reencoded tokens =======
'<|start|>' 'assistant' ' to' '=' 'functions' '.d' 'ummy' '<|channel|>' 'comment' 'ary' ' ' '<|constrain|>' 'json' '<|message|>' '{}' '<|call|>'

output after this PR

tokens == reencoded_tokens=True
text == text_reencoded=True
text =======
<|start|>assistant to=functions.dummy<|channel|>commentary <|constrain|>json<|message|>{}<|call|>

reencoded text =======
<|start|>assistant to=functions.dummy<|channel|>commentary <|constrain|>json<|message|>{}<|call|>

tokens =======
'<|start|>' 'assistant' ' to' '=' 'functions' '.d' 'ummy' '<|channel|>' 'comment' 'ary' ' ' '<|constrain|>' 'json' '<|message|>' '{}' '<|call|>'

reencoded tokens =======
'<|start|>' 'assistant' ' to' '=' 'functions' '.d' 'ummy' '<|channel|>' 'comment' 'ary' ' ' '<|constrain|>' 'json' '<|message|>' '{}' '<|call|>'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants