fix(kimi-k25): preserve thinking content in offline data formatting#115
Conversation
Kimi K2.x is a thinking model whose chat template emits the opening <think>
as part of the generation prompt, so a response captured from an inference
server is '{reasoning}</think>{answer}' — it has the closing </think> but not
the opening one. KimiK25Parser.format() only checked startswith('<think>') and
prepended an empty <think></think>, producing a malformed assistant turn:
'<think></think>{reasoning}</think>{answer}' (dangling close, reasoning outside
the think block). This corrupts offline EAGLE3 training data for K2.6.
Add KimiK25Parser._recover_missing_think_open(): if content has a </think> but
no opening <think>, restore the opener so the turn is a proper
<think>{reasoning}</think>{answer} block. Plain answers (no think tags) still
get the empty <think></think>; well-formed content is unchanged.
Adds TestKimiK25ThinkRecovery (3 cases).
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Add has_unbalanced_thinking_tags() and an aggregated load-time warning in load_conversation_dataset(): if formatted samples have an unequal number of <think>/</think> tags, surface it (count + likely cause + fix) before training instead of silently corrupting hidden-state extraction. This is the automatic guard for the dropped-opening-<think> class of bug (thinking model emits the opening <think> in the generation prompt, so captured responses lack it). Adds TestThinkBalanceCheck (5 cases). Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When data is generated with a reasoning parser (e.g. --reasoning-parser kimi_k2),
the reasoning is returned in a separate reasoning_content/thinking field and
content is the bare answer. KimiK25Parser.format() previously ignored that field
and emitted only <think></think>{answer}, silently dropping the reasoning from
training (even though has_thinking_content() detects it).
Add _merge_reasoning_field(): for the last assistant turn, rebuild
<think>{reasoning}</think>{answer} from the reasoning field (fields kept in sync
with has_thinking_content) when content has no inline think tags. Composes with
_recover_missing_think_open so both generation paths (inline raw text vs
separated reasoning field) produce a correct single think block. Earlier turns
keep thinking stripped.
Adds TestKimiK25ReasoningField (6 cases).
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
8610f3b to
8e91dfc
Compare
The load-time has_unbalanced_thinking_tags() warning ran on every sample, but in defer/decode mode formatted_prompt ends with the generation-prompt opening <think> (legitimately unbalanced), firing a false-positive warning on every sample. Guard it to the offline path (not defer_tokenization), where the full conversation is re-tokenized and the malformation actually matters. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Historical (non-last) assistant turns drop their thinking via _strip_thinking, whose regex needs a complete <think>...</think>. A dropped-opener historical turn (reasoning</think>answer) therefore wasn't stripped and stayed malformed (<think></think>reasoning</think>answer). Recover the opening <think> before stripping so the block is matched and removed cleanly. Closes the multi-turn gap. Adds TestKimiK25MultiTurnDroppedOpener. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Note that format() produces a hybrid of the native template's two modes: non-last turns match preserve_thinking=False (reasoning stripped, = serving context), the last turn matches preserve_thinking=True (reasoning kept, = what the model generates / what the draft trains on). Single-turn data equals native preserve_thinking=True exactly. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 477d6f2e5a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if "</think>" in content and not content.lstrip().startswith("<think>"): | ||
| return "<think>" + content |
There was a problem hiding this comment.
Keep auto masking in sync with recovered thinking
When this branch fires for raw Kimi output like reasoning</think>answer, the formatted sample now contains non-empty thinking, but has_thinking_content() (used by _resolve_last_turn_loss_only() when last_turn_loss_only='auto') still returns false because the raw message has no opening <think> tag or reasoning field. In multi-turn offline tokenization with the default auto setting, these recovered thinking samples are therefore treated as non-thinking and receive loss on every assistant turn instead of only the final turn; please mirror this dropped-opener detection in the thinking-content check or pass a normalized signal into the mask decision.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — confirmed and fixed in fdaa2fd.
has_thinking_content (used by last_turn_loss_only='auto') only matched a literal opening <think> or a separate reasoning field, so the recovered {reasoning}</think>{answer} shape was treated as non-thinking and got loss on every assistant turn instead of only the last. I factored the dropped-opener detection into a shared _has_dropped_think_opener helper now used by both the detection path and KimiK25Parser._recover_missing_think_open, so masking and recovery can no longer drift.
has_thinking_content (used by last_turn_loss_only='auto') only matched a
literal opening <think> or a separate reasoning field, so the recovered
{reasoning}</think>{answer} shape was treated as non-thinking and got loss
on every assistant turn instead of only the last. Factor the dropped-opener
detection into a shared _has_dropped_think_opener helper used by both
detection and KimiK25Parser._recover_missing_think_open so they can't drift.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
1e6d4ea to
84ada13
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84ada13545
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not isinstance(msg, dict) or msg.get("role") != "assistant": | ||
| continue | ||
| content = msg.get("content", "") |
There was a problem hiding this comment.
Don't mark empty dropped-opener turns as thinking
When the saved Kimi response is just </think>answer (the generation prompt supplied <think> but the model emitted no reasoning before closing it), recovery correctly formats this as an empty <think></think> block, but this condition still makes has_thinking_content() return true. With last_turn_loss_only='auto' on multi-turn offline tokenization, those no-reasoning samples will be masked as thinking samples and only the final assistant turn gets loss, dropping supervision for earlier assistant answers; require non-whitespace content before the first </think> before treating a dropped opener as real thinking.
Useful? React with 👍 / 👎.
Summary
KimiK25Parser.format()corrupts assistant turns from thinking models (Kimi K2.x) when their text is re-tokenized for offline training data. The opening<think>is emitted by the chat template as part of the generation prompt, so a captured response is{reasoning}</think>{answer}(closing tag, no opener) — or, with a reasoning parser, the reasoning is in a separate field andcontentis the bare answer. In both casesformat()produced a malformed turn and silently corrupted the supervised sequence / loss mask.This is latent on
mainbecause the online/decode path captures generated tokens verbatim (no re-tokenization); it only surfaces in offline EAGLE3 data prep.Changes
_recover_missing_think_open— restore a dropped opening<think>whencontenthas a</think>but no opener (raw, no reasoning parser)._merge_reasoning_field— rebuild<think>{reasoning}</think>{answer}from a separatereasoning_content/thinkingfield (reasoning-parser output, e.g.--reasoning-parser kimi_k2); fields kept in sync withhas_thinking_content._strip_thinking, so dropped-opener historical turns strip cleanly (multi-turn).has_unbalanced_thinking_tags+ an aggregated load-time warning inload_conversation_dataset(scoped to the offline path, so the decode generation-prompt<think>doesn't false-positive).The two recovery paths compose and are mutually exclusive (inline vs separated). Plain answers still get
<think></think>; well-formed content is unchanged.Behavior: before → after (last assistant turn unless noted)
content={reasoning}</think>{answer}<think></think>{reasoning}</think>{answer}(dangling</think>, reasoning outside think block)<think>{reasoning}</think>{answer}--reasoning-parser kimi_k2)reasoning_content/thinkingfield<think></think>{answer}(reasoning silently dropped)<think>{reasoning}</think>{answer}content=<think>…</think>…content={answer}<think></think>{answer}<think></think>{answer}Tests
TestKimiK25ThinkRecovery,TestKimiK25ReasoningField,TestKimiK25MultiTurnDroppedOpener,TestThinkBalanceCheck— 44 parser tests pass.