Skip to content

fix(kimi-k25): preserve thinking content in offline data formatting#115

Merged
qywu merged 8 commits into
mainfrom
qywu/kimi-k25-think-recover
Jun 4, 2026
Merged

fix(kimi-k25): preserve thinking content in offline data formatting#115
qywu merged 8 commits into
mainfrom
qywu/kimi-k25-think-recover

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented Jun 3, 2026

Summary

KimiK25Parser.format() corrupts assistant turns from thinking models (Kimi K2.x) when their text is re-tokenized for offline training data. The opening <think> is emitted by the chat template as part of the generation prompt, so a captured response is {reasoning}</think>{answer} (closing tag, no opener) — or, with a reasoning parser, the reasoning is in a separate field and content is the bare answer. In both cases format() produced a malformed turn and silently corrupted the supervised sequence / loss mask.

This is latent on main because the online/decode path captures generated tokens verbatim (no re-tokenization); it only surfaces in offline EAGLE3 data prep.

Changes

  • _recover_missing_think_open — restore a dropped opening <think> when content has a </think> but no opener (raw, no reasoning parser).
  • _merge_reasoning_field — rebuild <think>{reasoning}</think>{answer} from a separate reasoning_content/thinking field (reasoning-parser output, e.g. --reasoning-parser kimi_k2); fields kept in sync with has_thinking_content.
  • Apply recovery on non-last turns before _strip_thinking, so dropped-opener historical turns strip cleanly (multi-turn).
  • has_unbalanced_thinking_tags + an aggregated load-time warning in load_conversation_dataset (scoped to the offline path, so the decode generation-prompt <think> doesn't false-positive).

The two recovery paths compose and are mutually exclusive (inline vs separated). Plain answers still get <think></think>; well-formed content is unchanged.

Behavior: before → after (last assistant turn unless noted)

# data source reasoning carried in upstream (before) this PR (after)
1 no reasoning parser inline content = {reasoning}</think>{answer} <think></think>{reasoning}</think>{answer} (dangling </think>, reasoning outside think block) <think>{reasoning}</think>{answer}
2 reasoning parser (--reasoning-parser kimi_k2) separate reasoning_content/thinking field <think></think>{answer} (reasoning silently dropped) <think>{reasoning}</think>{answer}
3 inline well-formed content = <think>…</think>… ✅ unchanged ✅ unchanged
4 plain answer (no reasoning) content = {answer} <think></think>{answer} <think></think>{answer}
5 earlier (non-last) turns — well-formed / field / dropped-opener either ❌ dropped-opener stays malformed ✅ thinking stripped (all cases)

Tests

TestKimiK25ThinkRecovery, TestKimiK25ReasoningField, TestKimiK25MultiTurnDroppedOpener, TestThinkBalanceCheck — 44 parser tests pass.

Found while building an offline EAGLE3 overfit for Kimi-K2.6-NVFP4.

qywu added 3 commits June 2, 2026 21:42
Kimi K2.x is a thinking model whose chat template emits the opening <think>
as part of the generation prompt, so a response captured from an inference
server is '{reasoning}</think>{answer}' — it has the closing </think> but not
the opening one. KimiK25Parser.format() only checked startswith('<think>') and
prepended an empty <think></think>, producing a malformed assistant turn:
'<think></think>{reasoning}</think>{answer}' (dangling close, reasoning outside
the think block). This corrupts offline EAGLE3 training data for K2.6.

Add KimiK25Parser._recover_missing_think_open(): if content has a </think> but
no opening <think>, restore the opener so the turn is a proper
<think>{reasoning}</think>{answer} block. Plain answers (no think tags) still
get the empty <think></think>; well-formed content is unchanged.

Adds TestKimiK25ThinkRecovery (3 cases).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Add has_unbalanced_thinking_tags() and an aggregated load-time warning in
load_conversation_dataset(): if formatted samples have an unequal number of
<think>/</think> tags, surface it (count + likely cause + fix) before training
instead of silently corrupting hidden-state extraction. This is the automatic
guard for the dropped-opening-<think> class of bug (thinking model emits the
opening <think> in the generation prompt, so captured responses lack it).

Adds TestThinkBalanceCheck (5 cases).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
When data is generated with a reasoning parser (e.g. --reasoning-parser kimi_k2),
the reasoning is returned in a separate reasoning_content/thinking field and
content is the bare answer. KimiK25Parser.format() previously ignored that field
and emitted only <think></think>{answer}, silently dropping the reasoning from
training (even though has_thinking_content() detects it).

Add _merge_reasoning_field(): for the last assistant turn, rebuild
<think>{reasoning}</think>{answer} from the reasoning field (fields kept in sync
with has_thinking_content) when content has no inline think tags. Composes with
_recover_missing_think_open so both generation paths (inline raw text vs
separated reasoning field) produce a correct single think block. Earlier turns
keep thinking stripped.

Adds TestKimiK25ReasoningField (6 cases).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the qywu/kimi-k25-think-recover branch from 8610f3b to 8e91dfc Compare June 3, 2026 04:42
qywu added 2 commits June 2, 2026 21:48
The load-time has_unbalanced_thinking_tags() warning ran on every sample,
but in defer/decode mode formatted_prompt ends with the generation-prompt
opening <think> (legitimately unbalanced), firing a false-positive warning on
every sample. Guard it to the offline path (not defer_tokenization), where the
full conversation is re-tokenized and the malformation actually matters.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Historical (non-last) assistant turns drop their thinking via _strip_thinking,
whose regex needs a complete <think>...</think>. A dropped-opener historical
turn (reasoning</think>answer) therefore wasn't stripped and stayed malformed
(<think></think>reasoning</think>answer). Recover the opening <think> before
stripping so the block is matched and removed cleanly. Closes the multi-turn gap.

Adds TestKimiK25MultiTurnDroppedOpener.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu marked this pull request as ready for review June 3, 2026 04:56
Note that format() produces a hybrid of the native template's two modes:
non-last turns match preserve_thinking=False (reasoning stripped, = serving
context), the last turn matches preserve_thinking=True (reasoning kept, = what
the model generates / what the draft trains on). Single-turn data equals
native preserve_thinking=True exactly.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 477d6f2e5a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread torchspec/data/parse.py Outdated
Comment on lines +437 to +438
if "</think>" in content and not content.lstrip().startswith("<think>"):
return "<think>" + content
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep auto masking in sync with recovered thinking

When this branch fires for raw Kimi output like reasoning</think>answer, the formatted sample now contains non-empty thinking, but has_thinking_content() (used by _resolve_last_turn_loss_only() when last_turn_loss_only='auto') still returns false because the raw message has no opening <think> tag or reasoning field. In multi-turn offline tokenization with the default auto setting, these recovered thinking samples are therefore treated as non-thinking and receive loss on every assistant turn instead of only the final turn; please mirror this dropped-opener detection in the thinking-content check or pass a normalized signal into the mask decision.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed and fixed in fdaa2fd.

has_thinking_content (used by last_turn_loss_only='auto') only matched a literal opening <think> or a separate reasoning field, so the recovered {reasoning}</think>{answer} shape was treated as non-thinking and got loss on every assistant turn instead of only the last. I factored the dropped-opener detection into a shared _has_dropped_think_opener helper now used by both the detection path and KimiK25Parser._recover_missing_think_open, so masking and recovery can no longer drift.

@qywu qywu requested a review from yubofredwang June 3, 2026 05:55
@qywu qywu marked this pull request as draft June 3, 2026 05:56
qywu added 2 commits June 3, 2026 11:17
has_thinking_content (used by last_turn_loss_only='auto') only matched a
literal opening <think> or a separate reasoning field, so the recovered
{reasoning}</think>{answer} shape was treated as non-thinking and got loss
on every assistant turn instead of only the last. Factor the dropped-opener
detection into a shared _has_dropped_think_opener helper used by both
detection and KimiK25Parser._recover_missing_think_open so they can't drift.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the qywu/kimi-k25-think-recover branch from 1e6d4ea to 84ada13 Compare June 3, 2026 18:17
@yubofredwang yubofredwang marked this pull request as ready for review June 4, 2026 06:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84ada13545

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread torchspec/data/parse.py
Comment on lines 65 to 67
if not isinstance(msg, dict) or msg.get("role") != "assistant":
continue
content = msg.get("content", "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't mark empty dropped-opener turns as thinking

When the saved Kimi response is just </think>answer (the generation prompt supplied <think> but the model emitted no reasoning before closing it), recovery correctly formats this as an empty <think></think> block, but this condition still makes has_thinking_content() return true. With last_turn_loss_only='auto' on multi-turn offline tokenization, those no-reasoning samples will be masked as thinking samples and only the final assistant turn gets loss, dropping supervision for earlier assistant answers; require non-whitespace content before the first </think> before treating a dropped opener as real thinking.

Useful? React with 👍 / 👎.

@qywu qywu merged commit 482a071 into main Jun 4, 2026
2 checks passed
@qywu qywu deleted the qywu/kimi-k25-think-recover branch June 4, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants