fix(kimi-k25): preserve thinking content in offline data formatting by qywu · Pull Request #115 · lightseekorg/TorchSpec

qywu · 2026-06-03T04:36:16Z

Summary

KimiK25Parser.format() corrupts assistant turns from thinking models (Kimi K2.x) when their text is re-tokenized for offline training data. The opening <think> is emitted by the chat template as part of the generation prompt, so a captured response is {reasoning}</think>{answer} (closing tag, no opener) — or, with a reasoning parser, the reasoning is in a separate field and content is the bare answer. In both cases format() produced a malformed turn and silently corrupted the supervised sequence / loss mask.

This is latent on main because the online/decode path captures generated tokens verbatim (no re-tokenization); it only surfaces in offline EAGLE3 data prep.

Changes

_recover_missing_think_open — restore a dropped opening <think> when content has a </think> but no opener (raw, no reasoning parser).
_merge_reasoning_field — rebuild <think>{reasoning}</think>{answer} from a separate reasoning_content/thinking field (reasoning-parser output, e.g. --reasoning-parser kimi_k2); fields kept in sync with has_thinking_content.
Apply recovery on non-last turns before _strip_thinking, so dropped-opener historical turns strip cleanly (multi-turn).
has_unbalanced_thinking_tags + an aggregated load-time warning in load_conversation_dataset (scoped to the offline path, so the decode generation-prompt <think> doesn't false-positive).

The two recovery paths compose and are mutually exclusive (inline vs separated). Plain answers still get <think></think>; well-formed content is unchanged.

Behavior: before → after (last assistant turn unless noted)

#	data source	reasoning carried in	upstream (before)	this PR (after)
1	no reasoning parser	inline `content` = `{reasoning}</think>{answer}`	❌ `<think></think>{reasoning}</think>{answer}` (dangling `</think>`, reasoning outside think block)	✅ `<think>{reasoning}</think>{answer}`
2	reasoning parser (`--reasoning-parser kimi_k2`)	separate `reasoning_content`/`thinking` field	❌ `<think></think>{answer}` (reasoning silently dropped)	✅ `<think>{reasoning}</think>{answer}`
3	inline well-formed	`content` = `<think>…</think>…`	✅ unchanged	✅ unchanged
4	plain answer (no reasoning)	`content` = `{answer}`	✅ `<think></think>{answer}`	✅ `<think></think>{answer}`
5	earlier (non-last) turns — well-formed / field / dropped-opener	either	❌ dropped-opener stays malformed	✅ thinking stripped (all cases)

Tests

TestKimiK25ThinkRecovery, TestKimiK25ReasoningField, TestKimiK25MultiTurnDroppedOpener, TestThinkBalanceCheck — 44 parser tests pass.

Found while building an offline EAGLE3 overfit for Kimi-K2.6-NVFP4.

Kimi K2.x is a thinking model whose chat template emits the opening <think> as part of the generation prompt, so a response captured from an inference server is '{reasoning}</think>{answer}' — it has the closing </think> but not the opening one. KimiK25Parser.format() only checked startswith('<think>') and prepended an empty <think></think>, producing a malformed assistant turn: '<think></think>{reasoning}</think>{answer}' (dangling close, reasoning outside the think block). This corrupts offline EAGLE3 training data for K2.6. Add KimiK25Parser._recover_missing_think_open(): if content has a </think> but no opening <think>, restore the opener so the turn is a proper <think>{reasoning}</think>{answer} block. Plain answers (no think tags) still get the empty <think></think>; well-formed content is unchanged. Adds TestKimiK25ThinkRecovery (3 cases). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Add has_unbalanced_thinking_tags() and an aggregated load-time warning in load_conversation_dataset(): if formatted samples have an unequal number of <think>/</think> tags, surface it (count + likely cause + fix) before training instead of silently corrupting hidden-state extraction. This is the automatic guard for the dropped-opening-<think> class of bug (thinking model emits the opening <think> in the generation prompt, so captured responses lack it). Adds TestThinkBalanceCheck (5 cases). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

When data is generated with a reasoning parser (e.g. --reasoning-parser kimi_k2), the reasoning is returned in a separate reasoning_content/thinking field and content is the bare answer. KimiK25Parser.format() previously ignored that field and emitted only <think></think>{answer}, silently dropping the reasoning from training (even though has_thinking_content() detects it). Add _merge_reasoning_field(): for the last assistant turn, rebuild <think>{reasoning}</think>{answer} from the reasoning field (fields kept in sync with has_thinking_content) when content has no inline think tags. Composes with _recover_missing_think_open so both generation paths (inline raw text vs separated reasoning field) produce a correct single think block. Earlier turns keep thinking stripped. Adds TestKimiK25ReasoningField (6 cases). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

The load-time has_unbalanced_thinking_tags() warning ran on every sample, but in defer/decode mode formatted_prompt ends with the generation-prompt opening <think> (legitimately unbalanced), firing a false-positive warning on every sample. Guard it to the offline path (not defer_tokenization), where the full conversation is re-tokenized and the malformation actually matters. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Historical (non-last) assistant turns drop their thinking via _strip_thinking, whose regex needs a complete <think>...</think>. A dropped-opener historical turn (reasoning</think>answer) therefore wasn't stripped and stayed malformed (<think></think>reasoning</think>answer). Recover the opening <think> before stripping so the block is matched and removed cleanly. Closes the multi-turn gap. Adds TestKimiK25MultiTurnDroppedOpener. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Note that format() produces a hybrid of the native template's two modes: non-last turns match preserve_thinking=False (reasoning stripped, = serving context), the last turn matches preserve_thinking=True (reasoning kept, = what the model generates / what the draft trains on). Single-turn data equals native preserve_thinking=True exactly. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 477d6f2e5a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-03T05:55:09Z

+        if "</think>" in content and not content.lstrip().startswith("<think>"):
+            return "<think>" + content


Keep auto masking in sync with recovered thinking

When this branch fires for raw Kimi output like reasoning</think>answer, the formatted sample now contains non-empty thinking, but has_thinking_content() (used by _resolve_last_turn_loss_only() when last_turn_loss_only='auto') still returns false because the raw message has no opening <think> tag or reasoning field. In multi-turn offline tokenization with the default auto setting, these recovered thinking samples are therefore treated as non-thinking and receive loss on every assistant turn instead of only the final turn; please mirror this dropped-opener detection in the thinking-content check or pass a normalized signal into the mask decision.

Useful? React with 👍 / 👎.

Good catch — confirmed and fixed in fdaa2fd.

has_thinking_content (used by last_turn_loss_only='auto') only matched a literal opening <think> or a separate reasoning field, so the recovered {reasoning}</think>{answer} shape was treated as non-thinking and got loss on every assistant turn instead of only the last. I factored the dropped-opener detection into a shared _has_dropped_think_opener helper now used by both the detection path and KimiK25Parser._recover_missing_think_open, so masking and recovery can no longer drift.

has_thinking_content (used by last_turn_loss_only='auto') only matched a literal opening <think> or a separate reasoning field, so the recovered {reasoning}</think>{answer} shape was treated as non-thinking and got loss on every assistant turn instead of only the last. Factor the dropped-opener detection into a shared _has_dropped_think_opener helper used by both detection and KimiK25Parser._recover_missing_think_open so they can't drift. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84ada13545

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T06:25:52Z

        if not isinstance(msg, dict) or msg.get("role") != "assistant":
            continue
        content = msg.get("content", "")


Don't mark empty dropped-opener turns as thinking

When the saved Kimi response is just </think>answer (the generation prompt supplied <think> but the model emitted no reasoning before closing it), recovery correctly formats this as an empty <think></think> block, but this condition still makes has_thinking_content() return true. With last_turn_loss_only='auto' on multi-turn offline tokenization, those no-reasoning samples will be masked as thinking samples and only the final assistant turn gets loss, dropping supervision for earlier assistant answers; require non-whitespace content before the first </think> before treating a dropped opener as real thinking.

Useful? React with 👍 / 👎.

qywu added 3 commits June 2, 2026 21:42

qywu force-pushed the qywu/kimi-k25-think-recover branch from 8610f3b to 8e91dfc Compare June 3, 2026 04:42

qywu added 2 commits June 2, 2026 21:48

qywu marked this pull request as ready for review June 3, 2026 04:56

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

qywu requested a review from yubofredwang June 3, 2026 05:55

qywu marked this pull request as draft June 3, 2026 05:56

qywu added 2 commits June 3, 2026 11:17

chore: remove verbose comments from kimi-k25 think recovery

84ada13

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu force-pushed the qywu/kimi-k25-think-recover branch from 1e6d4ea to 84ada13 Compare June 3, 2026 18:17

yubofredwang approved these changes Jun 3, 2026

View reviewed changes

yubofredwang marked this pull request as ready for review June 4, 2026 06:23

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

qywu merged commit 482a071 into main Jun 4, 2026
2 checks passed

qywu deleted the qywu/kimi-k25-think-recover branch June 4, 2026 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kimi-k25): preserve thinking content in offline data formatting#115

fix(kimi-k25): preserve thinking content in offline data formatting#115
qywu merged 8 commits into
mainfrom
qywu/kimi-k25-think-recover

qywu commented Jun 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Uh oh!

qywu Jun 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if "</think>" in content and not content.lstrip().startswith("<think>"):
		return "<think>" + content

Conversation

qywu commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Behavior: before → after (last assistant turn unless noted)

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

qywu Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qywu commented Jun 3, 2026 •

edited

Loading