Skip to content

fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991

Merged
mudler merged 1 commit into
masterfrom
fix/9985-autoparser-reasoning-leak
May 25, 2026
Merged

fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991
mudler merged 1 commit into
masterfrom
fix/9985-autoparser-reasoning-leak

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Fixes #9985 — qwen3-4b (and the rest of the qwen3 family) was returning the <think>...</think> block inside the OpenAI content field instead of in a separate reasoning field. Regression from v4.0.0, introduced by the C++ autoparser ChatDeltas path (#9224).

Root cause

When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery), llama.cpp's chat parser falls back to a "pure content" PEG parser. It dumps the entire raw response — <think> tags and all — into ChatDelta.Content and leaves ChatDelta.ReasoningContent empty. The Go side in chat.go then preferred the autoparser's content over tokenCallback's correctly-split result, so the tags leaked through.

Debug log showing the bug:

[ChatDeltas] non-streaming Predict received deltas from C++ autoparser total_deltas=1
[ChatDeltas] non-SSE no-tools: overriding result with C++ autoparser deltas content_len=376 reasoning_len=0

Fix shape

  • Conditional fallback. applyAutoparserOverride (extracted from chat.go's inline override) now runs Go-side ExtractReasoningWithConfig when the autoparser delivered content but no reasoning. When the autoparser DID populate ReasoningContent, we trust it untouched — jinja-enabled installs are not regressed.
  • Streaming gets a sticky preferAutoparser flag. It flips on the first chunk where the autoparser classified reasoning_content; until then the streaming worker uses the Go-side extractor's deltas.
  • Realtime mirrors the non-streaming fallback.
  • gallery/qwen3.yaml now enables use_jinja:true so the autoparser classifies <think> natively for the 20+ qwen3 family entries sharing this template. The Go-side fallback still covers older on-disk installs and any future imported models without jinja.

Test plan

  • go test ./core/http/endpoints/openai/ ./core/http/endpoints/openresponses/ ./pkg/reasoning/ ./pkg/functions/ — green
  • New Ginkgo specs in chat_test.go covering:
    • autoparser delivered <think> in content + empty reasoning → split correctly (red without fix, green with fix)
    • autoparser already populated reasoning → passthrough untouched (no-regression on jinja path)
    • plain content, no reasoning tags → passthrough
    • empty <think></think> block from qwen3 /no_think → tags stripped, no spurious reasoning field
    • empty chatDeltas → returns existing result
  • golangci-lint run --new-from-merge-base=master — 0 new issues
  • End-to-end against running qwen3-4b (Q4_K_M):
    • Default thinking mode: content clean, reasoning in its own field
    • /no_think mode: empty think block stripped cleanly
    • Streaming: reasoning chunks delivered in delta.reasoning, content chunks clean
    • use_jinja:true variant (working-autoparser baseline): content_len=39 reasoning_len=376 from autoparser — Go-side fallback bypassed as expected

🤖 Generated with Claude Code

…in pure-content mode

When LocalAI templates a thinking model outside of jinja (the default for
the qwen3 gallery family), llama.cpp's chat parser falls back to a
"pure content" PEG parser that dumps the entire raw response into
ChatDelta.Content with an empty ReasoningContent. The Go side then
trusted that content verbatim and overrode tokenCallback's
correctly-split reasoning, so <think>...</think> blocks ended up in the
OpenAI `content` field. Regression from v4.0.0 introduced when the
autoparser ChatDeltas path was added (#9224).

The override now runs Go-side reasoning extraction defensively when the
autoparser delivered content but no reasoning. The streaming worker
gains a sticky preferAutoparser flag that flips on the first chunk
where the autoparser classified reasoning_content; until then we use
the streaming Go-side extractor. Realtime mirrors the non-streaming
fallback. When the autoparser already populated ReasoningContent we
trust it untouched, so jinja-enabled installs are not regressed.

gallery/qwen3.yaml now enables use_jinja, letting the autoparser
classify <think> natively for all 20+ qwen3 family entries that share
this template.

Fixes #9985

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit 1c6c3ad into master May 25, 2026
57 checks passed
@mudler mudler deleted the fix/9985-autoparser-reasoning-leak branch May 25, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: Reasoning/thinking output provided as regular output

2 participants