Skip to content

fix(streaming/tools): stop healing-marker stubs from gating off content#9999

Merged
mudler merged 2 commits into
masterfrom
fix/9988-streaming-tools-empty
May 25, 2026
Merged

fix(streaming/tools): stop healing-marker stubs from gating off content#9999
mudler merged 2 commits into
masterfrom
fix/9988-streaming-tools-empty

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Fixes #9988 — when streaming chat completions with tools, qwen3-4b (and any model that emits a JSON tool call without jinja-driven autoparsing) was dribbling only the first {" to the client and then nothing.

Root cause

For each chunk, chat_stream_workers.go runs functions.ParseJSONIterative on the accumulated content to look for a tool call. The PEG parser (parseJSONWithStack) heals partial input by inserting a random-integer healing marker, e.g. { becomes {"4310046988783340008":1}. removeHealingMarkerFromJSON only stripped the marker from values, so the synthetic key survived. The streaming detector saw a "result", ran the inner loop with continue (because there was no name), and then unconditionally bumped lastEmittedCount = len(jsonResults) — gating off all further content emission for the rest of the stream.

Debug logs from the repro showed exactly this: 15 content chunks from the autoparser, ParseJSONIterative matching 8 times with tool_calls=0, and a final tool_calls=1 text_content="" decision when the deferred end-of-stream code parsed the full JSON.

Fix shape

Three changes, each with its own regression test:

  1. pkg/functions/iterative_parser.goremoveHealingMarkerFromJSON strips the marker from keys (preserving the model-typed prefix if any), and drops the entry when the truncated key is empty. Existing healing-marker tests for {, [, and { "code) still pass.
  2. pkg/functions/parse.goParseJSONIterative skips empty-after-healing maps so the parser surface stays clean.
  3. core/http/endpoints/openai/chat_stream_workers.go — the streaming JSON tool-call detector breaks (not continues) on entries without a usable name, and bumps lastEmittedCount only past successfully-emitted entries. Defense-in-depth against any future partial-parse shape.

Test plan

  • go test ./pkg/functions/ — green, including a new DescribeTable over 8 partial-JSON-prefix shapes ({, {", {"n, {"na, {"name", {"name":, {"name":", {"name":"ans) that verifies no healing-marker characters leak into result keys.
  • New test confirms the two shapes that should produce no result at all ({, {") return empty results.
  • New test confirms a clean tool call {"name":"answer","arguments":{"message":"Hi"}} still parses with no marker-keyed garbage.
  • New test confirms {"name":"answer" (partial, missing close) returns {name: "answer"} with marker-only keys dropped.
  • End-to-end against qwen3-4b with stream: true + tools: [...]:
    • Before this PR: client receives data: {"content":"{\""} and then nothing.
    • After this PR: content flows token-by-token, the tool_call chunk lands once the model commits a name, and finish_reason: tool_calls is set correctly.

Related

Companion to #9991 (issue #9985<think> leaking into content). Both were qwen3-4b regressions vs v4.0.0; fixing them together restores the v4.0.0 user-visible behavior for the most common qwen3 streaming flows.

🤖 Generated with Claude Code

mudler added 2 commits May 25, 2026 20:41
When the C++ autoparser is in pure-content fallback mode (e.g. qwen3
without --jinja) and the model emits a tool call as JSON, the streaming
worker calls ParseJSONIterative on each new chunk. parseJSONWithStack
heals partial input like `{` into `{"<marker>":1}` where <marker> is a
random integer. removeHealingMarkerFromJSON only stripped the marker
from values, so the synthetic key survived and downstream callers saw
a stub object with a random-looking key.

chat_stream_workers.go's JSON tool-call detector then bumped
lastEmittedCount past the stub even though no real tool call was
emitted, gating off ALL subsequent content chunks. The qwen3 + tools +
streaming case ended up dribbling only the first `{"` to clients and
then nothing, even when the model went on to call the noAction
`answer({"message": "…"})` pseudo-tool.

Three changes, each with its own regression test:

* removeHealingMarkerFromJSON now strips the marker suffix from keys
  too, dropping the entry when the truncated key is empty. Inputs like
  `{` no longer leak `{"<marker>":1}` to callers; partial keys like
  `{ "code` still preserve the model-typed prefix `code`.

* ParseJSONIterative skips empty-after-healing maps so a healed `{`
  doesn't surface as a stub result.

* The streaming JSON detector now breaks (not continues) on entries
  without a usable `name`, and only bumps lastEmittedCount past
  successfully-emitted entries. Defense-in-depth against any future
  partial-parse shape.

The parser tests cover eight partial-JSON-prefix shapes and verify no
marker characters leak into keys, plus the two early shapes (`{`,
`{"`) that should not surface a stub at all.

Fixes #9988

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas
and unit-test it against every shape that can hit the streaming worker:

* the bug case — a healing-marker stub at index 0 must NOT bump
  lastEmittedCount, so subsequent content chunks keep flowing;
* the autoparser-correctly-working case — empty jsonResults (because
  the C++ autoparser cleared the raw text and delivers tool calls via
  TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream
  emitter to ship the autoparser's tool calls;
* a single complete tool call — emit one chunk, advance to 1;
* arguments arriving as a JSON-string vs as a nested object — both
  serialize to the wire as JSON-string arguments;
* multiple parallel tool calls — one chunk each;
* a real tool call followed by a partial stub — emit the real one,
  stop at the stub, resume on a later chunk once the stub completes.

Locks down the no-regression guarantee the user asked for: this PR's
fix is scoped to the pure-content fallback path; when the autoparser
actually classifies tool calls (jinja-recognized chat format with tool
support), the helper is a no-op and nothing changes.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit f17d99f into master May 25, 2026
57 checks passed
@mudler mudler deleted the fix/9988-streaming-tools-empty branch May 25, 2026 21:55
mudler added a commit that referenced this pull request May 25, 2026
…iling reasoning chunk

When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious

    data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}

chunk carrying the same JSON that was already in delta.tool_calls.

The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.

Two changes:

* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
  the analogous flag in processStream from #9985. Once any ChatDelta
  carries content or reasoning, the flag stays on for the rest of the
  stream and the worker stops falling back to the Go-side extractor for
  per-token deltas. This avoids the per-chunk leak path and the cumulative
  pollution.

* Extract chooseDeferredReasoning, a small helper that selects the
  end-of-stream reasoning source. When preferAutoparser is set, return
  functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
  extractor.Reasoning() (the correct source for vLLM and other backends
  with no autoparser).

The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).

E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added a commit that referenced this pull request May 26, 2026
…iling reasoning chunk (#10000)

When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious

    data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}

chunk carrying the same JSON that was already in delta.tool_calls.

The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.

Two changes:

* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
  the analogous flag in processStream from #9985. Once any ChatDelta
  carries content or reasoning, the flag stays on for the rest of the
  stream and the worker stops falling back to the Go-side extractor for
  per-token deltas. This avoids the per-chunk leak path and the cumulative
  pollution.

* Extract chooseDeferredReasoning, a small helper that selects the
  end-of-stream reasoning source. When preferAutoparser is set, return
  functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
  extractor.Reasoning() (the correct source for vLLM and other backends
  with no autoparser).

The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).

E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: AI responds with {" when requested with tools and stream=true

2 participants