fix(streaming/tools): stop healing-marker stubs from gating off content by localai-bot · Pull Request #9999 · mudler/LocalAI

localai-bot · 2026-05-25T20:41:51Z

Summary

Fixes #9988 — when streaming chat completions with tools, qwen3-4b (and any model that emits a JSON tool call without jinja-driven autoparsing) was dribbling only the first {" to the client and then nothing.

Root cause

For each chunk, chat_stream_workers.go runs functions.ParseJSONIterative on the accumulated content to look for a tool call. The PEG parser (parseJSONWithStack) heals partial input by inserting a random-integer healing marker, e.g. { becomes {"4310046988783340008":1}. removeHealingMarkerFromJSON only stripped the marker from values, so the synthetic key survived. The streaming detector saw a "result", ran the inner loop with continue (because there was no name), and then unconditionally bumped lastEmittedCount = len(jsonResults) — gating off all further content emission for the rest of the stream.

Debug logs from the repro showed exactly this: 15 content chunks from the autoparser, ParseJSONIterative matching 8 times with tool_calls=0, and a final tool_calls=1 text_content="" decision when the deferred end-of-stream code parsed the full JSON.

Fix shape

Three changes, each with its own regression test:

pkg/functions/iterative_parser.go — removeHealingMarkerFromJSON strips the marker from keys (preserving the model-typed prefix if any), and drops the entry when the truncated key is empty. Existing healing-marker tests for {, [, and { "code) still pass.
pkg/functions/parse.go — ParseJSONIterative skips empty-after-healing maps so the parser surface stays clean.
core/http/endpoints/openai/chat_stream_workers.go — the streaming JSON tool-call detector breaks (not continues) on entries without a usable name, and bumps lastEmittedCount only past successfully-emitted entries. Defense-in-depth against any future partial-parse shape.

Test plan

go test ./pkg/functions/ — green, including a new DescribeTable over 8 partial-JSON-prefix shapes ({, {", {"n, {"na, {"name", {"name":, {"name":", {"name":"ans) that verifies no healing-marker characters leak into result keys.
New test confirms the two shapes that should produce no result at all ({, {") return empty results.
New test confirms a clean tool call {"name":"answer","arguments":{"message":"Hi"}} still parses with no marker-keyed garbage.
New test confirms {"name":"answer" (partial, missing close) returns {name: "answer"} with marker-only keys dropped.
End-to-end against qwen3-4b with stream: true + tools: [...]:
- Before this PR: client receives data: {"content":"{\""} and then nothing.
- After this PR: content flows token-by-token, the tool_call chunk lands once the model commits a name, and finish_reason: tool_calls is set correctly.

When the C++ autoparser is in pure-content fallback mode (e.g. qwen3 without --jinja) and the model emits a tool call as JSON, the streaming worker calls ParseJSONIterative on each new chunk. parseJSONWithStack heals partial input like `{` into `{"<marker>":1}` where <marker> is a random integer. removeHealingMarkerFromJSON only stripped the marker from values, so the synthetic key survived and downstream callers saw a stub object with a random-looking key. chat_stream_workers.go's JSON tool-call detector then bumped lastEmittedCount past the stub even though no real tool call was emitted, gating off ALL subsequent content chunks. The qwen3 + tools + streaming case ended up dribbling only the first `{"` to clients and then nothing, even when the model went on to call the noAction `answer({"message": "…"})` pseudo-tool. Three changes, each with its own regression test: * removeHealingMarkerFromJSON now strips the marker suffix from keys too, dropping the entry when the truncated key is empty. Inputs like `{` no longer leak `{"<marker>":1}` to callers; partial keys like `{ "code` still preserve the model-typed prefix `code`. * ParseJSONIterative skips empty-after-healing maps so a healed `{` doesn't surface as a stub result. * The streaming JSON detector now breaks (not continues) on entries without a usable `name`, and only bumps lastEmittedCount past successfully-emitted entries. Defense-in-depth against any future partial-parse shape. The parser tests cover eight partial-JSON-prefix shapes and verify no marker characters leak into keys, plus the two early shapes (`{`, `{"`) that should not surface a stub at all. Fixes #9988 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas and unit-test it against every shape that can hit the streaming worker: * the bug case — a healing-marker stub at index 0 must NOT bump lastEmittedCount, so subsequent content chunks keep flowing; * the autoparser-correctly-working case — empty jsonResults (because the C++ autoparser cleared the raw text and delivers tool calls via TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream emitter to ship the autoparser's tool calls; * a single complete tool call — emit one chunk, advance to 1; * arguments arriving as a JSON-string vs as a nested object — both serialize to the wire as JSON-string arguments; * multiple parallel tool calls — one chunk each; * a real tool call followed by a partial stub — emit the real one, stop at the stub, resume on a later chunk once the stub completes. Locks down the no-regression guarantee the user asked for: this PR's fix is scoped to the pure-content fallback path; when the autoparser actually classifies tool calls (jinja-recognized chat format with tool support), the helper is a no-op and nothing changes. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…iling reasoning chunk When the C++ autoparser is in pure-content fallback mode (qwen3-4b after model emits a tool-call JSON in non-thinking mode, the streaming worker ended the SSE stream with a spurious data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}} chunk carrying the same JSON that was already in delta.tool_calls. The Go-side ReasoningExtractor is configured from DetectThinkingStartToken, which scans the model's jinja chat template verbatim and finds <think> inside an {% if enable_thinking %} block without evaluating the conditional. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and makes ExtractReasoning treat everything after as reasoning. The autoparser correctly classifies zero reasoning (qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content), but processStreamWithTools then preferred extractor.Reasoning() over functions.ReasoningFromChatDeltas at the end-of-stream flush — handing the polluted Go-side state to buildDeferredToolCallChunks, which emitted it as a trailing reasoning chunk. Two changes: * Add a sticky preferAutoparser flag to processStreamWithTools, mirroring the analogous flag in processStream from #9985. Once any ChatDelta carries content or reasoning, the flag stays on for the rest of the stream and the worker stops falling back to the Go-side extractor for per-token deltas. This avoids the per-chunk leak path and the cumulative pollution. * Extract chooseDeferredReasoning, a small helper that selects the end-of-stream reasoning source. When preferAutoparser is set, return functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to extractor.Reasoning() (the correct source for vLLM and other backends with no autoparser). The helper has a focused test suite covering both sides of the contract: autoparser-active with empty reasoning (the qwen3 case — the fix's purpose), autoparser-active with real reasoning_content (jinja-with-recognized-format models), and autoparser-not-active with genuine Go-side reasoning (vLLM-style backends). E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with name='exec' and the right arguments, finish_reason=tool_calls, and zero reasoning chunks — down from one polluted reasoning chunk before this fix. Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end; the helper unit tests are independent. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…iling reasoning chunk (#10000) When the C++ autoparser is in pure-content fallback mode (qwen3-4b after model emits a tool-call JSON in non-thinking mode, the streaming worker ended the SSE stream with a spurious data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}} chunk carrying the same JSON that was already in delta.tool_calls. The Go-side ReasoningExtractor is configured from DetectThinkingStartToken, which scans the model's jinja chat template verbatim and finds <think> inside an {% if enable_thinking %} block without evaluating the conditional. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and makes ExtractReasoning treat everything after as reasoning. The autoparser correctly classifies zero reasoning (qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content), but processStreamWithTools then preferred extractor.Reasoning() over functions.ReasoningFromChatDeltas at the end-of-stream flush — handing the polluted Go-side state to buildDeferredToolCallChunks, which emitted it as a trailing reasoning chunk. Two changes: * Add a sticky preferAutoparser flag to processStreamWithTools, mirroring the analogous flag in processStream from #9985. Once any ChatDelta carries content or reasoning, the flag stays on for the rest of the stream and the worker stops falling back to the Go-side extractor for per-token deltas. This avoids the per-chunk leak path and the cumulative pollution. * Extract chooseDeferredReasoning, a small helper that selects the end-of-stream reasoning source. When preferAutoparser is set, return functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to extractor.Reasoning() (the correct source for vLLM and other backends with no autoparser). The helper has a focused test suite covering both sides of the contract: autoparser-active with empty reasoning (the qwen3 case — the fix's purpose), autoparser-active with real reasoning_content (jinja-with-recognized-format models), and autoparser-not-active with genuine Go-side reasoning (vLLM-style backends). E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with name='exec' and the right arguments, finish_reason=tool_calls, and zero reasoning chunks — down from one polluted reasoning chunk before this fix. Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end; the helper unit tests are independent. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 2 commits May 25, 2026 20:41

localai-bot mentioned this pull request May 25, 2026

fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk #10000

Merged

5 tasks

mudler merged commit f17d99f into master May 25, 2026
57 checks passed

mudler deleted the fix/9988-streaming-tools-empty branch May 25, 2026 21:55

BrewTestBot mentioned this pull request May 27, 2026

localai 4.3.2 Homebrew/homebrew-core#285003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(streaming/tools): stop healing-marker stubs from gating off content#9999

fix(streaming/tools): stop healing-marker stubs from gating off content#9999
mudler merged 2 commits into
masterfrom
fix/9988-streaming-tools-empty

localai-bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 25, 2026

Summary

Root cause

Fix shape

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants