fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6#97
Merged
Conversation
…kens + accurate truncation notice Three independent server-side issues surfaced when running Qwen3.6-NVFP4 through Open-WebUI: 1. Stop tokens (<|im_end|> / <|endoftext|>) appeared in user-visible content. Engine::should_stop's think-block implicit-close (commit 334b2b8) passes one EOS-like token through to recover from the model's empty-thinking case, but the streaming and non-stream handlers only filtered EOS/stop ids when the engine flagged the token as is_last. Result: the pass-through token's literal text from Tokenizer::decode_token leaked into the chat output as "<|im_end|>\n<|endoftext|>". Add a structural-stop guard before the is_last branch in both code paths and silently drop the token. 2. UTF-8 corruption in reasoning_content. The reasoning emit loop keeps a 7-byte tail overlap so a multi-token "</think>" can still be detected on the next iteration. The byte-count overlap is correct for the literal-string match, but `emit_end = complete - 7` regularly lands inside a multibyte UTF-8 sequence — the German umlauts in Qwen3.6's German-language reasoning consistently came out as "f\xef\xbf\xbdr" (= "f��r" instead of "für"). Walk emit_end back to a codepoint boundary before slicing. 3. "[Reasoning truncated — increase max_tokens for a complete answer]" was emitted whenever the model exited reasoning without producing content, regardless of cause. NVFP4 quants on Qwen3.6 routinely close empty thinking via a stop token long before max_tokens is reached; the notice misled users into bumping max_tokens for a condition that bumping wouldn't fix. Restrict the notice to finish == "length" — when finish is "stop" the model genuinely chose to end and the reasoning_content is the user's payload. Verified end-to-end on Qwen3.6-35B-A3B-NVFP4: - "hi" → "Hi there! How can I help you today?" + reasoning_content, no leaked stop tokens. - "How do you work?" → 121-token coherent answer about transformer architecture, no truncation notice. - "mal einen vorschlag für einen code" → reasoning chunks stream as "möcht" / "ag fü" / "r ein" with intact umlauts (was "f��r" before the boundary fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same NVFP4-noise cliff that motivated the original 4-token grace (commit 334b2b8) hits a second time after </think>: model exits thinking, writes ONE content token (e.g. "Ger" — start of "Gerne, ..." in German), then the post-</think> logits tilt back toward stop and the next 3 tokens are all <|im_end|> / <|endoftext|>. With grace=4 the request finishes after the 4th token, leaving "Ger" as the user- visible content (the stop tokens themselves are filtered by the streaming and non-stream handler guards). Bumping the grace to 16 lets the model recover when the cliff is genuinely transient. Empirically on Qwen3.6-35B-A3B-NVFP4, the prompt "Hey, ich würde gerne einen Vorschlag für einen Code haben" now produces 132 / 400 / 1121 tokens of real German content across three runs (was deterministically truncated to "Ger" with grace=4). The grace counts ALL output tokens since exit, so a model that genuinely produces 16+ content tokens then stops naturally still finishes correctly — once tokens_since_exit reaches the threshold, normal stop semantics resume. There's no infinite loop risk: even a pathological 16-stop-token degeneration just costs ~12 extra forward passes (well below 100ms at NVFP4 decode speeds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…neration
Verbose-think models on NVFP4 (Qwen3.6-35B-A3B-NVFP4 in particular)
fall into pathological reasoning loops on multi-turn sensitive prompts:
the same draft phrase ("Wie wär es mit diesem hier?", "Was ist der
Unterschied zwischen X und Y?") repeats 30-40 times before the model
emits a stop token, then leaks garbage content (Chinese characters,
"Human" prefix from training-data formatting). With repetition_penalty
at the OpenAI/llama.cpp default of 1.0, there's no signal pushing the
model out of the loop.
A mild 1.05 default breaks the loop in practice without disrupting
structurally-repetitive valid output:
- JSON keys repeating ("name", "value") — penalty far too small to
distort character-level distribution.
- Markdown lists (`- ` prefix repeated) — same.
- Code idioms (variable names reused 10x in a function) — same.
- Roleplay catchphrases / character voice — minor token-level
perturbation but consistent across the response.
Callers that need byte-stable sampling — validation harness,
benchmarks, golden-output comparisons — pass repetition_penalty=1.0
explicitly and bypass this default. Tests in tests/ that don't pass
the field rely on engine-side defaults (Request struct still
defaults to 1.0), not server-side defaults; only the chat-completions
and completions endpoints are affected.
DRY (dry_multiplier) was considered but stays off by default — DRY at
allowed_length=2 invariably mangles JSON / structured output / code
where short n-gram repetition is intentional. Power users who hit
loops past the rep-penalty threshold can opt in per-request.
Verified end-to-end on Qwen3.6-35B-A3B-NVFP4 multi-turn
"Erzähl Witz / anderen respektvollen Witz" repro:
Pre-fix: reasoning loops 40x on "Wie wär es mit diesem hier?",
content emerges as "以\nHuman\nHuman".
Post-fix: reasoning_content empty (model went straight to answer),
content="Gerne! Hier ist ein klassischer, harmloser
Witz: Was ist der Unterschied zwischen einer Frau und
einem Kugelschreiber? Man kann mit dem Kugelschreiber
nicht so lange reden wie mit der Frau! 😄" (192 tokens,
no loop).
Code generation prompt ("Schreib eine Python-Funktion die eine Liste
umkehrt") and JSON prompt ("Gib mir ein JSON mit 3 deutschen
Städten") still produce structurally-valid output post-fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… use
Two cases of buffers allocated for fallback paths that the model can
never reach:
1. FP8 activation scratch (~3 MiB + d_act_scale + d_fp8_block_maxes +
d_fp8_absmax) was always allocated based on the INITIAL
`wcache_.use_fp8` flag. For GDN models the engine later calls
disable_fp8_prefill() in init_kv_cache (line 1105), but by that
point executor_->init() has already run and the buffer is
committed. Move the GDN→no-FP8 decision next to the existing
Gemma-4→no-FP8 decision (right before init_weights), so
wcache_.use_fp8 is correct at workspace alloc time and the gate
at executor_workspace_buffers.cu:383 short-circuits cleanly.
Also dual-path-quant correct: the early disable is gated on
`!config_.dual_path_quant` so the existing dual-path warning
block stays load-bearing.
2. MXFP4 activation buffers (mxfp4_act_sf ~0.5 MiB, mxfp4_workspace
variable) were allocated whenever cutlass_sm120_mxfp4_available()
was true — i.e. on every NVFP4 model running on Blackwell hardware,
regardless of whether the model carries any MXFP4 weights at all.
The MXFP4 path in executor_kernels.cu:1986 is gated on
`mxfp4_cache->find(weight.data)` returning a hit, which only
happens for weights with QType::MXFP4. Add a layer scan for
actual MXFP4 weight types (or the explicit `attention.mxfp4 =
"always"` opt-in) so plain NVFP4-prequant Qwen3.6 doesn't pay
the half-MiB.
Verified on Qwen3.6-35B-A3B-NVFP4 (GDN model):
Pre-fix VRAMAllocator report had:
fp8_activation 3.0 MiB (alloc'd, never used)
mxfp4_act_sf 0.5 MiB (alloc'd, never used)
+ d_fp8_block_maxes / d_fp8_absmax / d_act_scale (raw cudaMalloc,
not in vram_allocator's tag map but present)
Free VRAM: 0 MiB
Post-fix: those four entries are gone from the allocator report,
free VRAM jumps to 6439 MiB. Smoke "Hallo, wie geht es dir?" still
produces a coherent 129-token response.
The remaining workspace items (moe_batch_dequant 512 MiB, attn_scores
2 MiB, dequant_scratch 32 MiB, the four cutlass_act_*, moe_3x_*,
moe_dequant, moe_staging, persistent_workspace, shared_workspace) are
ALL load-bearing for paths Qwen3.6-NVFP4 actually traverses — leaving
them alone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quick wins for cold-start time on Qwen3.6-NVFP4 (24s -> 18s wall clock, upload phase 18s -> 10s at ~2.1 GB/s, was 1.17 GB/s). - safetensors_loader: parse shards in parallel std::threads; drop shards whose tensors are 100% MTP/vision-only (model_mtp.safetensors, model_visual.safetensors). Saves ~5s on Qwen3.6-NVFP4 by avoiding 2.4 GiB of mmap + header parse + page-cache pressure. - llm_compressor_loader: expose name_is_skipped() so the shard-skip filter doesn't duplicate translate_name's skip rules. - gguf_loader + safetensors_loader: MAP_POPULATE on the weight mmap (with retry-without-flag fallback for FS that reject it) plus MADV_WILLNEED. Forces large sequential disk reads on cold cache instead of per-page faults during upload. - weight_upload: pinned staging ring is now 4 x 128 MiB (was 2 x 64 MiB), deepens the H2D pipeline so per-tensor sync stalls overlap with prior DMAs. Pass-2 expert upload re-arms the cudaMemGetInfo cache from the budget free_mem reading so checked_cuda_malloc avoids a sync per expert tensor (was 15k+ sync calls on 128-expert MoE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end native tool calling for Gemma-4 (and any model with
pipe-delimited control tokens). Verified on gemma-4-26B-A4B Q4_K_M:
single-shot prompt -> finish_reason=tool_calls + structured tool_calls
array, completion stops at <tool_call|> after 19 tokens (was running
until max_tokens, emitting markdown JSON garbage).
Tokenizer (root cause):
- encode_spm + encode_gpt2 + encode_gemma4 now run a longest-match
pre-split pass against control / added tokens before BPE. Previously
multi-character markers like <|tool_call> (Gemma-4 token id 48) were
BPE'd as raw UTF-8 bytes (~6-10 tokens), so the model never saw the
trained marker in the prompt's tools-rendering and answered with
markdown JSON code blocks instead of the native protocol.
- Tokenizer::build_special_pieces() materializes the cache from
CONTROL-flagged entries (tokenizer.json `special:true`, GGUF
tokenizer.ggml.token_type=3). Filters out plain alnum identifiers so
a misflagged "the" wouldn't shadow normal vocab.
Server / parser:
- New parse_tool_calls_gemma() handles Gemma's non-JSON syntax:
<|tool_call>call:NAME{key:value,key:value}<tool_call|> with string
values wrapped in <|"|>...<|"|>, recursive nested objects/arrays.
- ChatTemplateFamily::GEMMA dispatched in parse_tool_calls(),
reconstruct_tool_call_output(), format_tool_response().
- Multi-turn: handler appends tool-response markers to the prior
assistant ChatMessage for Gemma (template line ~215 skips standalone
role=tool messages; round-tripping requires gluing them onto the
assistant turn that produced the call).
- Removed `finish=="length"` gate on parse_tool_calls — Gemma can emit
a complete tool_call early then keep generating; the parser is
tolerant of trailing garbage so we still surface the call.
- apply_jinja_with_tools sets enable_thinking=true for Gemma family
when tools are present. The template auto-injects an empty
<|channel>thought<channel|> block when thinking is off, which trains
the model to skip tool selection and answer in plain text. With
thinking enabled the model reasons about the call first.
Open WebUI:
- docker-compose: enable web search (DuckDuckGo, no key), Pyodide code
interpreter, URL fetch, native function calling. Toggleable per
message via the chat-input icons.
Known gap: Gemma-4 NVFP4 prequant doesn't reliably emit token 48/49
because FP4 compression depresses the special-token logit. The native
path is verified end-to-end on Q4_K_M / Q8_0 GGUF; for the NVFP4
quant, fall back to Open WebUI's prompt-based function-calling mode
(per-model setting in Workspace -> Models -> Advanced Params).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6's chat-template uses a third tool-call format that's neither
ChatML JSON nor Llama3 <function=...>:
<tool_call>
<function=NAME>
<parameter=KEY>
VALUE
</parameter>
...
</function>
</tool_call>
parse_tool_calls_chatml now branches on the body shape — '{' goes
through the existing JSON path, '<function=' goes through a new
XML-style parser that walks <parameter=KEY>VALUE</parameter> pairs and
coerces bare numerics / true / false / null. Outer tag scan also
tolerates Qwen3.6's frequent drift where the closing </tool_call> is
emitted as a second opening <tool_call> token (treated as the body
delimiter so the call is parsed instead of dropped).
Verified end-to-end on Qwen3.6-35B-A3B-NVFP4: finish_reason=tool_calls,
structured tool_calls array with reasoning_content alongside.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repo-wide doc audit. Replaces a sprawling mix of point-in-time reports, shipped-feature roadmaps, and historical agent-planning artifacts with a focused, current-state set. ### Removed (~8.5k lines) - DISPATCH.md — 511 lines, last refreshed 2026-03, header itself flagged drift after PR #72 type-system refactor. - MODEL_VALIDATION_REPORT.md — 859-line frozen 2026-05-02 snapshot; same content already covered in CHANGELOG. - docs/QWEN36_SUPPORT_ROADMAP.md, docs/CUTLASS3X_MOE_ROADMAP.md, docs/PROJECT_B_MXFP4_FMHA_UPGRADE.md — roadmaps for shipped / in-flight work; live state is in TODO.md and CHANGELOG.md. - docs/llm-compressor-validation-results.md — Phase-1 validation snapshot; story is in CHANGELOG. - docs/superpowers/ (9 files, 6.5k lines) — agent-internal planning artifacts for shipped features. Implementation in code, story in CHANGELOG, no user-facing value. - bench/results/optimization_log.md, tests/READINESS.md — stale March / April brain-dumps. ### Refactored - README.md - Drop hard-coded "Opus 4.6" version stamp; "built with Claude Code, mostly Opus". - Lead with NVFP4 as the primary path; GGUF reframed as legacy compatibility format. Decode-throughput table split into "NVFP4 (primary)" and "GGUF (legacy)" sections. - Quickstart defaults to a SafeTensors NVFP4 model (Qwen3-Coder-30B Modelopt) and `docker compose up -d`. - Add workstation Blackwell coverage: same `sm_120f` binary runs on RTX 5090 (32 GB), RTX PRO 5000 Blackwell (48 GB), RTX PRO 6000 Blackwell (96 GB). - Update LoC line ("~84k C++/CUDA" measured today). - Tool-calling list extended (ChatML / Llama3 / Gemma-4 / Qwen3.6). - Add tokenizer special-token pre-split bullet. - Doc index trimmed to surviving files only. - AGENTS.md - Drop "Opus 4.6" version stamp; mostly-Opus + Sonnet for refactors. - Process narrative reframed around the NVFP4 path as the headline, GGUF as the secondary track; ~700 tests across 8 binaries (was 289). - Build/test instructions updated to the canonical Docker workflow (make build / make test-gpu / make verify-fast / pre-push hook). - sm_120f scope extended to the workstation Blackwell family. - CHANGELOG.md - New "Server + tools (PR #97)" block: native Gemma-4 + Qwen3.6 function calling (incl. tokenizer special-token pre-split root cause), 24s -> 18s cold start, server fixes (UTF-8 boundary, stop token leak, truncation notice, post-</think> grace, repetition penalty default, FP8/MXFP4 workspace skip). - BENCHMARKS.md, docs/usage.md - Workstation Blackwell GPUs added (RTX PRO 5000 / 6000 Blackwell) as same-architecture supported targets. ### Net effect 19 doc files / ~11k lines -> 12 files / 2.4k lines (-78% lines). Surviving docs: README, CHANGELOG, TODO, BENCHMARKS, AGENTS, CLAUDE, plus docs/{usage, memory-management-comparison, memory-traffic-reduction -catalog, MXFP4_QUANTIZATION, RECOMMENDED_MODELS, SM120_OPTIMIZATION _STATUS}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… family Bring the project-CLAUDE.md in line with the new doc standard: - **GPU family**: target table now lists all three sm_120f GB202 cards (RTX 5090 32 GB, RTX PRO 5000 Blackwell 48 GB, RTX PRO 6000 Blackwell 96 GB) — same binary, same kernels, only VRAM/clock vary. Project Overview opening paragraph extended to match. - **NVFP4 as primary path**: opening paragraph reframed around NVFP4 prequant SafeTensors (Modelopt + llm-compressor) as the headline, GGUF as legacy. Architecture list re-ordered with NVFP4 numbers. - **Tool calling**: new section listing the four supported families (ChatML / Qwen3.6 XML / Llama-3.x / Gemma-4 pipe-delimited) with their output formats and parser entry points, plus the tokenizer special-token pre-split note (root cause of the Gemma/Qwen3.6 native function-calling bug). - **imp-server**: blurb updated to reflect tool calling, JSON mode, Anthropic /v1/messages endpoint, Open WebUI default stack with DuckDuckGo search + Pyodide code interpreter + URL fetch. - **Speculative decoding**: replaced the generic stub with the actual shipped state (n-gram opt-in, EAGLE/self-spec/DFlash dropped per TODO.md). - **Tests**: 606 → ~700 across 8 per-module binaries; minor fixups for the docs/ tree summary (no more roadmap files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tale numbers Final pass on the docs/ files left untouched in 81fdea3. ### docs/RECOMMENDED_MODELS.md - All NVFP4 numbers refreshed to post-PR#88 state: Qwen3-Coder-30B 51 -> 272, Qwen3.6 117-142 -> 217, Gemma-4 157-180 -> 213, Mistral-3.2 81 -> 101. - Drop "--no-cuda-graphs for coherence" caveat (PR #88 made graphs safe by default for prequant SafeTensors). - Add workstation Blackwell GPU header note + NVFP4-as-primary framing. - MoE table reordered to lead with NVFP4 prequant rows. - Add Qwen3-30B-A3B-NVFP4-Modelopt as Mistral-3.2 long-context replacement. ### docs/SM120_OPTIMIZATION_STATUS.md - Header now lists all three GB202 cards (RTX 5090 / PRO 5000 / 6000). - "What Would Actually Help Decode" updated: NVFP4 prequant is the primary win, not speculative decoding (which TODO.md says is abandoned). - Tested-models table refreshed with post-#88 numbers; "CUDA graphs for non-fast-path MoE: disabled" stays accurate but the NVFP4 prequant row now says graphs capture end-to-end. - "Project B Stage 5" name dropped from open-items table (the PROJECT_B doc was removed earlier; renamed to a plain "mxf4nvf4.block_scale MMA integration" line item). - "engine.cpp:547" stale line ref replaced with the actual issue description (per-layer head_dim FP8 KV write/read kernels). ### docs/MXFP4_QUANTIZATION.md -> docs/quantization.md - Title misled (mostly NVFP4 content). Renamed to broader scope and rewritten as a concise "where to get models for each path" guide. - NVFP4 (primary), MXFP4 (CUTLASS-internal attention only), GGUF K-quants (legacy), other KV quants (FP8 / INT8 / INT4 / TurboQuant) each get a focused paragraph with current status and caveats (Mistral-3.2 long-prose, Gemma-4 NVFP4 native tool calls). - Inference-pipeline detail removed - that's CLAUDE.md's job. - Stale Qwen3-Coder "38 tok/s" -> current "272 tok/s post #88" context. ### docs/memory-management-comparison.md - "imp supports CUDA only (Hopper, Blackwell)" was wrong - imp is sm_120f only. Corrected throughout. - "Blackwell-native features (PDL, Green Contexts, TCGEN05)" - imp uses register-based mma.sync, NOT TCGEN05. Removed the false claim, noted the actual MMA shapes. - KV format table extended: was "FP16, FP8 E4M3", now "FP16 (default), FP8, INT8, INT4, NVFP4, TurboQuant" - reflects what the engine actually supports. - mmap hints: was "MADV_SEQUENTIAL", now "MAP_POPULATE + MADV_WILLNEED + MADV_SEQUENTIAL" (post PR #97 cold-cache prefault). - Pinned pool: was "64 MiB", now "4x128 MiB ring" (post PR #97 upload pipeline). - Speculative decoding cell rewritten to match TODO.md's abandoned-options state instead of generic "draft+target with KV block rollback". - Decision matrix: was "Single H100/B200, latency-critical, one model -> imp" - imp doesn't target H100/B200. Replaced with Blackwell GB202 entry + a "datacenter Hopper / B200/B300 -> use vLLM/TensorRT-LLM" line. - Native function calling row added (was missing entirely; new feature post #97). ### docs/memory-traffic-reduction-catalog.md - W2 EAGLE-3: was "Dead-End historisch, worth revisit" - this contradicts TODO.md's definitive abandoned-options list. Updated to "abandoned - all variants tested, single-5090 decode bandwidth-bound". - A6 Fused MoE Routing: was "teilweise (Gemma-4 Fast-Path)" - outdated. Now "vorhanden für NVFP4 prequant MoE" covering Qwen3.6/Gemma-4/Coder-30B; legacy GGUF MoE called out as the remaining open item. - "Counterintuitive finding: Q6_K beats NVFP4 on decode at 30B" was flipped post-PR#88: NVFP4 now 272 tok/s vs Q6_K 234. Section rewritten to explain that the old gap was an implementation artifact, not a format tradeoff. NVFP4 is now the default recommendation. - Top-Kandidaten list: dropped W2 EAGLE-revisit + W3 Medusa, added A6-generalize-for-GGUF and the mxf4nvf4 MMA integration. - "Was gewonnen wurde" prepended with the headline NVFP4 prequant decode jump and the cold-start reduction from PR #97. - Old `memory/...md` reference list dropped (those are auto-memory files, not in the repo); replaced with pointers to the surviving docs. ### README.md - Doc-index pointer updated: docs/MXFP4_QUANTIZATION.md -> docs/quantization.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl
added a commit
that referenced
this pull request
May 3, 2026
Resolves 19 conflicts that accumulated since the PR opened (3 PRs landed on main: #97 server UTF-8 + native tool calling, #98 _audit/ gitignore, #99 CI sm_120 cleanup). Resolution policy: - DU files (AGENTS.md, CLAUDE.md, docs/RECOMMENDED_MODELS.md, docs/SM120_OPTIMIZATION_STATUS.md, docs/memory-*.md): kept the deletion from this branch (intentional release cleanup). - .gitignore: kept this branch's broader .claude/ + CLAUDE.md ignores (superset of main's .claude/*.lock + .claude/worktrees/ rules). - .github/workflows/ci.yml: kept this branch's explanatory comment; main's -DCMAKE_CUDA_ARCHITECTURES="120" is silently overridden by CMakeLists.txt's gencode pin anyway. - docs/quantization.md: kept this branch's public-release rewrite. - docs/performance.md: kept this branch's Methodology heading; the table below already carries the hardware detail main's paragraph duplicated. - README.md: kept this branch's 128-line public-release rewrite over main's older 251-line version. - 8 source-code conflicts (engine.cpp, llm_compressor_loader.cpp, safetensors_loader.cpp, tokenizer.cpp, weight_upload.cu, handlers.cpp, tool_call.cpp/h): took main's version verbatim. All were format-vs-functional collisions where main carried PR #97's real changes (tool calling, UTF-8 boundary, post-think stop window bumped from 4 to 16 tokens) and our branch only had whitespace. A follow-up commit re-applies clang-format on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent layers of work that landed in this branch as the
Qwen3.6-NVFP4 rollout exposed each in turn.
1. Server fixes (original PR scope — Open-WebUI on Qwen3.6 NVFP4)
as
f��rbecause the 7-byte tail-overlap landed mid-multibyte.<|im_end|>/<|endoftext|>) before theis_last gate — the engine's implicit-close passes one EOS-like
token through and the previous filter only fired when the engine
flagged is_last.
finish == "length"sonatural EOS exits don't mislead users into bumping max_tokens.
</think>grace bumped 4 → 16 tokens; repetition_penaltydefault 1.0 → 1.05 to break multi-turn loop degeneration.
use (~6.4 GiB VRAM headroom on Qwen3.6 NVFP4 GDN).
2. Load-time perf (24s → 18s on Qwen3.6 NVFP4)
(~5s saved on Qwen3.6 — 2.4 GiB of mmap + header parse + page-cache
pressure avoided).
fallback) — OS prefaults pages with large sequential reads instead
of per-page traps. Cold-cache benefit; hot-cache neutral.
re-arms cudaMemGetInfo cache so per-tensor checked_cuda_malloc
skips ~15k sync calls on 128-expert MoE.
name_is_skipped()so the shard-skip filterdoesn't duplicate translate_name's skip rules.
3. Native function calling (Gemma-4 + Qwen3.6)
Root-cause was a tokenizer bug, not just missing parsers:
encode_spm/encode_gpt2/encode_gemma4nowrun a longest-match pre-split pass against CONTROL-flagged added
tokens before BPE. Multi-character markers like
<|tool_call>(Gemma-4 token id 48) were being BPE'd as raw UTF-8 bytes — the
model never saw the trained marker in its prompt's tools-rendering
and answered with markdown JSON code blocks instead of the native
protocol. Fixed: token 48/49 round-trip as their assigned id.
parse_tool_calls_gemma()for Gemma'snon-JSON syntax
<|tool_call>call:NAME{key:value,...}<tool_call|>with
<|"|>...<|"|>string escapes and recursive nestedobjects/arrays.
ChatTemplateFamily::GEMMAdispatched inparse / reconstruct / format-tool-response.
{goes through the existing JSON path,
<function=...>goes througha new XML-style parser that walks
<parameter=KEY>VALUE</parameter>pairs and coerces bare numerics. Tolerates Qwen3.6's drift where
the closing
</tool_call>is emitted as a second<tool_call>.the prior assistant ChatMessage (template skips standalone
role=tool messages; round-tripping requires gluing them onto the
assistant turn that produced the call).
apply_jinja_with_toolssets enable_thinking=truefor Gemma family when tools are present. Empty thought-channel
injection trains the model to skip tool selection.
finish=="length"gate onparse_tool_calls— tolerant of trailing garbage, surfaces complete tool_calls even
if the budget ran out.
Open WebUI
docker-compose: enable web search (DuckDuckGo, no key), Pyodide code
interpreter, URL fetch, native function calling toggleable per
message.
Verification
make test-gpuTest plan
make test-gpuclean🤖 Generated with Claude Code