fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6 by kekzl · Pull Request #97 · kekzl/imp

kekzl · 2026-05-02T21:03:27Z

Summary

Three independent layers of work that landed in this branch as the
Qwen3.6-NVFP4 rollout exposed each in turn.

1. Server fixes (original PR scope — Open-WebUI on Qwen3.6 NVFP4)

UTF-8 boundary walk in reasoning stream — German umlauts came out
as f��r because the 7-byte tail-overlap landed mid-multibyte.
Drop leaked stop tokens (<|im_end|> / <|endoftext|>) before the
is_last gate — the engine's implicit-close passes one EOS-like
token through and the previous filter only fired when the engine
flagged is_last.
Restrict "[Reasoning truncated]" notice to finish == "length" so
natural EOS exits don't mislead users into bumping max_tokens.
Post-</think> grace bumped 4 → 16 tokens; repetition_penalty
default 1.0 → 1.05 to break multi-turn loop degeneration.
Workspace: don't allocate FP8 / MXFP4 scratch for paths we won't
use (~6.4 GiB VRAM headroom on Qwen3.6 NVFP4 GDN).

2. Load-time perf (24s → 18s on Qwen3.6 NVFP4)

Skip MTP / vision-only SafeTensors shards when neither is wired up
(~5s saved on Qwen3.6 — 2.4 GiB of mmap + header parse + page-cache
pressure avoided).
MAP_POPULATE + MADV_WILLNEED on weight mmaps (with retry-without
fallback) — OS prefaults pages with large sequential reads instead
of per-page traps. Cold-cache benefit; hot-cache neutral.
Pinned staging ring 2x64 MiB → 4x128 MiB; Pass-2 expert upload
re-arms cudaMemGetInfo cache so per-tensor checked_cuda_malloc
skips ~15k sync calls on 128-expert MoE.
Concurrent SafeTensors shard parse (3 shards in parallel threads).
Refactor: expose name_is_skipped() so the shard-skip filter
doesn't duplicate translate_name's skip rules.

3. Native function calling (Gemma-4 + Qwen3.6)

Root-cause was a tokenizer bug, not just missing parsers:

Tokenizer: encode_spm / encode_gpt2 / encode_gemma4 now
run a longest-match pre-split pass against CONTROL-flagged added
tokens before BPE. Multi-character markers like <|tool_call>
(Gemma-4 token id 48) were being BPE'd as raw UTF-8 bytes — the
model never saw the trained marker in its prompt's tools-rendering
and answered with markdown JSON code blocks instead of the native
protocol. Fixed: token 48/49 round-trip as their assigned id.
Parser (Gemma-4): new parse_tool_calls_gemma() for Gemma's
non-JSON syntax <|tool_call>call:NAME{key:value,...}<tool_call|>
with <|"|>...<|"|> string escapes and recursive nested
objects/arrays. ChatTemplateFamily::GEMMA dispatched in
parse / reconstruct / format-tool-response.
Parser (Qwen3.6): ChatML parser branches on body shape — {
goes through the existing JSON path, <function=...> goes through
a new XML-style parser that walks <parameter=KEY>VALUE</parameter>
pairs and coerces bare numerics. Tolerates Qwen3.6's drift where
the closing </tool_call> is emitted as a second <tool_call>.
Multi-turn (Gemma): handler appends tool-response markers to
the prior assistant ChatMessage (template skips standalone
role=tool messages; round-tripping requires gluing them onto the
assistant turn that produced the call).
Thinking: apply_jinja_with_tools sets enable_thinking=true
for Gemma family when tools are present. Empty thought-channel
injection trains the model to skip tool selection.
Server: removed finish=="length" gate on parse_tool_calls
— tolerant of trailing garbage, surfaces complete tool_calls even
if the budget ran out.

Open WebUI

docker-compose: enable web search (DuckDuckGo, no key), Pyodide code
interpreter, URL fetch, native function calling toggleable per
message.

Verification

Path	Status
Qwen3.6-35B-A3B-NVFP4 native tool calls	✅ end-to-end (calculator(17,23), calculator(5,3))
Gemma-4 Q4_K_M GGUF native tool calls	✅ end-to-end (calculator(5,3), 19 tokens, finish=tool_calls)
Gemma-4 NVFP4 native tool calls	⚠️ FP4 depresses token 48/49 logits; fall back to Open WebUI Default mode
Tokenizer pre-split	✅ token 48/49 round-trip
Cold-start time Qwen3.6 NVFP4	✅ 24s → 18s, sanity output coherent
`make test-gpu`	✅ 73 PASS, 18 SKIPPED (model-deps), 0 FAIL

Test plan

make test-gpu clean
Qwen3.6-NVFP4: tool_calls end-to-end with calculator + reasoning_content
Gemma-4 Q4_K_M GGUF: tool_calls end-to-end
Boot time measured 24s → 18s on Qwen3.6-NVFP4
Sanity output coherent on both Qwen3.6-NVFP4 and Gemma-4 NVFP4
CI green (in progress)

🤖 Generated with Claude Code

…kens + accurate truncation notice Three independent server-side issues surfaced when running Qwen3.6-NVFP4 through Open-WebUI: 1. Stop tokens (<|im_end|> / <|endoftext|>) appeared in user-visible content. Engine::should_stop's think-block implicit-close (commit 334b2b8) passes one EOS-like token through to recover from the model's empty-thinking case, but the streaming and non-stream handlers only filtered EOS/stop ids when the engine flagged the token as is_last. Result: the pass-through token's literal text from Tokenizer::decode_token leaked into the chat output as "<|im_end|>\n<|endoftext|>". Add a structural-stop guard before the is_last branch in both code paths and silently drop the token. 2. UTF-8 corruption in reasoning_content. The reasoning emit loop keeps a 7-byte tail overlap so a multi-token "</think>" can still be detected on the next iteration. The byte-count overlap is correct for the literal-string match, but `emit_end = complete - 7` regularly lands inside a multibyte UTF-8 sequence — the German umlauts in Qwen3.6's German-language reasoning consistently came out as "f\xef\xbf\xbdr" (= "f��r" instead of "für"). Walk emit_end back to a codepoint boundary before slicing. 3. "[Reasoning truncated — increase max_tokens for a complete answer]" was emitted whenever the model exited reasoning without producing content, regardless of cause. NVFP4 quants on Qwen3.6 routinely close empty thinking via a stop token long before max_tokens is reached; the notice misled users into bumping max_tokens for a condition that bumping wouldn't fix. Restrict the notice to finish == "length" — when finish is "stop" the model genuinely chose to end and the reasoning_content is the user's payload. Verified end-to-end on Qwen3.6-35B-A3B-NVFP4: - "hi" → "Hi there! How can I help you today?" + reasoning_content, no leaked stop tokens. - "How do you work?" → 121-token coherent answer about transformer architecture, no truncation notice. - "mal einen vorschlag für einen code" → reasoning chunks stream as "möcht" / "ag fü" / "r ein" with intact umlauts (was "f��r" before the boundary fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same NVFP4-noise cliff that motivated the original 4-token grace (commit 334b2b8) hits a second time after </think>: model exits thinking, writes ONE content token (e.g. "Ger" — start of "Gerne, ..." in German), then the post-</think> logits tilt back toward stop and the next 3 tokens are all <|im_end|> / <|endoftext|>. With grace=4 the request finishes after the 4th token, leaving "Ger" as the user- visible content (the stop tokens themselves are filtered by the streaming and non-stream handler guards). Bumping the grace to 16 lets the model recover when the cliff is genuinely transient. Empirically on Qwen3.6-35B-A3B-NVFP4, the prompt "Hey, ich würde gerne einen Vorschlag für einen Code haben" now produces 132 / 400 / 1121 tokens of real German content across three runs (was deterministically truncated to "Ger" with grace=4). The grace counts ALL output tokens since exit, so a model that genuinely produces 16+ content tokens then stops naturally still finishes correctly — once tokens_since_exit reaches the threshold, normal stop semantics resume. There's no infinite loop risk: even a pathological 16-stop-token degeneration just costs ~12 extra forward passes (well below 100ms at NVFP4 decode speeds). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…neration Verbose-think models on NVFP4 (Qwen3.6-35B-A3B-NVFP4 in particular) fall into pathological reasoning loops on multi-turn sensitive prompts: the same draft phrase ("Wie wär es mit diesem hier?", "Was ist der Unterschied zwischen X und Y?") repeats 30-40 times before the model emits a stop token, then leaks garbage content (Chinese characters, "Human" prefix from training-data formatting). With repetition_penalty at the OpenAI/llama.cpp default of 1.0, there's no signal pushing the model out of the loop. A mild 1.05 default breaks the loop in practice without disrupting structurally-repetitive valid output: - JSON keys repeating ("name", "value") — penalty far too small to distort character-level distribution. - Markdown lists (`- ` prefix repeated) — same. - Code idioms (variable names reused 10x in a function) — same. - Roleplay catchphrases / character voice — minor token-level perturbation but consistent across the response. Callers that need byte-stable sampling — validation harness, benchmarks, golden-output comparisons — pass repetition_penalty=1.0 explicitly and bypass this default. Tests in tests/ that don't pass the field rely on engine-side defaults (Request struct still defaults to 1.0), not server-side defaults; only the chat-completions and completions endpoints are affected. DRY (dry_multiplier) was considered but stays off by default — DRY at allowed_length=2 invariably mangles JSON / structured output / code where short n-gram repetition is intentional. Power users who hit loops past the rep-penalty threshold can opt in per-request. Verified end-to-end on Qwen3.6-35B-A3B-NVFP4 multi-turn "Erzähl Witz / anderen respektvollen Witz" repro: Pre-fix: reasoning loops 40x on "Wie wär es mit diesem hier?", content emerges as "以\nHuman\nHuman". Post-fix: reasoning_content empty (model went straight to answer), content="Gerne! Hier ist ein klassischer, harmloser Witz: Was ist der Unterschied zwischen einer Frau und einem Kugelschreiber? Man kann mit dem Kugelschreiber nicht so lange reden wie mit der Frau! 😄" (192 tokens, no loop). Code generation prompt ("Schreib eine Python-Funktion die eine Liste umkehrt") and JSON prompt ("Gib mir ein JSON mit 3 deutschen Städten") still produce structurally-valid output post-fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… use Two cases of buffers allocated for fallback paths that the model can never reach: 1. FP8 activation scratch (~3 MiB + d_act_scale + d_fp8_block_maxes + d_fp8_absmax) was always allocated based on the INITIAL `wcache_.use_fp8` flag. For GDN models the engine later calls disable_fp8_prefill() in init_kv_cache (line 1105), but by that point executor_->init() has already run and the buffer is committed. Move the GDN→no-FP8 decision next to the existing Gemma-4→no-FP8 decision (right before init_weights), so wcache_.use_fp8 is correct at workspace alloc time and the gate at executor_workspace_buffers.cu:383 short-circuits cleanly. Also dual-path-quant correct: the early disable is gated on `!config_.dual_path_quant` so the existing dual-path warning block stays load-bearing. 2. MXFP4 activation buffers (mxfp4_act_sf ~0.5 MiB, mxfp4_workspace variable) were allocated whenever cutlass_sm120_mxfp4_available() was true — i.e. on every NVFP4 model running on Blackwell hardware, regardless of whether the model carries any MXFP4 weights at all. The MXFP4 path in executor_kernels.cu:1986 is gated on `mxfp4_cache->find(weight.data)` returning a hit, which only happens for weights with QType::MXFP4. Add a layer scan for actual MXFP4 weight types (or the explicit `attention.mxfp4 = "always"` opt-in) so plain NVFP4-prequant Qwen3.6 doesn't pay the half-MiB. Verified on Qwen3.6-35B-A3B-NVFP4 (GDN model): Pre-fix VRAMAllocator report had: fp8_activation 3.0 MiB (alloc'd, never used) mxfp4_act_sf 0.5 MiB (alloc'd, never used) + d_fp8_block_maxes / d_fp8_absmax / d_act_scale (raw cudaMalloc, not in vram_allocator's tag map but present) Free VRAM: 0 MiB Post-fix: those four entries are gone from the allocator report, free VRAM jumps to 6439 MiB. Smoke "Hallo, wie geht es dir?" still produces a coherent 129-token response. The remaining workspace items (moe_batch_dequant 512 MiB, attn_scores 2 MiB, dequant_scratch 32 MiB, the four cutlass_act_*, moe_3x_*, moe_dequant, moe_staging, persistent_workspace, shared_workspace) are ALL load-bearing for paths Qwen3.6-NVFP4 actually traverses — leaving them alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Quick wins for cold-start time on Qwen3.6-NVFP4 (24s -> 18s wall clock, upload phase 18s -> 10s at ~2.1 GB/s, was 1.17 GB/s). - safetensors_loader: parse shards in parallel std::threads; drop shards whose tensors are 100% MTP/vision-only (model_mtp.safetensors, model_visual.safetensors). Saves ~5s on Qwen3.6-NVFP4 by avoiding 2.4 GiB of mmap + header parse + page-cache pressure. - llm_compressor_loader: expose name_is_skipped() so the shard-skip filter doesn't duplicate translate_name's skip rules. - gguf_loader + safetensors_loader: MAP_POPULATE on the weight mmap (with retry-without-flag fallback for FS that reject it) plus MADV_WILLNEED. Forces large sequential disk reads on cold cache instead of per-page faults during upload. - weight_upload: pinned staging ring is now 4 x 128 MiB (was 2 x 64 MiB), deepens the H2D pipeline so per-tensor sync stalls overlap with prior DMAs. Pass-2 expert upload re-arms the cudaMemGetInfo cache from the budget free_mem reading so checked_cuda_malloc avoids a sync per expert tensor (was 15k+ sync calls on 128-expert MoE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end native tool calling for Gemma-4 (and any model with pipe-delimited control tokens). Verified on gemma-4-26B-A4B Q4_K_M: single-shot prompt -> finish_reason=tool_calls + structured tool_calls array, completion stops at <tool_call|> after 19 tokens (was running until max_tokens, emitting markdown JSON garbage). Tokenizer (root cause): - encode_spm + encode_gpt2 + encode_gemma4 now run a longest-match pre-split pass against control / added tokens before BPE. Previously multi-character markers like <|tool_call> (Gemma-4 token id 48) were BPE'd as raw UTF-8 bytes (~6-10 tokens), so the model never saw the trained marker in the prompt's tools-rendering and answered with markdown JSON code blocks instead of the native protocol. - Tokenizer::build_special_pieces() materializes the cache from CONTROL-flagged entries (tokenizer.json `special:true`, GGUF tokenizer.ggml.token_type=3). Filters out plain alnum identifiers so a misflagged "the" wouldn't shadow normal vocab. Server / parser: - New parse_tool_calls_gemma() handles Gemma's non-JSON syntax: <|tool_call>call:NAME{key:value,key:value}<tool_call|> with string values wrapped in <|"|>...<|"|>, recursive nested objects/arrays. - ChatTemplateFamily::GEMMA dispatched in parse_tool_calls(), reconstruct_tool_call_output(), format_tool_response(). - Multi-turn: handler appends tool-response markers to the prior assistant ChatMessage for Gemma (template line ~215 skips standalone role=tool messages; round-tripping requires gluing them onto the assistant turn that produced the call). - Removed `finish=="length"` gate on parse_tool_calls — Gemma can emit a complete tool_call early then keep generating; the parser is tolerant of trailing garbage so we still surface the call. - apply_jinja_with_tools sets enable_thinking=true for Gemma family when tools are present. The template auto-injects an empty <|channel>thought<channel|> block when thinking is off, which trains the model to skip tool selection and answer in plain text. With thinking enabled the model reasons about the call first. Open WebUI: - docker-compose: enable web search (DuckDuckGo, no key), Pyodide code interpreter, URL fetch, native function calling. Toggleable per message via the chat-input icons. Known gap: Gemma-4 NVFP4 prequant doesn't reliably emit token 48/49 because FP4 compression depresses the special-token logit. The native path is verified end-to-end on Q4_K_M / Q8_0 GGUF; for the NVFP4 quant, fall back to Open WebUI's prompt-based function-calling mode (per-model setting in Workspace -> Models -> Advanced Params). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.6's chat-template uses a third tool-call format that's neither ChatML JSON nor Llama3 <function=...>: <tool_call> <function=NAME> <parameter=KEY> VALUE </parameter> ... </function> </tool_call> parse_tool_calls_chatml now branches on the body shape — '{' goes through the existing JSON path, '<function=' goes through a new XML-style parser that walks <parameter=KEY>VALUE</parameter> pairs and coerces bare numerics / true / false / null. Outer tag scan also tolerates Qwen3.6's frequent drift where the closing </tool_call> is emitted as a second opening <tool_call> token (treated as the body delimiter so the call is parsed instead of dropped). Verified end-to-end on Qwen3.6-35B-A3B-NVFP4: finish_reason=tool_calls, structured tool_calls array with reasoning_content alongside. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Repo-wide doc audit. Replaces a sprawling mix of point-in-time reports, shipped-feature roadmaps, and historical agent-planning artifacts with a focused, current-state set. ### Removed (~8.5k lines) - DISPATCH.md — 511 lines, last refreshed 2026-03, header itself flagged drift after PR #72 type-system refactor. - MODEL_VALIDATION_REPORT.md — 859-line frozen 2026-05-02 snapshot; same content already covered in CHANGELOG. - docs/QWEN36_SUPPORT_ROADMAP.md, docs/CUTLASS3X_MOE_ROADMAP.md, docs/PROJECT_B_MXFP4_FMHA_UPGRADE.md — roadmaps for shipped / in-flight work; live state is in TODO.md and CHANGELOG.md. - docs/llm-compressor-validation-results.md — Phase-1 validation snapshot; story is in CHANGELOG. - docs/superpowers/ (9 files, 6.5k lines) — agent-internal planning artifacts for shipped features. Implementation in code, story in CHANGELOG, no user-facing value. - bench/results/optimization_log.md, tests/READINESS.md — stale March / April brain-dumps. ### Refactored - README.md - Drop hard-coded "Opus 4.6" version stamp; "built with Claude Code, mostly Opus". - Lead with NVFP4 as the primary path; GGUF reframed as legacy compatibility format. Decode-throughput table split into "NVFP4 (primary)" and "GGUF (legacy)" sections. - Quickstart defaults to a SafeTensors NVFP4 model (Qwen3-Coder-30B Modelopt) and `docker compose up -d`. - Add workstation Blackwell coverage: same `sm_120f` binary runs on RTX 5090 (32 GB), RTX PRO 5000 Blackwell (48 GB), RTX PRO 6000 Blackwell (96 GB). - Update LoC line ("~84k C++/CUDA" measured today). - Tool-calling list extended (ChatML / Llama3 / Gemma-4 / Qwen3.6). - Add tokenizer special-token pre-split bullet. - Doc index trimmed to surviving files only. - AGENTS.md - Drop "Opus 4.6" version stamp; mostly-Opus + Sonnet for refactors. - Process narrative reframed around the NVFP4 path as the headline, GGUF as the secondary track; ~700 tests across 8 binaries (was 289). - Build/test instructions updated to the canonical Docker workflow (make build / make test-gpu / make verify-fast / pre-push hook). - sm_120f scope extended to the workstation Blackwell family. - CHANGELOG.md - New "Server + tools (PR #97)" block: native Gemma-4 + Qwen3.6 function calling (incl. tokenizer special-token pre-split root cause), 24s -> 18s cold start, server fixes (UTF-8 boundary, stop token leak, truncation notice, post-</think> grace, repetition penalty default, FP8/MXFP4 workspace skip). - BENCHMARKS.md, docs/usage.md - Workstation Blackwell GPUs added (RTX PRO 5000 / 6000 Blackwell) as same-architecture supported targets. ### Net effect 19 doc files / ~11k lines -> 12 files / 2.4k lines (-78% lines). Surviving docs: README, CHANGELOG, TODO, BENCHMARKS, AGENTS, CLAUDE, plus docs/{usage, memory-management-comparison, memory-traffic-reduction -catalog, MXFP4_QUANTIZATION, RECOMMENDED_MODELS, SM120_OPTIMIZATION _STATUS}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… family Bring the project-CLAUDE.md in line with the new doc standard: - **GPU family**: target table now lists all three sm_120f GB202 cards (RTX 5090 32 GB, RTX PRO 5000 Blackwell 48 GB, RTX PRO 6000 Blackwell 96 GB) — same binary, same kernels, only VRAM/clock vary. Project Overview opening paragraph extended to match. - **NVFP4 as primary path**: opening paragraph reframed around NVFP4 prequant SafeTensors (Modelopt + llm-compressor) as the headline, GGUF as legacy. Architecture list re-ordered with NVFP4 numbers. - **Tool calling**: new section listing the four supported families (ChatML / Qwen3.6 XML / Llama-3.x / Gemma-4 pipe-delimited) with their output formats and parser entry points, plus the tokenizer special-token pre-split note (root cause of the Gemma/Qwen3.6 native function-calling bug). - **imp-server**: blurb updated to reflect tool calling, JSON mode, Anthropic /v1/messages endpoint, Open WebUI default stack with DuckDuckGo search + Pyodide code interpreter + URL fetch. - **Speculative decoding**: replaced the generic stub with the actual shipped state (n-gram opt-in, EAGLE/self-spec/DFlash dropped per TODO.md). - **Tests**: 606 → ~700 across 8 per-module binaries; minor fixups for the docs/ tree summary (no more roadmap files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tale numbers Final pass on the docs/ files left untouched in 81fdea3. ### docs/RECOMMENDED_MODELS.md - All NVFP4 numbers refreshed to post-PR#88 state: Qwen3-Coder-30B 51 -> 272, Qwen3.6 117-142 -> 217, Gemma-4 157-180 -> 213, Mistral-3.2 81 -> 101. - Drop "--no-cuda-graphs for coherence" caveat (PR #88 made graphs safe by default for prequant SafeTensors). - Add workstation Blackwell GPU header note + NVFP4-as-primary framing. - MoE table reordered to lead with NVFP4 prequant rows. - Add Qwen3-30B-A3B-NVFP4-Modelopt as Mistral-3.2 long-context replacement. ### docs/SM120_OPTIMIZATION_STATUS.md - Header now lists all three GB202 cards (RTX 5090 / PRO 5000 / 6000). - "What Would Actually Help Decode" updated: NVFP4 prequant is the primary win, not speculative decoding (which TODO.md says is abandoned). - Tested-models table refreshed with post-#88 numbers; "CUDA graphs for non-fast-path MoE: disabled" stays accurate but the NVFP4 prequant row now says graphs capture end-to-end. - "Project B Stage 5" name dropped from open-items table (the PROJECT_B doc was removed earlier; renamed to a plain "mxf4nvf4.block_scale MMA integration" line item). - "engine.cpp:547" stale line ref replaced with the actual issue description (per-layer head_dim FP8 KV write/read kernels). ### docs/MXFP4_QUANTIZATION.md -> docs/quantization.md - Title misled (mostly NVFP4 content). Renamed to broader scope and rewritten as a concise "where to get models for each path" guide. - NVFP4 (primary), MXFP4 (CUTLASS-internal attention only), GGUF K-quants (legacy), other KV quants (FP8 / INT8 / INT4 / TurboQuant) each get a focused paragraph with current status and caveats (Mistral-3.2 long-prose, Gemma-4 NVFP4 native tool calls). - Inference-pipeline detail removed - that's CLAUDE.md's job. - Stale Qwen3-Coder "38 tok/s" -> current "272 tok/s post #88" context. ### docs/memory-management-comparison.md - "imp supports CUDA only (Hopper, Blackwell)" was wrong - imp is sm_120f only. Corrected throughout. - "Blackwell-native features (PDL, Green Contexts, TCGEN05)" - imp uses register-based mma.sync, NOT TCGEN05. Removed the false claim, noted the actual MMA shapes. - KV format table extended: was "FP16, FP8 E4M3", now "FP16 (default), FP8, INT8, INT4, NVFP4, TurboQuant" - reflects what the engine actually supports. - mmap hints: was "MADV_SEQUENTIAL", now "MAP_POPULATE + MADV_WILLNEED + MADV_SEQUENTIAL" (post PR #97 cold-cache prefault). - Pinned pool: was "64 MiB", now "4x128 MiB ring" (post PR #97 upload pipeline). - Speculative decoding cell rewritten to match TODO.md's abandoned-options state instead of generic "draft+target with KV block rollback". - Decision matrix: was "Single H100/B200, latency-critical, one model -> imp" - imp doesn't target H100/B200. Replaced with Blackwell GB202 entry + a "datacenter Hopper / B200/B300 -> use vLLM/TensorRT-LLM" line. - Native function calling row added (was missing entirely; new feature post #97). ### docs/memory-traffic-reduction-catalog.md - W2 EAGLE-3: was "Dead-End historisch, worth revisit" - this contradicts TODO.md's definitive abandoned-options list. Updated to "abandoned - all variants tested, single-5090 decode bandwidth-bound". - A6 Fused MoE Routing: was "teilweise (Gemma-4 Fast-Path)" - outdated. Now "vorhanden für NVFP4 prequant MoE" covering Qwen3.6/Gemma-4/Coder-30B; legacy GGUF MoE called out as the remaining open item. - "Counterintuitive finding: Q6_K beats NVFP4 on decode at 30B" was flipped post-PR#88: NVFP4 now 272 tok/s vs Q6_K 234. Section rewritten to explain that the old gap was an implementation artifact, not a format tradeoff. NVFP4 is now the default recommendation. - Top-Kandidaten list: dropped W2 EAGLE-revisit + W3 Medusa, added A6-generalize-for-GGUF and the mxf4nvf4 MMA integration. - "Was gewonnen wurde" prepended with the headline NVFP4 prequant decode jump and the cold-start reduction from PR #97. - Old `memory/...md` reference list dropped (those are auto-memory files, not in the repo); replaced with pointers to the surviving docs. ### README.md - Doc-index pointer updated: docs/MXFP4_QUANTIZATION.md -> docs/quantization.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves 19 conflicts that accumulated since the PR opened (3 PRs landed on main: #97 server UTF-8 + native tool calling, #98 _audit/ gitignore, #99 CI sm_120 cleanup). Resolution policy: - DU files (AGENTS.md, CLAUDE.md, docs/RECOMMENDED_MODELS.md, docs/SM120_OPTIMIZATION_STATUS.md, docs/memory-*.md): kept the deletion from this branch (intentional release cleanup). - .gitignore: kept this branch's broader .claude/ + CLAUDE.md ignores (superset of main's .claude/*.lock + .claude/worktrees/ rules). - .github/workflows/ci.yml: kept this branch's explanatory comment; main's -DCMAKE_CUDA_ARCHITECTURES="120" is silently overridden by CMakeLists.txt's gencode pin anyway. - docs/quantization.md: kept this branch's public-release rewrite. - docs/performance.md: kept this branch's Methodology heading; the table below already carries the hardware detail main's paragraph duplicated. - README.md: kept this branch's 128-line public-release rewrite over main's older 251-line version. - 8 source-code conflicts (engine.cpp, llm_compressor_loader.cpp, safetensors_loader.cpp, tokenizer.cpp, weight_upload.cu, handlers.cpp, tool_call.cpp/h): took main's version verbatim. All were format-vs-functional collisions where main carried PR #97's real changes (tool calling, UTF-8 boundary, post-think stop window bumped from 4 to 16 tokens) and our branch only had whitespace. A follow-up commit re-applies clang-format on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 7 commits May 2, 2026 23:02

kekzl changed the title ~~fix(server): UTF-8 boundary in reasoning + drop leaked stop tokens + accurate truncation notice~~ fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6 May 3, 2026

kekzl and others added 3 commits May 3, 2026 02:45

kekzl enabled auto-merge (squash) May 3, 2026 01:00

kekzl merged commit 55e39e9 into main May 3, 2026
2 checks passed

kekzl deleted the fix/server-utf8-and-stop-token-leak branch May 3, 2026 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6#97

fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6#97
kekzl merged 10 commits into
mainfrom
fix/server-utf8-and-stop-token-leak

kekzl commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Server fixes (original PR scope — Open-WebUI on Qwen3.6 NVFP4)

2. Load-time perf (24s → 18s on Qwen3.6 NVFP4)

3. Native function calling (Gemma-4 + Qwen3.6)

Open WebUI

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kekzl commented May 2, 2026 •

edited

Loading