Skip to content

fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6#97

Merged
kekzl merged 10 commits into
mainfrom
fix/server-utf8-and-stop-token-leak
May 3, 2026
Merged

fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6#97
kekzl merged 10 commits into
mainfrom
fix/server-utf8-and-stop-token-leak

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 2, 2026

Summary

Three independent layers of work that landed in this branch as the
Qwen3.6-NVFP4 rollout exposed each in turn.

1. Server fixes (original PR scope — Open-WebUI on Qwen3.6 NVFP4)

  • UTF-8 boundary walk in reasoning stream — German umlauts came out
    as f��r because the 7-byte tail-overlap landed mid-multibyte.
  • Drop leaked stop tokens (<|im_end|> / <|endoftext|>) before the
    is_last gate — the engine's implicit-close passes one EOS-like
    token through and the previous filter only fired when the engine
    flagged is_last.
  • Restrict "[Reasoning truncated]" notice to finish == "length" so
    natural EOS exits don't mislead users into bumping max_tokens.
  • Post-</think> grace bumped 4 → 16 tokens; repetition_penalty
    default 1.0 → 1.05 to break multi-turn loop degeneration.
  • Workspace: don't allocate FP8 / MXFP4 scratch for paths we won't
    use (~6.4 GiB VRAM headroom on Qwen3.6 NVFP4 GDN).

2. Load-time perf (24s → 18s on Qwen3.6 NVFP4)

  • Skip MTP / vision-only SafeTensors shards when neither is wired up
    (~5s saved on Qwen3.6 — 2.4 GiB of mmap + header parse + page-cache
    pressure avoided).
  • MAP_POPULATE + MADV_WILLNEED on weight mmaps (with retry-without
    fallback) — OS prefaults pages with large sequential reads instead
    of per-page traps. Cold-cache benefit; hot-cache neutral.
  • Pinned staging ring 2x64 MiB → 4x128 MiB; Pass-2 expert upload
    re-arms cudaMemGetInfo cache so per-tensor checked_cuda_malloc
    skips ~15k sync calls on 128-expert MoE.
  • Concurrent SafeTensors shard parse (3 shards in parallel threads).
  • Refactor: expose name_is_skipped() so the shard-skip filter
    doesn't duplicate translate_name's skip rules.

3. Native function calling (Gemma-4 + Qwen3.6)

Root-cause was a tokenizer bug, not just missing parsers:

  • Tokenizer: encode_spm / encode_gpt2 / encode_gemma4 now
    run a longest-match pre-split pass against CONTROL-flagged added
    tokens before BPE. Multi-character markers like <|tool_call>
    (Gemma-4 token id 48) were being BPE'd as raw UTF-8 bytes — the
    model never saw the trained marker in its prompt's tools-rendering
    and answered with markdown JSON code blocks instead of the native
    protocol. Fixed: token 48/49 round-trip as their assigned id.
  • Parser (Gemma-4): new parse_tool_calls_gemma() for Gemma's
    non-JSON syntax <|tool_call>call:NAME{key:value,...}<tool_call|>
    with <|"|>...<|"|> string escapes and recursive nested
    objects/arrays. ChatTemplateFamily::GEMMA dispatched in
    parse / reconstruct / format-tool-response.
  • Parser (Qwen3.6): ChatML parser branches on body shape — {
    goes through the existing JSON path, <function=...> goes through
    a new XML-style parser that walks <parameter=KEY>VALUE</parameter>
    pairs and coerces bare numerics. Tolerates Qwen3.6's drift where
    the closing </tool_call> is emitted as a second <tool_call>.
  • Multi-turn (Gemma): handler appends tool-response markers to
    the prior assistant ChatMessage (template skips standalone
    role=tool messages; round-tripping requires gluing them onto the
    assistant turn that produced the call).
  • Thinking: apply_jinja_with_tools sets enable_thinking=true
    for Gemma family when tools are present. Empty thought-channel
    injection trains the model to skip tool selection.
  • Server: removed finish=="length" gate on parse_tool_calls
    — tolerant of trailing garbage, surfaces complete tool_calls even
    if the budget ran out.

Open WebUI

docker-compose: enable web search (DuckDuckGo, no key), Pyodide code
interpreter, URL fetch, native function calling toggleable per
message.

Verification

Path Status
Qwen3.6-35B-A3B-NVFP4 native tool calls ✅ end-to-end (calculator(17,23), calculator(5,3))
Gemma-4 Q4_K_M GGUF native tool calls ✅ end-to-end (calculator(5,3), 19 tokens, finish=tool_calls)
Gemma-4 NVFP4 native tool calls ⚠️ FP4 depresses token 48/49 logits; fall back to Open WebUI Default mode
Tokenizer pre-split ✅ token 48/49 round-trip
Cold-start time Qwen3.6 NVFP4 ✅ 24s → 18s, sanity output coherent
make test-gpu ✅ 73 PASS, 18 SKIPPED (model-deps), 0 FAIL

Test plan

  • make test-gpu clean
  • Qwen3.6-NVFP4: tool_calls end-to-end with calculator + reasoning_content
  • Gemma-4 Q4_K_M GGUF: tool_calls end-to-end
  • Boot time measured 24s → 18s on Qwen3.6-NVFP4
  • Sanity output coherent on both Qwen3.6-NVFP4 and Gemma-4 NVFP4
  • CI green (in progress)

🤖 Generated with Claude Code

kekzl and others added 7 commits May 2, 2026 23:02
…kens + accurate truncation notice

Three independent server-side issues surfaced when running Qwen3.6-NVFP4
through Open-WebUI:

1. Stop tokens (<|im_end|> / <|endoftext|>) appeared in user-visible
   content. Engine::should_stop's think-block implicit-close
   (commit 334b2b8) passes one EOS-like token through to recover from
   the model's empty-thinking case, but the streaming and non-stream
   handlers only filtered EOS/stop ids when the engine flagged the
   token as is_last. Result: the pass-through token's literal text
   from Tokenizer::decode_token leaked into the chat output as
   "<|im_end|>\n<|endoftext|>". Add a structural-stop guard before
   the is_last branch in both code paths and silently drop the token.

2. UTF-8 corruption in reasoning_content. The reasoning emit loop
   keeps a 7-byte tail overlap so a multi-token "</think>" can still
   be detected on the next iteration. The byte-count overlap is
   correct for the literal-string match, but `emit_end = complete - 7`
   regularly lands inside a multibyte UTF-8 sequence — the German
   umlauts in Qwen3.6's German-language reasoning consistently came
   out as "f\xef\xbf\xbdr" (= "f��r" instead of "für"). Walk emit_end
   back to a codepoint boundary before slicing.

3. "[Reasoning truncated — increase max_tokens for a complete answer]"
   was emitted whenever the model exited reasoning without producing
   content, regardless of cause. NVFP4 quants on Qwen3.6 routinely
   close empty thinking via a stop token long before max_tokens is
   reached; the notice misled users into bumping max_tokens for a
   condition that bumping wouldn't fix. Restrict the notice to
   finish == "length" — when finish is "stop" the model genuinely
   chose to end and the reasoning_content is the user's payload.

Verified end-to-end on Qwen3.6-35B-A3B-NVFP4:
- "hi" → "Hi there! How can I help you today?" + reasoning_content,
  no leaked stop tokens.
- "How do you work?" → 121-token coherent answer about transformer
  architecture, no truncation notice.
- "mal einen vorschlag für einen code" → reasoning chunks stream as
  "möcht" / "ag fü" / "r ein" with intact umlauts (was "f��r" before
  the boundary fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same NVFP4-noise cliff that motivated the original 4-token grace
(commit 334b2b8) hits a second time after </think>: model exits
thinking, writes ONE content token (e.g. "Ger" — start of "Gerne, ..."
in German), then the post-</think> logits tilt back toward stop and
the next 3 tokens are all <|im_end|> / <|endoftext|>. With grace=4
the request finishes after the 4th token, leaving "Ger" as the user-
visible content (the stop tokens themselves are filtered by the
streaming and non-stream handler guards).

Bumping the grace to 16 lets the model recover when the cliff is
genuinely transient. Empirically on Qwen3.6-35B-A3B-NVFP4, the
prompt "Hey, ich würde gerne einen Vorschlag für einen Code haben"
now produces 132 / 400 / 1121 tokens of real German content across
three runs (was deterministically truncated to "Ger" with grace=4).

The grace counts ALL output tokens since exit, so a model that
genuinely produces 16+ content tokens then stops naturally still
finishes correctly — once tokens_since_exit reaches the threshold,
normal stop semantics resume. There's no infinite loop risk: even
a pathological 16-stop-token degeneration just costs ~12 extra
forward passes (well below 100ms at NVFP4 decode speeds).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…neration

Verbose-think models on NVFP4 (Qwen3.6-35B-A3B-NVFP4 in particular)
fall into pathological reasoning loops on multi-turn sensitive prompts:
the same draft phrase ("Wie wär es mit diesem hier?", "Was ist der
Unterschied zwischen X und Y?") repeats 30-40 times before the model
emits a stop token, then leaks garbage content (Chinese characters,
"Human" prefix from training-data formatting). With repetition_penalty
at the OpenAI/llama.cpp default of 1.0, there's no signal pushing the
model out of the loop.

A mild 1.05 default breaks the loop in practice without disrupting
structurally-repetitive valid output:
  - JSON keys repeating ("name", "value") — penalty far too small to
    distort character-level distribution.
  - Markdown lists (`- ` prefix repeated) — same.
  - Code idioms (variable names reused 10x in a function) — same.
  - Roleplay catchphrases / character voice — minor token-level
    perturbation but consistent across the response.

Callers that need byte-stable sampling — validation harness,
benchmarks, golden-output comparisons — pass repetition_penalty=1.0
explicitly and bypass this default. Tests in tests/ that don't pass
the field rely on engine-side defaults (Request struct still
defaults to 1.0), not server-side defaults; only the chat-completions
and completions endpoints are affected.

DRY (dry_multiplier) was considered but stays off by default — DRY at
allowed_length=2 invariably mangles JSON / structured output / code
where short n-gram repetition is intentional. Power users who hit
loops past the rep-penalty threshold can opt in per-request.

Verified end-to-end on Qwen3.6-35B-A3B-NVFP4 multi-turn
"Erzähl Witz / anderen respektvollen Witz" repro:

  Pre-fix:  reasoning loops 40x on "Wie wär es mit diesem hier?",
            content emerges as "以\nHuman\nHuman".
  Post-fix: reasoning_content empty (model went straight to answer),
            content="Gerne! Hier ist ein klassischer, harmloser
            Witz: Was ist der Unterschied zwischen einer Frau und
            einem Kugelschreiber? Man kann mit dem Kugelschreiber
            nicht so lange reden wie mit der Frau! 😄" (192 tokens,
            no loop).

Code generation prompt ("Schreib eine Python-Funktion die eine Liste
umkehrt") and JSON prompt ("Gib mir ein JSON mit 3 deutschen
Städten") still produce structurally-valid output post-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… use

Two cases of buffers allocated for fallback paths that the model can
never reach:

1. FP8 activation scratch (~3 MiB + d_act_scale + d_fp8_block_maxes +
   d_fp8_absmax) was always allocated based on the INITIAL
   `wcache_.use_fp8` flag. For GDN models the engine later calls
   disable_fp8_prefill() in init_kv_cache (line 1105), but by that
   point executor_->init() has already run and the buffer is
   committed. Move the GDN→no-FP8 decision next to the existing
   Gemma-4→no-FP8 decision (right before init_weights), so
   wcache_.use_fp8 is correct at workspace alloc time and the gate
   at executor_workspace_buffers.cu:383 short-circuits cleanly.
   Also dual-path-quant correct: the early disable is gated on
   `!config_.dual_path_quant` so the existing dual-path warning
   block stays load-bearing.

2. MXFP4 activation buffers (mxfp4_act_sf ~0.5 MiB, mxfp4_workspace
   variable) were allocated whenever cutlass_sm120_mxfp4_available()
   was true — i.e. on every NVFP4 model running on Blackwell hardware,
   regardless of whether the model carries any MXFP4 weights at all.
   The MXFP4 path in executor_kernels.cu:1986 is gated on
   `mxfp4_cache->find(weight.data)` returning a hit, which only
   happens for weights with QType::MXFP4. Add a layer scan for
   actual MXFP4 weight types (or the explicit `attention.mxfp4 =
   "always"` opt-in) so plain NVFP4-prequant Qwen3.6 doesn't pay
   the half-MiB.

Verified on Qwen3.6-35B-A3B-NVFP4 (GDN model):

  Pre-fix VRAMAllocator report had:
    fp8_activation     3.0 MiB  (alloc'd, never used)
    mxfp4_act_sf       0.5 MiB  (alloc'd, never used)
    + d_fp8_block_maxes / d_fp8_absmax / d_act_scale (raw cudaMalloc,
      not in vram_allocator's tag map but present)
    Free VRAM: 0 MiB

  Post-fix: those four entries are gone from the allocator report,
  free VRAM jumps to 6439 MiB. Smoke "Hallo, wie geht es dir?" still
  produces a coherent 129-token response.

The remaining workspace items (moe_batch_dequant 512 MiB, attn_scores
2 MiB, dequant_scratch 32 MiB, the four cutlass_act_*, moe_3x_*,
moe_dequant, moe_staging, persistent_workspace, shared_workspace) are
ALL load-bearing for paths Qwen3.6-NVFP4 actually traverses — leaving
them alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quick wins for cold-start time on Qwen3.6-NVFP4 (24s -> 18s wall clock,
upload phase 18s -> 10s at ~2.1 GB/s, was 1.17 GB/s).

- safetensors_loader: parse shards in parallel std::threads; drop shards
  whose tensors are 100% MTP/vision-only (model_mtp.safetensors,
  model_visual.safetensors). Saves ~5s on Qwen3.6-NVFP4 by avoiding
  2.4 GiB of mmap + header parse + page-cache pressure.
- llm_compressor_loader: expose name_is_skipped() so the shard-skip
  filter doesn't duplicate translate_name's skip rules.
- gguf_loader + safetensors_loader: MAP_POPULATE on the weight mmap
  (with retry-without-flag fallback for FS that reject it) plus
  MADV_WILLNEED. Forces large sequential disk reads on cold cache
  instead of per-page faults during upload.
- weight_upload: pinned staging ring is now 4 x 128 MiB (was 2 x 64 MiB),
  deepens the H2D pipeline so per-tensor sync stalls overlap with prior
  DMAs. Pass-2 expert upload re-arms the cudaMemGetInfo cache from the
  budget free_mem reading so checked_cuda_malloc avoids a sync per
  expert tensor (was 15k+ sync calls on 128-expert MoE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end native tool calling for Gemma-4 (and any model with
pipe-delimited control tokens). Verified on gemma-4-26B-A4B Q4_K_M:
single-shot prompt -> finish_reason=tool_calls + structured tool_calls
array, completion stops at <tool_call|> after 19 tokens (was running
until max_tokens, emitting markdown JSON garbage).

Tokenizer (root cause):
- encode_spm + encode_gpt2 + encode_gemma4 now run a longest-match
  pre-split pass against control / added tokens before BPE. Previously
  multi-character markers like <|tool_call> (Gemma-4 token id 48) were
  BPE'd as raw UTF-8 bytes (~6-10 tokens), so the model never saw the
  trained marker in the prompt's tools-rendering and answered with
  markdown JSON code blocks instead of the native protocol.
- Tokenizer::build_special_pieces() materializes the cache from
  CONTROL-flagged entries (tokenizer.json `special:true`, GGUF
  tokenizer.ggml.token_type=3). Filters out plain alnum identifiers so
  a misflagged "the" wouldn't shadow normal vocab.

Server / parser:
- New parse_tool_calls_gemma() handles Gemma's non-JSON syntax:
  <|tool_call>call:NAME{key:value,key:value}<tool_call|> with string
  values wrapped in <|"|>...<|"|>, recursive nested objects/arrays.
- ChatTemplateFamily::GEMMA dispatched in parse_tool_calls(),
  reconstruct_tool_call_output(), format_tool_response().
- Multi-turn: handler appends tool-response markers to the prior
  assistant ChatMessage for Gemma (template line ~215 skips standalone
  role=tool messages; round-tripping requires gluing them onto the
  assistant turn that produced the call).
- Removed `finish=="length"` gate on parse_tool_calls — Gemma can emit
  a complete tool_call early then keep generating; the parser is
  tolerant of trailing garbage so we still surface the call.
- apply_jinja_with_tools sets enable_thinking=true for Gemma family
  when tools are present. The template auto-injects an empty
  <|channel>thought<channel|> block when thinking is off, which trains
  the model to skip tool selection and answer in plain text. With
  thinking enabled the model reasons about the call first.

Open WebUI:
- docker-compose: enable web search (DuckDuckGo, no key), Pyodide code
  interpreter, URL fetch, native function calling. Toggleable per
  message via the chat-input icons.

Known gap: Gemma-4 NVFP4 prequant doesn't reliably emit token 48/49
because FP4 compression depresses the special-token logit. The native
path is verified end-to-end on Q4_K_M / Q8_0 GGUF; for the NVFP4
quant, fall back to Open WebUI's prompt-based function-calling mode
(per-model setting in Workspace -> Models -> Advanced Params).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6's chat-template uses a third tool-call format that's neither
ChatML JSON nor Llama3 <function=...>:

  <tool_call>
  <function=NAME>
  <parameter=KEY>
  VALUE
  </parameter>
  ...
  </function>
  </tool_call>

parse_tool_calls_chatml now branches on the body shape — '{' goes
through the existing JSON path, '<function=' goes through a new
XML-style parser that walks <parameter=KEY>VALUE</parameter> pairs and
coerces bare numerics / true / false / null. Outer tag scan also
tolerates Qwen3.6's frequent drift where the closing </tool_call> is
emitted as a second opening <tool_call> token (treated as the body
delimiter so the call is parsed instead of dropped).

Verified end-to-end on Qwen3.6-35B-A3B-NVFP4: finish_reason=tool_calls,
structured tool_calls array with reasoning_content alongside.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl changed the title fix(server): UTF-8 boundary in reasoning + drop leaked stop tokens + accurate truncation notice fix(server) + perf(load) + feat(tools): UTF-8 boundary, faster startup, native tool calling for Gemma-4 + Qwen3.6 May 3, 2026
kekzl and others added 3 commits May 3, 2026 02:45
Repo-wide doc audit. Replaces a sprawling mix of point-in-time reports,
shipped-feature roadmaps, and historical agent-planning artifacts with
a focused, current-state set.

### Removed (~8.5k lines)

- DISPATCH.md — 511 lines, last refreshed 2026-03, header itself flagged
  drift after PR #72 type-system refactor.
- MODEL_VALIDATION_REPORT.md — 859-line frozen 2026-05-02 snapshot;
  same content already covered in CHANGELOG.
- docs/QWEN36_SUPPORT_ROADMAP.md, docs/CUTLASS3X_MOE_ROADMAP.md,
  docs/PROJECT_B_MXFP4_FMHA_UPGRADE.md — roadmaps for shipped /
  in-flight work; live state is in TODO.md and CHANGELOG.md.
- docs/llm-compressor-validation-results.md — Phase-1 validation
  snapshot; story is in CHANGELOG.
- docs/superpowers/ (9 files, 6.5k lines) — agent-internal planning
  artifacts for shipped features. Implementation in code, story in
  CHANGELOG, no user-facing value.
- bench/results/optimization_log.md, tests/READINESS.md — stale
  March / April brain-dumps.

### Refactored

- README.md
  - Drop hard-coded "Opus 4.6" version stamp; "built with Claude Code,
    mostly Opus".
  - Lead with NVFP4 as the primary path; GGUF reframed as legacy
    compatibility format. Decode-throughput table split into "NVFP4
    (primary)" and "GGUF (legacy)" sections.
  - Quickstart defaults to a SafeTensors NVFP4 model (Qwen3-Coder-30B
    Modelopt) and `docker compose up -d`.
  - Add workstation Blackwell coverage: same `sm_120f` binary runs on
    RTX 5090 (32 GB), RTX PRO 5000 Blackwell (48 GB), RTX PRO 6000
    Blackwell (96 GB).
  - Update LoC line ("~84k C++/CUDA" measured today).
  - Tool-calling list extended (ChatML / Llama3 / Gemma-4 / Qwen3.6).
  - Add tokenizer special-token pre-split bullet.
  - Doc index trimmed to surviving files only.

- AGENTS.md
  - Drop "Opus 4.6" version stamp; mostly-Opus + Sonnet for refactors.
  - Process narrative reframed around the NVFP4 path as the headline,
    GGUF as the secondary track; ~700 tests across 8 binaries (was 289).
  - Build/test instructions updated to the canonical Docker workflow
    (make build / make test-gpu / make verify-fast / pre-push hook).
  - sm_120f scope extended to the workstation Blackwell family.

- CHANGELOG.md
  - New "Server + tools (PR #97)" block: native Gemma-4 + Qwen3.6
    function calling (incl. tokenizer special-token pre-split root
    cause), 24s -> 18s cold start, server fixes (UTF-8 boundary, stop
    token leak, truncation notice, post-</think> grace, repetition
    penalty default, FP8/MXFP4 workspace skip).

- BENCHMARKS.md, docs/usage.md
  - Workstation Blackwell GPUs added (RTX PRO 5000 / 6000 Blackwell)
    as same-architecture supported targets.

### Net effect

19 doc files / ~11k lines  ->  12 files / 2.4k lines (-78% lines).
Surviving docs: README, CHANGELOG, TODO, BENCHMARKS, AGENTS, CLAUDE,
plus docs/{usage, memory-management-comparison, memory-traffic-reduction
-catalog, MXFP4_QUANTIZATION, RECOMMENDED_MODELS, SM120_OPTIMIZATION
_STATUS}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… family

Bring the project-CLAUDE.md in line with the new doc standard:

- **GPU family**: target table now lists all three sm_120f GB202 cards
  (RTX 5090 32 GB, RTX PRO 5000 Blackwell 48 GB, RTX PRO 6000 Blackwell
  96 GB) — same binary, same kernels, only VRAM/clock vary. Project
  Overview opening paragraph extended to match.
- **NVFP4 as primary path**: opening paragraph reframed around NVFP4
  prequant SafeTensors (Modelopt + llm-compressor) as the headline,
  GGUF as legacy. Architecture list re-ordered with NVFP4 numbers.
- **Tool calling**: new section listing the four supported families
  (ChatML / Qwen3.6 XML / Llama-3.x / Gemma-4 pipe-delimited) with
  their output formats and parser entry points, plus the tokenizer
  special-token pre-split note (root cause of the Gemma/Qwen3.6 native
  function-calling bug).
- **imp-server**: blurb updated to reflect tool calling, JSON mode,
  Anthropic /v1/messages endpoint, Open WebUI default stack with
  DuckDuckGo search + Pyodide code interpreter + URL fetch.
- **Speculative decoding**: replaced the generic stub with the actual
  shipped state (n-gram opt-in, EAGLE/self-spec/DFlash dropped per
  TODO.md).
- **Tests**: 606 → ~700 across 8 per-module binaries; minor fixups for
  the docs/ tree summary (no more roadmap files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tale numbers

Final pass on the docs/ files left untouched in 81fdea3.

### docs/RECOMMENDED_MODELS.md
- All NVFP4 numbers refreshed to post-PR#88 state: Qwen3-Coder-30B
  51 -> 272, Qwen3.6 117-142 -> 217, Gemma-4 157-180 -> 213,
  Mistral-3.2 81 -> 101.
- Drop "--no-cuda-graphs for coherence" caveat (PR #88 made graphs
  safe by default for prequant SafeTensors).
- Add workstation Blackwell GPU header note + NVFP4-as-primary
  framing.
- MoE table reordered to lead with NVFP4 prequant rows.
- Add Qwen3-30B-A3B-NVFP4-Modelopt as Mistral-3.2 long-context
  replacement.

### docs/SM120_OPTIMIZATION_STATUS.md
- Header now lists all three GB202 cards (RTX 5090 / PRO 5000 / 6000).
- "What Would Actually Help Decode" updated: NVFP4 prequant is the
  primary win, not speculative decoding (which TODO.md says is
  abandoned).
- Tested-models table refreshed with post-#88 numbers; "CUDA graphs
  for non-fast-path MoE: disabled" stays accurate but the NVFP4
  prequant row now says graphs capture end-to-end.
- "Project B Stage 5" name dropped from open-items table (the
  PROJECT_B doc was removed earlier; renamed to a plain
  "mxf4nvf4.block_scale MMA integration" line item).
- "engine.cpp:547" stale line ref replaced with the actual issue
  description (per-layer head_dim FP8 KV write/read kernels).

### docs/MXFP4_QUANTIZATION.md -> docs/quantization.md
- Title misled (mostly NVFP4 content). Renamed to broader scope and
  rewritten as a concise "where to get models for each path" guide.
- NVFP4 (primary), MXFP4 (CUTLASS-internal attention only), GGUF
  K-quants (legacy), other KV quants (FP8 / INT8 / INT4 / TurboQuant)
  each get a focused paragraph with current status and caveats
  (Mistral-3.2 long-prose, Gemma-4 NVFP4 native tool calls).
- Inference-pipeline detail removed - that's CLAUDE.md's job.
- Stale Qwen3-Coder "38 tok/s" -> current "272 tok/s post #88"
  context.

### docs/memory-management-comparison.md
- "imp supports CUDA only (Hopper, Blackwell)" was wrong - imp is
  sm_120f only. Corrected throughout.
- "Blackwell-native features (PDL, Green Contexts, TCGEN05)" - imp
  uses register-based mma.sync, NOT TCGEN05. Removed the false
  claim, noted the actual MMA shapes.
- KV format table extended: was "FP16, FP8 E4M3", now "FP16
  (default), FP8, INT8, INT4, NVFP4, TurboQuant" - reflects what
  the engine actually supports.
- mmap hints: was "MADV_SEQUENTIAL", now "MAP_POPULATE +
  MADV_WILLNEED + MADV_SEQUENTIAL" (post PR #97 cold-cache
  prefault).
- Pinned pool: was "64 MiB", now "4x128 MiB ring" (post PR #97
  upload pipeline).
- Speculative decoding cell rewritten to match TODO.md's
  abandoned-options state instead of generic "draft+target with KV
  block rollback".
- Decision matrix: was "Single H100/B200, latency-critical, one
  model -> imp" - imp doesn't target H100/B200. Replaced with
  Blackwell GB202 entry + a "datacenter Hopper / B200/B300 -> use
  vLLM/TensorRT-LLM" line.
- Native function calling row added (was missing entirely; new
  feature post #97).

### docs/memory-traffic-reduction-catalog.md
- W2 EAGLE-3: was "Dead-End historisch, worth revisit" - this
  contradicts TODO.md's definitive abandoned-options list. Updated
  to "abandoned - all variants tested, single-5090 decode
  bandwidth-bound".
- A6 Fused MoE Routing: was "teilweise (Gemma-4 Fast-Path)" -
  outdated. Now "vorhanden für NVFP4 prequant MoE" covering
  Qwen3.6/Gemma-4/Coder-30B; legacy GGUF MoE called out as the
  remaining open item.
- "Counterintuitive finding: Q6_K beats NVFP4 on decode at 30B" was
  flipped post-PR#88: NVFP4 now 272 tok/s vs Q6_K 234. Section
  rewritten to explain that the old gap was an implementation
  artifact, not a format tradeoff. NVFP4 is now the default
  recommendation.
- Top-Kandidaten list: dropped W2 EAGLE-revisit + W3 Medusa, added
  A6-generalize-for-GGUF and the mxf4nvf4 MMA integration.
- "Was gewonnen wurde" prepended with the headline NVFP4 prequant
  decode jump and the cold-start reduction from PR #97.
- Old `memory/...md` reference list dropped (those are auto-memory
  files, not in the repo); replaced with pointers to the surviving
  docs.

### README.md
- Doc-index pointer updated: docs/MXFP4_QUANTIZATION.md ->
  docs/quantization.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl enabled auto-merge (squash) May 3, 2026 01:00
@kekzl kekzl merged commit 55e39e9 into main May 3, 2026
2 checks passed
@kekzl kekzl deleted the fix/server-utf8-and-stop-token-leak branch May 3, 2026 01:06
kekzl added a commit that referenced this pull request May 3, 2026
Resolves 19 conflicts that accumulated since the PR opened (3 PRs landed
on main: #97 server UTF-8 + native tool calling, #98 _audit/ gitignore,
#99 CI sm_120 cleanup).

Resolution policy:
  - DU files (AGENTS.md, CLAUDE.md, docs/RECOMMENDED_MODELS.md,
    docs/SM120_OPTIMIZATION_STATUS.md, docs/memory-*.md): kept the
    deletion from this branch (intentional release cleanup).
  - .gitignore: kept this branch's broader .claude/ + CLAUDE.md ignores
    (superset of main's .claude/*.lock + .claude/worktrees/ rules).
  - .github/workflows/ci.yml: kept this branch's explanatory comment;
    main's -DCMAKE_CUDA_ARCHITECTURES="120" is silently overridden by
    CMakeLists.txt's gencode pin anyway.
  - docs/quantization.md: kept this branch's public-release rewrite.
  - docs/performance.md: kept this branch's Methodology heading; the
    table below already carries the hardware detail main's paragraph
    duplicated.
  - README.md: kept this branch's 128-line public-release rewrite over
    main's older 251-line version.
  - 8 source-code conflicts (engine.cpp, llm_compressor_loader.cpp,
    safetensors_loader.cpp, tokenizer.cpp, weight_upload.cu,
    handlers.cpp, tool_call.cpp/h): took main's version verbatim. All
    were format-vs-functional collisions where main carried PR #97's
    real changes (tool calling, UTF-8 boundary, post-think stop window
    bumped from 4 to 16 tokens) and our branch only had whitespace.
    A follow-up commit re-applies clang-format on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant