Skip to content

runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscription#60

Merged
kekzl merged 4 commits into
mainfrom
fix/mxfp4-vram-oversubscribe-diag
Apr 25, 2026
Merged

runtime: clear diagnostic for MXFP4 FP16-fallback VRAM oversubscription#60
kekzl merged 4 commits into
mainfrom
fix/mxfp4-vram-oversubscribe-diag

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented Apr 25, 2026

Summary

Qwen3.5-27B-mxfp4 GDN hit a confusing illegal-memory-access at the first decode forward on a 32 GiB GPU. Root cause: the MXFP4 → FP16 decode fallback (needed because GDN forward kernels read weights directly, bypassing the MXFP4 dispatcher) tried to allocate ~48 GiB of FP16 weights while ~12 GiB of MXFP4 raw was already resident → total 60 GiB > 32 GiB VRAM. On WSL2/WDDM `cudaMalloc` happily pages over the device boundary, but cuBLASLt then fails to get its internal workspace at runtime → status 14 (INVALID_VALUE) followed by IMA in the next `cudaMemcpyAsync`.

This PR doesn't make a 27B MXFP4 model fit on 32 GiB — that needs structural changes (host-side dequant during upload, freeing MXFP4 raw after replace). It surfaces the failure clearly so users can diagnose it in seconds instead of minutes.

Changes

  1. Pre-flight VRAM check before the bulk MXFP4→FP16 alloc in `pre_dequant_weights`. If the request would leave less than 2 GiB headroom, log a "model too large" error and skip the alloc. `IMP_MXFP4_FP16_FALLBACK=force` bypasses for debugging.

  2. cublasGemmEx fallback now checks return status in `gemm_cublaslt_generic`. When both cublasLt and the GemmEx fallback fail with status 14 on the same shape, log an ERROR with dtypes and dimensions instead of silently leaving garbage in the output buffer and IMAing in the next kernel.

Diagnostic output (Qwen3.5-27B-mxfp4 on RTX 5090)

Before:
```
[ERROR] cudaMemcpyAsync(...) — illegal memory access was encountered
```

After:
```
[ERROR] MXFP4 FP16 fallback would oversubscribe VRAM
(need 47.7 GiB + 2.0 GiB runtime headroom, 9.4 GiB free).
Model is too large for this GPU with the FP16 decode fallback.
Use a smaller quant or a smaller model.
[WARN] gemm: cublasLtMatmul failed (status 14) M=5 K=6144 N=5120 ...
[ERROR] gemm: cublasGemmEx fallback also failed (status 14) M=5 K=6144 N=5120
dtA=2 dtB=2 dtC=2. Output buffer is garbage; expect downstream IMA.
[ERROR] cudaMemcpyAsync(...) — illegal memory access was encountered
```

User now knows: oversubscribed VRAM → cuBLAS rejection → IMA, in that order.

Test plan

  • `imp-tests --gtest_brief=1`: 79/94 pass (15 skipped, model-dependent — same as main)
  • Qwen3.5-4B Q8_0 GDN: coherent output, tg=170 tok/s
  • Qwen3-4B MXFP4: coherent ("Paris.")
  • Qwen3.6-35B Q4_K_M: coherent (no FP16 fallback path triggered)
  • Qwen3.5-27B-mxfp4: now logs the precise oversubscription error before the IMA

🤖 Generated with Claude Code

kekzl and others added 4 commits April 25, 2026 17:46
upload_weight() for gdn_alpha/gdn_beta ran with raw_quant=false, which
host-dequants Q8_0 → FP16 and overwrites weight.data, but never updates
L.gdn_*_qtype. gemm_dispatch then saw qtype=Q8_0 and re-interpreted the
FP16 bytes as Q8_0 blocks → ~80× too-large α/β projection → g_t collapses
to 0/1, recurrent state decays in 2-3 tokens, output degenerates to
" the the the …".

Drop the _RAW(false) carve-out so α/β upload like every other Q8_0 weight.

Verified Qwen3.5-4B Q8_0 (pp=13676, tg=220 tok/s, coherent), Qwen3.5-9B
Q8_0 (pp=9483, tg=140 tok/s, coherent), Qwen3.6-35B Q4_K_M (no regression).
79/94 tests pass, 4/4 DegenerationTest pass, 2/2 GDNModelTest pass with
IMP_TEST_MODEL_GDN=Qwen3.5-4B-Q8_0.gguf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
upload_embeddings_and_output() already updated tok_emb_qtype after a host
Q8_0/Q6_K → FP16 dequant, but out_proj_qtype was passed by value and the
matching update was missing. For models with separate (untied) out_proj
and a quant type outside {Q6_K, Q8_0, Q4_0} (e.g. Q4_K, Q5_K, Q4_1, Q2_K,
Q3_K, MXFP4), the host dequant path would silently leave qtype pointing
at the original block format, mirroring the GDN α/β bug fixed in the
parent commit.

Take out_proj_qtype by reference and apply the same F16 update branch.
Most common GGUF schemes (Q4_K_M, Q5_K_M) use Q6_K for the output tensor
and weight-tied models bypass this path entirely, so no real-world model
in the repo currently triggers the issue — the change is defensive.

Tests: full GDN + Degeneration + EndToEnd + Embedding suites pass
(14/19 with model-dependent skips); Qwen3.5-4B/9B Q8_0 still coherent;
Qwen3.6-35B Q4_K_M still emits "Paris."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5-27B-mxfp4 hit a confusing illegal-memory-access at the first
decode forward on a 32 GiB RTX 5090. Root cause: the MXFP4 → FP16 decode
fallback (needed for GDN models, where forward kernels read weights
directly without going through the MXFP4 dispatcher) tried to allocate
~48 GiB of FP16 weights while ~12 GiB of MXFP4 raw was already resident,
total 60 GiB > 32 GiB VRAM. On WSL2/WDDM cudaMalloc happily pages over
the device boundary, but cuBLASLt then fails to allocate its internal
workspace at runtime → status 14 (INVALID_VALUE) followed by an IMA in
the next cudaMemcpyAsync.

Two defensive changes:

1. Pre-flight VRAM check before the bulk MXFP4→FP16 alloc. If the
   request would leave less than 2 GiB headroom for cuBLAS workspaces
   and KV writes, log a clear "model too large for this GPU" error and
   skip the alloc. Set IMP_MXFP4_FP16_FALLBACK=force to bypass for
   debugging (still IMAs, but with an informed user).

2. Check the cublasGemmEx fallback return code in gemm_cublaslt_generic.
   When both cublasLt and the GemmEx fallback fail (now observed on the
   same shape that triggered the original IMA), surface that as an
   ERROR log including dtypes and dimensions instead of silently
   leaving garbage in the output and IMAing in the next kernel.

Tests: full suite 79/94 pass (15 model-dependent skips, same as before).
Qwen3.5-4B Q8_0 GDN, Qwen3-4B MXFP4, Qwen3.6-35B Q4_K_M still emit
coherent output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The comment claimed "GDN forward reads weights directly, not through
gemm_dispatch" — true at PR #8 time, but no longer accurate after the
GDN refactor moved every linear projection (ssm_in, ssm_out, gdn_gate,
gdn_alpha, gdn_beta) through gemm_dispatch. The scan kernel itself only
touches FP32 tensors (A_log, dt_bias, conv1d, ssm_norm) which are never
quantized to MXFP4.

The actual reason the GDN fallback is preserved is more painful: when
\`mxfp4_gemv_available\` is forced on for GDN (skipping the 48 GiB FP16
fallback that doesn't fit on 32 GiB GPUs), cuBLASLt returns status 14
(INTERNAL_ERROR) on the MXFP4 prefill dispatch for a number of GDN-shape
weights — notably Qwen3.5-27B's ssm_out at K=6144 N=5120 and FFN at
K=17408 N=5120 — and an IMA cascades through every subsequent
cudaMemcpyAsync. Tracking + suspect list in
qwen35_27b_mxfp4_ima_2026_04_25.md. Until that's fixed, leave the
historical fallback path active and update the comment so future
readers don't waste time chasing the wrong mental model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit 0596ee4 into main Apr 25, 2026
2 checks passed
@kekzl kekzl deleted the fix/mxfp4-vram-oversubscribe-diag branch April 25, 2026 22:54
kekzl added a commit that referenced this pull request Apr 27, 2026
* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split

Two coupled fixes from PR #62 (569af55) that apply cleanly to current main.
The third change in that commit — dropping the GDN→FP16 fallback in
executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a
diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14)
cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn`
guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason
per the original PR #62 author's notes, so dropping the guard wouldn't
unblock working usage today.

1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch
   buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can
   accumulate enough round-off across deep layers to NaN at attention
   layers in the second half of the model. The scale==1.0 gate only
   triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512,
   scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention
   output at L59 → garbage logits.

2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor
   (capacity = max seq_len^2) to `attention_cublas_prefill` instead of
   constructing a sub-view sized for the current n. The FP32-fits check
   inside attention_cublas reads the passed S tensor's shape, not the
   underlying buffer's capacity, so the sub-view always made FP32-fits
   evaluate false even when the real allocation was 200× larger. With
   the full buffer passed, the FP32-S path actually fires for prefill
   shapes that fit in the buffer.

3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with
   norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4
   ships with head_dim=512 for Q, head_dim=256 for K, and a single
   (256,) norm weight intended to be applied per 256-dim chunk along the
   head. The previous code reshaped Q to [n*nh, hd=512] and called
   rmsnorm with the (256,) weight — reading 256 elements past the end
   of the weight buffer and producing the L59 attention NaN that
   poisoned the rest of the forward pass. New `split_norm_dim` lambda
   computes `norm_dim` from the weight's shape and reshapes the Q/K
   view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when
   norm_dim == hd (the common case).

Verified clean cherry-pick with no overlap against the recently-merged
PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tools: drop unreferenced create_test_mxfp4.py

Last MXFP4 converter remnant. The production convert_mxfp4.py was already
removed in PR #8 (e93deff). create_test_mxfp4.py only generated a tiny
synthetic GGUF for ad-hoc loader testing and is referenced by no test,
build target, doc, or script. Drop it to finish the converter cleanup.

MXFP4 input format is now exclusively third-party GGUFs (Unsloth,
bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA
Model Optimizer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kekzl added a commit that referenced this pull request Apr 30, 2026
* runtime: FP32 attention S-matrix + Qwen3.5 QK-norm split

Two coupled fixes from PR #62 (8f8d2b3) that apply cleanly to current main.
The third change in that commit — dropping the GDN→FP16 fallback in
executor_pre_dequant.cu — is *intentionally skipped* here: PR #60 added a
diagnostic comment on main explaining the cuBLAS-INTERNAL_ERROR (status 14)
cascade for GDN-shape MXFP4 weights is still unrooted, so the `has_gdn`
guard stays. Qwen3.5-27B-mxfp4 remains incoherent for an unrelated reason
per the original PR #62 author's notes, so dropping the guard wouldn't
unblock working usage today.

1. **attention_cublas.cu** — Use the FP32 S-matrix whenever the scratch
   buffer fits, not just when scale==1.0 (Gemma-4). The FP16-S path can
   accumulate enough round-off across deep layers to NaN at attention
   layers in the second half of the model. The scale==1.0 gate only
   triggered on Gemma-4 by accident; Qwen3.5-27B-mxfp4 (head_dim=512,
   scale=1/sqrt(512)) hits the same FP16 round-off → all-NaN attention
   output at L59 → garbage logits.

2. **executor_attention.cu (a)** — Pass the full `attn_scores_` tensor
   (capacity = max seq_len^2) to `attention_cublas_prefill` instead of
   constructing a sub-view sized for the current n. The FP32-fits check
   inside attention_cublas reads the passed S tensor's shape, not the
   underlying buffer's capacity, so the sub-view always made FP32-fits
   evaluate false even when the real allocation was 200× larger. With
   the full buffer passed, the FP32-S path actually fires for prefill
   shapes that fit in the buffer.

3. **executor_attention.cu (b)** — Detect attn_q_norm/attn_k_norm with
   norm_dim < head_dim and reshape Q/K accordingly. Qwen3.5-27B-mxfp4
   ships with head_dim=512 for Q, head_dim=256 for K, and a single
   (256,) norm weight intended to be applied per 256-dim chunk along the
   head. The previous code reshaped Q to [n*nh, hd=512] and called
   rmsnorm with the (256,) weight — reading 256 elements past the end
   of the weight buffer and producing the L59 attention NaN that
   poisoned the rest of the forward pass. New `split_norm_dim` lambda
   computes `norm_dim` from the weight's shape and reshapes the Q/K
   view to `[n*nh*(hd/norm_dim), norm_dim]`. No-op when
   norm_dim == hd (the common case).

Verified clean cherry-pick with no overlap against the recently-merged
PR #65 (server reasoning_content split + Gemma-4 NVFP4 multi-turn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tools: drop unreferenced create_test_mxfp4.py

Last MXFP4 converter remnant. The production convert_mxfp4.py was already
removed in PR #8 (5b18d2e). create_test_mxfp4.py only generated a tiny
synthetic GGUF for ad-hoc loader testing and is referenced by no test,
build target, doc, or script. Drop it to finish the converter cleanup.

MXFP4 input format is now exclusively third-party GGUFs (Unsloth,
bartowski, gpt-oss); own quantization pipeline is NVFP4 via NVIDIA
Model Optimizer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant