fix(attention): Track E PV repack — m16n8k16 A-operand layout (re-enables Track E) by kekzl · Pull Request #353 · kekzl/imp

kekzl · 2026-05-21T15:42:42Z

Context

PR #352 was meant to ship this fix but auto-merge fired on the disable-hotfix commit before the actual fix commit was pushed. Main is currently running with Track E unconditionally disabled (early `return false;` at launcher entry).

This PR re-enables Track E with the actual root-cause fix.

Root cause

The PV repack (S_frag → P_frag for the second mma) used the wrong A-operand interleaving for `mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32`.

Correct PTX m16n8k16 A-frag per-lane layout (row × k-half interleaved):

`a[0] = (row_a, k<8)` → accumulates into `d[0,1]` (row_a output)
`a[1] = (row_b, k<8)` → accumulates into `d[2,3]` (row_b output)
`a[2] = (row_a, k≥8)` → accumulates into `d[0,1]`
`a[3] = (row_b, k≥8)` → accumulates into `d[2,3]`

Buggy code grouped by row-then-k:

`a[1] = S[row_a, k≥8]` ← WRONG (should be row_b k<8)
`a[2] = S[row_b, k<8]` ← WRONG (should be row_a k≥8)

Row_a softmax weights ended up in d[2,3] (row_b output) and vice versa. Original test fill (magnitude 0.125) was too uniform to expose this — error stayed within 5e-3 tolerance. Real Qwen3 attention scores produced ~0.116 abs error → garbage output.

Fix

P_frag construction — corrected interleaving order in PV mma call.
O_frag rescale & epilogue store — restored standard mappings (`frag[0,1] → row_a = lane/4`, `frag[2,3] → row_b = lane/4 + 8`).
Test fill magnitude — bumped from `* 0.125f` to `* 1.0f` so future bugs of this class don't slip through.
Removes the unconditional `return false` from fix(attention): Track E PV repack — m16n8k16 A-operand layout #352.

Verification

```
Prompt: "What is 17 + 25?"
With #352 (Track E disabled, runs cuBLAS): coherent output
With this fix (Track E enabled): "Okay, I need to figure out how to solve this problem..." (coherent)
```

Test suite at stricter magnitude=1.0 fill:

6/6 TrackE_Correctness tests PASS (1 hd=512 SKIP, expected)
3/3 TrackE_Features tests PASS (SWA, softcap, combined)
max_abs = 0.000244 (FP16 rounding floor only)

Test plan

All 9 TrackE_* tests pass at magnitude=1.0
Real Qwen3-8B Q8_0 inference produces coherent output
verify-fast pre-push hook green
Re-validate perf gains from perf(attention): Track E warp-spec 4+4 + perf validation report #351 are intact after the fix (no reason to expect regression, but worth a quick A/B post-merge)

🤖 Generated with Claude Code

Root cause of degeneration on real model weights: the P_frag construction for the PV mma was swapping the row/k-half interleaving of the m16n8k16 A-operand layout. PTX m16n8k16 A-frag per-lane layout: a[0] = (row_a, k<8) → accumulates into d[0,1] (row_a output) a[1] = (row_b, k<8) → accumulates into d[2,3] (row_b output) a[2] = (row_a, k≥8) → accumulates into d[0,1] (row_a output) a[3] = (row_b, k≥8) → accumulates into d[2,3] (row_b output) Where row_a = lane/4 and row_b = lane/4 + 8. The buggy code grouped by row-then-k: a[0] = S[row_a, sa=k<8] OK a[1] = S[row_a, sb=k≥8] WRONG — should be row_b k<8 a[2] = S[row_b, sa=k<8] WRONG — should be row_a k≥8 a[3] = S[row_b, sb=k≥8] OK Net effect: row_a softmax weights partially accumulated into d[2,3] (the row_b output slot) and vice versa. With the original uniform test fill (magnitude 0.125), all row values were similar so the swap produced ~5e-3 error — within the test tolerance. On real Qwen3 attention weights (highly non-uniform, magnitude ~1.0), the swap produces ~0.116 abs error → garbage logits → degenerate output ("，，，" or "ffff..." infinite repetition). Fix: rewrite P_frag construction with correct interleaving. Also restored standard mappings throughout: - O_frag rescaling: scale_prev[0] for [0,1], scale_prev[1] for [2,3] - Epilogue normalize: row_l[0] for [0,1], row_l[1] for [2,3] - Epilogue store: frag[0,1] → abs_row_a (lane/4), frag[2,3] → abs_row_b (lane/4 + 8) Test fill bumped from magnitude 0.125 to 1.0 — the original fill was too forgiving and let this bug ship. Now max_abs = 0.000244 (fp16 rounding only) across all 9 TrackE_* tests at the tighter magnitude. Real prompt verification (Qwen3-8B Q8_0, "What is 17 + 25?"): Before fix: "，，，，，，，，，，..." After fix: "Okay, I need to figure out how to solve this problem..." Supersedes the hotfix-disable in 6f2ce41 (which is now reverted via this enable-with-fix). Net of both commits on this branch = a single PV-repack fix with no production behavior change beyond bug elimination. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Multi-model smoke test (2026-05-21) revealed Track E is broken on most production models, not just one. The PV-repack fix in #353 resolved ONE bug class (m16n8k16 A-operand interleaving) but more remain. Test results with Track E enabled (post #353): Qwen3-8B Q8_0 → "Okay, I need to figure out..." OK Qwen3.6-35B NVFP4 → "The user is asking for the sum..." OK Qwen3-8B NVFP4 → "1. 1. 1. 1." DEGENERATE Gemma-4-26B Q8_0 → "층-다스-층-다스-..." DEGENERATE (Korean) Gemma-4-26B Q4_K_M → "the-land-land-land-..." DEGENERATE Gemma-4-26B NVFP4 → "<eos><eos>" IMMEDIATE EOS Test results with Track E DISABLED (this commit): Qwen3-8B Q8_0 → "Okay, I need to figure out..." OK Qwen3-8B NVFP4 → "Okay, let's see." OK Gemma-4-26B Q8_0 → "17 + 25 = 42<turn|>" OK Suspects for the remaining bug classes: - hd=256 path (Gemma-4 SWA layers) uses Br=32 Bkv=32 — different register/smem layout than hd=128 Br=64 Bkv=64, may have its own indexing bugs that the magnitude=1.0 test fill still misses - GDN+attention hybrid (Qwen3.5/3.6) passes attention K/V with different shapes that Track E may misinterpret - NVFP4-weight models produce attention activations with different distribution from Q8_0; some numerical path may overflow/underflow The TrackE_Correctness tests at magnitude=1.0 fill PASS for all configs (max_abs = 0.000244), so the test fill is still not representative enough of real distributions. Need richer test data before re-enabling. Track E stays in tree for investigation; production reverts to cuBLAS / FMHA via the existing dispatcher branches. Note: Qwen3.5-4B has a pre-existing cudaFree illegal-memory-access error unrelated to Track E (reproduces with this disable in place). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Track E v1 (PR #350 + PR #353) shipped with bugs that the uniform fill_fp16_deterministic test couldn't catch. Multi-model smoke test revealed degeneration on Gemma-4 (all 3 quants), Qwen3-8B NVFP4, and others. PR #354 disabled Track E in production after multiple fix attempts didn't resolve all cases. This commit adds the groundwork for a future Track E v2: 1. tests/test_mma_layout_probe.cu — empirical layout verification - LdmatrixX4_SamePointer: confirms same-ptr ldmatrix produces row 0 replicated (the suspected Q-load bug in Track E v1) - M16N8K16_DFrag: runs known-A × known-B through ldmatrix + mma and dumps per-lane fragments for verification against PTX ISA spec 2. docs/superpowers/specs/2026-05-21-track-e-v2-groundwork.md - Per-model degeneration table - Specific bugs identified (P_frag fixed in #353, Q-load + r[]→a[] ordering open) - Recommended v2 approach: WMMA-based (like attention_fmha_sm120.cu) - Test-rig requirements: Gaussian/real-checkpoint fill, multi-model smoke gate in CI Track E kernel itself stays disabled. cuBLAS / FMHA handle all prefill. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot enabled auto-merge (squash) May 21, 2026 15:42

github-actions Bot merged commit 92807d4 into main May 21, 2026
3 checks passed

kekzl mentioned this pull request May 21, 2026

fix(attention): disable Track E — multi-model degeneration (URGENT) #354

Merged

4 tasks

kekzl mentioned this pull request May 21, 2026

track-e/v2-groundwork: layout probe + bug analysis #355

Merged

3 tasks

kekzl mentioned this pull request May 21, 2026

feat(attention): Track E v2 — WMMA-based correct rewrite #356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(attention): Track E PV repack — m16n8k16 A-operand layout (re-enables Track E)#353

fix(attention): Track E PV repack — m16n8k16 A-operand layout (re-enables Track E)#353
github-actions[bot] merged 1 commit into
mainfrom
fix/track-e-pv-repack

kekzl commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 21, 2026

Context

Root cause

Fix

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant