Skip to content

fix(attention): Track E PV repack — m16n8k16 A-operand layout (re-enables Track E)#353

Merged
github-actions[bot] merged 1 commit into
mainfrom
fix/track-e-pv-repack
May 21, 2026
Merged

fix(attention): Track E PV repack — m16n8k16 A-operand layout (re-enables Track E)#353
github-actions[bot] merged 1 commit into
mainfrom
fix/track-e-pv-repack

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 21, 2026

Context

PR #352 was meant to ship this fix but auto-merge fired on the disable-hotfix commit before the actual fix commit was pushed. Main is currently running with Track E unconditionally disabled (early `return false;` at launcher entry).

This PR re-enables Track E with the actual root-cause fix.

Root cause

The PV repack (S_frag → P_frag for the second mma) used the wrong A-operand interleaving for `mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32`.

Correct PTX m16n8k16 A-frag per-lane layout (row × k-half interleaved):

  • `a[0] = (row_a, k<8)` → accumulates into `d[0,1]` (row_a output)
  • `a[1] = (row_b, k<8)` → accumulates into `d[2,3]` (row_b output)
  • `a[2] = (row_a, k≥8)` → accumulates into `d[0,1]`
  • `a[3] = (row_b, k≥8)` → accumulates into `d[2,3]`

Buggy code grouped by row-then-k:

  • `a[1] = S[row_a, k≥8]` ← WRONG (should be row_b k<8)
  • `a[2] = S[row_b, k<8]` ← WRONG (should be row_a k≥8)

Row_a softmax weights ended up in d[2,3] (row_b output) and vice versa. Original test fill (magnitude 0.125) was too uniform to expose this — error stayed within 5e-3 tolerance. Real Qwen3 attention scores produced ~0.116 abs error → garbage output.

Fix

  1. P_frag construction — corrected interleaving order in PV mma call.
  2. O_frag rescale & epilogue store — restored standard mappings (`frag[0,1] → row_a = lane/4`, `frag[2,3] → row_b = lane/4 + 8`).
  3. Test fill magnitude — bumped from `* 0.125f` to `* 1.0f` so future bugs of this class don't slip through.
  4. Removes the unconditional `return false` from fix(attention): Track E PV repack — m16n8k16 A-operand layout #352.

Verification

```
Prompt: "What is 17 + 25?"
With #352 (Track E disabled, runs cuBLAS): coherent output
With this fix (Track E enabled): "Okay, I need to figure out how to solve this problem..." (coherent)
```

Test suite at stricter magnitude=1.0 fill:

  • 6/6 TrackE_Correctness tests PASS (1 hd=512 SKIP, expected)
  • 3/3 TrackE_Features tests PASS (SWA, softcap, combined)
  • max_abs = 0.000244 (FP16 rounding floor only)

Test plan

🤖 Generated with Claude Code

Root cause of degeneration on real model weights: the P_frag construction
for the PV mma was swapping the row/k-half interleaving of the m16n8k16
A-operand layout.

PTX m16n8k16 A-frag per-lane layout:
  a[0] = (row_a, k<8)   → accumulates into d[0,1]  (row_a output)
  a[1] = (row_b, k<8)   → accumulates into d[2,3]  (row_b output)
  a[2] = (row_a, k≥8)   → accumulates into d[0,1]  (row_a output)
  a[3] = (row_b, k≥8)   → accumulates into d[2,3]  (row_b output)

Where row_a = lane/4 and row_b = lane/4 + 8.

The buggy code grouped by row-then-k:
  a[0] = S[row_a, sa=k<8] OK
  a[1] = S[row_a, sb=k≥8] WRONG — should be row_b k<8
  a[2] = S[row_b, sa=k<8] WRONG — should be row_a k≥8
  a[3] = S[row_b, sb=k≥8] OK

Net effect: row_a softmax weights partially accumulated into d[2,3]
(the row_b output slot) and vice versa. With the original uniform
test fill (magnitude 0.125), all row values were similar so the
swap produced ~5e-3 error — within the test tolerance.

On real Qwen3 attention weights (highly non-uniform, magnitude ~1.0),
the swap produces ~0.116 abs error → garbage logits → degenerate
output (",,," or "ffff..." infinite repetition).

Fix: rewrite P_frag construction with correct interleaving. Also
restored standard mappings throughout:
- O_frag rescaling: scale_prev[0] for [0,1], scale_prev[1] for [2,3]
- Epilogue normalize: row_l[0] for [0,1], row_l[1] for [2,3]
- Epilogue store: frag[0,1] → abs_row_a (lane/4), frag[2,3] → abs_row_b (lane/4 + 8)

Test fill bumped from magnitude 0.125 to 1.0 — the original fill was
too forgiving and let this bug ship. Now max_abs = 0.000244 (fp16
rounding only) across all 9 TrackE_* tests at the tighter magnitude.

Real prompt verification (Qwen3-8B Q8_0, "What is 17 + 25?"):
  Before fix: ",,,,,,,,,,..."
  After fix:  "Okay, I need to figure out how to solve this problem..."

Supersedes the hotfix-disable in 6f2ce41 (which is now reverted via
this enable-with-fix). Net of both commits on this branch = a single
PV-repack fix with no production behavior change beyond bug elimination.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot enabled auto-merge (squash) May 21, 2026 15:42
@github-actions github-actions Bot merged commit 92807d4 into main May 21, 2026
3 checks passed
github-actions Bot pushed a commit that referenced this pull request May 21, 2026
Multi-model smoke test (2026-05-21) revealed Track E is broken on most
production models, not just one. The PV-repack fix in #353 resolved
ONE bug class (m16n8k16 A-operand interleaving) but more remain.

Test results with Track E enabled (post #353):
  Qwen3-8B Q8_0        → "Okay, I need to figure out..."  OK
  Qwen3.6-35B NVFP4    → "The user is asking for the sum..." OK
  Qwen3-8B NVFP4       → "1. 1. 1. 1."                 DEGENERATE
  Gemma-4-26B Q8_0     → "층-다스-층-다스-..."         DEGENERATE (Korean)
  Gemma-4-26B Q4_K_M   → "the-land-land-land-..."      DEGENERATE
  Gemma-4-26B NVFP4    → "<eos><eos>"                  IMMEDIATE EOS

Test results with Track E DISABLED (this commit):
  Qwen3-8B Q8_0        → "Okay, I need to figure out..."  OK
  Qwen3-8B NVFP4       → "Okay, let's see."             OK
  Gemma-4-26B Q8_0     → "17 + 25 = 42<turn|>"          OK

Suspects for the remaining bug classes:
  - hd=256 path (Gemma-4 SWA layers) uses Br=32 Bkv=32 — different
    register/smem layout than hd=128 Br=64 Bkv=64, may have its own
    indexing bugs that the magnitude=1.0 test fill still misses
  - GDN+attention hybrid (Qwen3.5/3.6) passes attention K/V with
    different shapes that Track E may misinterpret
  - NVFP4-weight models produce attention activations with different
    distribution from Q8_0; some numerical path may overflow/underflow

The TrackE_Correctness tests at magnitude=1.0 fill PASS for all
configs (max_abs = 0.000244), so the test fill is still not
representative enough of real distributions. Need richer test data
before re-enabling.

Track E stays in tree for investigation; production reverts to
cuBLAS / FMHA via the existing dispatcher branches.

Note: Qwen3.5-4B has a pre-existing cudaFree illegal-memory-access
error unrelated to Track E (reproduces with this disable in place).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot pushed a commit that referenced this pull request May 21, 2026
Track E v1 (PR #350 + PR #353) shipped with bugs that the uniform
fill_fp16_deterministic test couldn't catch. Multi-model smoke test
revealed degeneration on Gemma-4 (all 3 quants), Qwen3-8B NVFP4, and
others. PR #354 disabled Track E in production after multiple fix
attempts didn't resolve all cases.

This commit adds the groundwork for a future Track E v2:

1. tests/test_mma_layout_probe.cu — empirical layout verification
   - LdmatrixX4_SamePointer: confirms same-ptr ldmatrix produces row 0
     replicated (the suspected Q-load bug in Track E v1)
   - M16N8K16_DFrag: runs known-A × known-B through ldmatrix + mma
     and dumps per-lane fragments for verification against PTX ISA spec

2. docs/superpowers/specs/2026-05-21-track-e-v2-groundwork.md
   - Per-model degeneration table
   - Specific bugs identified (P_frag fixed in #353, Q-load + r[]→a[]
     ordering open)
   - Recommended v2 approach: WMMA-based (like attention_fmha_sm120.cu)
   - Test-rig requirements: Gaussian/real-checkpoint fill, multi-model
     smoke gate in CI

Track E kernel itself stays disabled. cuBLAS / FMHA handle all prefill.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant