Skip to content

refactor(attention): archive FMHA sm_120 cluster kernel#295

Merged
github-actions[bot] merged 2 commits into
mainfrom
refactor/archive-fmha-cluster
May 20, 2026
Merged

refactor(attention): archive FMHA sm_120 cluster kernel#295
github-actions[bot] merged 2 commits into
mainfrom
refactor/archive-fmha-cluster

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 20, 2026

Summary

  • Archive `src/compute/attention_fmha_sm120_cluster.cu` (1102 LOC, defines both FP16 + FP8 cluster entry points) to `docs/archive/fmha_sm120_cluster/` with a resurrection memo
  • Remove `try_fmha_sm120_cluster_prefill` + `try_fmha_sm120_fp8_cluster_prefill` forward decls + call sites
  • Remove `attention.no_fmha_cluster` config field (parser + seed_from_env)
  • Remove the conditional CMakeLists entry
  • Delete the FP16 + FP8 cluster test suites (23 `ClusterPath` / `ClusterMatchesLegacy` gtests total** — 13 in `test_attention_fmha_sm120.cu` + 10 in `test_fmha_fp8.cu`. The commit body says "8" — this was a count error in the spec text that the implementer preserved verbatim; the real number is 23.)
  • Update `docs/roadmap.md` cluster section to reflect archival (follow-up commit `defffb3`)

Why

Two A/B refute memos established the cluster path is bit-identical to legacy:

  • `fmha_tma_lever_refuted_2026_05_14.md` — TMA bulk-store on sm_120 underperforms cp.async 0.31×-0.79×
  • `m5_slice2_cluster_refuted_2026_05_17.md` — 4-model A/B sweep: noise-dominated, output bit-identical

The path was default-off since #204. The test `FmhaSm120Test.ClusterPathNonAligned` was failing on main without anyone noticing — confirming the code was unexercised. This PR retires the opt-in.

Side effect

The pre-existing `FmhaSm120Test.ClusterPathNonAligned` failure on main is resolved by removing the test along with the code.

Test plan

  • `make build` green
  • `make verify-fast` green (`=== verify fast: OK ===`, no failing tests)
  • Pre-push hook verify-fast green
  • Spec compliance reviewer ✅ (FP8 expansion justified — both entry points live in the same .cu)
  • Code quality reviewer ✅ Approved (with two Important issues both addressed: roadmap.md updated in `defffb3`, test count documented here)

Phase 2 Task 1 of `docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md`.

🤖 Generated with Claude Code

kekzl and others added 2 commits May 20, 2026 10:24
The cluster kernel was opt-in (default off) per
m5_slice2_cluster_refuted_2026_05_17.md and produced bit-identical
output to the non-cluster kernel under all measured configurations.
fmha_tma_lever_refuted_2026_05_14.md established that the TMA-style
distributed-shared-memory pattern it relied on underperforms cp.async
on sm_120.

A side effect: the test FmhaSm120Test.ClusterPathNonAligned, which was
failing on main without being noticed, is removed along with its code.

Archive at docs/archive/fmha_sm120_cluster/ with a resurrection memo
describing how to re-introduce on a future GPU SKU.

Removes ~1100 LOC + the no_fmha_cluster runtime config field + 8
ClusterPath* gtests.

Phase 2 of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster section still said "code retained as opt-in for future
hardware where the signal might emerge." That stopped being true the
moment the kernel was archived to docs/archive/fmha_sm120_cluster/
and the attention.no_fmha_cluster config field was removed. A reader
following the original paragraph would conclude they could flip the
flag back on; the flag no longer exists.

Update the section header from "investigated and shelved" to
"archived", note the Phase 2 archival, and point at the resurrection
memo. The original investigation history is preserved.

Follow-up to a1e221a on the same branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot enabled auto-merge (squash) May 20, 2026 08:30
@github-actions github-actions Bot merged commit 6025108 into main May 20, 2026
3 checks passed
kekzl added a commit that referenced this pull request May 20, 2026
* refactor(attention): collapse prefill dispatcher gate

After archiving the cluster, mxf4nvf4, and naive paths in Tasks 1-3,
the dispatcher's prefill gate at executor_attention.cu collapses from
a multi-clause cascade to a clean two-branch switch:

  if S-matrix fits AND not non-Gemma-4 sliding:
      cuBLAS QK^T + softmax + PV
  else:
      FMHA fallback chain

The decode switch at the bottom of the function is untouched
(already flat per Phase 2 spec).

executor_attention.cu: 1277 -> 1253 LOC (-24).

Also drops the now-orphaned attention.no_cublas knob from RuntimeConfig
(src/runtime/config.{h,cpp}) and imp.conf.example. After Tasks 1-4 the
flag had no remaining readers in the engine; the prefill gate that used
it is now a hard policy (cuBLAS default, FMHA fallback on capacity or
non-Gemma-4 sliding).

Phase 2 of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(arch): close Phase 2 of refactor roadmap

Phase 2 (Attention-Dispatcher entrümpeln) is done. PRs that landed:
- #295 archive FMHA sm_120 cluster (~1100 LOC + 23 cluster gtests)
- #296 archive FMHA mxf4nvf4 + remove attention.fmha_blockscale orphan
- #297 archive attention_naive + inline parity ref into test
- #298 collapse prefill gate to 2-branch switch + remove attention.no_cublas

Closeout updates:
1. docs/superpowers/specs/...-roadmap-design.md — Phase 2 status line
   with PR list + deferred soft PRs (5/6/7).
2. docs/architecture.md — refresh "Attention dispatcher" section to
   show the new two-branch gate, drop the stale 4-clause snippet.
3. docs/architecture.dot — same in the diagram source (drop ad_naive
   node, rename cluster label to "cuBLAS default / FMHA fallback",
   simplify ad_gate label).
4. docs/architecture.svg + .png — regenerated from the new .dot via
   `docker run --rm -v $(pwd)/docs:/d nshine/dot dot -T{svg,png} ...`.

Phase 3 (Pre-Dequant + Quant-Zoo aufräumen) may now begin. A new
writing-plans output is required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant