[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)

# [Feature Request] SM_120/SM_121 `CollectiveBuilder` specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)

## Summary

CUTLASS 4.5 ships SM_120 grouped GEMM (`GroupProblemShape` + ptr-array layouts) for **blockwise-scaled FP8**, **NVFP4**, and **MXFP8** quantization schemes — but not for **tensor/token-scaled FP8** (scale factors applied in the epilogue, not the mainloop). This is the FP8 variant most commonly produced by Hugging Face quantization tooling (e.g. `compressed-tensors`, `llm-compressor` "FP8-Dynamic"), used by Gemma 4, Mixtral FP8, Qwen FP8, and other production MoE checkpoints.

Datacenter Blackwell (SM_100) supports this code path via `cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100` + matching `cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm` epilogue schedule, both successfully wired up in vLLM's `grouped_mm_c3x_sm100.cu`. The analogous SM_120 path does not exist yet — empirically confirmed on real NVIDIA DGX Spark (SM_121) hardware.

## What appears to exist in CUTLASS 4.5

| Variant | Status on SM_120 | Where |
|---|---|---|
| NVFP4 grouped GEMM | ✅ Working | `examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu` |
| FP8 **blockwise** grouped GEMM | ✅ Working | `examples/87_blackwell_geforce_gemm_blockwise/87c_blackwell_geforce_fp8_bf16_grouped_gemm_groupwise.cu` |
| MXFP8 blockscaled grouped GEMM | ✅ Working | `test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/` |
| **Tensor/token-scaled FP8 grouped GEMM** | ❌ **No working CollectiveBuilder specialization** | — |

## What appears missing

`include/cutlass/gemm/dispatch_policy.hpp` defines:

```cpp
template<int SchedulerPipelineStageCount_>
struct KernelPtrArrayTmaWarpSpecializedCooperativeSm120 : KernelPtrArrayTmaWarpSpecializedCooperative { ... };

template<int SchedulerPipelineStageCount_>
struct KernelPtrArrayTmaWarpSpecializedPingpongSm120 : KernelPtrArrayTmaWarpSpecializedPingpong { ... };
```

A GitHub code search confirms these have **zero usages anywhere in CUTLASS** (no examples, no tests, no `CollectiveBuilder<...>::CollectiveOp` specialization). Attempting to use either as the `KernelSchedule` template parameter to `cutlass::gemm::collective::CollectiveBuilder<arch::Sm120, OpClassTensorOp, float_e4m3_t, LayoutA*, ..., KernelSchedule, void>` results in the standard "Could not build a collective for given parameters" `static_assert`.

The equivalent SM_100 ptr-array kernels (`KernelPtrArrayTmaWarpSpecialized1SmSm100`, `KernelPtrArrayTmaWarpSpecialized2SmSm100`) DO have working `CollectiveBuilder` specializations and are used in real production via vLLM's `grouped_mm_c3x_sm100.cu`. The dispatch-policy types alone aren't enough — the corresponding `CollectiveBuilder` template specialization needs to be implemented for SM_120.

## Downstream impact

This is the missing piece that prevents CUTLASS MoE on **any consumer Blackwell GPU** for the most common FP8 quantization scheme:

- NVIDIA GB10 (DGX Spark) — SM_121
- NVIDIA RTX 5090 / 5080 / 5070 — SM_120
- Any future SM_12x consumer Blackwell variant

Affected vLLM models running with sub-optimal Triton MoE backend instead of native CUTLASS:
- `RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic` (Gemma 4 MoE)
- Other RedHatAI / neuralmagic / compressed-tensors FP8 MoE checkpoints
- Mistral FP8 MoE variants
- Any model quantized via `llm-compressor` with per-tensor or per-token FP8 scaling

Downstream vLLM tracking issue: https://github.com/vllm-project/vllm/issues/43507

## What "shipping this" looks like

The pattern is already established for SM_100. The equivalent SM_120 path would be:

```cpp
// In include/cutlass/gemm/collective/builders/sm120_*.inl  (new file or extension)
template <int Stages>
struct CollectiveBuilder<
    arch::Sm120, arch::OpClassTensorOp,
    float_e4m3_t, LayoutA*, AlignmentA,
    float_e4m3_t, LayoutB*, AlignmentB,
    float,                                    // accumulator
    TileShape, ClusterShape,
    gemm::collective::StageCountAutoCarveout<...>,
    KernelPtrArrayTmaWarpSpecializedCooperativeSm120<Stages>,
    void
> {
    using CollectiveOp = ...;  // analog of the SM_100 implementation
};
```

Plus a matching `cutlass::epilogue::PtrArrayTmaWarpSpecializedSm120` epilogue schedule (or whatever the consumer-Blackwell naming convention is) that `cutlass::epilogue::collective::EpilogueScheduleAuto` can dispatch to when `ArchTag = Sm120`.

Consumer Blackwell (SM_120/121) doesn't have 2-CTA clusters, so the SM_100 `2Sm` variants wouldn't have an analog — only the `1Sm`/single-CTA equivalent is needed for v1.

## Local validation environment available

I have a physical NVIDIA DGX Spark (SM_121) configured with vLLM source-build infrastructure ready to test. Once a CUTLASS release includes the missing specialization, I can:

1. Rebuild the vLLM Docker image (CUDA 13.0, `TORCH_CUDA_ARCH_LIST="12.0a;12.1a"`) with the new CUTLASS pulled in
2. Run the existing vLLM test suite for FP8 MoE on real SM_121 silicon
3. Provide numerical correctness comparisons against the current Triton MoE backend
4. Provide throughput benchmarks (vLLM's `benchmark_moe.py` or end-to-end inference)
5. Submit the vLLM-side PR that wires it up (the `grouped_mm_c3x_sm120.cu` analog of the SM_100 file)

Happy to validate any pre-release / branch builds. I expect to be one of the first non-NVIDIA users with a Spark in the wild willing to run kernel validation.

## Reproduction of the gap on real SM_121 hardware (DGX Spark, 2026-05-23)

Container: vLLM main HEAD (`4438b6e`), CUTLASS 4.5.x, CUDA 13.0.2, `TORCH_CUDA_ARCH_LIST=12.0a;12.1a;12.1+PTX`, `gencode arch=compute_120a,code=sm_120a`.

Tried, in order, as `KernelSchedule` for an analog of `vllm/csrc/libtorch_stable/quantization/w8a8/cutlass/moe/grouped_mm_c3x_sm100.cu`:

| Schedule | Error |
|---|---|
| `KernelPtrArrayTmaWarpSpecialized1SmSm120` | `namespace "cutlass::gemm" has no member "..."` (symbol doesn't exist, was pre-release name) |
| `KernelScheduleSm120Blockwise` | `Could not build a collective for given parameters` — expects blockwise-scaled `tuple<LayoutA*, LayoutSFA*>` layout; our model has plain `LayoutA*` |
| `KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2>` | `Could not build a collective for given parameters` — no `CollectiveBuilder` specialization for this combination on SM_120 |

The third failure is the actionable one: the dispatch policy is defined but no `CollectiveBuilder<...>::CollectiveOp` exists to consume it.

---

**AI assistance disclosure**: This issue was researched and drafted with AI (Claude) assistance over a multi-hour investigation that included reading CUTLASS dispatch policy headers, running four physical compile iterations on real NVIDIA DGX Spark hardware (SM_121), and reading vLLM's grouped GEMM source. All build commands and compile errors are from actual runs on SM_121 silicon. The human submitter (Tyler Merritt) reviewed the diagnosis end-to-end and is the validator for any test builds.


Schedule	Error
`KernelPtrArrayTmaWarpSpecialized1SmSm120`	`namespace "cutlass::gemm" has no member "..."` (symbol doesn't exist, was pre-release name)
`KernelScheduleSm120Blockwise`	`Could not build a collective for given parameters` — expects blockwise-scaled `tuple<LayoutA, LayoutSFA>` layout; our model has plain `LayoutA*`
`KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2>`	`Could not build a collective for given parameters` — no `CollectiveBuilder` specialization for this combination on SM_120

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array) #3263

[Feature Request] SM_120/SM_121 `CollectiveBuilder` specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)

Summary

What appears to exist in CUTLASS 4.5

What appears missing

Downstream impact

What "shipping this" looks like

Local validation environment available

Reproduction of the gap on real SM_121 hardware (DGX Spark, 2026-05-23)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Variant	Status on SM_120	Where
NVFP4 grouped GEMM	✅ Working	`examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu`
FP8 blockwise grouped GEMM	✅ Working	`examples/87_blackwell_geforce_gemm_blockwise/87c_blackwell_geforce_fp8_bf16_grouped_gemm_groupwise.cu`
MXFP8 blockscaled grouped GEMM	✅ Working	`test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/`
Tensor/token-scaled FP8 grouped GEMM	❌ No working CollectiveBuilder specialization	—

[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array) #3263

Description

[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)

Summary

What appears to exist in CUTLASS 4.5

What appears missing

Downstream impact

What "shipping this" looks like

Local validation environment available

Reproduction of the gap on real SM_121 hardware (DGX Spark, 2026-05-23)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Feature Request] SM_120/SM_121 `CollectiveBuilder` specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)