Skip to content

[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array) #3263

@tgmerritt

Description

@tgmerritt

[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)

Summary

CUTLASS 4.5 ships SM_120 grouped GEMM (GroupProblemShape + ptr-array layouts) for blockwise-scaled FP8, NVFP4, and MXFP8 quantization schemes — but not for tensor/token-scaled FP8 (scale factors applied in the epilogue, not the mainloop). This is the FP8 variant most commonly produced by Hugging Face quantization tooling (e.g. compressed-tensors, llm-compressor "FP8-Dynamic"), used by Gemma 4, Mixtral FP8, Qwen FP8, and other production MoE checkpoints.

Datacenter Blackwell (SM_100) supports this code path via cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100 + matching cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm epilogue schedule, both successfully wired up in vLLM's grouped_mm_c3x_sm100.cu. The analogous SM_120 path does not exist yet — empirically confirmed on real NVIDIA DGX Spark (SM_121) hardware.

What appears to exist in CUTLASS 4.5

Variant Status on SM_120 Where
NVFP4 grouped GEMM ✅ Working examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu
FP8 blockwise grouped GEMM ✅ Working examples/87_blackwell_geforce_gemm_blockwise/87c_blackwell_geforce_fp8_bf16_grouped_gemm_groupwise.cu
MXFP8 blockscaled grouped GEMM ✅ Working test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/
Tensor/token-scaled FP8 grouped GEMM No working CollectiveBuilder specialization

What appears missing

include/cutlass/gemm/dispatch_policy.hpp defines:

template<int SchedulerPipelineStageCount_>
struct KernelPtrArrayTmaWarpSpecializedCooperativeSm120 : KernelPtrArrayTmaWarpSpecializedCooperative { ... };

template<int SchedulerPipelineStageCount_>
struct KernelPtrArrayTmaWarpSpecializedPingpongSm120 : KernelPtrArrayTmaWarpSpecializedPingpong { ... };

A GitHub code search confirms these have zero usages anywhere in CUTLASS (no examples, no tests, no CollectiveBuilder<...>::CollectiveOp specialization). Attempting to use either as the KernelSchedule template parameter to cutlass::gemm::collective::CollectiveBuilder<arch::Sm120, OpClassTensorOp, float_e4m3_t, LayoutA*, ..., KernelSchedule, void> results in the standard "Could not build a collective for given parameters" static_assert.

The equivalent SM_100 ptr-array kernels (KernelPtrArrayTmaWarpSpecialized1SmSm100, KernelPtrArrayTmaWarpSpecialized2SmSm100) DO have working CollectiveBuilder specializations and are used in real production via vLLM's grouped_mm_c3x_sm100.cu. The dispatch-policy types alone aren't enough — the corresponding CollectiveBuilder template specialization needs to be implemented for SM_120.

Downstream impact

This is the missing piece that prevents CUTLASS MoE on any consumer Blackwell GPU for the most common FP8 quantization scheme:

  • NVIDIA GB10 (DGX Spark) — SM_121
  • NVIDIA RTX 5090 / 5080 / 5070 — SM_120
  • Any future SM_12x consumer Blackwell variant

Affected vLLM models running with sub-optimal Triton MoE backend instead of native CUTLASS:

  • RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE)
  • Other RedHatAI / neuralmagic / compressed-tensors FP8 MoE checkpoints
  • Mistral FP8 MoE variants
  • Any model quantized via llm-compressor with per-tensor or per-token FP8 scaling

Downstream vLLM tracking issue: vllm-project/vllm#43507

What "shipping this" looks like

The pattern is already established for SM_100. The equivalent SM_120 path would be:

// In include/cutlass/gemm/collective/builders/sm120_*.inl  (new file or extension)
template <int Stages>
struct CollectiveBuilder<
    arch::Sm120, arch::OpClassTensorOp,
    float_e4m3_t, LayoutA*, AlignmentA,
    float_e4m3_t, LayoutB*, AlignmentB,
    float,                                    // accumulator
    TileShape, ClusterShape,
    gemm::collective::StageCountAutoCarveout<...>,
    KernelPtrArrayTmaWarpSpecializedCooperativeSm120<Stages>,
    void
> {
    using CollectiveOp = ...;  // analog of the SM_100 implementation
};

Plus a matching cutlass::epilogue::PtrArrayTmaWarpSpecializedSm120 epilogue schedule (or whatever the consumer-Blackwell naming convention is) that cutlass::epilogue::collective::EpilogueScheduleAuto can dispatch to when ArchTag = Sm120.

Consumer Blackwell (SM_120/121) doesn't have 2-CTA clusters, so the SM_100 2Sm variants wouldn't have an analog — only the 1Sm/single-CTA equivalent is needed for v1.

Local validation environment available

I have a physical NVIDIA DGX Spark (SM_121) configured with vLLM source-build infrastructure ready to test. Once a CUTLASS release includes the missing specialization, I can:

  1. Rebuild the vLLM Docker image (CUDA 13.0, TORCH_CUDA_ARCH_LIST="12.0a;12.1a") with the new CUTLASS pulled in
  2. Run the existing vLLM test suite for FP8 MoE on real SM_121 silicon
  3. Provide numerical correctness comparisons against the current Triton MoE backend
  4. Provide throughput benchmarks (vLLM's benchmark_moe.py or end-to-end inference)
  5. Submit the vLLM-side PR that wires it up (the grouped_mm_c3x_sm120.cu analog of the SM_100 file)

Happy to validate any pre-release / branch builds. I expect to be one of the first non-NVIDIA users with a Spark in the wild willing to run kernel validation.

Reproduction of the gap on real SM_121 hardware (DGX Spark, 2026-05-23)

Container: vLLM main HEAD (4438b6e), CUTLASS 4.5.x, CUDA 13.0.2, TORCH_CUDA_ARCH_LIST=12.0a;12.1a;12.1+PTX, gencode arch=compute_120a,code=sm_120a.

Tried, in order, as KernelSchedule for an analog of vllm/csrc/libtorch_stable/quantization/w8a8/cutlass/moe/grouped_mm_c3x_sm100.cu:

Schedule Error
KernelPtrArrayTmaWarpSpecialized1SmSm120 namespace "cutlass::gemm" has no member "..." (symbol doesn't exist, was pre-release name)
KernelScheduleSm120Blockwise Could not build a collective for given parameters — expects blockwise-scaled tuple<LayoutA*, LayoutSFA*> layout; our model has plain LayoutA*
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> Could not build a collective for given parameters — no CollectiveBuilder specialization for this combination on SM_120

The third failure is the actionable one: the dispatch policy is defined but no CollectiveBuilder<...>::CollectiveOp exists to consume it.


AI assistance disclosure: This issue was researched and drafted with AI (Claude) assistance over a multi-hour investigation that included reading CUTLASS dispatch policy headers, running four physical compile iterations on real NVIDIA DGX Spark hardware (SM_121), and reading vLLM's grouped GEMM source. All build commands and compile errors are from actual runs on SM_121 silicon. The human submitter (Tyler Merritt) reviewed the diagnosis end-to-end and is the validator for any test builds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions