[Feature Request] SM_120/SM_121 CollectiveBuilder specialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)
Summary
CUTLASS 4.5 ships SM_120 grouped GEMM (GroupProblemShape + ptr-array layouts) for blockwise-scaled FP8, NVFP4, and MXFP8 quantization schemes — but not for tensor/token-scaled FP8 (scale factors applied in the epilogue, not the mainloop). This is the FP8 variant most commonly produced by Hugging Face quantization tooling (e.g. compressed-tensors, llm-compressor "FP8-Dynamic"), used by Gemma 4, Mixtral FP8, Qwen FP8, and other production MoE checkpoints.
Datacenter Blackwell (SM_100) supports this code path via cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100 + matching cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm epilogue schedule, both successfully wired up in vLLM's grouped_mm_c3x_sm100.cu. The analogous SM_120 path does not exist yet — empirically confirmed on real NVIDIA DGX Spark (SM_121) hardware.
What appears to exist in CUTLASS 4.5
| Variant |
Status on SM_120 |
Where |
| NVFP4 grouped GEMM |
✅ Working |
examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu |
| FP8 blockwise grouped GEMM |
✅ Working |
examples/87_blackwell_geforce_gemm_blockwise/87c_blackwell_geforce_fp8_bf16_grouped_gemm_groupwise.cu |
| MXFP8 blockscaled grouped GEMM |
✅ Working |
test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/ |
| Tensor/token-scaled FP8 grouped GEMM |
❌ No working CollectiveBuilder specialization |
— |
What appears missing
include/cutlass/gemm/dispatch_policy.hpp defines:
template<int SchedulerPipelineStageCount_>
struct KernelPtrArrayTmaWarpSpecializedCooperativeSm120 : KernelPtrArrayTmaWarpSpecializedCooperative { ... };
template<int SchedulerPipelineStageCount_>
struct KernelPtrArrayTmaWarpSpecializedPingpongSm120 : KernelPtrArrayTmaWarpSpecializedPingpong { ... };
A GitHub code search confirms these have zero usages anywhere in CUTLASS (no examples, no tests, no CollectiveBuilder<...>::CollectiveOp specialization). Attempting to use either as the KernelSchedule template parameter to cutlass::gemm::collective::CollectiveBuilder<arch::Sm120, OpClassTensorOp, float_e4m3_t, LayoutA*, ..., KernelSchedule, void> results in the standard "Could not build a collective for given parameters" static_assert.
The equivalent SM_100 ptr-array kernels (KernelPtrArrayTmaWarpSpecialized1SmSm100, KernelPtrArrayTmaWarpSpecialized2SmSm100) DO have working CollectiveBuilder specializations and are used in real production via vLLM's grouped_mm_c3x_sm100.cu. The dispatch-policy types alone aren't enough — the corresponding CollectiveBuilder template specialization needs to be implemented for SM_120.
Downstream impact
This is the missing piece that prevents CUTLASS MoE on any consumer Blackwell GPU for the most common FP8 quantization scheme:
- NVIDIA GB10 (DGX Spark) — SM_121
- NVIDIA RTX 5090 / 5080 / 5070 — SM_120
- Any future SM_12x consumer Blackwell variant
Affected vLLM models running with sub-optimal Triton MoE backend instead of native CUTLASS:
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE)
- Other RedHatAI / neuralmagic / compressed-tensors FP8 MoE checkpoints
- Mistral FP8 MoE variants
- Any model quantized via
llm-compressor with per-tensor or per-token FP8 scaling
Downstream vLLM tracking issue: vllm-project/vllm#43507
What "shipping this" looks like
The pattern is already established for SM_100. The equivalent SM_120 path would be:
// In include/cutlass/gemm/collective/builders/sm120_*.inl (new file or extension)
template <int Stages>
struct CollectiveBuilder<
arch::Sm120, arch::OpClassTensorOp,
float_e4m3_t, LayoutA*, AlignmentA,
float_e4m3_t, LayoutB*, AlignmentB,
float, // accumulator
TileShape, ClusterShape,
gemm::collective::StageCountAutoCarveout<...>,
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<Stages>,
void
> {
using CollectiveOp = ...; // analog of the SM_100 implementation
};
Plus a matching cutlass::epilogue::PtrArrayTmaWarpSpecializedSm120 epilogue schedule (or whatever the consumer-Blackwell naming convention is) that cutlass::epilogue::collective::EpilogueScheduleAuto can dispatch to when ArchTag = Sm120.
Consumer Blackwell (SM_120/121) doesn't have 2-CTA clusters, so the SM_100 2Sm variants wouldn't have an analog — only the 1Sm/single-CTA equivalent is needed for v1.
Local validation environment available
I have a physical NVIDIA DGX Spark (SM_121) configured with vLLM source-build infrastructure ready to test. Once a CUTLASS release includes the missing specialization, I can:
- Rebuild the vLLM Docker image (CUDA 13.0,
TORCH_CUDA_ARCH_LIST="12.0a;12.1a") with the new CUTLASS pulled in
- Run the existing vLLM test suite for FP8 MoE on real SM_121 silicon
- Provide numerical correctness comparisons against the current Triton MoE backend
- Provide throughput benchmarks (vLLM's
benchmark_moe.py or end-to-end inference)
- Submit the vLLM-side PR that wires it up (the
grouped_mm_c3x_sm120.cu analog of the SM_100 file)
Happy to validate any pre-release / branch builds. I expect to be one of the first non-NVIDIA users with a Spark in the wild willing to run kernel validation.
Reproduction of the gap on real SM_121 hardware (DGX Spark, 2026-05-23)
Container: vLLM main HEAD (4438b6e), CUTLASS 4.5.x, CUDA 13.0.2, TORCH_CUDA_ARCH_LIST=12.0a;12.1a;12.1+PTX, gencode arch=compute_120a,code=sm_120a.
Tried, in order, as KernelSchedule for an analog of vllm/csrc/libtorch_stable/quantization/w8a8/cutlass/moe/grouped_mm_c3x_sm100.cu:
| Schedule |
Error |
KernelPtrArrayTmaWarpSpecialized1SmSm120 |
namespace "cutlass::gemm" has no member "..." (symbol doesn't exist, was pre-release name) |
KernelScheduleSm120Blockwise |
Could not build a collective for given parameters — expects blockwise-scaled tuple<LayoutA*, LayoutSFA*> layout; our model has plain LayoutA* |
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> |
Could not build a collective for given parameters — no CollectiveBuilder specialization for this combination on SM_120 |
The third failure is the actionable one: the dispatch policy is defined but no CollectiveBuilder<...>::CollectiveOp exists to consume it.
AI assistance disclosure: This issue was researched and drafted with AI (Claude) assistance over a multi-hour investigation that included reading CUTLASS dispatch policy headers, running four physical compile iterations on real NVIDIA DGX Spark hardware (SM_121), and reading vLLM's grouped GEMM source. All build commands and compile errors are from actual runs on SM_121 silicon. The human submitter (Tyler Merritt) reviewed the diagnosis end-to-end and is the validator for any test builds.
[Feature Request] SM_120/SM_121
CollectiveBuilderspecialization for tensor/token-scaled FP8 grouped GEMM (ptr-array)Summary
CUTLASS 4.5 ships SM_120 grouped GEMM (
GroupProblemShape+ ptr-array layouts) for blockwise-scaled FP8, NVFP4, and MXFP8 quantization schemes — but not for tensor/token-scaled FP8 (scale factors applied in the epilogue, not the mainloop). This is the FP8 variant most commonly produced by Hugging Face quantization tooling (e.g.compressed-tensors,llm-compressor"FP8-Dynamic"), used by Gemma 4, Mixtral FP8, Qwen FP8, and other production MoE checkpoints.Datacenter Blackwell (SM_100) supports this code path via
cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100+ matchingcutlass::epilogue::PtrArrayTmaWarpSpecialized1Smepilogue schedule, both successfully wired up in vLLM'sgrouped_mm_c3x_sm100.cu. The analogous SM_120 path does not exist yet — empirically confirmed on real NVIDIA DGX Spark (SM_121) hardware.What appears to exist in CUTLASS 4.5
examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cuexamples/87_blackwell_geforce_gemm_blockwise/87c_blackwell_geforce_fp8_bf16_grouped_gemm_groupwise.cutest/unit/gemm/device/sm120_blockscaled_tensorop_gemm/What appears missing
include/cutlass/gemm/dispatch_policy.hppdefines:A GitHub code search confirms these have zero usages anywhere in CUTLASS (no examples, no tests, no
CollectiveBuilder<...>::CollectiveOpspecialization). Attempting to use either as theKernelScheduletemplate parameter tocutlass::gemm::collective::CollectiveBuilder<arch::Sm120, OpClassTensorOp, float_e4m3_t, LayoutA*, ..., KernelSchedule, void>results in the standard "Could not build a collective for given parameters"static_assert.The equivalent SM_100 ptr-array kernels (
KernelPtrArrayTmaWarpSpecialized1SmSm100,KernelPtrArrayTmaWarpSpecialized2SmSm100) DO have workingCollectiveBuilderspecializations and are used in real production via vLLM'sgrouped_mm_c3x_sm100.cu. The dispatch-policy types alone aren't enough — the correspondingCollectiveBuildertemplate specialization needs to be implemented for SM_120.Downstream impact
This is the missing piece that prevents CUTLASS MoE on any consumer Blackwell GPU for the most common FP8 quantization scheme:
Affected vLLM models running with sub-optimal Triton MoE backend instead of native CUTLASS:
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic(Gemma 4 MoE)llm-compressorwith per-tensor or per-token FP8 scalingDownstream vLLM tracking issue: vllm-project/vllm#43507
What "shipping this" looks like
The pattern is already established for SM_100. The equivalent SM_120 path would be:
Plus a matching
cutlass::epilogue::PtrArrayTmaWarpSpecializedSm120epilogue schedule (or whatever the consumer-Blackwell naming convention is) thatcutlass::epilogue::collective::EpilogueScheduleAutocan dispatch to whenArchTag = Sm120.Consumer Blackwell (SM_120/121) doesn't have 2-CTA clusters, so the SM_100
2Smvariants wouldn't have an analog — only the1Sm/single-CTA equivalent is needed for v1.Local validation environment available
I have a physical NVIDIA DGX Spark (SM_121) configured with vLLM source-build infrastructure ready to test. Once a CUTLASS release includes the missing specialization, I can:
TORCH_CUDA_ARCH_LIST="12.0a;12.1a") with the new CUTLASS pulled inbenchmark_moe.pyor end-to-end inference)grouped_mm_c3x_sm120.cuanalog of the SM_100 file)Happy to validate any pre-release / branch builds. I expect to be one of the first non-NVIDIA users with a Spark in the wild willing to run kernel validation.
Reproduction of the gap on real SM_121 hardware (DGX Spark, 2026-05-23)
Container: vLLM main HEAD (
4438b6e), CUTLASS 4.5.x, CUDA 13.0.2,TORCH_CUDA_ARCH_LIST=12.0a;12.1a;12.1+PTX,gencode arch=compute_120a,code=sm_120a.Tried, in order, as
KernelSchedulefor an analog ofvllm/csrc/libtorch_stable/quantization/w8a8/cutlass/moe/grouped_mm_c3x_sm100.cu:KernelPtrArrayTmaWarpSpecialized1SmSm120namespace "cutlass::gemm" has no member "..."(symbol doesn't exist, was pre-release name)KernelScheduleSm120BlockwiseCould not build a collective for given parameters— expects blockwise-scaledtuple<LayoutA*, LayoutSFA*>layout; our model has plainLayoutA*KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2>Could not build a collective for given parameters— noCollectiveBuilderspecialization for this combination on SM_120The third failure is the actionable one: the dispatch policy is defined but no
CollectiveBuilder<...>::CollectiveOpexists to consume it.AI assistance disclosure: This issue was researched and drafted with AI (Claude) assistance over a multi-hour investigation that included reading CUTLASS dispatch policy headers, running four physical compile iterations on real NVIDIA DGX Spark hardware (SM_121), and reading vLLM's grouped GEMM source. All build commands and compile errors are from actual runs on SM_121 silicon. The human submitter (Tyler Merritt) reviewed the diagnosis end-to-end and is the validator for any test builds.