Skip to content

[v1.0] TF32 Kernel Further Optimization (CUTLASS-level) #53

@m96-chan

Description

@m96-chan

Overview

Continue TF32 TensorCore GEMM optimization beyond the v0.2.3 baseline (27.38 TFLOPS).
Using NVIDIA CUTLASS as reference.

Current Status

  • Achieved: 27.38 TFLOPS on RTX 3090 Ti (8192×8192)
  • cuBLAS reference: ~59 TFLOPS (47% gap remaining)

Remaining Optimization Opportunities

High Priority

  • Swizzled shared memory layout (bank conflict elimination)
  • Multi-stage pipeline (4+ stages)
  • PTX ldmatrix instruction for fragment loading

Medium Priority

  • m16n8k4 instruction variant
  • Warp specialization (producer/consumer warps)
  • Dynamic shared memory (100KB on Ampere)

Future / Architecture-Specific

  • Hopper TF32x3 mode (3× throughput on H100)
  • Epilogue fusion (bias + activation)
  • Auto-tuner for tile sizes (BM/BN/BK)

Key Observations from v0.2.3

  1. Register pressure is the primary limiter - acc[2][8][4] = 64 registers per warp
  2. Shared memory limits occupancy - 37KB/block → ~1 block/SM = 16.7% occupancy
  3. High variance (±1-2 TFLOPS) due to GPU boost clocks

Failed Attempts (Reference)

Attempt Result Cause
3-stage pipeline -28% 50% more smem reduced occupancy
512-thread config -17% Changed access patterns
BM=64 smaller tiles -28% on 8192 Reduced parallelism
Manual kk unroll -6% Increased register pressure
BK=8 for occupancy -7% on 4096 Doubled K iterations

Target

  • RTX 3090 Ti: 35+ TFLOPS (cuBLAS parity)
  • A100: 60+ TFLOPS
  • H100: 90+ TFLOPS (TF32x3)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions