feat(tf32): add TF32 TensorCore GEMM kernel achieving 27 TFLOPS by m96-chan · Pull Request #45 · m96-chan/PyGPUkit

m96-chan · 2025-12-14T04:38:58Z

Summary

Add TF32 TensorCore GEMM kernel using PTX mma.sync.aligned.m16n8k8 for Ampere GPUs (SM80+)
Implement cp.async double-buffering pipeline for efficient global→shared memory transfers
Optimize with A fragment hoisting technique for improved register utilization
Document verified PTX fragment mappings in CLAUDE.md for future development

Benchmark Results (RTX 3090 Ti)

Matrix Size	PyGPUkit TF32	cuBLAS TF32	Efficiency
2048×2048	18.29 TFLOPS	28.02 TFLOPS	65%
4096×4096	25.71 TFLOPS	34.99 TFLOPS	73%
8192×8192	27.38 TFLOPS	35.75 TFLOPS	77%

Peak: 27.38 TFLOPS (77% of cuBLAS performance)

Correctness

Relative error: ~3-5% (expected for TF32 19-bit mantissa precision)
All matrix sizes validated against FP32 reference

Key Files Changed

native/ops/matmul_f32_tf32.cuh - TF32 TensorCore kernel implementation
native/ops/basic.cu - Dispatch logic for TF32 path
tests/test_tf32_tensorcore.py - Comprehensive test suite
CLAUDE.md - PTX fragment mapping documentation

Test plan

Correctness tests pass for all matrix sizes (256³ to 8192³)
Benchmark shows 27+ TFLOPS on RTX 3090 Ti
FP32 path unaffected (regression test)
CI tests on GPU runner

🤖 Generated with Claude Code

Initial implementation of TF32 TensorCore GEMM using WMMA API. Current status: Dispatcher works but kernel has correctness bug. Added: - native/ops/matmul_f32_tf32.cuh - TF32 WMMA kernel - tests/test_tf32_tensorcore.py - TDD tests - benchmark_tf32.py - Performance benchmark Known issue: - Relative error ~1.38 (138% off) - store_matrix_sync layout bug 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Root cause: WMMA store_matrix_sync with mem_row_major requires leading dimension (N) to be a multiple of 8. When N % 8 != 0 (e.g., 129, 131, 133, 135), direct store to global memory produced incorrect results. Fix: Add n_aligned check to both TF32 kernels: - Fast path: only used when N % 8 == 0 - Tail path: store to shared memory (stride 16), then copy to global Results: - All 150 correctness tests pass - N=129, 257, 513, 1921, 4096: error < 1e-3 (OK) - Performance: 13-18 TFLOPS (optimization pending) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Added mandatory workflow for kernel development: - Always commit after validation/benchmark regardless of results - Include benchmark results in commit message - Preserve performance history for rollback 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark results (RTX 3090 Ti): - 2048x2048: 8.50 TFLOPS - 4096x4096: 13.70 TFLOPS - 8192x8192: 16.55 TFLOPS Correctness: FAIL (rel_err ~10-40%) Known issue: Pipeline prefetch overwrites tile k+1 before it's computed. Need to fix: load k+2 into curr stage, not next stage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark results (RTX 3090 Ti): - 2048x2048: 8.37 TFLOPS - 4096x4096: 13.37 TFLOPS - 8192x8192: 16.62 TFLOPS Correctness: FAIL (10-50% relative error) Changes: - BK=32 (increased from 16) - smB[BK][BN] layout (not transposed) - Fixed store_matrix_sync type cast for N (unsigned int) - Fixed 2D array pointer for tmp epilogue Known issues: - Correctness bug: b_frag loads expect col_major but smB is not transposed - Using 70KB smem (may reduce occupancy) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark results (RTX 3090 Ti): - 2048x2048: 8.31 TFLOPS - 4096x4096: 13.40 TFLOPS - 8192x8192: 16.55 TFLOPS Correctness: FAIL (10-50% relative error) Resources: - 32KB smem (good for 2 blocks/SM) - 128 registers Changes from user rewrite: - BK=16 (reduced from 32) - Bs[BN][BK] layout for col_major WMMA - Fixed fragment types to wmma::precision::tf32 - Simplified prologue/epilogue Known issue: Correctness bug remains 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark results (RTX 3090 Ti): - 2048x2048: 8.38 TFLOPS - 4096x4096: 14.08 TFLOPS - 8192x8192: 16.59 TFLOPS Correctness: FAIL (11-52% relative error) Resources: - 51KB smem - 128 registers - 0 bytes stack, 0 spills 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark results (RTX 3090 Ti): - 2048x2048: 8.64 TFLOPS - 4096x4096: 13.12 TFLOPS - 8192x8192: 16.57 TFLOPS Correctness: FAIL (10-52% relative error) Resources: - 40KB smem - 128 registers - 1296 bytes stack, 8/12 bytes spill 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Kernel specifications: - BM=128, BN=128, BK=16 - 38KB smem (< 40KB target), 128 regs, 0 spills - Both A and B loaded via cp.async (no scatter stores) - B stored row-major K×N, row_major fragments - HMMA.1684.F32.TF32 instructions confirmed via cuobjdump - Correctness: PASS (normalized error ~8e-04) - Performance: ~6 TFLOPS (under investigation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- BK=32, extern shared memory - Simplified cp.async pipeline - load_B uses float4 scatter-store (not cp.async) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Validation: ALL FAIL (0% pass rate) Benchmark: Invalid (902 TFLOPS - kernel not executing correctly) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark: - 1024: 7.97 TFLOPS - 2048: 18.54 TFLOPS - 4096: 27.98 TFLOPS - 8192: 32.53 TFLOPS Validation: FAIL (race condition in pipeline) - pct<1% error: 1-7% (should be >99%) Known issue: Prefetch into 'next' stage overwrites tile k+1 before it's computed in iteration k+1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark: - 1024: 9.89 TFLOPS - 2048: 28.97 TFLOPS - 4096: 44.64 TFLOPS (peak) - 8192: 40.22 TFLOPS Validation: FAIL (98% have >=10% error) Determinism: FAIL (max diff 25-28 between runs) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Key findings: - WMMA API with row_major A + row_major B: PASS - WMMA API with row_major A + col_major B: FAIL (memory layout mismatch) - PTX mma.sync mapping still needs investigation Test results (WMMA row_row, M=16, N=16): - K=8: max_err=0.0055, rel_err=0.05% PASS - K=16: max_err=0.0089, rel_err=0.07% PASS - K=32: max_err=0.0094, rel_err=0.06% PASS - K=64: max_err=0.0205, rel_err=0.10% PASS - K=128: max_err=0.0247, rel_err=0.08% PASS - K=256: max_err=0.0373, rel_err=0.08% PASS Next: Use debug_dump_fragments to understand WMMA's actual fragment layout, then fix PTX mma.sync version. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Measured actual WMMA fragment layout using dump_fragments.cu: A fragment (16x8, row_major): Thread t: a_row = t/4, a_col = t%4 a[0] = A[a_row][a_col] a[1] = A[a_row+8][a_col] a[2] = A[a_row][a_col+4] a[3] = A[a_row+8][a_col+4] B fragment (8x16, row_major): Thread t: b_row = t%4, b_col = t/4 b[0] = B[b_row][b_col] b[1] = B[b_row+4][b_col] b[2] = B[b_row][b_col+8] b[3] = B[b_row+4][b_col+8] Key insight: PTX m16n8k8 uses only the left half (cols 0-7) of WMMA's B/C fragments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fixed C fragment output mapping in both single-tile and full kernels - Key discovery: C fragment uses (t%4)*2 column indexing, not t%4 like A - All correctness tests now pass with ~0.08% relative error (TF32 precision) - Added dump_c_fragment.cu for verifying C fragment layout C fragment mapping (verified with dump_c_fragment.cu): c_row = t / 4 (0-7) c_col = (t % 4) * 2 (0, 2, 4, 6) c[0] -> C[c_row][c_col] c[1] -> C[c_row][c_col + 1] c[2] -> C[c_row + 8][c_col] c[3] -> C[c_row + 8][c_col + 1] Test results: - 256³: rel_err = 8.69e-04 PASS - 1024³: rel_err = 7.99e-04 PASS - 4096³: rel_err = 7.91e-04 PASS - Deterministic 100 iterations: PASS Performance: 11-18 TFLOPS (optimization pending) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix double-buffering bug: prefetch into OTHER stage (next), not same stage - Previous 44 TFLOPS kernel only computed half the matrix (WARP_TILES_N=4) - Correct kernel with WARP_TILES_N=8 achieves 27 TFLOPS on 8192x8192 - Document PTX mma.sync fragment mapping in CLAUDE.md - Document correct cp.async pipeline pattern in CLAUDE.md Performance (RTX 3090 Ti): - 4096x4096: 19.5 TFLOPS - 8192x8192: 27.5 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add benchmark comparison: NumPy vs PyTorch vs PyGPUkit - PyTorch numbers are estimates (actual benchmarks planned for v0.2.4) - Update roadmap: v0.2.3 released, v0.2.4 for actual benchmarks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- cuBLAS FP32: ~21 TFLOPS (PyGPUkit: 18 = 86%) - cuBLAS TF32: ~59 TFLOPS (PyGPUkit: 27 = 46%) - Source: NVIDIA developer forum benchmark 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Optimizations applied: - Hoist A fragment loads outside wn loop (saves 8x redundant smem loads) - Remove branch from cp_async_wait_0() (unconditional wait) - Remove branch from prefetch code (unconditional prefetch) - Clean up comments and simplify code Performance improvement (RTX 3090 Ti): - 4096x4096: 19.13 → 20.48 TFLOPS (+7.1%) - 8192x8192: 27.53 → 28.56 TFLOPS (+3.7%) All correctness tests pass with TF32 tolerance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Document all optimization attempts for the TF32 TensorCore GEMM kernel: - 3 successful optimizations (+1.35 TFLOPS total) - 8 failed attempts with analysis - Technical observations and remaining opportunities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 23 commits December 13, 2025 16:20

wip(tf32): revert to simplified kernel structure

0b1345e

- BK=32, extern shared memory - Simplified cp.async pipeline - load_B uses float4 scatter-store (not cp.async) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

wip(tf32): add launch function (correctness broken)

a2a70dd

Validation: ALL FAIL (0% pass rate) Benchmark: Invalid (902 TFLOPS - kernel not executing correctly) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: remove optimization summary md (moving to Issue #41)

ea3700f

fix(lint): remove extraneous f-string prefixes

a1c8f3c

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit f47253b into main Dec 14, 2025
13 checks passed

m96-chan mentioned this pull request Dec 14, 2025

PyGPUkit v0.2.3 — TF32 TensorCore GEMM Implementation (22–30 TFLOPS Target) #41

Closed

m96-chan deleted the feature/v0.2.3-tf32-tensorcore branch December 26, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tf32): add TF32 TensorCore GEMM kernel achieving 27 TFLOPS#45

feat(tf32): add TF32 TensorCore GEMM kernel achieving 27 TFLOPS#45
m96-chan merged 23 commits intomainfrom
feature/v0.2.3-tf32-tensorcore

m96-chan commented Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 14, 2025

Summary

Benchmark Results (RTX 3090 Ti)

Correctness

Key Files Changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant