Autoresearch for SASS-level GPU kernel optimization. Give it a compiled CUDA kernel, go to sleep, wake up to an optimized binary. No source code changes.
Inspired by @karpathy/autoresearch and autokernel. Instead of rewriting kernels in Triton, autosass optimizes the compiler's own GPU assembly (SASS) — reordering instructions, setting register cache hints, and renaming registers to reduce bank conflicts. Uses gpuasm.com MCP API for disassembly/reassembly.
Give autosass any compiled CUDA kernel (.cubin). It will:
- Disassemble the kernel via the gpuasm.com MCP API
- Find FFMA (fused multiply-add) blocks in the inner loop
- Reorder instructions to maximize register file cache hits
- Set reuse flags so hardware caches register values between instructions
- Rename registers via liveness analysis to break bank conflicts
- Reassemble with selective binary patching (only changed instructions are re-encoded)
The agent reads program.md — the optimization playbook — which describes the SM86 register file cache model, bank conflict rules, and the MCP API. It runs optimize.py, benchmarks with bench_cubin.py, keeps or reverts.
Each optimization takes ~30 seconds. Benchmarking takes ~60 seconds.
Requirements: NVIDIA GPU, Python 3.10+, a compiled CUDA kernel.
# Clone
git clone https://github.com/gpuasm/autosass.git
cd autosass
# Optimize a kernel
python optimize.py --cubin kernel.cubin --kernel myKernel --out optimized.cubin
# Benchmark (patches cubin into executable, compares baseline vs optimized)
python bench_cubin.py optimized.cubin
# Dry run (no GPU needed — checks MCP connectivity + analysis)
python optimize.py --cubin kernel.cubin --kernel myKernel --dry-runSpin up Claude or any coding agent in this directory:
Read program.md. Optimize the kernel in sgemm_kernel_10.cubin.
The agent will disassemble, find optimization opportunities, apply them, reassemble, and benchmark. program.md covers the microarchitecture model, the optimization strategy, and crash recovery.
- Huerta et al., "Analyzing Modern NVIDIA GPU cores," 2025
- siboehm/SGEMM_CUDA — SGEMM kernel implementations
- gpuasm.com — SASS assembler/disassembler + MCP API
- autokernel — autoresearch for GPU kernels (Triton/CUDA C++)
MIT
