autosass

Autoresearch for SASS-level GPU kernel optimization. Give it a compiled CUDA kernel, go to sleep, wake up to an optimized binary. No source code changes.

Inspired by @karpathy/autoresearch and autokernel. Instead of rewriting kernels in Triton, autosass optimizes the compiler's own GPU assembly (SASS) — reordering instructions, setting register cache hints, and renaming registers to reduce bank conflicts. Uses gpuasm.com MCP API for disassembly/reassembly.

How It Works

Give autosass any compiled CUDA kernel (.cubin). It will:

Disassemble the kernel via the gpuasm.com MCP API
Find FFMA (fused multiply-add) blocks in the inner loop
Reorder instructions to maximize register file cache hits
Set reuse flags so hardware caches register values between instructions
Rename registers via liveness analysis to break bank conflicts
Reassemble with selective binary patching (only changed instructions are re-encoded)

The agent reads program.md — the optimization playbook — which describes the SM86 register file cache model, bank conflict rules, and the MCP API. It runs optimize.py, benchmarks with bench_cubin.py, keeps or reverts.

Each optimization takes ~30 seconds. Benchmarking takes ~60 seconds.

Quick Start

Requirements: NVIDIA GPU, Python 3.10+, a compiled CUDA kernel.

# Clone
git clone https://github.com/gpuasm/autosass.git
cd autosass

# Optimize a kernel
python optimize.py --cubin kernel.cubin --kernel myKernel --out optimized.cubin

# Benchmark (patches cubin into executable, compares baseline vs optimized)
python bench_cubin.py optimized.cubin

# Dry run (no GPU needed — checks MCP connectivity + analysis)
python optimize.py --cubin kernel.cubin --kernel myKernel --dry-run

Running the Agent

Spin up Claude or any coding agent in this directory:

Read program.md. Optimize the kernel in sgemm_kernel_10.cubin.

The agent will disassemble, find optimization opportunities, apply them, reassemble, and benchmark. program.md covers the microarchitecture model, the optimization strategy, and crash recovery.

References

Huerta et al., "Analyzing Modern NVIDIA GPU cores," 2025
siboehm/SGEMM_CUDA — SGEMM kernel implementations
gpuasm.com — SASS assembler/disassembler + MCP API
autokernel — autoresearch for GPU kernels (Triton/CUDA C++)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench_cubin.py		bench_cubin.py
gpuasm_client.py		gpuasm_client.py
optimize.py		optimize.py
program.md		program.md
progress.png		progress.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autosass

How It Works

Quick Start

Running the Agent

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autosass

How It Works

Quick Start

Running the Agent

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages