cute-kernel-lab

This repo is a dedicated kernel-research lab: a stable harness around deeply case-local workspaces.

The point is not just to mutate one kernel file. The point is to let an agent or engineer run a full research program inside one case without polluting the rest of the repo:

stable benchmark harness
stable logging, plots, and webhook flow
per-case serving glue
per-case operator families
per-case profiler and parity helpers
per-case tests, notes, and prior-art docs

Development model

Every optimization target is a case:

one model snapshot
one hardware target
one primary objective
one fixed benchmark suite
one mutable workspace

The framework under src/cute_kernel_lab/** stays boring and stable. The case workspace is where the real changes happen.

That gives you three clean layers:

src/cute_kernel_lab/** Stable platform code: serving shim, benchmark runner, logging, plotting, notifications.
cases/<model>/<case>/** One optimization target: case config, benchmark definitions, run history, program instructions.
cases/<model>/<case>/workspace/** A self-contained kernel project: runtime package, operator families, bench helpers, tests, docs, scripts.

Repository layout

cute-kernel-lab/
├── .codex/skills/                 # portable Codex skills
├── docs/                          # repo-local lab guidance
├── templates/                     # copyable case starters
├── src/cute_kernel_lab/
│   ├── api/                       # OpenAI-compatible shim
│   ├── bench/                     # benchmark runner + score flattening
│   ├── optimize/                  # logging, plotting, Discord notifier
│   └── serving/                   # backend interfaces + mock backend
├── cases/
│   └── <model>/<case>/
│       ├── benchmarks/            # fixed suites for that target
│       ├── workspace/
│       │   ├── backend/           # serving glue, env defaults, model loader
│       │   ├── ops/               # operator families: cute, triton, ptx, cuda
│       │   ├── bench/             # parity and profiling helpers
│       │   ├── tests/             # workspace-local correctness checks
│       │   ├── docs/              # running notes, design docs, prior art
│       │   ├── scripts/           # case-local build/profile wrappers
│       │   ├── serve_custom_backend.py
│       │   ├── launch_transformers_server.py
│       │   └── kernel_manifest.yaml
│       ├── runs/
│       ├── case.yaml
│       └── program.md
├── scripts/
│   ├── new_kernel_case.sh
│   ├── new_kernel_case.py
│   ├── serve_case.py
│   ├── evaluate_case.py
│   └── run-kernel-hillclimb.sh
└── .env.example

Why this stays clean

The stable harness still owns:

starting servers
running benchmarks
scoring
logging history
plotting best-so-far curves
sending webhook updates

Webhook debugging should go through the repo config loader, not raw shell env:

uv run python scripts/check_webhook_status.py

Every evaluate_case.py record now includes:

webhook_enabled_by_config
webhook_disabled_by_flag
webhook_expected
webhook_attempted
webhook_sent
webhook_status

The case workspace owns:

runtime policy and load path
operator-family code
case-local build helpers
parity and profiler scripts
running notes and design docs

That lets you be much more thorough inside a case without turning the repo itself into a pile of one-off experiment files.

Start a new hillclimb series

Stamp a new case from the template instead of building a workspace by hand:

./scripts/new_kernel_case.sh \
  --model-slug my_model \
  --case-slug rtx_pro_6000_single_stream_tok_s \
  --model-name "org/model" \
  --model-path "models/my-model" \
  --hardware "NVIDIA RTX PRO 6000 Blackwell Workstation Edition 96GB"

The stamped case now includes a deeper workspace by default:

workspace/backend/
- serving glue, env defaults, model loading, hook installation, parity helpers
workspace/ops/
- separate surfaces for cute, triton, ptx, and raw cuda
workspace/bench/
- local parity and one-shot profiling helpers
workspace/tests/
- workspace-local correctness checks
workspace/docs/
- notes, development plan, prior art
workspace/scripts/
- build, parity, and profile wrappers
workspace/scripts/fetch_upstream_refs.sh
- optional read-only upstream example fetcher
workspace/docs/tri_dao_examples.md
- curated file-level example map from FlashAttention, Mamba, QuACK, SonicMoE, and related Dao-AILab repos

Default starter priorities:

do not assume the answer is CuTe
do not default to CUTLASS C++
keep CuTe, Triton, PTX, raw CUDA, and runtime-policy ideas all available in the same case

Recommended start sequence:

Stamp the case.
Pick a fresh run tag:

export CUTE_KERNEL_LAB_RUN_TAG=kernel_hillclimb_apr06

Verify the model snapshot exists at the stamped model_path.
Run the stamped setup checks:

cd cases/my_model/rtx_pro_6000_single_stream_tok_s
./workspace/scripts/check_setup.sh --build --emit-ptx

Baseline the copied case:

cd cases/my_model/rtx_pro_6000_single_stream_tok_s
CUTE_KERNEL_LAB_RUN_TAG=kernel_hillclimb_apr06 ./workspace/run_hillclimb.sh "baseline"

Keep early mutations env-only.
For a deep kernel pass, fetch the curated upstream examples:

./workspace/scripts/fetch_upstream_refs.sh --pack tri-dao

Run a prompt-level parity probe before every replay, streamer, or mutating-buffer benchmark.
Only promote a branch into checked-in defaults after the source-default path survives its own benchmark.

Deep Hillclimb Rhythm

Do not let a long run collapse into endless tiny sweeps.

The intended rhythm for a serious case is:

baseline the case
profile the real harness
do a short prior-art pass
write down the next structural candidate set
spend the next scored block on one family
re-ablate older thin-wrapper or cache wins after the deeper boundary lands

Good signs that it is time to escalate:

several near-ties from the same thin wrapper family
thread-count or graph-step sweeps moving less than normal run variance
top profiler buckets are not changing
an older default-on branch has not been rechecked since a deeper boundary landed

Good deeper targets:

attention-entry prep such as qkv -> rope -> cache write
MLP super-boundaries instead of isolated linears
allocation / output-reuse / scratch-arena cleanup
graph-safe state ownership and replay boundaries
cache layout and batch-only routing in throughput cases

In throughput cases with a separate batch-1 guardrail, it is valid to route an experiment only through the high-batch lane when that lane is the real target metric and the guardrail stays healthy.

Workspace blueprint

Read workspace-blueprint.md before starting a new case. It is the repo-level reference for:

how to split a case workspace
which concerns belong in backend/ vs ops/ vs bench/
when to use CuTe, Triton, PTX, raw CUDA, or runtime policy
how to keep the case autonomous without polluting the framework

What the scaffold gives you

OpenAI-compatible FastAPI serving shim
benchmark runner and score flattening
run history and plots
Discord/webhook notifier
generic baseline launcher
copyable deep workspace structure
PTX and raw-CUDA starter surfaces
case-local parity and profiling helpers
case-local setup checks
case-local notes and prior-art docs

Benchmark surfaces now include two clean patterns:

OpenAI-compatible HTTP benchmarks for serving-facing cases
shell_json benchmarks for cases that should drive an external harness or repo-local CLI directly and return metrics as JSON

Use shell_json when the real benchmark already exists as a stable script and faking the workload through an OpenAI server would distort the target.

GPU selection

This repo is pinned to the NVIDIA RTX PRO 6000 Blackwell Workstation Edition.

Use one of these before any GPU work:

. scripts/selected-gpu.sh

or:

scripts/with-selected-gpu.sh uv run python scripts/evaluate_case.py --case ...

Both set CUDA_VISIBLE_DEVICES to the repo-owned GPU UUID so serving, benchmarking, profiling, and compile steps stay on the same card.

Python environment

This repo is uv-first:

uv sync creates or updates .venv
uv run ... executes inside that environment
.python-version pins the interpreter

Keep the shared environment case-agnostic. Put case-specific setup and operator notes in the case workspace, not in the repo root.

Codex workflow

If you use Codex in this repo:

read AGENTS.md first
use the copied kernel skill at SKILL.md
keep repo-specific observations in docs/
stamp new cases from templates/kernel_hillclimb_case/
treat the workspace as a real operator project, not just one mutable kernel file

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.codex/skills/cute-dsl-kernels		.codex/skills/cute-dsl-kernels
cases		cases
docs		docs
kernels/templates/cute_attention		kernels/templates/cute_attention
scripts		scripts
src		src
templates/kernel_hillclimb_case		templates/kernel_hillclimb_case
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cute-kernel-lab

Development model

Repository layout

Why this stays clean

Start a new hillclimb series

Deep Hillclimb Rhythm

Workspace blueprint

What the scaffold gives you

GPU selection

Python environment

Codex workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cute-kernel-lab

Development model

Repository layout

Why this stays clean

Start a new hillclimb series

Deep Hillclimb Rhythm

Workspace blueprint

What the scaffold gives you

GPU selection

Python environment

Codex workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages