This repo is a dedicated kernel-research lab: a stable harness around deeply case-local workspaces.
The point is not just to mutate one kernel file. The point is to let an agent or engineer run a full research program inside one case without polluting the rest of the repo:
- stable benchmark harness
- stable logging, plots, and webhook flow
- per-case serving glue
- per-case operator families
- per-case profiler and parity helpers
- per-case tests, notes, and prior-art docs
Every optimization target is a case:
- one model snapshot
- one hardware target
- one primary objective
- one fixed benchmark suite
- one mutable workspace
The framework under src/cute_kernel_lab/** stays boring and stable. The case workspace is where the real changes happen.
That gives you three clean layers:
src/cute_kernel_lab/**Stable platform code: serving shim, benchmark runner, logging, plotting, notifications.cases/<model>/<case>/**One optimization target: case config, benchmark definitions, run history, program instructions.cases/<model>/<case>/workspace/**A self-contained kernel project: runtime package, operator families, bench helpers, tests, docs, scripts.
cute-kernel-lab/
├── .codex/skills/ # portable Codex skills
├── docs/ # repo-local lab guidance
├── templates/ # copyable case starters
├── src/cute_kernel_lab/
│ ├── api/ # OpenAI-compatible shim
│ ├── bench/ # benchmark runner + score flattening
│ ├── optimize/ # logging, plotting, Discord notifier
│ └── serving/ # backend interfaces + mock backend
├── cases/
│ └── <model>/<case>/
│ ├── benchmarks/ # fixed suites for that target
│ ├── workspace/
│ │ ├── backend/ # serving glue, env defaults, model loader
│ │ ├── ops/ # operator families: cute, triton, ptx, cuda
│ │ ├── bench/ # parity and profiling helpers
│ │ ├── tests/ # workspace-local correctness checks
│ │ ├── docs/ # running notes, design docs, prior art
│ │ ├── scripts/ # case-local build/profile wrappers
│ │ ├── serve_custom_backend.py
│ │ ├── launch_transformers_server.py
│ │ └── kernel_manifest.yaml
│ ├── runs/
│ ├── case.yaml
│ └── program.md
├── scripts/
│ ├── new_kernel_case.sh
│ ├── new_kernel_case.py
│ ├── serve_case.py
│ ├── evaluate_case.py
│ └── run-kernel-hillclimb.sh
└── .env.example
The stable harness still owns:
- starting servers
- running benchmarks
- scoring
- logging history
- plotting best-so-far curves
- sending webhook updates
Webhook debugging should go through the repo config loader, not raw shell env:
uv run python scripts/check_webhook_status.pyEvery evaluate_case.py record now includes:
webhook_enabled_by_configwebhook_disabled_by_flagwebhook_expectedwebhook_attemptedwebhook_sentwebhook_status
The case workspace owns:
- runtime policy and load path
- operator-family code
- case-local build helpers
- parity and profiler scripts
- running notes and design docs
That lets you be much more thorough inside a case without turning the repo itself into a pile of one-off experiment files.
Stamp a new case from the template instead of building a workspace by hand:
./scripts/new_kernel_case.sh \
--model-slug my_model \
--case-slug rtx_pro_6000_single_stream_tok_s \
--model-name "org/model" \
--model-path "models/my-model" \
--hardware "NVIDIA RTX PRO 6000 Blackwell Workstation Edition 96GB"The stamped case now includes a deeper workspace by default:
workspace/backend/- serving glue, env defaults, model loading, hook installation, parity helpers
workspace/ops/- separate surfaces for
cute,triton,ptx, and rawcuda
- separate surfaces for
workspace/bench/- local parity and one-shot profiling helpers
workspace/tests/- workspace-local correctness checks
workspace/docs/- notes, development plan, prior art
workspace/scripts/- build, parity, and profile wrappers
workspace/scripts/fetch_upstream_refs.sh- optional read-only upstream example fetcher
workspace/docs/tri_dao_examples.md- curated file-level example map from FlashAttention, Mamba, QuACK, SonicMoE, and related Dao-AILab repos
Default starter priorities:
- do not assume the answer is CuTe
- do not default to CUTLASS C++
- keep CuTe, Triton, PTX, raw CUDA, and runtime-policy ideas all available in the same case
Recommended start sequence:
- Stamp the case.
- Pick a fresh run tag:
export CUTE_KERNEL_LAB_RUN_TAG=kernel_hillclimb_apr06- Verify the model snapshot exists at the stamped
model_path. - Run the stamped setup checks:
cd cases/my_model/rtx_pro_6000_single_stream_tok_s
./workspace/scripts/check_setup.sh --build --emit-ptx- Baseline the copied case:
cd cases/my_model/rtx_pro_6000_single_stream_tok_s
CUTE_KERNEL_LAB_RUN_TAG=kernel_hillclimb_apr06 ./workspace/run_hillclimb.sh "baseline"- Keep early mutations env-only.
- For a deep kernel pass, fetch the curated upstream examples:
./workspace/scripts/fetch_upstream_refs.sh --pack tri-dao- Run a prompt-level parity probe before every replay, streamer, or mutating-buffer benchmark.
- Only promote a branch into checked-in defaults after the source-default path survives its own benchmark.
Do not let a long run collapse into endless tiny sweeps.
The intended rhythm for a serious case is:
- baseline the case
- profile the real harness
- do a short prior-art pass
- write down the next structural candidate set
- spend the next scored block on one family
- re-ablate older thin-wrapper or cache wins after the deeper boundary lands
Good signs that it is time to escalate:
- several near-ties from the same thin wrapper family
- thread-count or graph-step sweeps moving less than normal run variance
- top profiler buckets are not changing
- an older default-on branch has not been rechecked since a deeper boundary landed
Good deeper targets:
- attention-entry prep such as
qkv -> rope -> cache write - MLP super-boundaries instead of isolated linears
- allocation / output-reuse / scratch-arena cleanup
- graph-safe state ownership and replay boundaries
- cache layout and batch-only routing in throughput cases
In throughput cases with a separate batch-1 guardrail, it is valid to route an experiment only through the high-batch lane when that lane is the real target metric and the guardrail stays healthy.
Read workspace-blueprint.md before starting a new case. It is the repo-level reference for:
- how to split a case workspace
- which concerns belong in
backend/vsops/vsbench/ - when to use CuTe, Triton, PTX, raw CUDA, or runtime policy
- how to keep the case autonomous without polluting the framework
- OpenAI-compatible FastAPI serving shim
- benchmark runner and score flattening
- run history and plots
- Discord/webhook notifier
- generic baseline launcher
- copyable deep workspace structure
- PTX and raw-CUDA starter surfaces
- case-local parity and profiling helpers
- case-local setup checks
- case-local notes and prior-art docs
Benchmark surfaces now include two clean patterns:
- OpenAI-compatible HTTP benchmarks for serving-facing cases
shell_jsonbenchmarks for cases that should drive an external harness or repo-local CLI directly and return metrics as JSON
Use shell_json when the real benchmark already exists as a stable script and faking the workload through an OpenAI server would distort the target.
This repo is pinned to the NVIDIA RTX PRO 6000 Blackwell Workstation Edition.
Use one of these before any GPU work:
. scripts/selected-gpu.shor:
scripts/with-selected-gpu.sh uv run python scripts/evaluate_case.py --case ...Both set CUDA_VISIBLE_DEVICES to the repo-owned GPU UUID so serving, benchmarking, profiling, and compile steps stay on the same card.
This repo is uv-first:
uv synccreates or updates.venvuv run ...executes inside that environment.python-versionpins the interpreter
Keep the shared environment case-agnostic. Put case-specific setup and operator notes in the case workspace, not in the repo root.
If you use Codex in this repo: