Skip to content

ncky/autocute

Repository files navigation

cute-kernel-lab

This repo is a dedicated kernel-research lab: a stable harness around deeply case-local workspaces.

The point is not just to mutate one kernel file. The point is to let an agent or engineer run a full research program inside one case without polluting the rest of the repo:

  • stable benchmark harness
  • stable logging, plots, and webhook flow
  • per-case serving glue
  • per-case operator families
  • per-case profiler and parity helpers
  • per-case tests, notes, and prior-art docs

Development model

Every optimization target is a case:

  • one model snapshot
  • one hardware target
  • one primary objective
  • one fixed benchmark suite
  • one mutable workspace

The framework under src/cute_kernel_lab/** stays boring and stable. The case workspace is where the real changes happen.

That gives you three clean layers:

  1. src/cute_kernel_lab/** Stable platform code: serving shim, benchmark runner, logging, plotting, notifications.
  2. cases/<model>/<case>/** One optimization target: case config, benchmark definitions, run history, program instructions.
  3. cases/<model>/<case>/workspace/** A self-contained kernel project: runtime package, operator families, bench helpers, tests, docs, scripts.

Repository layout

cute-kernel-lab/
├── .codex/skills/                 # portable Codex skills
├── docs/                          # repo-local lab guidance
├── templates/                     # copyable case starters
├── src/cute_kernel_lab/
│   ├── api/                       # OpenAI-compatible shim
│   ├── bench/                     # benchmark runner + score flattening
│   ├── optimize/                  # logging, plotting, Discord notifier
│   └── serving/                   # backend interfaces + mock backend
├── cases/
│   └── <model>/<case>/
│       ├── benchmarks/            # fixed suites for that target
│       ├── workspace/
│       │   ├── backend/           # serving glue, env defaults, model loader
│       │   ├── ops/               # operator families: cute, triton, ptx, cuda
│       │   ├── bench/             # parity and profiling helpers
│       │   ├── tests/             # workspace-local correctness checks
│       │   ├── docs/              # running notes, design docs, prior art
│       │   ├── scripts/           # case-local build/profile wrappers
│       │   ├── serve_custom_backend.py
│       │   ├── launch_transformers_server.py
│       │   └── kernel_manifest.yaml
│       ├── runs/
│       ├── case.yaml
│       └── program.md
├── scripts/
│   ├── new_kernel_case.sh
│   ├── new_kernel_case.py
│   ├── serve_case.py
│   ├── evaluate_case.py
│   └── run-kernel-hillclimb.sh
└── .env.example

Why this stays clean

The stable harness still owns:

  • starting servers
  • running benchmarks
  • scoring
  • logging history
  • plotting best-so-far curves
  • sending webhook updates

Webhook debugging should go through the repo config loader, not raw shell env:

uv run python scripts/check_webhook_status.py

Every evaluate_case.py record now includes:

  • webhook_enabled_by_config
  • webhook_disabled_by_flag
  • webhook_expected
  • webhook_attempted
  • webhook_sent
  • webhook_status

The case workspace owns:

  • runtime policy and load path
  • operator-family code
  • case-local build helpers
  • parity and profiler scripts
  • running notes and design docs

That lets you be much more thorough inside a case without turning the repo itself into a pile of one-off experiment files.

Start a new hillclimb series

Stamp a new case from the template instead of building a workspace by hand:

./scripts/new_kernel_case.sh \
  --model-slug my_model \
  --case-slug rtx_pro_6000_single_stream_tok_s \
  --model-name "org/model" \
  --model-path "models/my-model" \
  --hardware "NVIDIA RTX PRO 6000 Blackwell Workstation Edition 96GB"

The stamped case now includes a deeper workspace by default:

  • workspace/backend/
    • serving glue, env defaults, model loading, hook installation, parity helpers
  • workspace/ops/
    • separate surfaces for cute, triton, ptx, and raw cuda
  • workspace/bench/
    • local parity and one-shot profiling helpers
  • workspace/tests/
    • workspace-local correctness checks
  • workspace/docs/
    • notes, development plan, prior art
  • workspace/scripts/
    • build, parity, and profile wrappers
  • workspace/scripts/fetch_upstream_refs.sh
    • optional read-only upstream example fetcher
  • workspace/docs/tri_dao_examples.md
    • curated file-level example map from FlashAttention, Mamba, QuACK, SonicMoE, and related Dao-AILab repos

Default starter priorities:

  • do not assume the answer is CuTe
  • do not default to CUTLASS C++
  • keep CuTe, Triton, PTX, raw CUDA, and runtime-policy ideas all available in the same case

Recommended start sequence:

  1. Stamp the case.
  2. Pick a fresh run tag:
export CUTE_KERNEL_LAB_RUN_TAG=kernel_hillclimb_apr06
  1. Verify the model snapshot exists at the stamped model_path.
  2. Run the stamped setup checks:
cd cases/my_model/rtx_pro_6000_single_stream_tok_s
./workspace/scripts/check_setup.sh --build --emit-ptx
  1. Baseline the copied case:
cd cases/my_model/rtx_pro_6000_single_stream_tok_s
CUTE_KERNEL_LAB_RUN_TAG=kernel_hillclimb_apr06 ./workspace/run_hillclimb.sh "baseline"
  1. Keep early mutations env-only.
  2. For a deep kernel pass, fetch the curated upstream examples:
./workspace/scripts/fetch_upstream_refs.sh --pack tri-dao
  1. Run a prompt-level parity probe before every replay, streamer, or mutating-buffer benchmark.
  2. Only promote a branch into checked-in defaults after the source-default path survives its own benchmark.

Deep Hillclimb Rhythm

Do not let a long run collapse into endless tiny sweeps.

The intended rhythm for a serious case is:

  1. baseline the case
  2. profile the real harness
  3. do a short prior-art pass
  4. write down the next structural candidate set
  5. spend the next scored block on one family
  6. re-ablate older thin-wrapper or cache wins after the deeper boundary lands

Good signs that it is time to escalate:

  • several near-ties from the same thin wrapper family
  • thread-count or graph-step sweeps moving less than normal run variance
  • top profiler buckets are not changing
  • an older default-on branch has not been rechecked since a deeper boundary landed

Good deeper targets:

  • attention-entry prep such as qkv -> rope -> cache write
  • MLP super-boundaries instead of isolated linears
  • allocation / output-reuse / scratch-arena cleanup
  • graph-safe state ownership and replay boundaries
  • cache layout and batch-only routing in throughput cases

In throughput cases with a separate batch-1 guardrail, it is valid to route an experiment only through the high-batch lane when that lane is the real target metric and the guardrail stays healthy.

Workspace blueprint

Read workspace-blueprint.md before starting a new case. It is the repo-level reference for:

  • how to split a case workspace
  • which concerns belong in backend/ vs ops/ vs bench/
  • when to use CuTe, Triton, PTX, raw CUDA, or runtime policy
  • how to keep the case autonomous without polluting the framework

What the scaffold gives you

  • OpenAI-compatible FastAPI serving shim
  • benchmark runner and score flattening
  • run history and plots
  • Discord/webhook notifier
  • generic baseline launcher
  • copyable deep workspace structure
  • PTX and raw-CUDA starter surfaces
  • case-local parity and profiling helpers
  • case-local setup checks
  • case-local notes and prior-art docs

Benchmark surfaces now include two clean patterns:

  • OpenAI-compatible HTTP benchmarks for serving-facing cases
  • shell_json benchmarks for cases that should drive an external harness or repo-local CLI directly and return metrics as JSON

Use shell_json when the real benchmark already exists as a stable script and faking the workload through an OpenAI server would distort the target.

GPU selection

This repo is pinned to the NVIDIA RTX PRO 6000 Blackwell Workstation Edition.

Use one of these before any GPU work:

. scripts/selected-gpu.sh

or:

scripts/with-selected-gpu.sh uv run python scripts/evaluate_case.py --case ...

Both set CUDA_VISIBLE_DEVICES to the repo-owned GPU UUID so serving, benchmarking, profiling, and compile steps stay on the same card.

Python environment

This repo is uv-first:

  • uv sync creates or updates .venv
  • uv run ... executes inside that environment
  • .python-version pins the interpreter

Keep the shared environment case-agnostic. Put case-specific setup and operator notes in the case workspace, not in the repo root.

Codex workflow

If you use Codex in this repo:

  • read AGENTS.md first
  • use the copied kernel skill at SKILL.md
  • keep repo-specific observations in docs/
  • stamp new cases from templates/kernel_hillclimb_case/
  • treat the workspace as a real operator project, not just one mutable kernel file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors