A harness for benchmarking prompt-compression strategies in Claude Code.
Pick a strategy (a system prompt, a CLAUDE.md rule, a plugin, a one-line preamble), run it across the dataset via claude -p, and score every answer with an LLM judge against per-record key_points, must_use_terms, and must_avoid rubrics. Output: an apples-to-apples comparison of correctness vs. token cost.
Lots of advice exists for "making Claude less verbose". Plugins, custom skills, prompt prefixes, system-level rules. Almost none of it is measured. This repo measures it.
The harness is strategy-agnostic. Caveman, "Be brief.", baseline, custom CLAUDE.md additions, hook-based injectors. Anything that changes how Claude Code generates a response is just an arm.
# 1. Install
brew install jq # for the runner
curl -fsSL https://bun.sh/install | bash # for the judge
bun install # ai sdk + zod + commander
# 2. Generate answers for an arm
./dryrun.sh baseline # default Claude
./dryrun.sh ultra # caveman ultra (requires plugin installed)
CAVEMAN_BENCH_PREAMBLE="Be brief." ./dryrun.sh brief
# 3. Score them
export ANTHROPIC_API_KEY=sk-ant-...
bun judge.ts results/dryrun_baseline_*.jsonlEach run writes results/dryrun_<label>_<ts>.jsonl. Each judge run writes <input>.judged.jsonl alongside it.
An arm is anything that changes the input or environment for claude -p. The harness supports two arm shapes natively:
Prompt preamble. Prepend text to every prompt. No plugin required.
CAVEMAN_BENCH_PREAMBLE="Respond in fewer than 100 tokens." \
./dryrun.sh under-100Environment variable. Toggle a plugin or hook via env. The existing caveman arms work this way:
CAVEMAN_DEFAULT_MODE=ultra ./dryrun.sh ultraFor arms that need more (e.g. swapping a CLAUDE.md file, mounting a different settings.json), wrap the call in a shell script that sets up the state and then invokes dryrun.sh with a free-form label.
dataset.jsonl 24 evaluation records (see authoring rules below)
dryrun.sh runs the dataset through `claude -p` for one arm
judge.ts scores a results file via Sonnet 4.6 + structured output
results/ per-arm output and judged output
The first sweep ran 24 prompts × 5 arms on claude-opus-4-7, judged by claude-sonnet-4-6:
| Arm | mean score | mean tokens |
|---|---|---|
| baseline | 0.985 | 636 |
brief ("Be brief." preamble) |
0.985 | 419 |
| caveman lite | 0.976 | 401 |
| caveman full | 0.975 | 404 |
| caveman ultra | 0.970 | 449 |
Full writeup with per-category breakdowns, the per-question variance analysis on safety categories, and the failure modes I found: caveman-findings.md.
- Generation.
claude -pwith--output-format json, Opus by default. Each row records the response text plus token counts, cache stats, cost, duration, and session id. - Judge. Sonnet 4.6 via Vercel AI SDK
generateText+Output.object({ schema }), forcing the judge into a typed{key_points_hit, must_use_terms_hit, must_avoid_triggered, score, notes}shape. The judge sees only the prompt, the rubric, and the answer. Never the arm label or any other arm's response. Prompt caching on the judge system prompt cuts repeat-call cost. - Scoring. Judge emits a holistic 0.0–1.0
scoreplus parallel boolean arrays. Aggregation (mean/median, per-category breakdowns) is the harness's concern, not the judge's.
24 prompts across 6 categories, picked to stress different ways compression can fail:
| Category | Failure mode | Skill claim tested | n |
|---|---|---|---|
| Bug diagnosis | Drops the why, gives fix without cause | — | 5 |
| Concept explanation | Strips nuance, edge cases, or compresses technical terms into plain English | Technical terms exact | 5 |
| Architectural tradeoffs | Drops caveats that change the advice | — | 4 |
| Multi-step setup | Collapses or reorders steps | — | 4 |
| Security / destructive ops | Missing warnings on irreversible actions | Auto-Clarity escape | 3 |
| Error interpretation | Paraphrases or truncates the error string | Errors quoted exact | 3 |
Error-interpretation prompts must contain a realistic stack trace or error string in the prompt body.
{
"id": "bug_01",
"category": "bug_diagnosis",
"prompt": "...",
"key_points": ["fragment", "fragment", "fragment"],
"must_use_terms": ["optional"],
"must_avoid": ["optional"]
}key_points is required. must_use_terms and must_avoid are optional and omitted entirely when they don't apply. Don't include empty arrays.
- Realistic dev scenarios, not toy examples.
- Include enough context (code, stack trace, version) to make the answer deterministic.
- One question per prompt. No multi-part asks.
What the judge scores against. Optimize for reliable matching, not readability.
- Max 3. Two is fine. One is almost never fine. If you only have one, the prompt is probably too narrow.
- Fragments, not prose. 3–8 words. Terse, fact-checkable.
"N+1 query problem"not"This is a classic case of the N+1 query problem where...". - Atomic. One fact per entry. No compounds with
and,e.g., or;. If you need a conjunction, split into two. Or, more often, cut one. - Independent. Point B should not be derivable from point A. If knowing A tells you B, B is filler.
- Must-have. If a staff engineer could omit this and still give a correct answer, it's not a
key_point. - Substring-matchable where possible. The judge is semantic, but phrasing a point so the literal words are likely to appear reduces judge variance.
Include only when the precise term is the answer (idempotent, linearizability, N+1, ACCESS EXCLUSIVE, phantom). Acts as a strict word-presence check on top of the semantic key_points check. Catches answers that explain the concept correctly but never name it.
Include only when there is a concrete, plausible, and specifically nameable wrong claim an answer could confidently make.
Hard rule: name the specific wrong claim, not a vague category.
| Good (specific) | Bad (vague) |
|---|---|
"set rejectUnauthorized: false" |
"insecure TLS config" |
"BFG makes the leak safe" |
"bad advice about git secrets" |
"raising --max-old-space-size is the permanent fix" |
"not fixing the leak" |
"Kafka has per-message visibility timeout" |
"wrong claim about Kafka" |
If you cannot phrase the trap as a sentence someone might actually write, don't include must_avoid for that record.
Purpose: catch the confidently wrong short answer. One that hits all key_points but also recommends something dangerous. Positive-only rubrics can't express this failure mode.
{
"key_points_hit": [true, true, false],
"must_use_terms_hit": [true],
"must_avoid_triggered": [false],
"score": 0,
"notes": ""
}Scoring weights and aggregation are harness concerns, not dataset concerns.
- Pick the category and check it's not over its target count.
- Draft the prompt. Keep it self-contained.
- Draft
key_pointsby asking: what must any correct answer contain? Cut ruthlessly until each entry survives the rules above. - If a precise term is central, add
must_use_terms. - If a specific dangerous wrong claim is plausible, add
must_avoid. Phrased as a concrete sentence, not a category. - Append one JSON object on a single line to
dataset.jsonl. No trailing commas, no blank lines, no wrapping array.
PRs welcome for:
- New arms. Submit a script and a results file generated against the current dataset.
- New prompts. Follow the authoring rules above.
- Methodology improvements. Judge model swaps, statistical aggregation, multi-seed runs.
If you're adding an arm that requires a plugin or external tool, include install instructions in the PR.
MIT.