Skip to content

max-taylor/cc-compression-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cc-compression-bench

A harness for benchmarking prompt-compression strategies in Claude Code.

Pick a strategy (a system prompt, a CLAUDE.md rule, a plugin, a one-line preamble), run it across the dataset via claude -p, and score every answer with an LLM judge against per-record key_points, must_use_terms, and must_avoid rubrics. Output: an apples-to-apples comparison of correctness vs. token cost.

Why

Lots of advice exists for "making Claude less verbose". Plugins, custom skills, prompt prefixes, system-level rules. Almost none of it is measured. This repo measures it.

The harness is strategy-agnostic. Caveman, "Be brief.", baseline, custom CLAUDE.md additions, hook-based injectors. Anything that changes how Claude Code generates a response is just an arm.

Quick start

# 1. Install
brew install jq            # for the runner
curl -fsSL https://bun.sh/install | bash   # for the judge

bun install                # ai sdk + zod + commander

# 2. Generate answers for an arm
./dryrun.sh baseline       # default Claude
./dryrun.sh ultra          # caveman ultra (requires plugin installed)
CAVEMAN_BENCH_PREAMBLE="Be brief." ./dryrun.sh brief

# 3. Score them
export ANTHROPIC_API_KEY=sk-ant-...
bun judge.ts results/dryrun_baseline_*.jsonl

Each run writes results/dryrun_<label>_<ts>.jsonl. Each judge run writes <input>.judged.jsonl alongside it.

Adding an arm

An arm is anything that changes the input or environment for claude -p. The harness supports two arm shapes natively:

Prompt preamble. Prepend text to every prompt. No plugin required.

CAVEMAN_BENCH_PREAMBLE="Respond in fewer than 100 tokens." \
  ./dryrun.sh under-100

Environment variable. Toggle a plugin or hook via env. The existing caveman arms work this way:

CAVEMAN_DEFAULT_MODE=ultra ./dryrun.sh ultra

For arms that need more (e.g. swapping a CLAUDE.md file, mounting a different settings.json), wrap the call in a shell script that sets up the state and then invokes dryrun.sh with a free-form label.

Repo layout

dataset.jsonl          24 evaluation records (see authoring rules below)
dryrun.sh              runs the dataset through `claude -p` for one arm
judge.ts               scores a results file via Sonnet 4.6 + structured output
results/               per-arm output and judged output

Results

The first sweep ran 24 prompts × 5 arms on claude-opus-4-7, judged by claude-sonnet-4-6:

Arm mean score mean tokens
baseline 0.985 636
brief ("Be brief." preamble) 0.985 419
caveman lite 0.976 401
caveman full 0.975 404
caveman ultra 0.970 449

Full writeup with per-category breakdowns, the per-question variance analysis on safety categories, and the failure modes I found: caveman-findings.md.

Methodology

  • Generation. claude -p with --output-format json, Opus by default. Each row records the response text plus token counts, cache stats, cost, duration, and session id.
  • Judge. Sonnet 4.6 via Vercel AI SDK generateText + Output.object({ schema }), forcing the judge into a typed {key_points_hit, must_use_terms_hit, must_avoid_triggered, score, notes} shape. The judge sees only the prompt, the rubric, and the answer. Never the arm label or any other arm's response. Prompt caching on the judge system prompt cuts repeat-call cost.
  • Scoring. Judge emits a holistic 0.0–1.0 score plus parallel boolean arrays. Aggregation (mean/median, per-category breakdowns) is the harness's concern, not the judge's.

Dataset

24 prompts across 6 categories, picked to stress different ways compression can fail:

Category Failure mode Skill claim tested n
Bug diagnosis Drops the why, gives fix without cause 5
Concept explanation Strips nuance, edge cases, or compresses technical terms into plain English Technical terms exact 5
Architectural tradeoffs Drops caveats that change the advice 4
Multi-step setup Collapses or reorders steps 4
Security / destructive ops Missing warnings on irreversible actions Auto-Clarity escape 3
Error interpretation Paraphrases or truncates the error string Errors quoted exact 3

Error-interpretation prompts must contain a realistic stack trace or error string in the prompt body.

Record shape

{
  "id": "bug_01",
  "category": "bug_diagnosis",
  "prompt": "...",
  "key_points": ["fragment", "fragment", "fragment"],
  "must_use_terms": ["optional"],
  "must_avoid": ["optional"]
}

key_points is required. must_use_terms and must_avoid are optional and omitted entirely when they don't apply. Don't include empty arrays.

Authoring rules

Prompts

  • Realistic dev scenarios, not toy examples.
  • Include enough context (code, stack trace, version) to make the answer deterministic.
  • One question per prompt. No multi-part asks.

key_points. Evaluator-facing fragments

What the judge scores against. Optimize for reliable matching, not readability.

  • Max 3. Two is fine. One is almost never fine. If you only have one, the prompt is probably too narrow.
  • Fragments, not prose. 3–8 words. Terse, fact-checkable. "N+1 query problem" not "This is a classic case of the N+1 query problem where...".
  • Atomic. One fact per entry. No compounds with and, e.g., or ;. If you need a conjunction, split into two. Or, more often, cut one.
  • Independent. Point B should not be derivable from point A. If knowing A tells you B, B is filler.
  • Must-have. If a staff engineer could omit this and still give a correct answer, it's not a key_point.
  • Substring-matchable where possible. The judge is semantic, but phrasing a point so the literal words are likely to appear reduces judge variance.

must_use_terms. Optional, for terminology precision

Include only when the precise term is the answer (idempotent, linearizability, N+1, ACCESS EXCLUSIVE, phantom). Acts as a strict word-presence check on top of the semantic key_points check. Catches answers that explain the concept correctly but never name it.

must_avoid. Optional, for dangerous wrong claims

Include only when there is a concrete, plausible, and specifically nameable wrong claim an answer could confidently make.

Hard rule: name the specific wrong claim, not a vague category.

Good (specific) Bad (vague)
"set rejectUnauthorized: false" "insecure TLS config"
"BFG makes the leak safe" "bad advice about git secrets"
"raising --max-old-space-size is the permanent fix" "not fixing the leak"
"Kafka has per-message visibility timeout" "wrong claim about Kafka"

If you cannot phrase the trap as a sentence someone might actually write, don't include must_avoid for that record.

Purpose: catch the confidently wrong short answer. One that hits all key_points but also recommends something dangerous. Positive-only rubrics can't express this failure mode.

Judge contract

{
  "key_points_hit": [true, true, false],
  "must_use_terms_hit": [true],
  "must_avoid_triggered": [false],
  "score": 0,
  "notes": ""
}

Scoring weights and aggregation are harness concerns, not dataset concerns.

Adding a prompt

  1. Pick the category and check it's not over its target count.
  2. Draft the prompt. Keep it self-contained.
  3. Draft key_points by asking: what must any correct answer contain? Cut ruthlessly until each entry survives the rules above.
  4. If a precise term is central, add must_use_terms.
  5. If a specific dangerous wrong claim is plausible, add must_avoid. Phrased as a concrete sentence, not a category.
  6. Append one JSON object on a single line to dataset.jsonl. No trailing commas, no blank lines, no wrapping array.

Contributing

PRs welcome for:

  • New arms. Submit a script and a results file generated against the current dataset.
  • New prompts. Follow the authoring rules above.
  • Methodology improvements. Judge model swaps, statistical aggregation, multi-seed runs.

If you're adding an arm that requires a plugin or external tool, include install instructions in the PR.

License

MIT.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages