cc-compression-bench

A harness for benchmarking prompt-compression strategies in Claude Code.

Pick a strategy (a system prompt, a CLAUDE.md rule, a plugin, a one-line preamble), run it across the dataset via claude -p, and score every answer with an LLM judge against per-record key_points, must_use_terms, and must_avoid rubrics. Output: an apples-to-apples comparison of correctness vs. token cost.

Why

Lots of advice exists for "making Claude less verbose". Plugins, custom skills, prompt prefixes, system-level rules. Almost none of it is measured. This repo measures it.

The harness is strategy-agnostic. Caveman, "Be brief.", baseline, custom CLAUDE.md additions, hook-based injectors. Anything that changes how Claude Code generates a response is just an arm.

Quick start

# 1. Install
brew install jq            # for the runner
curl -fsSL https://bun.sh/install | bash   # for the judge

bun install                # ai sdk + zod + commander

# 2. Generate answers for an arm
./dryrun.sh baseline       # default Claude
./dryrun.sh ultra          # caveman ultra (requires plugin installed)
CAVEMAN_BENCH_PREAMBLE="Be brief." ./dryrun.sh brief

# 3. Score them
export ANTHROPIC_API_KEY=sk-ant-...
bun judge.ts results/dryrun_baseline_*.jsonl

Each run writes results/dryrun_<label>_<ts>.jsonl. Each judge run writes <input>.judged.jsonl alongside it.

Adding an arm

An arm is anything that changes the input or environment for claude -p. The harness supports two arm shapes natively:

Prompt preamble. Prepend text to every prompt. No plugin required.

CAVEMAN_BENCH_PREAMBLE="Respond in fewer than 100 tokens." \
  ./dryrun.sh under-100

Environment variable. Toggle a plugin or hook via env. The existing caveman arms work this way:

CAVEMAN_DEFAULT_MODE=ultra ./dryrun.sh ultra

For arms that need more (e.g. swapping a CLAUDE.md file, mounting a different settings.json), wrap the call in a shell script that sets up the state and then invokes dryrun.sh with a free-form label.

Repo layout

dataset.jsonl          24 evaluation records (see authoring rules below)
dryrun.sh              runs the dataset through `claude -p` for one arm
judge.ts               scores a results file via Sonnet 4.6 + structured output
results/               per-arm output and judged output

Results

The first sweep ran 24 prompts × 5 arms on claude-opus-4-7, judged by claude-sonnet-4-6:

Arm	mean score	mean tokens
baseline	0.985	636
brief (`"Be brief."` preamble)	0.985	419
caveman lite	0.976	401
caveman full	0.975	404
caveman ultra	0.970	449

Full writeup with per-category breakdowns, the per-question variance analysis on safety categories, and the failure modes I found: caveman-findings.md.

Methodology

Generation. claude -p with --output-format json, Opus by default. Each row records the response text plus token counts, cache stats, cost, duration, and session id.
Judge. Sonnet 4.6 via Vercel AI SDK generateText + Output.object({ schema }), forcing the judge into a typed {key_points_hit, must_use_terms_hit, must_avoid_triggered, score, notes} shape. The judge sees only the prompt, the rubric, and the answer. Never the arm label or any other arm's response. Prompt caching on the judge system prompt cuts repeat-call cost.
Scoring. Judge emits a holistic 0.0–1.0 score plus parallel boolean arrays. Aggregation (mean/median, per-category breakdowns) is the harness's concern, not the judge's.

Dataset

24 prompts across 6 categories, picked to stress different ways compression can fail:

Category	Failure mode	Skill claim tested	n
Bug diagnosis	Drops the why, gives fix without cause	—	5
Concept explanation	Strips nuance, edge cases, or compresses technical terms into plain English	Technical terms exact	5
Architectural tradeoffs	Drops caveats that change the advice	—	4
Multi-step setup	Collapses or reorders steps	—	4
Security / destructive ops	Missing warnings on irreversible actions	Auto-Clarity escape	3
Error interpretation	Paraphrases or truncates the error string	Errors quoted exact	3

Error-interpretation prompts must contain a realistic stack trace or error string in the prompt body.

Record shape

{
  "id": "bug_01",
  "category": "bug_diagnosis",
  "prompt": "...",
  "key_points": ["fragment", "fragment", "fragment"],
  "must_use_terms": ["optional"],
  "must_avoid": ["optional"]
}

key_points is required. must_use_terms and must_avoid are optional and omitted entirely when they don't apply. Don't include empty arrays.

Authoring rules

Prompts

Realistic dev scenarios, not toy examples.
Include enough context (code, stack trace, version) to make the answer deterministic.
One question per prompt. No multi-part asks.

`key_points`. Evaluator-facing fragments

What the judge scores against. Optimize for reliable matching, not readability.

Max 3. Two is fine. One is almost never fine. If you only have one, the prompt is probably too narrow.
Fragments, not prose. 3–8 words. Terse, fact-checkable. "N+1 query problem" not "This is a classic case of the N+1 query problem where...".
Atomic. One fact per entry. No compounds with and, e.g., or ;. If you need a conjunction, split into two. Or, more often, cut one.
Independent. Point B should not be derivable from point A. If knowing A tells you B, B is filler.
Must-have. If a staff engineer could omit this and still give a correct answer, it's not a key_point.
Substring-matchable where possible. The judge is semantic, but phrasing a point so the literal words are likely to appear reduces judge variance.

`must_use_terms`. Optional, for terminology precision

Include only when the precise term is the answer (idempotent, linearizability, N+1, ACCESS EXCLUSIVE, phantom). Acts as a strict word-presence check on top of the semantic key_points check. Catches answers that explain the concept correctly but never name it.

`must_avoid`. Optional, for dangerous wrong claims

Include only when there is a concrete, plausible, and specifically nameable wrong claim an answer could confidently make.

Hard rule: name the specific wrong claim, not a vague category.

Good (specific)	Bad (vague)
`"set rejectUnauthorized: false"`	`"insecure TLS config"`
`"BFG makes the leak safe"`	`"bad advice about git secrets"`
`"raising --max-old-space-size is the permanent fix"`	`"not fixing the leak"`
`"Kafka has per-message visibility timeout"`	`"wrong claim about Kafka"`

If you cannot phrase the trap as a sentence someone might actually write, don't include must_avoid for that record.

Purpose: catch the confidently wrong short answer. One that hits all key_points but also recommends something dangerous. Positive-only rubrics can't express this failure mode.

Judge contract

{
  "key_points_hit": [true, true, false],
  "must_use_terms_hit": [true],
  "must_avoid_triggered": [false],
  "score": 0,
  "notes": ""
}

Scoring weights and aggregation are harness concerns, not dataset concerns.

Adding a prompt

Pick the category and check it's not over its target count.
Draft the prompt. Keep it self-contained.
Draft key_points by asking: what must any correct answer contain? Cut ruthlessly until each entry survives the rules above.
If a precise term is central, add must_use_terms.
If a specific dangerous wrong claim is plausible, add must_avoid. Phrased as a concrete sentence, not a category.
Append one JSON object on a single line to dataset.jsonl. No trailing commas, no blank lines, no wrapping array.

Contributing

PRs welcome for:

New arms. Submit a script and a results file generated against the current dataset.
New prompts. Follow the authoring rules above.
Methodology improvements. Judge model swaps, statistical aggregation, multi-seed runs.

If you're adding an arm that requires a plugin or external tool, include install instructions in the PR.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
results		results
.gitignore		.gitignore
bun.lock		bun.lock
dataset.jsonl		dataset.jsonl
dryrun.sh		dryrun.sh
judge.ts		judge.ts
package.json		package.json
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc-compression-bench

Why

Quick start

Adding an arm

Repo layout

Results

Methodology

Dataset

Record shape

Authoring rules

Prompts

`key_points`. Evaluator-facing fragments

`must_use_terms`. Optional, for terminology precision

`must_avoid`. Optional, for dangerous wrong claims

Judge contract

Adding a prompt

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

cc-compression-bench

Why

Quick start

Adding an arm

Repo layout

Results

Methodology

Dataset

Record shape

Authoring rules

Prompts

key_points. Evaluator-facing fragments

must_use_terms. Optional, for terminology precision

must_avoid. Optional, for dangerous wrong claims

Judge contract

Adding a prompt

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`key_points`. Evaluator-facing fragments

`must_use_terms`. Optional, for terminology precision

`must_avoid`. Optional, for dangerous wrong claims

Packages