The vendor-neutral benchmark for AI infrastructure primitives.
OCR · web search · vector DBs · rerankers · retrieval · extraction · chunking · crawling · memory
Website · Docs · Methodology · Contributing · LinkedIn
Modern AI products are assembled from infrastructure primitives — OCR, web search, vector databases, rerankers, retrieval, extraction, chunking, crawling, memory. Choosing the right one is mostly folklore today. Primitive Bench turns that choice into evidence.
No primitive wins every slice. We publish per-slice, per-constraint results with confidence intervals and statistical separability — never a single global leaderboard.
This repo is the public trust anchor: the harness engine, the statistics library, the adapter SDK, the per-primitive eval packages, and the public dev splits. Held-out golden answers never live here — they sit behind the private eval server, so the scores stay honest.
| Primitive | Package | What it measures | Status |
|---|---|---|---|
| Web search | eval-websearch |
hit@k against golden-URL equivalence classes, sliced by intent | ✅ Live |
| Extraction | eval-extraction |
token survival of clean main-content extraction | ✅ Live |
| OCR | eval-ocr |
text fidelity across document types | 🚧 Planned |
| Vector DBs | eval-vectordb |
recall / latency / cost across index configs | 🚧 Planned |
| Rerankers | eval-reranker |
nDCG / MAP uplift over first-stage retrieval | 🚧 Planned |
| Retrieval | eval-retrieval |
nDCG@k, MAP@k, MRR@k, Recall@k | 🚧 Planned |
| Chunking | eval-chunking |
downstream retrieval quality by chunk strategy | 🚧 Planned |
| Crawling | eval-crawl |
coverage & freshness of fetched content | 🚧 Planned |
| Memory | — | long-horizon recall (LoCoMo-style) | 🗺️ Roadmap |
Filling in a 🚧 is the highest-impact first contribution — see CONTRIBUTING.md.
- No fake #1. A winner is named for a slice only when it's statistically separable from the runner-up (McNemar p < α, non-overlapping CIs); otherwise we publish a tie band.
- Real evals, not reviews. Every claim is backed by a canonical, citable statistic — McNemar, Wilson intervals, seeded bootstrap, Bradley-Terry / Elo.
- Reproducible by anyone. Deterministic seeds, pinned versions, and public dev splits reproduce public runs bit-for-bit.
- Neutral arbiter. No pay-to-rank. Three-tier ground truth (verified-external, authoritative-registry, sentinel-planted) with canary markers for contamination detection.
Packages aren't on PyPI yet — run from a clone for now.
uv sync
uv run bench run --primitive ocr --config configs/ocr.yaml
uv run bench view ./runs/<run_id>The bench CLI scaffolds a config (bench init), runs an eval (bench run), summarizes slices with
separability badges (bench view), and submits to the held-out eval server for scores only
(bench submit).
Primitive Bench uses the proven harness shape — dataset → Task → Adapter → Scorer → result schema (converging with EleutherAI lm-eval, UK AISI Inspect, and Stanford HELM).
The Gate. bench-schemas is the frozen contract (v0.1.0): every package imports types only
from it and writes only files it owns — no shared mutable state. That boundary is what lets the build
lanes run in parallel without colliding. See apps/docs/DECISIONS.md (D-03)
and the methodology.
packages/
bench-schemas/ # THE FROZEN CONTRACT — RunManifest, ItemResult, SliceResult, ScorerOutput, AdapterSpec
bench-core/ # harness engine: deterministic seeding, run/manifest, per-run dirs
bench-stats/ # McNemar, Wilson, bootstrap CIs, hit@k, nDCG/MAP/MRR, Bradley-Terry
bench-adapters/ # provider/primitive adapter SDK (lm-eval registry pattern)
eval-*/ # one package per primitive: public golden dev set + scorer + slice defs
apps/
cli/ # the `bench` CLI: init / run / view / submit
docs/ # methodology + DECISIONS.md
golden-sets-public/ # PUBLIC dev splits only (canary-marked). Held-out answers NEVER here.
We love contributions big and small — a new vendor adapter, a slice that separates two adapters, or a
whole stubbed primitive. Start with CONTRIBUTING.md; the best first issue is
implementing one of the 🚧 verticals using eval-websearch / eval-extraction as the template.
- Code: Apache-2.0.
- Public datasets under
golden-sets-public/: CC-BY-4.0. - Third-party attribution is in
NOTICE. We learn from lm-evaluation-harness, Inspect, HELM, ann-benchmarks, VectorDBBench, and OmniDocBench — and we do not vendor GPL/commercial-dual code.
Built by the Primitive Bench team · primitivebench.com · LinkedIn