Primitive Bench

The vendor-neutral benchmark for AI infrastructure primitives.
OCR · web search · vector DBs · rerankers · retrieval · extraction · chunking · crawling · memory

Website · Docs · Methodology · Contributing · LinkedIn

The vendor-neutral benchmark for AI infrastructure primitives

Modern AI products are assembled from infrastructure primitives — OCR, web search, vector databases, rerankers, retrieval, extraction, chunking, crawling, memory. Choosing the right one is mostly folklore today. Primitive Bench turns that choice into evidence.

"One winner is a lie."

No primitive wins every slice. We publish per-slice, per-constraint results with confidence intervals and statistical separability — never a single global leaderboard.

This repo is the public trust anchor: the harness engine, the statistics library, the adapter SDK, the per-primitive eval packages, and the public dev splits. Held-out golden answers never live here — they sit behind the private eval server, so the scores stay honest.

What we benchmark

Primitive	Package	What it measures	Status
Web search	`eval-websearch`	hit@k against golden-URL equivalence classes, sliced by intent	✅ Live
Extraction	`eval-extraction`	token survival of clean main-content extraction	✅ Live
OCR	`eval-ocr`	text fidelity across document types	🚧 Planned
Vector DBs	`eval-vectordb`	recall / latency / cost across index configs	🚧 Planned
Rerankers	`eval-reranker`	nDCG / MAP uplift over first-stage retrieval	🚧 Planned
Retrieval	`eval-retrieval`	nDCG@k, MAP@k, MRR@k, Recall@k	🚧 Planned
Chunking	`eval-chunking`	downstream retrieval quality by chunk strategy	🚧 Planned
Crawling	`eval-crawl`	coverage & freshness of fetched content	🚧 Planned
Memory	—	long-horizon recall (LoCoMo-style)	🗺️ Roadmap

Filling in a 🚧 is the highest-impact first contribution — see CONTRIBUTING.md.

Why Primitive Bench

No fake #1. A winner is named for a slice only when it's statistically separable from the runner-up (McNemar p < α, non-overlapping CIs); otherwise we publish a tie band.
Real evals, not reviews. Every claim is backed by a canonical, citable statistic — McNemar, Wilson intervals, seeded bootstrap, Bradley-Terry / Elo.
Reproducible by anyone. Deterministic seeds, pinned versions, and public dev splits reproduce public runs bit-for-bit.
Neutral arbiter. No pay-to-rank. Three-tier ground truth (verified-external, authoritative-registry, sentinel-planted) with canary markers for contamination detection.

Quickstart

Packages aren't on PyPI yet — run from a clone for now.

uv sync
uv run bench run --primitive ocr --config configs/ocr.yaml
uv run bench view ./runs/<run_id>

The bench CLI scaffolds a config (bench init), runs an eval (bench run), summarizes slices with separability badges (bench view), and submits to the held-out eval server for scores only (bench submit).

How it works

Primitive Bench uses the proven harness shape — dataset → Task → Adapter → Scorer → result schema (converging with EleutherAI lm-eval, UK AISI Inspect, and Stanford HELM).

The Gate. bench-schemas is the frozen contract (v0.1.0): every package imports types only from it and writes only files it owns — no shared mutable state. That boundary is what lets the build lanes run in parallel without colliding. See apps/docs/DECISIONS.md (D-03) and the methodology.

Repo layout

packages/
  bench-schemas/   # THE FROZEN CONTRACT — RunManifest, ItemResult, SliceResult, ScorerOutput, AdapterSpec
  bench-core/      # harness engine: deterministic seeding, run/manifest, per-run dirs
  bench-stats/     # McNemar, Wilson, bootstrap CIs, hit@k, nDCG/MAP/MRR, Bradley-Terry
  bench-adapters/  # provider/primitive adapter SDK (lm-eval registry pattern)
  eval-*/          # one package per primitive: public golden dev set + scorer + slice defs
apps/
  cli/             # the `bench` CLI: init / run / view / submit
  docs/            # methodology + DECISIONS.md
golden-sets-public/  # PUBLIC dev splits only (canary-marked). Held-out answers NEVER here.

Contributing

We love contributions big and small — a new vendor adapter, a slice that separates two adapters, or a whole stubbed primitive. Start with CONTRIBUTING.md; the best first issue is implementing one of the 🚧 verticals using eval-websearch / eval-extraction as the template.

License

Code: Apache-2.0.
Public datasets under golden-sets-public/: CC-BY-4.0.
Third-party attribution is in NOTICE. We learn from lm-evaluation-harness, Inspect, HELM, ann-benchmarks, VectorDBBench, and OmniDocBench — and we do not vendor GPL/commercial-dual code.

Built by the Primitive Bench team · primitivebench.com · LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
apps		apps
docker		docker
docs/assets		docs/assets
golden-sets-public		golden-sets-public
packages		packages
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
INTERPACKAGE.md		INTERPACKAGE.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
turbo.json		turbo.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Primitive Bench

The vendor-neutral benchmark for AI infrastructure primitives

"One winner is a lie."

What we benchmark

Why Primitive Bench

Quickstart

How it works

Repo layout

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Primitive Bench

The vendor-neutral benchmark for AI infrastructure primitives

"One winner is a lie."

What we benchmark

Why Primitive Bench

Quickstart

How it works

Repo layout

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages