Skip to content

primitive-bench/benchpublic

Repository files navigation

Primitive Bench

Primitive Bench

The vendor-neutral benchmark for AI infrastructure primitives.
OCR · web search · vector DBs · rerankers · retrieval · extraction · chunking · crawling · memory

License: Apache 2.0 PRs welcome Python 3.11+

Website · Docs · Methodology · Contributing · LinkedIn


The vendor-neutral benchmark for AI infrastructure primitives

Modern AI products are assembled from infrastructure primitives — OCR, web search, vector databases, rerankers, retrieval, extraction, chunking, crawling, memory. Choosing the right one is mostly folklore today. Primitive Bench turns that choice into evidence.

"One winner is a lie."

No primitive wins every slice. We publish per-slice, per-constraint results with confidence intervals and statistical separability — never a single global leaderboard.

This repo is the public trust anchor: the harness engine, the statistics library, the adapter SDK, the per-primitive eval packages, and the public dev splits. Held-out golden answers never live here — they sit behind the private eval server, so the scores stay honest.

What we benchmark

Primitive Package What it measures Status
Web search eval-websearch hit@k against golden-URL equivalence classes, sliced by intent ✅ Live
Extraction eval-extraction token survival of clean main-content extraction ✅ Live
OCR eval-ocr text fidelity across document types 🚧 Planned
Vector DBs eval-vectordb recall / latency / cost across index configs 🚧 Planned
Rerankers eval-reranker nDCG / MAP uplift over first-stage retrieval 🚧 Planned
Retrieval eval-retrieval nDCG@k, MAP@k, MRR@k, Recall@k 🚧 Planned
Chunking eval-chunking downstream retrieval quality by chunk strategy 🚧 Planned
Crawling eval-crawl coverage & freshness of fetched content 🚧 Planned
Memory long-horizon recall (LoCoMo-style) 🗺️ Roadmap

Filling in a 🚧 is the highest-impact first contribution — see CONTRIBUTING.md.

Why Primitive Bench

  • No fake #1. A winner is named for a slice only when it's statistically separable from the runner-up (McNemar p < α, non-overlapping CIs); otherwise we publish a tie band.
  • Real evals, not reviews. Every claim is backed by a canonical, citable statistic — McNemar, Wilson intervals, seeded bootstrap, Bradley-Terry / Elo.
  • Reproducible by anyone. Deterministic seeds, pinned versions, and public dev splits reproduce public runs bit-for-bit.
  • Neutral arbiter. No pay-to-rank. Three-tier ground truth (verified-external, authoritative-registry, sentinel-planted) with canary markers for contamination detection.

Quickstart

Packages aren't on PyPI yet — run from a clone for now.

uv sync
uv run bench run --primitive ocr --config configs/ocr.yaml
uv run bench view ./runs/<run_id>

The bench CLI scaffolds a config (bench init), runs an eval (bench run), summarizes slices with separability badges (bench view), and submits to the held-out eval server for scores only (bench submit).

How it works

Primitive Bench uses the proven harness shape — dataset → Task → Adapter → Scorer → result schema (converging with EleutherAI lm-eval, UK AISI Inspect, and Stanford HELM).

The Gate. bench-schemas is the frozen contract (v0.1.0): every package imports types only from it and writes only files it owns — no shared mutable state. That boundary is what lets the build lanes run in parallel without colliding. See apps/docs/DECISIONS.md (D-03) and the methodology.

Repo layout

packages/
  bench-schemas/   # THE FROZEN CONTRACT — RunManifest, ItemResult, SliceResult, ScorerOutput, AdapterSpec
  bench-core/      # harness engine: deterministic seeding, run/manifest, per-run dirs
  bench-stats/     # McNemar, Wilson, bootstrap CIs, hit@k, nDCG/MAP/MRR, Bradley-Terry
  bench-adapters/  # provider/primitive adapter SDK (lm-eval registry pattern)
  eval-*/          # one package per primitive: public golden dev set + scorer + slice defs
apps/
  cli/             # the `bench` CLI: init / run / view / submit
  docs/            # methodology + DECISIONS.md
golden-sets-public/  # PUBLIC dev splits only (canary-marked). Held-out answers NEVER here.

Contributing

We love contributions big and small — a new vendor adapter, a slice that separates two adapters, or a whole stubbed primitive. Start with CONTRIBUTING.md; the best first issue is implementing one of the 🚧 verticals using eval-websearch / eval-extraction as the template.

License

  • Code: Apache-2.0.
  • Public datasets under golden-sets-public/: CC-BY-4.0.
  • Third-party attribution is in NOTICE. We learn from lm-evaluation-harness, Inspect, HELM, ann-benchmarks, VectorDBBench, and OmniDocBench — and we do not vendor GPL/commercial-dual code.

Built by the Primitive Bench team · primitivebench.com · LinkedIn

Releases

No releases published

Packages

 
 
 

Contributors

Languages