Skip to content

htihle/open_closed_gap

Repository files navigation

Open vs closed AI model gap

How far behind the closed-weight frontier are open-weight models, and is the gap growing or shrinking? This repo measures the backward-facing delay: for each accuracy threshold the open frontier reaches on a benchmark, how long earlier did a closed model first reach it? It reproduces the "when closed and open models first reached each threshold" analysis (from Håvard Ihle's WeirdML post) across many benchmarks, using data from the Epoch AI Benchmarking Hub plus a few external leaderboards.

Two parallel analyses

Identical except for how "open" is defined:

  • open_vs_closed/ — open = any open-weight model (Llama, Mistral, Phi, Gemma, gpt-oss … plus DeepSeek, Qwen, GLM, Kimi, …).
  • open_chinese_vs_closed/ — open = Chinese open-weight models only.

Groups are assigned by joining each benchmark to data/epoch_capabilities_index.csv: open = Model accessibility starts with "Open weights"; closed = "API access" or "Hosted access (no API)". The Chinese analysis additionally requires Country contains "China" (sourced from the index, not the often-blank benchmark CSVs).

Headline: the two analyses give nearly the same answer — private, contamination-resistant benchmarks show roughly ~1.7–1.9× the gap of public ones, and broadening "open" beyond Chinese models moves the absolute gaps by under two weeks (the open frontier is ~80% Chinese among first-crossers). See each folder's CATEGORIES.md for the numbers.

Open vs closed gap, private vs public benchmarks

Each marker is one benchmark × accuracy threshold; the y-axis is how many months after the closed frontier the open frontier reached that score. The blue/red curves are the public vs private (contamination-resistant) trends, and company logos mark notable open-model releases. open_vs_closed/plots/gaps_over_time_by_category.png shows the same data split by capability category instead.

Per-benchmark "delay timelines"

Every accepted benchmark also gets its own figure: for each score threshold, the green (closed) and blue (open) first-crossers, with a red bar showing the delay between them. Two examples (all 17 live in each folder's plots/, with a combined grid in plots/timeline_overview.png):

GPQA Diamond — science/knowledge (public) WeirdML — ML coding (private)
GPQA Diamond WeirdML

Reproduce

pip install pandas matplotlib numpy
cd open_vs_closed && python3 analysis.py          # or: cd open_chinese_vs_closed

Each script reads the frozen CSVs in ../data/ and writes figures + a machine-readable plots/_datapoints.csv (every accepted datapoint: benchmark, threshold, first-crosser models/dates, gap). The input data is committed so the figures reproduce exactly; regenerate_data.py documents (and can re-derive) how that frozen data was built from raw Epoch data + the small hand-edits in data_patches/. Caveat: Epoch's Hub is a live dataset, so re-fetching tracks current data, which may differ from the committed snapshot behind the published figures.

How benchmarks were selected & datapoints accepted

A datapoint (benchmark × threshold) is kept only if both first-crossers — the earliest closed model and the earliest open model above that score — are plausibly the genuine earliest models of their type, judged per-pair by independent reviews (briefs in REVIEW_INSTRUCTIONS.md, per-benchmark reasoning in justifications/). Selection, verdicts, owner overrides, and the public/private + capability-category breakdowns are documented in each folder's CATEGORIES.md (and open_chinese_vs_closed/SELECTION.md).

Score provenance (please read before citing)

The accepted benchmarks differ a lot in how trustworthy their scores are — some are run by a single independent evaluator in one harness, others aggregate self-reported numbers. provenance_audit/ has a per-benchmark audit and a blog-ready summary table (APPENDIX.md). In short: GPQA Diamond, MATH Level 5, OTIS Mock AIME (Epoch-run), and WeirdML, SimpleBench, METR (single-operator) are cleanly run; GSM8K, MMLU, MMLU-Pro, Aider Polyglot, Terminal-Bench (and HLE's open side) are self-reported / submission-based; FrontierMath and ARC-AGI carry specific caveats — both involve possible contamination on the closed side (OpenAI's FrontierMath problem access; ARC's semi-private set being sent to commercial APIs). That doesn't make open look artificially good, but note the direction: inflating closed scores makes the closed frontier cross thresholds earlier, which widens the measured gap. So if anything those two benchmarks may overstate the gap — a reason for caution about the private-side numbers, not reassurance.

Layout

data/                      frozen input CSVs + epoch_capabilities_index.csv + icons/ (shared)
data_patches/              the hand-edits applied on top of raw Epoch data (audit trail)
regenerate_data.py         re-derives data/ from upstream sources + data_patches/
open_vs_closed/            analysis.py, plots/, justifications/, CATEGORIES.md, …
open_chinese_vs_closed/    same, for the Chinese-only open definition
provenance_audit/          per-benchmark score-provenance audit + APPENDIX.md

Data sources & licensing

Code is MIT (see LICENSE). Benchmark data is redistributed from third parties:

  • Epoch AI Benchmarking Hub — most CSVs; CC-BY 4.0, cite Epoch AI (see data/README.md).
  • WeirdML (Håvard Tveit Ihle), SimpleBench (AI Explained), METR time horizons, fiction.live (FictionLiveBench), ARC Prize (ARC-AGI/-2), TIGER-Lab (MMLU-Pro) — mirrored via Epoch or fetched directly; see provenance_audit/ for sources per benchmark.

Provider logos under data/icons/ are trademarks of their respective owners, used here only as figure markers.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages