How far behind the closed-weight frontier are open-weight models, and is the gap growing or shrinking? This repo measures the backward-facing delay: for each accuracy threshold the open frontier reaches on a benchmark, how long earlier did a closed model first reach it? It reproduces the "when closed and open models first reached each threshold" analysis (from Håvard Ihle's WeirdML post) across many benchmarks, using data from the Epoch AI Benchmarking Hub plus a few external leaderboards.
Identical except for how "open" is defined:
open_vs_closed/— open = any open-weight model (Llama, Mistral, Phi, Gemma, gpt-oss … plus DeepSeek, Qwen, GLM, Kimi, …).open_chinese_vs_closed/— open = Chinese open-weight models only.
Groups are assigned by joining each benchmark to data/epoch_capabilities_index.csv:
open = Model accessibility starts with "Open weights"; closed = "API access"
or "Hosted access (no API)". The Chinese analysis additionally requires Country
contains "China" (sourced from the index, not the often-blank benchmark CSVs).
Headline: the two analyses give nearly the same answer — private,
contamination-resistant benchmarks show roughly ~1.7–1.9× the gap of public ones,
and broadening "open" beyond Chinese models moves the absolute gaps by under two weeks
(the open frontier is ~80% Chinese among first-crossers). See each folder's
CATEGORIES.md for the numbers.
Each marker is one benchmark × accuracy threshold; the y-axis is how many months
after the closed frontier the open frontier reached that score. The blue/red curves are
the public vs private (contamination-resistant) trends, and company logos mark notable
open-model releases. open_vs_closed/plots/gaps_over_time_by_category.png shows the same
data split by capability category instead.
Every accepted benchmark also gets its own figure: for each score threshold, the
green (closed) and blue (open) first-crossers, with a red bar showing the delay between
them. Two examples (all 17 live in each folder's plots/, with a combined grid in
plots/timeline_overview.png):
| GPQA Diamond — science/knowledge (public) | WeirdML — ML coding (private) |
|---|---|
![]() |
![]() |
pip install pandas matplotlib numpy
cd open_vs_closed && python3 analysis.py # or: cd open_chinese_vs_closed
Each script reads the frozen CSVs in ../data/ and writes figures + a machine-readable
plots/_datapoints.csv (every accepted datapoint: benchmark, threshold, first-crosser
models/dates, gap). The input data is committed so the figures reproduce exactly;
regenerate_data.py documents (and can re-derive) how that frozen data was built from
raw Epoch data + the small hand-edits in data_patches/. Caveat: Epoch's Hub is a
live dataset, so re-fetching tracks current data, which may differ from the committed
snapshot behind the published figures.
A datapoint (benchmark × threshold) is kept only if both first-crossers — the
earliest closed model and the earliest open model above that score — are plausibly the
genuine earliest models of their type, judged per-pair by independent reviews (briefs in
REVIEW_INSTRUCTIONS.md, per-benchmark reasoning in justifications/). Selection,
verdicts, owner overrides, and the public/private + capability-category breakdowns are
documented in each folder's CATEGORIES.md (and open_chinese_vs_closed/SELECTION.md).
The accepted benchmarks differ a lot in how trustworthy their scores are — some are run
by a single independent evaluator in one harness, others aggregate self-reported numbers.
provenance_audit/ has a per-benchmark audit and a blog-ready summary table
(APPENDIX.md). In short: GPQA Diamond, MATH Level 5, OTIS Mock AIME (Epoch-run), and
WeirdML, SimpleBench, METR (single-operator) are cleanly run; GSM8K, MMLU, MMLU-Pro,
Aider Polyglot, Terminal-Bench (and HLE's open side) are self-reported / submission-based;
FrontierMath and ARC-AGI carry specific caveats — both involve possible contamination on
the closed side (OpenAI's FrontierMath problem access; ARC's semi-private set being sent
to commercial APIs). That doesn't make open look artificially good, but note the
direction: inflating closed scores makes the closed frontier cross thresholds earlier,
which widens the measured gap. So if anything those two benchmarks may overstate the
gap — a reason for caution about the private-side numbers, not reassurance.
data/ frozen input CSVs + epoch_capabilities_index.csv + icons/ (shared)
data_patches/ the hand-edits applied on top of raw Epoch data (audit trail)
regenerate_data.py re-derives data/ from upstream sources + data_patches/
open_vs_closed/ analysis.py, plots/, justifications/, CATEGORIES.md, …
open_chinese_vs_closed/ same, for the Chinese-only open definition
provenance_audit/ per-benchmark score-provenance audit + APPENDIX.md
Code is MIT (see LICENSE). Benchmark data is redistributed from third parties:
- Epoch AI Benchmarking Hub — most CSVs; CC-BY 4.0, cite Epoch AI (see
data/README.md). - WeirdML (Håvard Tveit Ihle), SimpleBench (AI Explained), METR time horizons,
fiction.live (FictionLiveBench), ARC Prize (ARC-AGI/-2), TIGER-Lab (MMLU-Pro) —
mirrored via Epoch or fetched directly; see
provenance_audit/for sources per benchmark.
Provider logos under data/icons/ are trademarks of their respective owners, used here
only as figure markers.


