Open vs closed AI model gap

How far behind the closed-weight frontier are open-weight models, and is the gap growing or shrinking? This repo measures the backward-facing delay: for each accuracy threshold the open frontier reaches on a benchmark, how long earlier did a closed model first reach it? It reproduces the "when closed and open models first reached each threshold" analysis (from Håvard Ihle's WeirdML post) across many benchmarks, using data from the Epoch AI Benchmarking Hub plus a few external leaderboards.

Two parallel analyses

Identical except for how "open" is defined:

open_vs_closed/ — open = any open-weight model (Llama, Mistral, Phi, Gemma, gpt-oss … plus DeepSeek, Qwen, GLM, Kimi, …).
open_chinese_vs_closed/ — open = Chinese open-weight models only.

Groups are assigned by joining each benchmark to data/epoch_capabilities_index.csv: open = Model accessibility starts with "Open weights"; closed = "API access" or "Hosted access (no API)". The Chinese analysis additionally requires Country contains "China" (sourced from the index, not the often-blank benchmark CSVs).

Headline: the two analyses give nearly the same answer — private, contamination-resistant benchmarks show roughly ~1.7–1.9× the gap of public ones, and broadening "open" beyond Chinese models moves the absolute gaps by under two weeks (the open frontier is ~80% Chinese among first-crossers). See each folder's CATEGORIES.md for the numbers.

Each marker is one benchmark × accuracy threshold; the y-axis is how many months after the closed frontier the open frontier reached that score. The blue/red curves are the public vs private (contamination-resistant) trends, and company logos mark notable open-model releases. open_vs_closed/plots/gaps_over_time_by_category.png shows the same data split by capability category instead.

Per-benchmark "delay timelines"

Every accepted benchmark also gets its own figure: for each score threshold, the green (closed) and blue (open) first-crossers, with a red bar showing the delay between them. Two examples (all 17 live in each folder's plots/, with a combined grid in plots/timeline_overview.png):

GPQA Diamond — science/knowledge (public)	WeirdML — ML coding (private)

Reproduce

pip install pandas matplotlib numpy
cd open_vs_closed && python3 analysis.py          # or: cd open_chinese_vs_closed

Each script reads the frozen CSVs in ../data/ and writes figures + a machine-readable plots/_datapoints.csv (every accepted datapoint: benchmark, threshold, first-crosser models/dates, gap). The input data is committed so the figures reproduce exactly; regenerate_data.py documents (and can re-derive) how that frozen data was built from raw Epoch data + the small hand-edits in data_patches/. Caveat: Epoch's Hub is a live dataset, so re-fetching tracks current data, which may differ from the committed snapshot behind the published figures.

How benchmarks were selected & datapoints accepted

A datapoint (benchmark × threshold) is kept only if both first-crossers — the earliest closed model and the earliest open model above that score — are plausibly the genuine earliest models of their type, judged per-pair by independent reviews (briefs in REVIEW_INSTRUCTIONS.md, per-benchmark reasoning in justifications/). Selection, verdicts, owner overrides, and the public/private + capability-category breakdowns are documented in each folder's CATEGORIES.md (and open_chinese_vs_closed/SELECTION.md).

Score provenance (please read before citing)

The accepted benchmarks differ a lot in how trustworthy their scores are — some are run by a single independent evaluator in one harness, others aggregate self-reported numbers. provenance_audit/ has a per-benchmark audit and a blog-ready summary table (APPENDIX.md). In short: GPQA Diamond, MATH Level 5, OTIS Mock AIME (Epoch-run), and WeirdML, SimpleBench, METR (single-operator) are cleanly run; GSM8K, MMLU, MMLU-Pro, Aider Polyglot, Terminal-Bench (and HLE's open side) are self-reported / submission-based; FrontierMath and ARC-AGI carry specific caveats — both involve possible contamination on the closed side (OpenAI's FrontierMath problem access; ARC's semi-private set being sent to commercial APIs). That doesn't make open look artificially good, but note the direction: inflating closed scores makes the closed frontier cross thresholds earlier, which widens the measured gap. So if anything those two benchmarks may overstate the gap — a reason for caution about the private-side numbers, not reassurance.

Layout

data/                      frozen input CSVs + epoch_capabilities_index.csv + icons/ (shared)
data_patches/              the hand-edits applied on top of raw Epoch data (audit trail)
regenerate_data.py         re-derives data/ from upstream sources + data_patches/
open_vs_closed/            analysis.py, plots/, justifications/, CATEGORIES.md, …
open_chinese_vs_closed/    same, for the Chinese-only open definition
provenance_audit/          per-benchmark score-provenance audit + APPENDIX.md

Data sources & licensing

Code is MIT (see LICENSE). Benchmark data is redistributed from third parties:

Epoch AI Benchmarking Hub — most CSVs; CC-BY 4.0, cite Epoch AI (see data/README.md).
WeirdML (Håvard Tveit Ihle), SimpleBench (AI Explained), METR time horizons, fiction.live (FictionLiveBench), ARC Prize (ARC-AGI/-2), TIGER-Lab (MMLU-Pro) — mirrored via Epoch or fetched directly; see provenance_audit/ for sources per benchmark.

Provider logos under data/icons/ are trademarks of their respective owners, used here only as figure markers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open vs closed AI model gap

Two parallel analyses

Per-benchmark "delay timelines"

Reproduce

How benchmarks were selected & datapoints accepted

Score provenance (please read before citing)

Layout

Data sources & licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
data_patches		data_patches
open_chinese_vs_closed		open_chinese_vs_closed
open_vs_closed		open_vs_closed
provenance_audit		provenance_audit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
regenerate_data.py		regenerate_data.py

Folders and files

Latest commit

History

Repository files navigation

Open vs closed AI model gap

Two parallel analyses

Per-benchmark "delay timelines"

Reproduce

How benchmarks were selected & datapoints accepted

Score provenance (please read before citing)

Layout

Data sources & licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages