Table Format Benchmarks

Recreates the "Which Table Format Do LLMs Understand Best?" blog experiment using the Inspect evaluation framework. The evaluation feeds 1,000 synthetic employee records to a model in 11 different formats and asks 1,000 numeric lookup questions (salary, age, years of experience, project count) to measure accuracy and token usage. The dataset generation is deterministic so results are reproducible across runs and models.

Project layout

evals/table_formats_eval.py – Inspect tasks (one per format) with shared dataset/question generation utilities.
scripts/run_benchmarks.py – Helper runner that loops over formats/models and shells out to inspect eval.
.env – Provide OPENAI_API_KEY (or other provider keys) before running evaluations.

Prerequisites

Using uv:

# Install dependencies based on pyproject + uv.lock
uv sync

# Run Inspect directly via uv (no manual venv activation needed)
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv \
  --model openai/gpt-4.1-mini \
  --limit 25

# Call the helper script the same way
uv run python scripts/run_benchmarks.py --models openai/gpt-4.1-mini --limit 50

Running a single format/model

Run an individual evaluation directly with Inspect (example uses OpenAI's gpt-4.1-mini):

uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv \
  --model openai/gpt-4.1-mini \
  --log-dir logs/gpt-4.1-mini/markdown-kv

# Tweak dataset size if needed (example: 200 records)
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv \
  --model openai/gpt-4.1-mini \
  -T num_records=200

Use --limit 25 for a quick smoke test before running the full 1,000 samples.

Running the full grid of formats

The helper script executes all (or a chosen subset of) formats across one or more models:

uv run python scripts/run_benchmarks.py \
  --models openai/gpt-4.1-mini openai/gpt-4.1-nano \
  --formats markdown_kv markdown_table json csv \
  --limit 200 \
  --num-records 200 \
  --inspect-args --display plain

Omit --limit to reproduce the full benchmark. Logs for each run are written under inspect-logs/<model>/<format> by default; add --no-logs to suppress log files.

Collecting accuracy & token metrics

Inspect prints aggregate accuracy after each run. Token usage per sample and per run is recorded in the log directories mentioned above (see the metrics.json files for structured data). These logs mirror the blog's reporting (accuracy plus usage) and make it easy to compare models side-by-side.

Extending to new models or formats

Add additional Inspect model identifiers to the --models list or set INSPECT_EVAL_MODEL globally.
To prototype new formats, extend FORMAT_SPECS/FORMAT_ORDER in evals/table_formats_eval.py with a new formatter function. The helper script picks up new tasks automatically if you follow the naming scheme table_formats_<format>.

Safety note

Each full benchmark run issues 11,000 model calls per model. Start with a small --limit to validate credentials and expected behaviour before committing to the full cost.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evals		evals
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table Format Benchmarks

Project layout

Prerequisites

Running a single format/model

Running the full grid of formats

Collecting accuracy & token metrics

Extending to new models or formats

Safety note

About

Uh oh!

Releases

Packages

Languages

jcheng5/table-formats

Folders and files

Latest commit

History

Repository files navigation

Table Format Benchmarks

Project layout

Prerequisites

Running a single format/model

Running the full grid of formats

Collecting accuracy & token metrics

Extending to new models or formats

Safety note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages