Recreates the "Which Table Format Do LLMs Understand Best?" blog experiment using the Inspect evaluation framework. The evaluation feeds 1,000 synthetic employee records to a model in 11 different formats and asks 1,000 numeric lookup questions (salary, age, years of experience, project count) to measure accuracy and token usage. The dataset generation is deterministic so results are reproducible across runs and models.
evals/table_formats_eval.py
– Inspect tasks (one per format) with shared dataset/question generation utilities.scripts/run_benchmarks.py
– Helper runner that loops over formats/models and shells out toinspect eval
..env
– ProvideOPENAI_API_KEY
(or other provider keys) before running evaluations.
Using uv:
# Install dependencies based on pyproject + uv.lock
uv sync
# Run Inspect directly via uv (no manual venv activation needed)
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv \
--model openai/gpt-4.1-mini \
--limit 25
# Call the helper script the same way
uv run python scripts/run_benchmarks.py --models openai/gpt-4.1-mini --limit 50
Run an individual evaluation directly with Inspect (example uses OpenAI's gpt-4.1-mini
):
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv \
--model openai/gpt-4.1-mini \
--log-dir logs/gpt-4.1-mini/markdown-kv
# Tweak dataset size if needed (example: 200 records)
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv \
--model openai/gpt-4.1-mini \
-T num_records=200
Use --limit 25
for a quick smoke test before running the full 1,000 samples.
The helper script executes all (or a chosen subset of) formats across one or more models:
uv run python scripts/run_benchmarks.py \
--models openai/gpt-4.1-mini openai/gpt-4.1-nano \
--formats markdown_kv markdown_table json csv \
--limit 200 \
--num-records 200 \
--inspect-args --display plain
Omit --limit
to reproduce the full benchmark. Logs for each run are written under inspect-logs/<model>/<format>
by default; add --no-logs
to suppress log files.
Inspect prints aggregate accuracy after each run. Token usage per sample and per run is recorded in the log directories mentioned above (see the metrics.json
files for structured data). These logs mirror the blog's reporting (accuracy plus usage) and make it easy to compare models side-by-side.
- Add additional Inspect model identifiers to the
--models
list or setINSPECT_EVAL_MODEL
globally. - To prototype new formats, extend
FORMAT_SPECS
/FORMAT_ORDER
inevals/table_formats_eval.py
with a new formatter function. The helper script picks up new tasks automatically if you follow the naming schemetable_formats_<format>
.
Each full benchmark run issues 11,000 model calls per model. Start with a small --limit
to validate credentials and expected behaviour before committing to the full cost.