Getting Started

Install

pip install itemeval        # or: uv add itemeval (in a project) / uv tool install itemeval (as a CLI)

Scaffold a runnable study and drive it — runs free on the mock provider, no key:

itemeval init my_study      # writes my_study/config.yaml (templates resolve from the package)
cd my_study
itemeval status   config.yaml
itemeval estimate config.yaml
itemeval generate config.yaml --yes
itemeval grade    config.yaml --yes
itemeval export   config.yaml

init writes only config.yaml; its builtin: prompt/rubric references resolve from templates packaged inside itemeval, so nothing else is needed to run. Pass --with-templates to also copy those templates into my_study/prompts/ and my_study/rubrics/ as editable starters (the config's references are rewritten to point at the local copies). Outputs land under the current directory: my_study/studies/my_study/.

API keys are read from the environment (OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY, ...) following inspect_ai's provider conventions.

Install from source (development)

git clone https://github.com/luozm/itemeval && cd itemeval
uv sync                       # creates ./.venv from pyproject.toml + uv.lock
./.venv/bin/python -m pytest  # one test downloads a small public HF dataset

The repo ships example configs under configs/; the 5-minute demo below uses one. No key is needed — it runs on the free mockllm/* provider.

The 5-minute demo (zero paid API calls)

configs/usamo_demo.yaml: 6 USAMO 2025 problems (public HuggingFace dataset, revision-pinned) solved by three mock models under two builtin: prompts, two replications each, graded by a mock judge against the builtin:standard rubric.

./.venv/bin/itemeval status   configs/usamo_demo.yaml   # see the expanded grid (nothing run yet)
./.venv/bin/itemeval estimate configs/usamo_demo.yaml   # projected cost per stage
./.venv/bin/itemeval generate configs/usamo_demo.yaml --yes
./.venv/bin/itemeval grade    configs/usamo_demo.yaml --yes
./.venv/bin/itemeval export   configs/usamo_demo.yaml
./.venv/bin/itemeval status   configs/usamo_demo.yaml   # everything 24/24 complete

After this, studies/usamo_demo/ contains the full output tree — see Outputs and Schemas. The analysis-ready file is:

import pandas as pd
df = pd.read_parquet("studies/usamo_demo/export/gradings_long.parquet")
# one row per grading event: item x model x prompt x replication x grader x rubric
df[["item_id", "model", "prompt_name", "replication", "score", "reasoning"]]

Re-run any command — completed work is skipped (skipped: complete), and inspect_ai's response cache means even --force re-runs of identical calls cost nothing.

Adapting it to your study

Scaffold a study: itemeval init my_study (installed), or copy configs/usamo_demo.yaml (from a clone). Point benchmark.datasets at your HuggingFace dataset and adjust benchmark.mapping to its column names (Configuration).
Choose prompts: keep the packaged builtin:minimal / builtin:standard, or write your own — one Markdown file per variant in prompts/solver/<name>.md containing an {input} placeholder, referenced by its bare name. (Run itemeval init --with-templates to start from editable copies of the built-ins.)
Pick grading: either facets.scorer: exact_match | multiple_choice | numeric (free, no LLM) or judge grading — keep builtin:standard or add rubric files in rubrics/<name>.md with {input} and {solution} placeholders, plus a graders: section naming the judge models.
Swap in real models: replace mockllm/... with real inspect model ids (openai/gpt-5-mini, anthropic/claude-haiku-4-5, openrouter/deepseek/deepseek-v3.2, ...).
Keep budget.policy: dev until the pipeline looks right — dev runs only the first 2 items. Then switch to full-interactive or full-batch.

Always run estimate before the first paid run, and refresh pricing first:

itemeval estimate my_study/config.yaml --refresh-pricing

Where to next

The tutorials walk every typical use case end to end, with real (cheap) runs:

Score a verifiable benchmark — the full pipeline on AIME 2025 for ~2¢. Start here.
Grade with an LLM judge — rubric-based judging of open-ended answers.
Compare models and prompts — a crossed design with replications, analyzed in pandas.
Add a second judge at $0 generation — the two-stage payoff: judge-sensitivity studies.
Scale up without surprises — policies, gates, batch mode, and the savings report.

Driving itemeval from an AI agent (or wiring it into one)? Hand the agent the Agent Guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Getting Started

Install

Install from source (development)

The 5-minute demo (zero paid API calls)

Adapting it to your study

Where to next

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally