-
Notifications
You must be signed in to change notification settings - Fork 0
Getting Started
pip install itemeval # or: uv add itemeval (in a project) / uv tool install itemeval (as a CLI)Scaffold a runnable study and drive it — runs free on the mock provider, no key:
itemeval init my_study # writes my_study/config.yaml (templates resolve from the package)
cd my_study
itemeval status config.yaml
itemeval estimate config.yaml
itemeval generate config.yaml --yes
itemeval grade config.yaml --yes
itemeval export config.yamlinit writes only config.yaml; its builtin: prompt/rubric references resolve
from templates packaged inside itemeval, so nothing else is needed to run. Pass
--with-templates to also copy those templates into my_study/prompts/ and
my_study/rubrics/ as editable starters (the config's references are rewritten
to point at the local copies). Outputs land under the current directory:
my_study/studies/my_study/.
API keys are read from the environment (OPENAI_API_KEY, ANTHROPIC_API_KEY,
OPENROUTER_API_KEY, ...) following inspect_ai's provider conventions.
git clone https://github.com/luozm/itemeval && cd itemeval
uv sync # creates ./.venv from pyproject.toml + uv.lock
./.venv/bin/python -m pytest # one test downloads a small public HF datasetThe repo ships example configs under configs/; the 5-minute demo below uses one.
No key is needed — it runs on the free mockllm/* provider.
configs/usamo_demo.yaml: 6 USAMO 2025 problems (public HuggingFace dataset,
revision-pinned) solved by three mock models under two builtin: prompts, two
replications each, graded by a mock judge against the builtin:standard rubric.
./.venv/bin/itemeval status configs/usamo_demo.yaml # see the expanded grid (nothing run yet)
./.venv/bin/itemeval estimate configs/usamo_demo.yaml # projected cost per stage
./.venv/bin/itemeval generate configs/usamo_demo.yaml --yes
./.venv/bin/itemeval grade configs/usamo_demo.yaml --yes
./.venv/bin/itemeval export configs/usamo_demo.yaml
./.venv/bin/itemeval status configs/usamo_demo.yaml # everything 24/24 completeAfter this, studies/usamo_demo/ contains the full output tree — see
Outputs and Schemas. The analysis-ready file is:
import pandas as pd
df = pd.read_parquet("studies/usamo_demo/export/gradings_long.parquet")
# one row per grading event: item x model x prompt x replication x grader x rubric
df[["item_id", "model", "prompt_name", "replication", "score", "reasoning"]]Re-run any command — completed work is skipped (skipped: complete), and
inspect_ai's response cache means even --force re-runs of identical calls
cost nothing.
-
Scaffold a study:
itemeval init my_study(installed), or copyconfigs/usamo_demo.yaml(from a clone). Pointbenchmark.datasetsat your HuggingFace dataset and adjustbenchmark.mappingto its column names (Configuration). -
Choose prompts: keep the packaged
builtin:minimal/builtin:standard, or write your own — one Markdown file per variant inprompts/solver/<name>.mdcontaining an{input}placeholder, referenced by its bare name. (Runitemeval init --with-templatesto start from editable copies of the built-ins.) -
Pick grading: either
facets.scorer: exact_match | multiple_choice | numeric(free, no LLM) or judge grading — keepbuiltin:standardor add rubric files inrubrics/<name>.mdwith{input}and{solution}placeholders, plus agraders:section naming the judge models. -
Swap in real models: replace
mockllm/...with real inspect model ids (openai/gpt-5-mini,anthropic/claude-haiku-4-5,openrouter/deepseek/deepseek-v3.2, ...). -
Keep
budget.policy: devuntil the pipeline looks right — dev runs only the first 2 items. Then switch tofull-interactiveorfull-batch.
Always run estimate before the first paid run, and refresh pricing first:
itemeval estimate my_study/config.yaml --refresh-pricingThe tutorials walk every typical use case end to end, with real (cheap) runs:
- Score a verifiable benchmark — the full pipeline on AIME 2025 for ~2¢. Start here.
- Grade with an LLM judge — rubric-based judging of open-ended answers.
- Compare models and prompts — a crossed design with replications, analyzed in pandas.
- Add a second judge at $0 generation — the two-stage payoff: judge-sensitivity studies.
- Scale up without surprises — policies, gates, batch mode, and the savings report.
Driving itemeval from an AI agent (or wiring it into one)? Hand the agent the Agent Guide.