-
Notifications
You must be signed in to change notification settings - Fork 0
Getting Started
git clone https://github.com/luozm/itemeval && cd itemeval
uv sync # creates ./.venv from pyproject.toml + uv.lock
./.venv/bin/python -m pytest # 158 tests; one downloads a small public HF datasetAPI keys are read from the environment (OPENAI_API_KEY, ANTHROPIC_API_KEY,
OPENROUTER_API_KEY, ...) following inspect_ai's provider conventions. No
key is needed for the demo below — it runs on the free mockllm/* provider.
The repo ships configs/usamo_demo.yaml: 6 USAMO 2025 problems (public
HuggingFace dataset, revision-pinned) solved by three mock models under two
prompts, two replications each, graded by a mock judge against a rubric.
./.venv/bin/itemeval status configs/usamo_demo.yaml # see the expanded grid (nothing run yet)
./.venv/bin/itemeval estimate configs/usamo_demo.yaml # projected cost per stage
./.venv/bin/itemeval generate configs/usamo_demo.yaml --yes
./.venv/bin/itemeval grade configs/usamo_demo.yaml --yes
./.venv/bin/itemeval export configs/usamo_demo.yaml
./.venv/bin/itemeval status configs/usamo_demo.yaml # everything 24/24 completeAfter this, studies/usamo_demo/ contains the full output tree — see
Outputs and Schemas. The analysis-ready file is:
import pandas as pd
df = pd.read_parquet("studies/usamo_demo/export/gradings_long.parquet")
# one row per grading event: item x model x prompt x replication x grader x rubric
df[["item_id", "model", "prompt_name", "replication", "score", "reasoning"]]Re-run any command — completed work is skipped (skipped: complete), and
inspect_ai's response cache means even --force re-runs of identical calls
cost nothing.
-
Copy a config: start from
configs/usamo_demo.yaml. Pointbenchmark.datasetsat your HuggingFace dataset and adjustbenchmark.mappingto its column names (Configuration). -
Write prompts: one Markdown file per prompt variant in
prompts/solver/<name>.md, containing an{input}placeholder. -
Pick grading: either
facets.scorer: exact_match | multiple_choice | numeric(free, no LLM) or judge grading — rubric files inrubrics/<name>.mdwith{input}and{solution}placeholders, plus agraders:section naming the judge models. -
Swap in real models: replace
mockllm/...with real inspect model ids (openai/gpt-5-mini,anthropic/claude-haiku-4-5,openrouter/deepseek/deepseek-v3.2, ...). -
Keep
budget.policy: devuntil the pipeline looks right — dev runs only the first 2 items. Then switch tofull-interactiveorfull-batch.
Always run estimate before the first paid run, and refresh pricing first:
./.venv/bin/itemeval estimate configs/my_study.yaml --refresh-pricing