-
Notifications
You must be signed in to change notification settings - Fork 0
Error Handling
itemeval's rule is flag, never silently drop: every failure mode during a run leaves a durable, inspectable record, and re-running is always safe. This page is the single reference for what can go wrong, how it's recorded, how it's reported, and what re-running does.
A failure is one of two kinds:
- Setup errors raise an exception and stop the command before any model is called (bad config, missing template, dataset load failure, cost gate).
-
Run-time failures never abort the stage — they are captured per row,
written to the store, and summarized. inspect runs with
fail_on_error=False, so one bad sample never sinks a whole condition.
During generate/grade, each sample resolves into exactly one of these. All
three keep the row — none is ever dropped.
| Channel | What it is | Recorded as | Reported as | Re-run behavior |
|---|---|---|---|---|
| Sample error | Provider/API failure: timeout, rate-limit exhaustion, 5xx, content filter, refusal |
error set; solution/judge_completion null |
errors=N (run summary; status err column) |
Re-attempted — the row is pending again |
| Parse failure (grade only) | Judge replied, but no valid {"score": <number>} block |
parse_ok=false, parse_error=<code>, judge_completion=raw text, score null |
parse_failures=N (run summary; status parse_fail column) |
Final — not retried; use grade --force
|
| Empty completion | Generation finished with no API error but blank text (typically a reasoning model whose max_tokens was spent entirely on hidden reasoning) |
error null, solution blank, stop_reason usually max_tokens
|
empty=N (status); grade prints a count + stop-reason breakdown |
Governed by solvers.on_empty
|
inspect retries each erroring sample once in-eval (retry_on_error=1) before the
error is recorded. A recorded error row has error set and no solution. On
the next generate/grade of the same command, errored rows are treated as
incomplete and re-attempted (already-succeeded samples are served from inspect's
response cache, so they are not re-paid).
Judge parsing is strict — fenced ```json blocks last-to-first, then any raw JSON
object — with exact failure codes in parse_error:
| Code | Meaning |
|---|---|
no_json_object |
No JSON object found anywhere in the completion |
no_score_in_json |
A JSON object was found, but it has no score key |
score_not_numeric |
score is present but not coercible to a finite number (or is a bool) |
score_not_finite |
score parsed to NaN/inf
|
Parse failures are results, not errors: the row is kept with parse_ok=false
and the raw judge_completion for inspection, and is final — re-running
grade will not retry it. To re-grade, either change the rubric (its hash
changes, starting a fresh grade condition) or run grade --force. A sample-level
error during grading (the judge call itself failed) is the sample error
channel instead — error set, parse_ok=false, parse_error null — and is
re-attempted, not final.
A completed generation with no error but no gradable text. This is a distinct
channel from both errors and parse failures, controlled by solvers.on_empty:
| Policy | Effect |
|---|---|
skip (default)
|
Excluded from grading; surfaced in the grade summary and the status empty column |
rerun |
Also treated as not-done by generate, so a later generate re-attempts them (raise max_tokens / lower reasoning_effort first — an identical request hits the response cache and stays empty) |
grade |
Sent to the judge as-is (an empty answer, usually scored low) |
The usual cause is too small a max_tokens for a reasoning model — size the cap
for hidden reasoning plus the visible answer. See
Configuration for details.
If an entire inspect_ai.eval(...) raises — a misconfigured task, an
unreachable provider, an auth failure — itemeval catches it, records the
condition as status="error" with the exception message, and continues to the
next condition. No rows are written for that condition. The CLI prints:
[2/4] gpt-5-mini_builtin-standard_default ERROR: PrerequisiteError: ...
The command's exit code is 1 if any condition errored. Other conditions in the same run still complete and persist normally.
These raise an exception, print itemeval: error: <message>, and stop:
| Exception | Cause | Exit code |
|---|---|---|
ConfigError |
YAML shape/validation failure, bad grader/template reference | 2 |
TemplateError |
Missing template file or required placeholder | 2 |
AdapterError |
Dataset load or field-mapping failure | 2 |
StoreError |
Parquet schema/IO problem (e.g. grade with no solutions) |
1 |
BudgetError |
Pricing refresh / estimator failure | 1 |
The Python API raises these exceptions directly instead of mapping them to exit codes.
The cost gate can decline to proceed; this is a deliberate stop, not a failure:
| Situation | Exit code |
|---|---|
Projection exceeds confirm_above_usd and confirmation is needed in a non-TTY shell |
3 |
Projection exceeds budget.max_usd (hard cap; --yes does not override) |
4 |
Pass --yes to auto-confirm in scripts/CI, and set budget.max_usd as the
un-overridable backstop. See Budget and Costs.
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | unexpected error, or at least one condition failed during a run |
| 2 | config / template / adapter error (and argparse usage errors) |
| 3 | cost gate declined, or confirmation required in a non-interactive shell |
| 4 | projected cost exceeds budget.max_usd
|
-
Run summary (stdout) — per-condition lines plus totals:
rows written,errors,parse_failures, and the empty-solution line. -
itemeval status— the completion matrix: generatedone / err / empty, gradedone / err / parse_fail, per condition. -
Stores —
solutions.parquet(error,stop_reason) andgradings.parquet(error,parse_ok,parse_error,judge_completion). See Outputs and Schemas. -
log_index.parquet— per-evalstatusand completed/total sample counts. -
Raw
.evallogs — full inspect evidence: stack traces, retries, per-sample events. The store is the source of truth; the logs are the receipts.
The parquet store is keyed, so re-invoking a command is always safe and never duplicates work:
- Completed rows skip; the response cache means already-paid calls aren't re-charged.
- Sample errors re-run.
-
Empty completions re-run only under
on_empty: rerun. -
Parse failures stay final (use
--forceto redo). -
--forcere-runs everything selected;--condition <id|prefix|slug>(and--grader/--rubricfor grade) narrows a run. - When a planned run would overwrite existing rows (
--force, epoch extension after raisingreplications,on_empty: rerun), the pre-gate block states it —this run replaces N existing rows (…)— as part of the single money confirmation, and the run JSON carriesrows_replaced.
Interrupting a run (Ctrl-C, a crash, a provider outage mid-run) needs no special handling — just run the same command again.