Error Handling

Error Handling and Reporting

itemeval's rule is flag, never silently drop: every failure mode during a run leaves a durable, inspectable record, and re-running is always safe. This page is the single reference for what can go wrong, how it's recorded, how it's reported, and what re-running does.

A failure is one of two kinds:

Setup errors raise an exception and stop the command before any model is called (bad config, missing template, dataset load failure, cost gate).
Run-time failures never abort the stage — they are captured per row, written to the store, and summarized. inspect runs with fail_on_error=False, so one bad sample never sinks a whole condition.

Run-time failures: the three row-level channels

During generate/grade, each sample resolves into exactly one of these. All three keep the row — none is ever dropped.

Channel	What it is	Recorded as	Reported as	Re-run behavior
Sample error	Provider/API failure: timeout, rate-limit exhaustion, 5xx, content filter, refusal	`error` set; `solution`/`judge_completion` null	`errors=N` (run summary; `status` err column)	Re-attempted — the row is pending again
Parse failure (grade only)	Judge replied, but no valid `{"score": <number>}` block	`parse_ok=false`, `parse_error=<code>`, `judge_completion`=raw text, `score` null	`parse_failures=N` (run summary; `status` parse_fail column)	Final — not retried; use `grade --force`
Empty completion	Generation finished with no API error but blank text (typically a reasoning model whose `max_tokens` was spent entirely on hidden reasoning)	`error` null, `solution` blank, `stop_reason` usually `max_tokens`	`empty=N` (`status`); `grade` prints a count + stop-reason breakdown	Governed by `solvers.on_empty`

Sample errors

inspect retries each erroring sample once in-eval (retry_on_error=1) before the error is recorded. A recorded error row has error set and no solution. On the next generate/grade of the same command, errored rows are treated as incomplete and re-attempted (already-succeeded samples are served from inspect's response cache, so they are not re-paid).

Parse failures (judge output)

Judge parsing is strict — fenced ```json blocks last-to-first, then any raw JSON object — with exact failure codes in parse_error:

Code	Meaning
`no_json_object`	No JSON object found anywhere in the completion
`no_score_in_json`	A JSON object was found, but it has no `score` key
`score_not_numeric`	`score` is present but not coercible to a finite number (or is a bool)
`score_not_finite`	`score` parsed to `NaN`/`inf`

Parse failures are results, not errors: the row is kept with parse_ok=false and the raw judge_completion for inspection, and is final — re-running grade will not retry it. To re-grade, either change the rubric (its hash changes, starting a fresh grade condition) or run grade --force. A sample-level error during grading (the judge call itself failed) is the sample error channel instead — error set, parse_ok=false, parse_error null — and is re-attempted, not final.

Empty completions

A completed generation with no error but no gradable text. This is a distinct channel from both errors and parse failures, controlled by solvers.on_empty:

Policy	Effect
`skip` (default)	Excluded from grading; surfaced in the `grade` summary and the `status` empty column
`rerun`	Also treated as not-done by `generate`, so a later `generate` re-attempts them (raise `max_tokens` / lower `reasoning_effort` first — an identical request hits the response cache and stays empty)
`grade`	Sent to the judge as-is (an empty answer, usually scored low)

The usual cause is too small a max_tokens for a reasoning model — size the cap for hidden reasoning plus the visible answer. See Configuration for details.

Eval-level (whole-condition) failures

If an entire inspect_ai.eval(...) raises — a misconfigured task, an unreachable provider, an auth failure — itemeval catches it, records the condition as status="error" with the exception message, and continues to the next condition. No rows are written for that condition. The CLI prints:

[2/4] gpt-5-mini_builtin-standard_default  ERROR: PrerequisiteError: ...

The command's exit code is 1 if any condition errored. Other conditions in the same run still complete and persist normally.

Setup errors (before any model call)

These raise an exception, print itemeval: error: <message>, and stop:

Exception	Cause	Exit code
`ConfigError`	YAML shape/validation failure, bad grader/template reference	2
`TemplateError`	Missing template file or required placeholder	2
`AdapterError`	Dataset load or field-mapping failure	2
`StoreError`	Parquet schema/IO problem (e.g. `grade` with no solutions)	1
`BudgetError`	Pricing refresh / estimator failure	1

The Python API raises these exceptions directly instead of mapping them to exit codes.

Budget gate (refusals, not errors)

The cost gate can decline to proceed; this is a deliberate stop, not a failure:

Situation	Exit code
Projection exceeds `confirm_above_usd` and confirmation is needed in a non-TTY shell	3
Projection exceeds `budget.max_usd` (hard cap; `--yes` does not override)	4

Pass --yes to auto-confirm in scripts/CI, and set budget.max_usd as the un-overridable backstop. See Budget and Costs.

Exit codes (all commands)

Code	Meaning
0	success
1	unexpected error, or at least one condition failed during a run
2	config / template / adapter error (and argparse usage errors)
3	cost gate declined, or confirmation required in a non-interactive shell
4	projected cost exceeds `budget.max_usd`

Where failures are visible after a run

Run summary (stdout) — per-condition lines plus totals: rows written, errors, parse_failures, and the empty-solution line.
itemeval status — the completion matrix: generate done / err / empty, grade done / err / parse_fail, per condition.
Stores — solutions.parquet (error, stop_reason) and gradings.parquet (error, parse_ok, parse_error, judge_completion). See Outputs and Schemas.
log_index.parquet — per-eval status and completed/total sample counts.
Raw .eval logs — full inspect evidence: stack traces, retries, per-sample events. The store is the source of truth; the logs are the receipts.

Retry and resume — re-run the same command

The parquet store is keyed, so re-invoking a command is always safe and never duplicates work:

Completed rows skip; the response cache means already-paid calls aren't re-charged.
Sample errors re-run.
Empty completions re-run only under on_empty: rerun.
Parse failures stay final (use --force to redo).
--force re-runs everything selected; --condition <id|prefix|slug> (and --grader/--rubric for grade) narrows a run.
When a planned run would overwrite existing rows (--force, epoch extension after raising replications, on_empty: rerun), the pre-gate block states it — this run replaces N existing rows (…) — as part of the single money confirmation, and the run JSON carries rows_replaced.

Interrupting a run (Ctrl-C, a crash, a provider outage mid-run) needs no special handling — just run the same command again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Handling

Error Handling and Reporting

Run-time failures: the three row-level channels

Sample errors

Parse failures (judge output)

Empty completions

Eval-level (whole-condition) failures

Setup errors (before any model call)

Budget gate (refusals, not errors)

Exit codes (all commands)

Where failures are visible after a run

Retry and resume — re-run the same command

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally