# Error Handling and Reporting itemeval's rule is **flag, never silently drop**: every failure mode during a run leaves a durable, inspectable record, and re-running is always safe. This page is the single reference for what can go wrong, how it's recorded, how it's reported, and what re-running does. A failure is one of two kinds: - **Setup errors** raise an exception and stop the command before any model is called (bad config, missing template, dataset load failure, cost gate). - **Run-time failures** never abort the stage — they are captured per row, written to the store, and summarized. inspect runs with `fail_on_error=False`, so one bad sample never sinks a whole condition. ## Run-time failures: the three row-level channels During `generate`/`grade`, each sample resolves into exactly one of these. All three keep the row — none is ever dropped. | Channel | What it is | Recorded as | Reported as | Re-run behavior | |---|---|---|---|---| | **Sample error** | Provider/API failure: timeout, rate-limit exhaustion, 5xx, content filter, refusal | `error` set; `solution`/`judge_completion` null | `errors=N` (run summary; `status` **err** column) | **Re-attempted** — the row is pending again | | **Parse failure** *(grade only)* | Judge replied, but no valid `{"score": }` block | `parse_ok=false`, `parse_error=`, `judge_completion`=raw text, `score` null | `parse_failures=N` (run summary; `status` **parse_fail** column) | **Final** — not retried; use `grade --force` | | **Empty completion** | Generation finished with no API error but blank text (typically a reasoning model whose `max_tokens` was spent entirely on hidden reasoning) | `error` null, `solution` blank, `stop_reason` usually `max_tokens` | `empty=N` (`status`); `grade` prints a count + stop-reason breakdown | Governed by [`solvers.on_empty`](Configuration.md) | ### Sample errors inspect retries each erroring sample once in-eval (`retry_on_error=1`) before the error is recorded. A recorded error row has `error` set and no `solution`. On the next `generate`/`grade` of the same command, errored rows are treated as incomplete and re-attempted (already-succeeded samples are served from inspect's response cache, so they are not re-paid). ### Parse failures (judge output) Judge parsing is strict — fenced ```json blocks last-to-first, then any raw JSON object — with exact failure codes in `parse_error`: | Code | Meaning | |---|---| | `no_json_object` | No JSON object found anywhere in the completion | | `no_score_in_json` | A JSON object was found, but it has no `score` key | | `score_not_numeric` | `score` is present but not coercible to a finite number (or is a bool) | | `score_not_finite` | `score` parsed to `NaN`/`inf` | Parse failures are **results, not errors**: the row is kept with `parse_ok=false` and the raw `judge_completion` for inspection, and is **final** — re-running `grade` will not retry it. To re-grade, either change the rubric (its hash changes, starting a fresh grade condition) or run `grade --force`. A sample-level *error* during grading (the judge call itself failed) is the **sample error** channel instead — `error` set, `parse_ok=false`, `parse_error` null — and is re-attempted, not final. ### Empty completions A completed generation with no error but no gradable text. This is a distinct channel from both errors and parse failures, controlled by `solvers.on_empty`: | Policy | Effect | |---|---| | `skip` *(default)* | Excluded from grading; surfaced in the `grade` summary and the `status` **empty** column | | `rerun` | Also treated as not-done by `generate`, so a later `generate` re-attempts them (raise `max_tokens` / lower `reasoning_effort` first — an identical request hits the response cache and stays empty) | | `grade` | Sent to the judge as-is (an empty answer, usually scored low) | The usual cause is too small a `max_tokens` for a reasoning model — size the cap for hidden reasoning **plus** the visible answer. See [Configuration](Configuration.md) for details. ## Eval-level (whole-condition) failures If an entire `inspect_ai.eval(...)` raises — a misconfigured task, an unreachable provider, an auth failure — itemeval catches it, records the condition as `status="error"` with the exception message, and **continues to the next condition**. No rows are written for that condition. The CLI prints: ``` [2/4] gpt-5-mini_builtin-standard_default ERROR: PrerequisiteError: ... ``` The command's exit code is **1** if any condition errored. Other conditions in the same run still complete and persist normally. ## Setup errors (before any model call) These raise an exception, print `itemeval: error: `, and stop: | Exception | Cause | Exit code | |---|---|---| | `ConfigError` | YAML shape/validation failure, bad grader/template reference | 2 | | `TemplateError` | Missing template file or required placeholder | 2 | | `AdapterError` | Dataset load or field-mapping failure | 2 | | `StoreError` | Parquet schema/IO problem (e.g. `grade` with no solutions) | 1 | | `BudgetError` | Pricing refresh / estimator failure | 1 | The Python API raises these exceptions directly instead of mapping them to exit codes. ## Budget gate (refusals, not errors) The cost gate can decline to proceed; this is a deliberate stop, not a failure: | Situation | Exit code | |---|---| | Projection exceeds `confirm_above_usd` and confirmation is needed in a non-TTY shell | 3 | | Projection exceeds `budget.max_usd` (hard cap; `--yes` does **not** override) | 4 | Pass `--yes` to auto-confirm in scripts/CI, and set `budget.max_usd` as the un-overridable backstop. See [Budget and Costs](Budget-and-Costs.md). ## Exit codes (all commands) | Code | Meaning | |------|---------| | 0 | success | | 1 | unexpected error, or at least one condition failed during a run | | 2 | config / template / adapter error (and argparse usage errors) | | 3 | cost gate declined, or confirmation required in a non-interactive shell | | 4 | projected cost exceeds `budget.max_usd` | ## Where failures are visible after a run - **Run summary (stdout)** — per-condition lines plus totals: `rows written`, `errors`, `parse_failures`, and the empty-solution line. - **`itemeval status`** — the completion matrix: generate `done / err / empty`, grade `done / err / parse_fail`, per condition. - **Stores** — `solutions.parquet` (`error`, `stop_reason`) and `gradings.parquet` (`error`, `parse_ok`, `parse_error`, `judge_completion`). See [Outputs and Schemas](Outputs-and-Schemas.md). - **`log_index.parquet`** — per-eval `status` and completed/total sample counts. - **Raw `.eval` logs** — full inspect evidence: stack traces, retries, per-sample events. The store is the source of truth; the logs are the receipts. ## Retry and resume — re-run the same command The parquet store is keyed, so re-invoking a command is always safe and never duplicates work: - **Completed** rows skip; the response cache means already-paid calls aren't re-charged. - **Sample errors** re-run. - **Empty completions** re-run only under `on_empty: rerun`. - **Parse failures** stay final (use `--force` to redo). - `--force` re-runs everything selected; `--condition ` (and `--grader`/`--rubric` for grade) narrows a run. - When a planned run would overwrite existing rows (`--force`, epoch extension after raising `replications`, `on_empty: rerun`), the pre-gate block states it — `this run replaces N existing rows (…)` — as part of the single money confirmation, and the run JSON carries `rows_replaced`. Interrupting a run (Ctrl-C, a crash, a provider outage mid-run) needs no special handling — just run the same command again.