# Python API

Everything the CLI does is available programmatically — `import itemeval`
exposes one function per pipeline step, plus the config models. Use whichever
fits: the CLI for terminal/script workflows, the Python API for notebooks and
orchestration code. Both drive the exact same internals and produce the same
on-disk outputs.

## The surface

```python
import itemeval

itemeval.load_config(path, work_dir=None)  # YAML -> ExperimentConfig (raises ConfigError)
itemeval.prepare_study(cfg)       # config -> PreparedStudy (datasets, templates, grid, plan, pricing)
itemeval.estimate_study(prep)     # -> Estimate (projected calls/tokens/$ per stage)
itemeval.run_generate(prep)       # -> GenerateResult (stage 1; writes solutions store)
itemeval.run_grade(prep)          # -> GradeResult (stage 2; writes gradings store)
itemeval.export_study(cfg)        # -> ExportResult (writes export/ tables)
itemeval.build_status(cfg, prep)  # -> StatusReport (completion matrix)

itemeval.ExperimentConfig         # the validated config model
itemeval.Item                     # the canonical item model
itemeval.__version__
```

These names (`itemeval.__all__`) are the entire supported Python surface.
`_`-prefixed modules remain importable but carry **no stability promise** —
pre-1.0 they may change in any minor release.

## A complete run

```python
import itemeval

cfg = itemeval.load_config("configs/my_study.yaml")
prep = itemeval.prepare_study(cfg)          # loads datasets (pins revisions), expands the grid

est = itemeval.estimate_study(prep)
print(f"projected: generate ${est.generate.usd:.2f}, grade ${est.grade.usd:.2f}")
for w in est.warnings:
    print("warning:", w)
gen = itemeval.run_generate(prep, max_usd=20)  # raises BudgetExceededError before
                                               # any API call if remaining > $20
assert not any(c.status == "error" for c in gen.conditions), gen.conditions

graded = itemeval.run_grade(prep)
print(f"{graded.rows_written} gradings, {graded.parse_failures} parse failures")

exported = itemeval.export_study(cfg)
report = itemeval.build_status(cfg, prep)
```

Then analyze:

```python
import pandas as pd
df = pd.read_parquet(cfg.study_dir / "export" / "gradings_long.parquet")
```

## Result objects

All returns are pydantic models — use `.model_dump()` / `.model_dump_json()`
freely:

| Call | Returns | Key fields |
|---|---|---|
| `estimate_study` | `Estimate` | `generate`/`grade` (`.calls`, `.input_tokens`, `.output_tokens`, `.usd`, `.unpriced_models`, per-condition list), `total_usd`, `warnings`, `pricing` (provenance) |
| `run_generate` | `GenerateResult` | `run_id`, `conditions` (per-condition `status` run/skipped/error, `rows_written`, `errors`, `usd`), `rows_written`, `total_usd`, `manifest_path` |
| `run_grade` | `GradeResult` | as above plus `parse_failures` |
| `export_study` | `ExportResult` | `rows`, output paths, `generation_usd`, `grading_usd`, `internally_reconciled`, `cost` (savings + per-provider `CostReport`), `pricing` (provenance) |
| `build_status` | `StatusReport` | datasets, item counts, per-condition `expected/completed/errors/parse_failures`, spend, manifests |

## Useful keyword arguments

```python
itemeval.prepare_study(cfg, refresh_pricing_table=True)   # pull OpenRouter prices first

itemeval.run_generate(prep,
    force=True,                          # re-run completed work (rows replaced)
    condition_filter=["gpt-5-mini_minimal"],  # id / id-prefix / slug, like --condition
    display="rich",                      # inspect progress UI (default "none")
)

itemeval.run_grade(prep,
    graders=["judge_b"],                 # like --grader
    rubrics=["strict"],                  # like --rubric
    force=False, condition_filter=None, display="none",
)

itemeval.estimate_study(prep)            # reads the solutions store automatically so
                                         # judge estimates use real stored solutions
```

## Differences from the CLI

1. **Consent is a parameter, never a prompt.** The CLI's interactive
   `confirm_above_usd` gate does not exist here; instead pass
   `max_usd=` to `run_generate`/`run_grade` — when the stage's *remaining*
   projection (completed work is never re-counted) exceeds it, the function
   raises `itemeval.BudgetExceededError` **before any API call**. The config's
   `budget.max_usd` hard cap is enforced the same way on this surface, so the
   cap holds everywhere. A library never prompts — it would hang notebooks
   and CI.
2. **No printing.** Information arrives as return values; condition-level
   eval failures are reported in `result.conditions` (status `"error"`), not
   raised — check them.
3. **Exceptions instead of exit codes.** Config/template/dataset problems
   raise `itemeval.ItemevalError` subclasses (`ConfigError`, `TemplateError`,
   `AdapterError`, `StoreError`, `BudgetError`); budget caps raise
   `itemeval.BudgetExceededError`. `ItemevalError` and `BudgetExceededError`
   are public exports; the narrower classes live in an internal module
   pre-1.0.

## Notes

- **Output location.** `load_config(path)` anchors inputs (prompts/rubrics) to
  the config file's directory and outputs (the `studies/` tree) to the current
  working directory. Pass `load_config(path, work_dir="/some/dir")` to anchor
  outputs elsewhere — the analogue of the CLI's `-C/--base-dir`. An in-memory
  `ExperimentConfig` (no file) has no config directory, so it anchors *both*
  inputs and outputs to `work_dir` (CWD by default).
- `import itemeval` is lightweight — heavy dependencies (inspect_ai, pandas)
  load lazily on first use of a pipeline function.
- `prepare_study` touches the HF Hub on first run (revision resolution +
  download); afterwards the lock file + local cache make it effectively
  offline.
- Stages call inspect's `eval()` serially, one condition at a time — don't
  run two stages concurrently in one process or share a study directory
  across processes.
- Everything remains resumable and idempotent exactly as documented in
  [Pipeline Concepts](Pipeline-Concepts.md): calling `run_generate` twice is
  safe, the second call skips completed conditions.