# Architecture: what each module does and why it exists

itemeval is ~2,950 effective lines across 28 modules — a few percent of the
size of inspect_ai, which it deliberately does **not** duplicate. "Thin"
here means: inspect_ai keeps the hard runtime problems (async execution,
~20 providers, rate limiting, retries, response/prompt caching, batch APIs,
`.eval` transcripts); itemeval only adds the experiment-design layer that
inspect_ai explicitly does not have. Every module below traces to one of the
five README features.

## Naming convention

Every internal module is `_`-prefixed (PEP 8 convention for non-public API,
mandated by this repo's conventions). Python has no enforced visibility; the
underscore is the contract: **only** `itemeval.__init__` (4 names) and the
CLI are public, so everything else can be refactored freely pre-1.0 without
breaking users. `cli.py` has no underscore because it is the console-script
entry point declared in `pyproject.toml`.

## Module map

Sizes are total lines (incl. docstrings). ◆ = candidate for future
consolidation — kept separate for clarity/test-ownership, not necessity.

### Foundation (leaf modules, ~100 lines)

| Module | Lines | Why it exists |
|---|---|---|
| `_errors.py` | 25 | One exception hierarchy → deterministic CLI exit codes (2 vs 1). |
| `_util.py` | 46 | Canonical JSON + sha256 (condition ids, manifests), atomic writes, the token heuristic. Used by nearly every module. |
| `_item.py` ◆ | 29 | The canonical `Item` model — the interface between adapters and both stages. Could live in `_config.py`; separate because it's exported. |

### Feature 1 — benchmark adapters (`adapters/`, ~180 lines)

inspect_ai's `hf_dataset()` loads rows into `Sample`s for one eval; it has no
revision-pinning policy, no lock file, no `grading_scheme`/metadata mapping
contract, no cross-dataset id-uniqueness check.

| Module | Lines | Why |
|---|---|---|
| `_base.py` | 98 | Adapter protocol + registry (ROADMAP post-0.1: github/local adapters), `dataset_locks.json` ("revision pinned at first run"), multi-dataset orchestration. |
| `_hf.py` | 85 | The one concrete adapter: `datasets.load_dataset(revision=...)` + the exact column→Item mapping rules. |

### Feature 2 — design grids (`design/`, ~200 lines)

inspect_ai has tasks, not experiment designs. Nothing in it represents "3
models × 2 prompts × 2 model-configs, fully crossed, with stable cell ids".

| Module | Lines | Why |
|---|---|---|
| `_ids.py` ◆ | 25 | Slug + content-hash id algorithm (tiny, but the stability contract deserves its own tested unit). |
| `_grid.py` | 178 | Facet crossing → `GenCondition`/`GradeCondition` lists; param resolution (facet over solver defaults); template placeholder validation. |

### Feature 3 — the two-stage pipeline (`generate/` + `grade/` + glue, ~1,070 lines)

This is the package's reason to exist. inspect's model-graded scorers run
*inside* the generating eval: judge calls would share the solver's logs, get
no separate batching/caching/cost accounting, and adding a rubric later would
re-run generation. Decoupling requires exactly what these modules do.

| Module | Lines | Why |
|---|---|---|
| `generate/_task.py` | 61 | items + condition → inspect `Task` (samples, epochs, GenerateConfig, cache policy). |
| `generate/_params.py` ◆ | 50 | Requested-vs-effective sampling params from model events (ROADMAP M2 checkbox: provider-forced values must be visible). |
| `generate/_run.py` | 370 | The stage orchestrator: resume computation, serial `eval()` calls, log→rows extraction, usage→USD, ledger/log-index writes, error containment. Biggest module because it owns the inspect↔store boundary; also exports helpers `grade/_run` reuses. |
| `grade/_verifiable.py` | 84 | exact/MC/numeric scorers as pure $0 functions over stored text (inspect scorers want a live `TaskState`, not a parquet row). |
| `grade/_parse.py` | 94 | The strict judge-output contract with exact failure codes; "flagged, never dropped" lives here. |
| `grade/_judge.py` | 93 | Stored solutions + rubric → a fresh inspect `Task` (judge-as-task), format suffix, prompt-cache hint. |
| `grade/_run.py` | 304 | Grade orchestrator: pending computation, verifiable vs judge dispatch, grader×rubric filters, solutions store never written. |
| `_mockmodels.py` | 68 | `mockllm/*` pass-through: deterministic outputs + fabricated usage so demos/tests/CI run the entire pipeline for $0. Dev affordance — the only module a user never needs. |

### Feature 4 — item-response store & export (`store/`, ~700 lines)

inspect logs are per-eval `.eval` files. Cross-run accumulation, keyed
upserts, resume predicates, and the long-format join do not exist there.

| Module | Lines | Why |
|---|---|---|
| `_base.py` | 59 | The one upsert engine: concat → dedup-on-key → schema-cast → atomic replace. |
| `_layout.py` ◆ | 47 | Single source of truth for every path in a study dir. |
| `_solutions.py` | 70 | Solutions schema (36 cols) + `items_to_run` resume predicate. |
| `_gradings.py` | 80 | Gradings schema (30 cols) + `pending_solutions` (parse-failures final, errors retry). |
| `_items.py` ◆ | 47 | Items snapshot (analysis joins need item text without re-downloading). |
| `_logs.py` ◆ | 38 | Raw-log index: store row ↔ `.eval` transcript audit trail. |
| `_ledger.py` ◆ | 36 | Cost ledger schema. |
| `_export.py` | 183 | The 45-column long-table join + CSV mirrors + internal reconciliation. |

The five small schema modules could be one file; they are split so each
table's schema+predicates are independently owned and tested.

### Feature 5 — budget layer (`budget/`, ~430 lines)

inspect_ai has no notion of dollars at all — no pricing, estimation, or
spending gates.

| Module | Lines | Why |
|---|---|---|
| `_pricing.py` | 131 | Pricing table model, packaged seed, OpenRouter refresh, model→price lookup, token→USD math. |
| `_policies.py` ◆ | 40 | dev / full-interactive / full-batch → effective plan (items limit, replications, batch flag). |
| `_estimator.py` | 187 | Per-stage projection: heuristic tokens × grid × prices; uses stored solutions for judge sizing. |
| `_gate.py` | 69 | confirm_above_usd / max_usd / --yes / interactive decision table → exit codes 3/4. |

### Orchestration & UX (~770 lines)

| Module | Lines | Why |
|---|---|---|
| `_config.py` | 209 | The YAML contract: every config model, strict validation, `load_config`. Shared by everything; designed first. |
| `_templates.py` | 79 | Content-hashed prompt/rubric registry + brace-safe rendering (`str.format` would explode on LaTeX/JSON). |
| `_manifest.py` | 183 | The README reproducibility promise as a pydantic schema + writer + post-run effective-params backfill. |
| `_prepare.py` | 83 | `prepare_study()`: config → datasets+templates+grid+plan+pricing, computed once, shared by all five commands. |
| `_status.py` | 153 | Completion matrix: expected vs done vs errors vs parse-failures per condition (M1/M6 exit). |
| `cli.py` | 325 | argparse wiring, gate enforcement, output formatting, exit-code mapping. |

## Could it be fewer files?

Yes — merging the ◆ modules would give ~14 files with identical behavior; the
split optimizes for one-concern-per-file and per-module tests, not because
each file is architecturally load-bearing. What you should **not** expect to
shrink is the behavior itself: each module maps to a checked ROADMAP box or a
README promise, and the total is still under 3k lines on top of a >100k-line
runtime.