-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
itemeval is ~2,950 effective lines across 28 modules — a few percent of the
size of inspect_ai, which it deliberately does not duplicate. "Thin"
here means: inspect_ai keeps the hard runtime problems (async execution,
~20 providers, rate limiting, retries, response/prompt caching, batch APIs,
.eval transcripts); itemeval only adds the experiment-design layer that
inspect_ai explicitly does not have. Every module below traces to one of the
five README features.
Every internal module is _-prefixed (PEP 8 convention for non-public API,
mandated by this repo's conventions). Python has no enforced visibility; the
underscore is the contract: only itemeval.__init__ (4 names) and the
CLI are public, so everything else can be refactored freely pre-1.0 without
breaking users. cli.py has no underscore because it is the console-script
entry point declared in pyproject.toml.
Sizes are total lines (incl. docstrings). ◆ = candidate for future consolidation — kept separate for clarity/test-ownership, not necessity.
| Module | Lines | Why it exists |
|---|---|---|
_errors.py |
25 | One exception hierarchy → deterministic CLI exit codes (2 vs 1). |
_util.py |
46 | Canonical JSON + sha256 (condition ids, manifests), atomic writes, the token heuristic. Used by nearly every module. |
_item.py ◆ |
29 | The canonical Item model — the interface between adapters and both stages. Could live in _config.py; separate because it's exported. |
inspect_ai's hf_dataset() loads rows into Samples for one eval; it has no
revision-pinning policy, no lock file, no grading_scheme/metadata mapping
contract, no cross-dataset id-uniqueness check.
| Module | Lines | Why |
|---|---|---|
_base.py |
98 | Adapter protocol + registry (ROADMAP post-0.1: github/local adapters), dataset_locks.json ("revision pinned at first run"), multi-dataset orchestration. |
_hf.py |
85 | The one concrete adapter: datasets.load_dataset(revision=...) + the exact column→Item mapping rules. |
inspect_ai has tasks, not experiment designs. Nothing in it represents "3 models × 2 prompts × 2 model-configs, fully crossed, with stable cell ids".
| Module | Lines | Why |
|---|---|---|
_ids.py ◆ |
25 | Slug + content-hash id algorithm (tiny, but the stability contract deserves its own tested unit). |
_grid.py |
178 | Facet crossing → GenCondition/GradeCondition lists; param resolution (facet over solver defaults); template placeholder validation. |
This is the package's reason to exist. inspect's model-graded scorers run inside the generating eval: judge calls would share the solver's logs, get no separate batching/caching/cost accounting, and adding a rubric later would re-run generation. Decoupling requires exactly what these modules do.
| Module | Lines | Why |
|---|---|---|
generate/_task.py |
61 | items + condition → inspect Task (samples, epochs, GenerateConfig, cache policy). |
generate/_params.py ◆ |
50 | Requested-vs-effective sampling params from model events (ROADMAP M2 checkbox: provider-forced values must be visible). |
generate/_run.py |
370 | The stage orchestrator: resume computation, serial eval() calls, log→rows extraction, usage→USD, ledger/log-index writes, error containment. Biggest module because it owns the inspect↔store boundary; also exports helpers grade/_run reuses. |
grade/_verifiable.py |
84 | exact/MC/numeric scorers as pure $0 functions over stored text (inspect scorers want a live TaskState, not a parquet row). |
grade/_parse.py |
94 | The strict judge-output contract with exact failure codes; "flagged, never dropped" lives here. |
grade/_judge.py |
93 | Stored solutions + rubric → a fresh inspect Task (judge-as-task), format suffix, prompt-cache hint. |
grade/_run.py |
304 | Grade orchestrator: pending computation, verifiable vs judge dispatch, grader×rubric filters, solutions store never written. |
_mockmodels.py |
68 |
mockllm/* pass-through: deterministic outputs + fabricated usage so demos/tests/CI run the entire pipeline for $0. Dev affordance — the only module a user never needs. |
inspect logs are per-eval .eval files. Cross-run accumulation, keyed
upserts, resume predicates, and the long-format join do not exist there.
| Module | Lines | Why |
|---|---|---|
_base.py |
59 | The one upsert engine: concat → dedup-on-key → schema-cast → atomic replace. |
_layout.py ◆ |
47 | Single source of truth for every path in a study dir. |
_solutions.py |
70 | Solutions schema (36 cols) + items_to_run resume predicate. |
_gradings.py |
80 | Gradings schema (30 cols) + pending_solutions (parse-failures final, errors retry). |
_items.py ◆ |
47 | Items snapshot (analysis joins need item text without re-downloading). |
_logs.py ◆ |
38 | Raw-log index: store row ↔ .eval transcript audit trail. |
_ledger.py ◆ |
36 | Cost ledger schema. |
_export.py |
183 | The 45-column long-table join + CSV mirrors + internal reconciliation. |
The five small schema modules could be one file; they are split so each table's schema+predicates are independently owned and tested.
inspect_ai has no notion of dollars at all — no pricing, estimation, or spending gates.
| Module | Lines | Why |
|---|---|---|
_pricing.py |
131 | Pricing table model, packaged seed, OpenRouter refresh, model→price lookup, token→USD math. |
_policies.py ◆ |
40 | dev / full-interactive / full-batch → effective plan (items limit, replications, batch flag). |
_estimator.py |
187 | Per-stage projection: heuristic tokens × grid × prices; uses stored solutions for judge sizing. |
_gate.py |
69 | confirm_above_usd / max_usd / --yes / interactive decision table → exit codes 3/4. |
| Module | Lines | Why |
|---|---|---|
_config.py |
209 | The YAML contract: every config model, strict validation, load_config. Shared by everything; designed first. |
_templates.py |
79 | Content-hashed prompt/rubric registry + brace-safe rendering (str.format would explode on LaTeX/JSON). |
_manifest.py |
183 | The README reproducibility promise as a pydantic schema + writer + post-run effective-params backfill. |
_prepare.py |
83 |
prepare_study(): config → datasets+templates+grid+plan+pricing, computed once, shared by all five commands. |
_status.py |
153 | Completion matrix: expected vs done vs errors vs parse-failures per condition (M1/M6 exit). |
cli.py |
325 | argparse wiring, gate enforcement, output formatting, exit-code mapping. |
Yes — merging the ◆ modules would give ~14 files with identical behavior; the split optimizes for one-concern-per-file and per-module tests, not because each file is architecturally load-bearing. What you should not expect to shrink is the behavior itself: each module maps to a checked ROADMAP box or a README promise, and the total is still under 3k lines on top of a >100k-line runtime.