Add benchmark validation suite, CLI command, and reporting artifacts by jam-sudo · Pull Request #6 · jam-sudo/Omega

jam-sudo · 2026-02-24T19:52:51Z

Motivation

Provide a reproducible validation/benchmark layer to track simulator behavior against reference PK curves and detect regressions.
Enable automated, deterministic benchmark runs and human-readable reporting to speed validation during development.

Description

Added a benchmark suite under benchmarks/ containing configs (benchmarks/configs/*.yaml), synthetic reference datasets (benchmarks/datasets/*.csv), and acceptance thresholds (benchmarks/expected/acceptance.json).
Implemented physio_sim.validation package with benchmarks.py (suite runner that loads configs, runs simulate(), interpolates sim->obs, computes AUC, Cmax, Tmax, RMSE, evaluates pass/fail versus thresholds, and writes artifacts) and report.py (builds report.md).
Exposed a non-breaking CLI subcommand benchmark in physio_sim.cli so the suite runs with python -m physio_sim.cli benchmark --suite benchmarks --out outputs/benchmarks and exits with code 1 when any case fails acceptance.
Added deterministic seed support for the suite by using existing configure_random_seed in configs, and wrote smoke/determinism tests in tests/test_benchmark_smoke.py to assert outputs and reproducibility.
Documentation: added docs/validation.md describing benchmark philosophy, dataset format, and instructions; README updated with quick example.
Files added/modified (high level): benchmarks/ (configs, datasets, expected JSON), physio_sim/validation/__init__.py, physio_sim/validation/benchmarks.py, physio_sim/validation/report.py, physio_sim/cli.py (benchmark command), docs/validation.md, tests/test_benchmark_smoke.py, and pyproject.toml (include new package).

Testing

Ran formatting and lint: python -m ruff format --check . and python -m ruff check . (passed).
Static typing: python -m mypy physio_sim (passed).
Unit tests: python -m pytest -q (all tests passed; 17 passed).
Executed benchmark runner end-to-end: python -m physio_sim.cli benchmark --suite benchmarks --out outputs/benchmarks, which produced outputs/benchmarks/<drug>/overlay.png, metrics.json, summary.json, and report.md and completed successfully (benchmarks passed in this run).
Note: bundled datasets (caffeine, warfarin, metoprolol) are synthetic example datasets and are explicitly documented in docs/validation.md.

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3916f2f55a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-24T19:55:51Z

+
+
+def run_benchmark_suite(suite_dir: Path, out_dir: Path) -> bool:
+    repo_root = Path.cwd()


Resolve benchmark file paths from suite directory

run_benchmark_suite anchors relative subject_file/dataset paths to Path.cwd() rather than the provided suite_dir, so running benchmark --suite /absolute/path/to/benchmarks from a different working directory looks for files under that unrelated cwd (for example /tmp/examples/subject_default.yaml) and crashes. This makes valid --suite invocations fail unless callers also cd to the repo root, which is a brittle regression for CLI and programmatic use.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-24T19:55:51Z

+    summary = {
+        "suite": str(suite_dir),
+        "thresholds": acceptance,
+        "overall_pass": all(item["pass"] for item in results),


Fail when benchmark suite executes zero cases

overall_pass is derived from all(item["pass"] for item in results) without verifying that any configs were run, and in Python all([]) is True. If configs/*.yaml is empty or not discovered, the command emits a passing summary and exits 0 even though no benchmarks executed, which can silently mask broken validation in CI.

Useful? React with 👍 / 👎.

Data fixes (6 entries, cross-referenced against FDA labels): valacyclovir: dose 20→1000mg (FDA Valtrex: 1g dose for acyclovir Cmax) carglumic acid: flagged uncertain (mg/kg dosing in pediatric label) darifenacin: dose 400→15mg (max clinical dose is 15mg) sertraline: Cmax 0.165→0.033 (FDA Zoloft 50mg: 33 ng/mL) ketorolac: dose 10→30mg (Cmax 2.52 matches 30mg FDA Toradol) cetirizine: dose 60→10mg (FDA Zyrtec: 10mg Cmax=311 ng/mL) AD filter refinement: - Thienopyridine SMARTS fixed: [#7]1[#6][#6]c2[#16]ccc2[#6]1 (catches clopidogrel) - Extreme lipophilicity threshold: logP 6.0→5.5 (catches sonidegib logP=5.2) In-domain holdout (54 drugs): AAFE: 1.923 [1.691, 2.206] %2-fold: 63.0% %3-fold: 85.2% Session total: AAFE 3.520 → 1.923 (-45.4%), zero model changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add benchmark validation suite and CLI reporting

3916f2f

jam-sudo added the codex label Feb 24, 2026 — with ChatGPT Codex Connector

jam-sudo merged commit 98698ce into main Feb 24, 2026
2 checks passed

chatgpt-codex-connector Bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark validation suite, CLI command, and reporting artifacts#6

Add benchmark validation suite, CLI command, and reporting artifacts#6
jam-sudo merged 1 commit intomainfrom
codex/add-benchmark-validation-suite

jam-sudo commented Feb 24, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 24, 2026

Uh oh!

chatgpt-codex-connector Bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def run_benchmark_suite(suite_dir: Path, out_dir: Path) -> bool:
		repo_root = Path.cwd()

Conversation

jam-sudo commented Feb 24, 2026

Motivation

Description

Testing

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant