Skip to content

Add benchmark validation suite, CLI command, and reporting artifacts#6

Merged
jam-sudo merged 1 commit intomainfrom
codex/add-benchmark-validation-suite
Feb 24, 2026
Merged

Add benchmark validation suite, CLI command, and reporting artifacts#6
jam-sudo merged 1 commit intomainfrom
codex/add-benchmark-validation-suite

Conversation

@jam-sudo
Copy link
Copy Markdown
Owner

Motivation

  • Provide a reproducible validation/benchmark layer to track simulator behavior against reference PK curves and detect regressions.
  • Enable automated, deterministic benchmark runs and human-readable reporting to speed validation during development.

Description

  • Added a benchmark suite under benchmarks/ containing configs (benchmarks/configs/*.yaml), synthetic reference datasets (benchmarks/datasets/*.csv), and acceptance thresholds (benchmarks/expected/acceptance.json).
  • Implemented physio_sim.validation package with benchmarks.py (suite runner that loads configs, runs simulate(), interpolates sim->obs, computes AUC, Cmax, Tmax, RMSE, evaluates pass/fail versus thresholds, and writes artifacts) and report.py (builds report.md).
  • Exposed a non-breaking CLI subcommand benchmark in physio_sim.cli so the suite runs with python -m physio_sim.cli benchmark --suite benchmarks --out outputs/benchmarks and exits with code 1 when any case fails acceptance.
  • Added deterministic seed support for the suite by using existing configure_random_seed in configs, and wrote smoke/determinism tests in tests/test_benchmark_smoke.py to assert outputs and reproducibility.
  • Documentation: added docs/validation.md describing benchmark philosophy, dataset format, and instructions; README updated with quick example.
  • Files added/modified (high level): benchmarks/ (configs, datasets, expected JSON), physio_sim/validation/__init__.py, physio_sim/validation/benchmarks.py, physio_sim/validation/report.py, physio_sim/cli.py (benchmark command), docs/validation.md, tests/test_benchmark_smoke.py, and pyproject.toml (include new package).

Testing

  • Ran formatting and lint: python -m ruff format --check . and python -m ruff check . (passed).
  • Static typing: python -m mypy physio_sim (passed).
  • Unit tests: python -m pytest -q (all tests passed; 17 passed).
  • Executed benchmark runner end-to-end: python -m physio_sim.cli benchmark --suite benchmarks --out outputs/benchmarks, which produced outputs/benchmarks/<drug>/overlay.png, metrics.json, summary.json, and report.md and completed successfully (benchmarks passed in this run).
  • Note: bundled datasets (caffeine, warfarin, metoprolol) are synthetic example datasets and are explicitly documented in docs/validation.md.

Codex Task

@jam-sudo jam-sudo merged commit 98698ce into main Feb 24, 2026
2 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3916f2f55a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".



def run_benchmark_suite(suite_dir: Path, out_dir: Path) -> bool:
repo_root = Path.cwd()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resolve benchmark file paths from suite directory

run_benchmark_suite anchors relative subject_file/dataset paths to Path.cwd() rather than the provided suite_dir, so running benchmark --suite /absolute/path/to/benchmarks from a different working directory looks for files under that unrelated cwd (for example /tmp/examples/subject_default.yaml) and crashes. This makes valid --suite invocations fail unless callers also cd to the repo root, which is a brittle regression for CLI and programmatic use.

Useful? React with 👍 / 👎.

summary = {
"suite": str(suite_dir),
"thresholds": acceptance,
"overall_pass": all(item["pass"] for item in results),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail when benchmark suite executes zero cases

overall_pass is derived from all(item["pass"] for item in results) without verifying that any configs were run, and in Python all([]) is True. If configs/*.yaml is empty or not discovered, the command emits a passing summary and exits 0 even though no benchmarks executed, which can silently mask broken validation in CI.

Useful? React with 👍 / 👎.

jam-sudo added a commit that referenced this pull request Mar 22, 2026
Data fixes (6 entries, cross-referenced against FDA labels):
  valacyclovir: dose 20→1000mg (FDA Valtrex: 1g dose for acyclovir Cmax)
  carglumic acid: flagged uncertain (mg/kg dosing in pediatric label)
  darifenacin: dose 400→15mg (max clinical dose is 15mg)
  sertraline: Cmax 0.165→0.033 (FDA Zoloft 50mg: 33 ng/mL)
  ketorolac: dose 10→30mg (Cmax 2.52 matches 30mg FDA Toradol)
  cetirizine: dose 60→10mg (FDA Zyrtec: 10mg Cmax=311 ng/mL)

AD filter refinement:
  - Thienopyridine SMARTS fixed: [#7]1[#6][#6]c2[#16]ccc2[#6]1 (catches clopidogrel)
  - Extreme lipophilicity threshold: logP 6.0→5.5 (catches sonidegib logP=5.2)

In-domain holdout (54 drugs):
  AAFE: 1.923 [1.691, 2.206]
  %2-fold: 63.0%   %3-fold: 85.2%

Session total: AAFE 3.520 → 1.923 (-45.4%), zero model changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant