Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3916f2f55a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
|
|
||
| def run_benchmark_suite(suite_dir: Path, out_dir: Path) -> bool: | ||
| repo_root = Path.cwd() |
There was a problem hiding this comment.
Resolve benchmark file paths from suite directory
run_benchmark_suite anchors relative subject_file/dataset paths to Path.cwd() rather than the provided suite_dir, so running benchmark --suite /absolute/path/to/benchmarks from a different working directory looks for files under that unrelated cwd (for example /tmp/examples/subject_default.yaml) and crashes. This makes valid --suite invocations fail unless callers also cd to the repo root, which is a brittle regression for CLI and programmatic use.
Useful? React with 👍 / 👎.
| summary = { | ||
| "suite": str(suite_dir), | ||
| "thresholds": acceptance, | ||
| "overall_pass": all(item["pass"] for item in results), |
There was a problem hiding this comment.
Fail when benchmark suite executes zero cases
overall_pass is derived from all(item["pass"] for item in results) without verifying that any configs were run, and in Python all([]) is True. If configs/*.yaml is empty or not discovered, the command emits a passing summary and exits 0 even though no benchmarks executed, which can silently mask broken validation in CI.
Useful? React with 👍 / 👎.
Data fixes (6 entries, cross-referenced against FDA labels): valacyclovir: dose 20→1000mg (FDA Valtrex: 1g dose for acyclovir Cmax) carglumic acid: flagged uncertain (mg/kg dosing in pediatric label) darifenacin: dose 400→15mg (max clinical dose is 15mg) sertraline: Cmax 0.165→0.033 (FDA Zoloft 50mg: 33 ng/mL) ketorolac: dose 10→30mg (Cmax 2.52 matches 30mg FDA Toradol) cetirizine: dose 60→10mg (FDA Zyrtec: 10mg Cmax=311 ng/mL) AD filter refinement: - Thienopyridine SMARTS fixed: [#7]1[#6][#6]c2[#16]ccc2[#6]1 (catches clopidogrel) - Extreme lipophilicity threshold: logP 6.0→5.5 (catches sonidegib logP=5.2) In-domain holdout (54 drugs): AAFE: 1.923 [1.691, 2.206] %2-fold: 63.0% %3-fold: 85.2% Session total: AAFE 3.520 → 1.923 (-45.4%), zero model changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Motivation
Description
benchmarks/containing configs (benchmarks/configs/*.yaml), synthetic reference datasets (benchmarks/datasets/*.csv), and acceptance thresholds (benchmarks/expected/acceptance.json).physio_sim.validationpackage withbenchmarks.py(suite runner that loads configs, runssimulate(), interpolates sim->obs, computesAUC,Cmax,Tmax,RMSE, evaluates pass/fail versus thresholds, and writes artifacts) andreport.py(buildsreport.md).benchmarkinphysio_sim.cliso the suite runs withpython -m physio_sim.cli benchmark --suite benchmarks --out outputs/benchmarksand exits with code1when any case fails acceptance.configure_random_seedin configs, and wrote smoke/determinism tests intests/test_benchmark_smoke.pyto assert outputs and reproducibility.docs/validation.mddescribing benchmark philosophy, dataset format, and instructions; README updated with quick example.benchmarks/(configs, datasets, expected JSON),physio_sim/validation/__init__.py,physio_sim/validation/benchmarks.py,physio_sim/validation/report.py,physio_sim/cli.py(benchmark command),docs/validation.md,tests/test_benchmark_smoke.py, andpyproject.toml(include new package).Testing
python -m ruff format --check .andpython -m ruff check .(passed).python -m mypy physio_sim(passed).python -m pytest -q(all tests passed; 17 passed).python -m physio_sim.cli benchmark --suite benchmarks --out outputs/benchmarks, which producedoutputs/benchmarks/<drug>/overlay.png,metrics.json,summary.json, andreport.mdand completed successfully (benchmarks passed in this run).caffeine,warfarin,metoprolol) are synthetic example datasets and are explicitly documented indocs/validation.md.Codex Task