fix(bpsd): apply species-kid OOB patch + restore Layer-1 CI tests#140
Conversation
ROOT CAUSE LOCATED. The CI SIGABRT class that forced 22+ tests into --deselect was traced via valgrind in PR #139 to a write- past-end in upstream BPSD's bpsd_setup_species_kdata loop: do nd=0,speciesx%ndmax-1,3 speciesx%kid(nd+1)='species%pa' speciesx%kid(nd+2)='species%pz' speciesx%kid(nd+3)='species%npa' ! kid(21) when ndmax=20 → OOB ... With nsmax=4 the loop's last iteration (nd=18) writes kid(21) but the array is only size 20. Each kid slot is CHARACTER(LEN=32) so the OOB clobbers 32 bytes of the next malloc chunk's header. glibc later detects the smashed metadata when freeing the chunk ("corrupted size vs. prev_size") and SIGABRTs the process. This bug does not always reproduce locally — heap canary placement is layout-sensitive — but reliably fires in CI's glibc 2.39 heap layout. Changes: 1. docs/external-patches/bpsd/bpsd-species-kid-oob-fix.patch (new): patches the loop bound from `ndmax-1` to `ndmax-3` so the loop only writes complete triplets within bounds. 2. .github/workflows/python-tests.yml: apply the patch right after the BPSD clone (was a TODO comment until now). 3. .github/workflows/python-tests.yml: REMOVE the entire --deselect block. Per MEMORY.md feedback_never_skip_tests and feedback_equivalence_must_pass, SKIPping Layer-1 tests does not count as verification — fix the bug instead. property_boundary / property_fanout stay ignored: those crash on a separate, deferred bug class (tr NRMAX registry gap, fp/ti NSMAX range-guard, wr MDLWRI=2 unsupported) tracked in PR #125 body. Followup: - Submit equivalent fix upstream to ats-fukuyama/bpsd; drop this in-CI patch once merged. - Address property_boundary/fanout punch list in separate PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit c807e65. Configure here.
CI run after BPSD patch surfaced a different failure mode in totlib's equivalence tests: when test_run/test_output/<case>/ does not contain the eqdata-* file referenced by the fixture's KNAMEQ string, eq_load() fails with ierr=7 and the libtotapi caller path STOPs the worker process. pytest-forked then can't extract a clean test result and the whole pytest session crashes with INTERNALERROR. Mirror the trlib/tests/test_equivalence.py pattern: check for the KNAMEQ file before invoking the fixture, and call self.skipTest with an actionable message pointing at the run_tests.sh recipe that generates it. This is the *correct* kind of skip per MEMORY.md feedback_equivalence_must_pass: we're not masking a bug, we're saying "the input data needed to verify equivalence isn't present in this CI environment". The test would still run locally where the user has executed run_tests.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #138 added gfortran-14 from ubuntu-toolchain-r PPA on the hypothesis that the CI SIGABRT class was a 13.2 codegen bug. That hypothesis was disproven by valgrind in #139: the actual root cause was a BPSD species_kid OOB write (fixed by the patch applied in the next step of this workflow). Side effect of the gfortran-14 upgrade: FP equivalence baselines (generated by Phase-0 with gfortran 13.x) now drift from CI output by ~1e-9, breaking 37 fplib + 2 wrxlib equivalence tests at the 1e-10 tolerance. Fix: drop the PPA install, use Ubuntu 24.04's stock gfortran 13.2 which is close enough to the baseline-generation toolchain (both 13.x). Drops the sudo + add-apt-repository step too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 54483a0. Configure here.
| gcc gfortran-14 make | ||
| sudo update-alternatives --install /usr/bin/gfortran gfortran \ | ||
| /usr/bin/gfortran-14 140 | ||
| gfortran gcc make valgrind |
There was a problem hiding this comment.
Unnecessary valgrind installed but never used in CI
Low Severity
valgrind is added to the apt-get install list but is never invoked in any workflow step. It appears to be a leftover from the local debugging session described in the PR ("located via valgrind in PR #139"). The only references to valgrind in the workflow are in comments explaining how the bug was found, not in any run: block.
Reviewed by Cursor Bugbot for commit 54483a0. Configure here.
…154) ## Root cause `bpsd_get_plasmaf(plasmaf_out, ierr)` declares `plasmaf_out` with `INTENT(OUT)`. Per Fortran 2003+ semantics on a derived type with allocatable components: - Allocatable members (rho/data/qinv) auto-deallocate on entry - Scalar members (nrmax/nsmax/time) become UNDEFINED on entry The body then tests `plasmaf_out%nrmax.eq.0` to dispatch mode 0 (fresh fill) vs mode 1 (preserve caller-provided sizing). With nrmax UNDEFINED, the heap garbage value can flip the test arbitrarily → on mode=1 path bpsd_adjust_array1D leaves rho's descriptor in a state where the subsequent indexed access trips the bounds check with a garbage upper/lower bound. Symptom in callers (libtotapi.so via tr_loop → tr_bpsd_get): At line 202 of file ../../bpsd/bpsd_plasmaf.f90 Fortran runtime error: Index '1' of dimension 1 of array 'plasmaf_out%rho' below lower bound of <garbage> Confirmed via file-based dump (debug branch debug/plasmaf-dump-148 in fork) showing on entry: ALLOC(rho) = F ← rho auto-dealloc'd by intent(out) out%nrmax = <stale value> ← scalar undefined, pulls heap garbage ## Fix 3-line zero-init at the top of bpsd_get_plasmaf (upstream BPSD patch docs/external-patches/bpsd/bpsd-plasmaf-intent-out-init.patch): ```fortran plasmaf_out%nrmax = 0 plasmaf_out%nsmax = 0 plasmaf_out%time = 0.0_dp ``` Forces mode=0 deterministically on first invocation; bpsd_adjust_array1D then unconditionally allocates rho to the correct size. ## CI workflow update `.github/workflows/python-tests.yml` step "Clone BPSD + apply upstream patches (#140 species-kid, #148 plasmaf)" now applies both patches in sequence after `git clone`. ## Local-dev helper `scripts/apply-bpsd-patches.sh` mirrors the CI step idempotently (skips already-applied patches via `git apply --reverse --check`) and force-rebuilds ALL libXapi.so. The "all" matters because each module's PIC build statically links its own bpsd_*.o; runtime symbol resolution picks the first .so loaded, so a stale .o in any module shadows the patched code in others. This was the false-positive that masked #148 for several test runs (tests passed on a fresh CI build but failed on a developer host that had built only a subset). ## Out of scope (deferred follow-up) - bpsd_get_species and bpsd_get_trsource use the same INTENT(OUT) pattern. Defensive resets prepared but reverted from this PR after they exposed a separate pre-existing tilib regression (`'vtot' above upper bound of -888800` in libmtxnompi.f90:704 — reproduces without my BPSD changes). Filing under #148 follow-up: investigate the tilib regression first, then add matching resets for the other two routines. - XFAIL_REMOVE_WITH_148 markers in tot_mcp test_integration.py will be removed in a follow-up PR after CI confirms green. ## Verified - Local: pytest python/mcp-servers/tot_mcp/tests/test_integration.py + trlib/totlib equivalence → 6/6 PASS. - 5x deterministic on the dump-instrumented probe (no longer crashes). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) * ci(bpsd): clone patched fork instead of upstream + git apply Replaces the "clone ats-fukuyama/bpsd + git apply 2 patches" CI step with a single clone of k-yoshimi/bpsd@develop, a personal fork that carries both task-private patches as proper commits: - 879bb7ac: species_kid OOB write fix (#140, PR #139) - dcccf84f: plasmaf INTENT(OUT) zero-init (#148, PR #154/#155) Both patches were emailed to upstream on 2026-04-22; the fork shrinks to zero (rebases away naturally) once upstream merges them. Removed: - docs/external-patches/bpsd/{README.md,*.patch} - scripts/apply-bpsd-patches.sh Migration rationale + sync workflow recorded in memory: reference_bpsd_fork.md. Local devs: re-point sibling clone with cd ../bpsd git remote set-url origin https://github.com/k-yoshimi/bpsd.git git fetch origin git checkout -B develop origin/develop then rebuild libbpsd.a + every lib<mod>api.so so each statically linked bpsd_*.o is the patched copy. Verified: trlib pytest 53/53 pass against the new sibling clone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(bpsd): document fork sync workflow + harden clone failure mode Bugbot/code-review findings on the previous commit: 1. The CI step comment pointed at `memory/reference_bpsd_fork.md` which is in agent memory, not in the repo — dead reference for any contributor. Replaced with `docs/external-patches/bpsd-fork.md`, a new in-repo doc covering the full sync workflow, sibling-clone setup, rebuild-all libXapi.so loop, and recovery procedure if the fork's develop branch ever disappears. 2. Bare `git clone` failure produced a cryptic "Remote branch develop not found" with no recovery hint. Wrapped in `if ! git clone …` and on failure prints a 3-line message pointing at bpsd-fork.md. 3. The pytest-step comment block had an orphaned sentence fragment ("property_boundary / property_fanout are the / remaining class") left over from removing the patch-path reference. Re-flowed for coherence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(bpsd-fork): add wr to sibling-clone rebuild loop Code-review caught: bpsd-fork.md's `for mod in …` loop omitted `wr`, which would leave `wr/libwrapi.so` linked against stale `bpsd_*.o` after a sibling-clone re-point — exactly the silent-stale-.so failure mode the same paragraph warned about. Aligned the loop with the CI "Build module shared libraries" step (`tr fp ti wr wrx eq` + `tot`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: k-yoshimi <k-yoshimi@g.ecc.u-tokyo.ac.jp> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds in Codex round-2 findings on the PNBTOT spec: HIGH: - §8.2: replace bogus `--output` flag with positional + stdout form, matching §4.4. The script has no --output flag. MED: - §8.1: document tr/Makefile dep behaviour — non-PIC has explicit trcomm.f90 dep (line 535), PIC variant uses generic %.o rule (line 257) without inter-module tracking; add `make clean` workaround. - §10 Risks: add BPSD-driven SIGABRT row (cross-ref PR #140 + feedback_use_valgrind.md), schema-drift parse-failure row, and near-zero→finite tolerance row for compare_metrics.py 1e-10. - §12 AC: add direct `set_param("PNBTOT", 25.0)` smoke test as independent proof of registry wiring; add `.in` content grep AC. Other: - Update HEAD reference 28a996f → 20d87da after CI hot-fix. - Update reviewer agent name feature-dev → superpowers (CLAUDE.md has the legacy name; the actually-available agent is superpowers:code-reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MED: - §10: drop the compare_metrics.py near-zero risk row — Codex confirmed compare_metrics.py guards _rel_err with max(.., 1e-300) so the over-stated "div-by-zero" risk does not exist. Replace the schema-change row to correctly describe the SCALAR_KEYS allowlist behaviour (silent drop of unknown keys) and remove the factually-wrong "--help-less stderr" wording. - §10: tighten BPSD risk row with explicit memory-file paths ($CLAUDE_MEMORY_DIR/memory/feedback_use_valgrind.md, PR #140 + project_bpsd_species_kid_oob_patch.md). - §12 AC: extend the registry-confirmation one-liner with tr.run(0) so the smoke covers the full set_param → tr_prep → trpnb.f90:46 consumer path (set_param alone only proves the CASE is wired, not that the value travels). Rename "end-to-end" → "registry- and-consumption" to avoid contract overlap with §8.3. §9 reviewer agent name (Codex MED) intentionally left as-is: spec describes the working agent name and explicitly notes the CLAUDE.md legacy reference; updating CLAUDE.md is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
ROOT CAUSE LOCATED for the CI SIGABRT class that forced 22+ tests into `--deselect`. Located via valgrind in PR #139.
Bug (in upstream BPSD's `bpsd_setup_species_kdata`):
```fortran
do nd=0,speciesx%ndmax-1,3 ! ndmax=20 → nd∈{0,3,..,18}
speciesx%kid(nd+1)='species%pa' ! kid(19) OK
speciesx%kid(nd+2)='species%pz' ! kid(20) OK
speciesx%kid(nd+3)='species%npa' ! kid(21) ★ OOB
```
`kid` array is `size = ndmax = nsmax*5 = 20`. Last iteration writes `kid(21)` — 32 bytes (CHARACTER LEN=32) past the array end → smashes the next malloc chunk's metadata. glibc later detects on free as "corrupted size vs. prev_size" → SIGABRT.
Why local doesn't always reproduce: heap canary placement is layout-sensitive. CI's glibc 2.39 reliably triggers; some dev hosts don't.
Changes
Followup
Test plan
🤖 Generated with Claude Code
Note
Medium Risk
CI behavior changes significantly by re-enabling previously deselected equivalence/lifecycle/integration tests and altering the toolchain/deps; risk is mainly new CI failures or flaky behavior rather than production runtime impact.
Overview
Restores full Layer-1 pytest coverage in CI by removing the large
--deselectblock and relying on a fixed upstream dependency instead of skipping SIGABRT-prone tests.CI now clones
ats-fukuyama/bpsdand applies a new in-repo patch (docs/external-patches/bpsd/bpsd-species-kid-oob-fix.patch) that fixes an out-of-bounds write inbpsd_setup_species_kdata, and switches the workflow back to Ubuntu’s defaultgfortran(dropping the gfortran-14 PPA path) while installingvalgrind.python/totlib/tests/test_equivalence.pygains a guard to skip cleanly when requiredKNAMEQeqdata files are missing, avoiding hard crashes insideeq_loadin environments without baseline eqdata.Reviewed by Cursor Bugbot for commit 54483a0. Bugbot is set up for automated code reviews on this repo. Configure here.