Skip to content

test(tr): Phase 0 - regression test infrastructure#2

Merged
k-yoshimi merged 16 commits into
developfrom
feature/tr-regression-phase0
Apr 18, 2026
Merged

test(tr): Phase 0 - regression test infrastructure#2
k-yoshimi merged 16 commits into
developfrom
feature/tr-regression-phase0

Conversation

@k-yoshimi
Copy link
Copy Markdown
Owner

@k-yoshimi k-yoshimi commented Apr 17, 2026

Stacked on #1. After #1 merges, the base of this PR should be changed to `develop`:

gh pr edit 2 --repo k-yoshimi/task --base develop

Summary

TASK/TR モジュールのリファクタリングに先立ち、数値挙動を固定する回帰テスト基盤を `test_run/` に整備する Phase 0 の実装。

  • `tr/trregress.f90` を追加(環境変数 `TR_REGRESS_DUMP=1` のときのみ `tr_regress.dat` を `1PE24.16` 書式で書き出す)
  • `test_run/scripts/` に dump → JSON 抽出・比較スクリプト + unittest
  • `test_run/inputs/` に `tr_m0904` と `tr_tst2` の短縮入力を追加
  • `test_run/baselines/` に 3 ケース分のゴールデン指標
  • `run_tests.sh` に TR モジュール実行時の `TR_REGRESS_DUMP=1` エクスポートと回帰判定を統合

設計書と実装計画

  • 設計書: `docs/superpowers/specs/2026-04-17-tr-refactoring-design.md`
  • 実装計画: `docs/superpowers/plans/2026-04-17-tr-refactoring-phase0.md`

設計方針は Option B(構造+危険箇所のみの限定的近代化)、既存開発者の参画しやすさを優先。

回帰判定の仕組み

  1. `run_tests.sh` は TR モジュールに対してだけ `TR_REGRESS_DUMP=1` をエクスポートして `tr/tr2` を実行
  2. CLOSED メッセージで計算成功を判定
  3. `scripts/check_regression.sh` が `tr_regress.dat` → JSON → `baselines/` と相対誤差 `1e-10` で比較
  4. 不一致なら `REGRESSION` で FAIL

通常実行への影響

なし。 `TR_REGRESS_DUMP` 未設定時は `tr_regress_dump_if_enabled` が即 `RETURN` するため、従来の挙動と完全に同一。

変更の要約(15 コミット)

# 内容
1 設計書+実装計画を追加
2-3 Phase 0 作業開始マーカー & 戦略ロック(空コミット)
4 `tr/trregress.f90` 追加(本 PR の中核)
5-6 `extract_tr_metrics.py` と pytest→unittest 移植
7 `compare_metrics.py`(許容誤差 1e-10)
8-9 M0904, TST-2 の入力ケース
10 `check_regression.sh` ラッパ
11 `test_definitions.conf` に 2 ケース追加
12 `run_tests.sh` に `TR_REGRESS_DUMP=1` + 回帰判定を統合
13 3 ケース分のベースライン JSON
14 README
15 `.gitignore` 調整

Test plan

  • 通常実行(`TR_REGRESS_DUMP` 未設定)で `tr_regress.dat` が生成されないことを確認
  • `./run_tests.sh tr_iter01 tr_m0904 tr_tst2` が 3 連続で全 PASS
  • 同一入力の run-to-run で dump が bit-exact 一致
  • Python の単体テスト 8 件全 PASS(`python3 -m unittest discover test_run/scripts/tests`)
  • RKEV 定数の 0.05% 改変で 207 件の mismatch を検出・REGRESSION 報告を確認
  • リストア後に全 PASS に復帰

pytest → unittest について

設計書と実装計画では pytest を想定していたが、対象環境に pytest が未インストールのため stdlib `unittest` に移植(コミット a5a6096)。機能的な差異はなく、`python3 -m unittest discover` で同等のテストを実行できる。

次のフェーズ

本 PR 後、Phase 1(`USE TRCOMM` への ONLY 句付与)から順次別 PR で進める予定。

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new TR codepath (file dump) and wires numerical comparisons into the test runner; while gated by TR_REGRESS_DUMP, failures can now come from baseline drift/tolerance and the dump hook runs on all TR test executions.

Overview
Adds Phase 0 TR regression testing infrastructure by introducing tr/trregress.f90 and a one-line hook in trloop.f90 (plus tr/Makefile update) to write tr_regress.dat in a fixed high-precision format when TR_REGRESS_DUMP=1.

Extends test_run to enforce numeric stability: run_tests.sh now sets TR_REGRESS_DUMP=1 for TR runs and, after CLOSED, calls a new scripts/check_regression.sh wrapper that extracts metrics (extract_tr_metrics.py) and compares them to committed baselines (compare_metrics.py, default rel tol 1e-10), reporting REGRESSION on mismatch.

Adds two new TR input cases (tr_m0904, tr_tst2) to test_definitions.conf, commits baseline metrics.json for tr_iter01, tr_m0904, and tr_tst2, and documents the workflow in test_run/README.md (with test_output/ and Python caches ignored via test_run/.gitignore).

Reviewed by Cursor Bugbot for commit fe94c1e. Bugbot is set up for automated code reviews on this repo. Configure here.

k-yoshimi and others added 15 commits April 17, 2026 21:08
- specs: TR module refactoring design (Option B: structure + danger-spot modernization)
- plans: Phase 0 regression test infrastructure with env-guarded high-precision dump
- Parked on feature/tr-regression-phase0-docs pending dealloc cleanup merge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add trregress module providing tr_regress_dump_if_enabled. When the
environment variable TR_REGRESS_DUMP=1 is set, tr_loop writes key
global quantities (WPT, AJT, Q0, BETAn, TAUEn, ZEFF0, ALI, RQ1) and
radial profiles (RN, RT, AJ, QP) to tr_regress.dat in 1PE24.16 format
at the end of the transport calculation loop. When the variable is
unset, the routine returns immediately so normal runs are unaffected.

This is Phase 0 Task 3 of the TR regression test infrastructure.
Run-to-run bit-exact reproducibility verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pytest is not installed on the target environment. Switch to unittest so
the regression suite runs on a vanilla Python 3 install.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baselines captured with:
- tr/tr2 built at commit a361df2 with TR_REGRESS_DUMP=1 env-guarded dump
- gfortran OFLAGS = -g -O3 -m64 -std=legacy
- Verified run-to-run reproducible within 1e-10 tolerance across 3 consecutive runs.
@k-yoshimi k-yoshimi changed the base branch from fix/tr-dealloc-cleanup to develop April 17, 2026 23:17
@k-yoshimi
Copy link
Copy Markdown
Owner Author

@cursor review

Comment thread test_run/scripts/compare_metrics.py
… passing

Cursor Bugbot flagged that `_rel_err(41.13, inf)` evaluates to NaN
(`inf / inf`) and `NaN > tol` is False, so a regression that produces
`inf` would be silently accepted. Guard `_check_scalar` with
`math.isinf` the same way `math.isnan` is guarded: identical infinities
still match, but any finite-vs-inf or +inf/-inf mismatch is reported.

Add three unittest cases covering actual-is-inf, baseline-is-inf, and
matching-infs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@k-yoshimi
Copy link
Copy Markdown
Owner Author

Cursor Bugbot の指摘に対応しました。

Fix: compare_metrics.py:_check_scalarmath.isinf ガードを追加し、finite-vs-inf / 符号違い inf を検知するようにしました。同値 inf(baseline=actual=inf)は従来通り match 扱いです。

Tests: test_compare_metrics.py に 3 ケース追加(test_fails_when_actual_is_inf / test_fails_when_baseline_is_inf / test_passes_when_both_are_same_inf)。unittest 11 件全 PASS。

@cursor review

@k-yoshimi
Copy link
Copy Markdown
Owner Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit fe94c1e. Configure here.

@k-yoshimi k-yoshimi merged commit 926b25b into develop Apr 18, 2026
1 check passed
@k-yoshimi k-yoshimi deleted the feature/tr-regression-phase0 branch April 18, 2026 15:44
k-yoshimi added a commit that referenced this pull request Apr 28, 2026
Bugbot's second pass on PR #178 surfaced 4 issues that survived the
first round of LOW fixes (e1e5a37). All four addressed in this
commit, with targeted regression tests:

#1 (LOW) PipelineResult.to_dict() now returns defensive copies of
   inner scalars dicts and coupling_applied lists. Previously, MCP-
   side mutation of the JSON payload would propagate into the
   originating PipelineStep. New test_to_dict_returns_defensive_copies
   asserts the contract.

#2 (LOW) _validate_steps now accepts list-of-2-element steps in
   addition to tuple-of-2-element. JSON-decoded MCP payloads
   deserialise tuples as lists, so the API now round-trips through
   to_dict() without manual coercion. New
   test_run_pipeline_accepts_list_shaped_steps regression guard.

#3 (LOW) set_param now explicitly rejects bool inputs with
   TotPipelineCouplingError instead of silently coercing to 1.0/0.0
   via the numeric branch (Python's bool is a subclass of int).
   The previous test_set_param_records_bool_as_float (which
   documented the silent-coercion behavior) is replaced with
   test_set_param_rejects_bool that asserts the explicit rejection.

#4 (MED) ExceptionGroup compatibility shim for Python 3.10. The
   close() multi-failure path uses builtin ExceptionGroup (3.11+);
   tot_mcp's pyproject.toml declares requires-python=">=3.10". The
   shim probes for the builtin first, falls back to the
   exceptiongroup PyPI backport if available, and degrades
   gracefully to raising errors[0] only when neither is reachable
   (same as pre-fix behavior). CI runs 3.11+ so the existing
   ExceptionGroup test still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
k-yoshimi added a commit that referenced this pull request Apr 28, 2026
* docs(spec): add L-7a cross-module coupling design

Design for incremental Python-side scalar coupling between fp and tr
in the tot orchestrator. Establishes TotPipeline class skeleton in a
new file (python/totlib/pipeline.py) without touching the legacy Tot
wrapper. Covers fp -> tr driven current (RJT volume integral -> PNBCD
scalar) only; profile/coordinate transforms deferred to L-7b via BPSD.

Two Codex review passes: 9 findings (HIGH 2 / MED 5 / LOW 1) all
addressed inline. Pre-flight Research R1 (singleton coexistence +
finalize/init state-leak) / R2 (rjt_volint extraction) / R3 (tr param
name confirmation) gated before implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(plan): add L-7a cross-module coupling implementation plan

Bite-sized TDD task list executing the L-7a design spec
(0a12bd0) across 4 phases / 16 tasks / 3 sub-PRs:

- Phase 0: Pre-flight research R1/R2/R3 (singleton coexistence,
  rjt_volint extraction, tr driven-current param confirmation)
- Phase 1 (PR 1): TotPipeline class + errors + mock unit tests
- Phase 2 (PR 2): compute_rjt_volint helper + COUPLING_RULES
  population + fp -> tr 1e-10 equivalence test
- Phase 3 (PR 3): run_pipeline MCP tool + docs

Each PR phase ends with the CLAUDE.md pre-push gate (code-reviewer +
codex:codex-rescue + REVIEW_OK marker + --forked --timeout=120
--timeout-method=signal pytest).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): record L-7a R1 (singleton coexistence) outcome

R1-a (concurrent live) and R1-b (re-init state leak) verified per
the L-7a design spec. Outcome and fallback decision recorded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): record L-7a R2 (rjt_volint extraction) outcome

R2 PASS. compute_rjt_volint(state, *, R0, a) verified against fp
internal aggregate rtotalIP at 1e-10 (sanity case NRMAX=10, RR=3,
RA=1, default fixture: helper -2.769528e-13 MA == fp stdout -2.7695E-13 MA).

FpState shape characterized: top-level dataclass attrs
[nrmax, nsamax, npmax, nthmax, ntg2, timefp,
 RNT/RWT/RTT/RJT/RPCT/RPWT]; **no `.scalars` dict** (asymmetric to
TrState — R1 concern #3 confirmed). FpState exposes no grid info
(RG/DV/VOLR/RA/RR absent), so helper takes R0, a as kwargs and
assumes uniform rho-grid in [0,1] (matches fp/fpprep.f90:97-99).

RJT units: [MA/m^2] (matches RJS in fp/fpcale.f90:290 where RJ_P =
RJS*1.D6 converts to A/m^2). Helper applies *1e6 internally and
returns total I in [A].

fplib unchanged — helper lives in totlib/pipeline.py (Phase 2 Task 2.1).

Probe stdout (truncated):
  FpState attributes: RJT/RNT/RPCT/RPWT/RTT/RWT (list[list[float]]),
  npmax=50, nrmax=1, nsamax=1, ntg2=2, nthmax=50, timefp=0.01.
  Grid candidates RG/DV/VOLP/RM/RMAX/DR/RA/RR/RB: ALL ABSENT.
  Scalar candidates AJT/AJT_driven/.../scalars: ALL ABSENT.

Sanity check (NRMAX=10):
  fp stdout: total plasma current [MA] -2.7695E-13
  helper:   compute_rjt_volint = -2.769528e-07 A
                                = -2.769528e-13 MA  (1e-10 match)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): record L-7a R3 (tr driven current param) outcome

PARTIAL PASS: PNBCD is NOT in tr_param_registry. Registered scalar
candidates are PICCD, PECCD, PLHCD (dimensionless current-drive
factors) and PLHTOT (LH input power [MW]). Selected PLHCD as L-7a
skeleton target with documented unit mismatch (Amperes -> dimensionless
factor); rigorous physical conversion deferred to L-7b.

Picked a = tr.RA for compute_rjt_volint(state, R0=tr.RR, a=tr.RA)
because fp's radial mesh is RA-normalized (fpcale.f90:34, 56).

Active-drive fixture for Phase 2.3 equivalence test:
  fp.set_param('E0', 0.001)  # induction E-field [V/m]
yields ~0.946 MA (vs ~1e-13 MA for default), well above the 1e-10
equivalence-test floor. MODEL_WAVE+PABS_LH path failed with rc=3 and
needs additional setup; E0 is the minimum-cost active-drive trigger.

Open issue surfaced: tr lacks a registered scalar that semantically
accepts an external driven current in MA. L-7b should add such a
param (or expose PNBCD via the registry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(l7a): R4 decision gate — accept skeleton coupling, update spec/plan

R3 confirmed PNBCD is not registered in tr_param_registry.f90; only
PLHCD/PICCD/PECCD (dimensionless multipliers) accept set_param. Per
user decision (option A), L-7a proceeds as API-plumbing skeleton with
PLHCD as the dst_param; physical fidelity (proper EXTERNAL_DRIVEN_I
scalar in tr) is moved to L-7b follow-ups.

Spec updates:
- §2 table: PNBCD -> PLHCD with skeleton-coupling caveat
- §3 non-goals: explicit "physical fidelity is L-7b scope"
- §5.2 CouplingRule: src_state_key now (state, params) 2-arg callable
- §5.3 PipelineResult + new _state_to_scalars adapter (handles
  FpState which lacks the .scalars dict per R1/R2)
- §5.4 TotPipeline: self._params tracking for callable rules
- §6.2 run_pipeline: uses _state_to_scalars + records dst_param
  injections back into _params
- §12 follow-ups: tr EXTERNAL_DRIVEN_I_MA addition + fplib .scalars
  property consideration

Plan updates:
- Task 1.4: __init__ adds self._params; set_param records into it
- Task 1.5: _extract_source 2-arg callable + _state_to_scalars helper
  + 5 new tests (callable+params, set_param tracking, adapter)
- Task 2.1: compute_rjt_volint(state, *, R0, a) signature with the
  R2-derived implementation pasted in
- Task 2.2: COUPLING_RULES uses lambda wrapper pulling tr:RR/tr:RA
  from params; dst_param=PLHCD; documents skeleton nature
- Task 2.3: equivalence test uses E0=0.001 active-drive fixture (R3),
  PLHCD instead of PNBCD, _state_to_scalars adapter, plus a sanity
  guard that the integral magnitude is non-trivial (>= 1e3 A)
- Test count expectations updated throughout

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(totlib): add TotPipeline error hierarchy

Five new exception classes for the L-7a pipeline:
- TotPipelineError (base, extends TotlibError)
- TotPipelineUnknownModuleError, TotPipelineCouplingError,
  TotPipelineLifecycleError
- TotPipelineRunError carrying partial_result + failed_step_index +
  failed_module attributes for debugging mid-pipeline failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(totlib): add CouplingRule/PipelineStep/PipelineResult dataclasses

Pure data carriers for the run_pipeline orchestrator. CouplingRule is
frozen (immutable) so the registry can be hashable in the future.
PipelineResult.last(module) returns the most recent matching step,
to_dict() is the MCP serialization format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(totlib): match typing convention from surrounding modules

Code-quality reviewer flagged divergence: pipeline.py used PEP 585
(dict[...], list[...]) while totlib/state.py and sibling state.py
files use Dict[...], List[...] from typing. Match the surrounding
codebase. Also addresses two reviewer nits: explicit "frozen=True
rationale" docstring on CouplingRule, and explicit shallow-ref
caveat docstring on PipelineResult.to_dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(totlib): add module registry + lazy import helpers

_MODULE_REGISTRY maps short names (fp/tr/eq/wr/wrx/ti) to
(package, wrapper_class, base_error_class) triples.
_import_wrapper / _import_module_error keep the lazy-load contract
from spec section 5.5: only modules used in the current pipeline
get their lib*.so / errors.py loaded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(totlib): correct ti wrapper class name + add parametrized registry tests

Code-quality reviewer flagged a coverage gap: only fp was exercised by
the registry-resolves-class tests. Adding parametrized tests across
all 6 modules immediately surfaced a real bug: the ti wrapper class
is `TiLib` (capital L), not `Tilib` as the registry claimed. The
sibling `TilibError` IS lowercase, so ti is the only entry where the
wrapper-class and error-class casing diverge — easy to miss.

Fixed registry entry; added a comment flagging the asymmetry. Spec
and plan updated to match. Now 31/31 pipeline tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(totlib): add TotPipeline lifecycle (init/set_param/close)

Implements the lifecycle subset of TotPipeline:
- __init__ creates empty _modules / _params / _closed=False; opens nothing.
- _ensure_module lazy-instantiates a wrapper on first reference, gated
  by the closed flag.
- set_param routes strings to set_param_str, numerics to set_param,
  and records the value into _params after the wrapper accepts it
  (so failed wrapper validation keeps _params consistent).
- close is idempotent and finalizes all modules even when one raises;
  the first exception is re-raised so the root cause is visible.
- Context manager support via __enter__/__exit__.

run_pipeline / _extract_source / COUPLING_RULES are deferred to Task 1.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(totlib): guard set_param_str when module wrapper lacks the method

Code-quality reviewer flagged that wrlib.Wrlib, wrxlib.Wrxlib, and
tilib.TiLib do not expose set_param_str (only fp/tr/eq do). Routing
a string value through TotPipeline.set_param previously raised a
bare AttributeError leaking from getattr. Now raises a typed
TotPipelineCouplingError with a clear message identifying the module.
Added a regression test using a mock wrapper without set_param_str.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(totlib): implement TotPipeline.run_pipeline orchestrator

- _state_to_scalars adapter (handles fp's missing .scalars dict + the
  standard .scalars-bearing modules uniformly).
- COUPLING_RULES registry (empty at this commit; ('fp','tr') is filled
  in Phase 2 after R2/R3 outcomes are baked into pipeline.py).
- _validate_steps performs side-effect-free pre-flight checks.
- _extract_source resolves callable src_state_key with (state, params)
  signature; string keys go through _state_to_scalars.
- run_pipeline iterates steps, applies coupling rules between adjacent
  pairs, and wraps per-module errors as TotPipelineRunError with the
  partial PipelineResult attached for debugging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(totlib): export TotPipeline + dataclasses at package level

Backwards-compatible: legacy 'from totlib import Tot' / 'TotState'
imports continue to work. New 'from totlib import TotPipeline,
CouplingRule, PipelineStep, PipelineResult' is added for L-7a users.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(totlib): broaden run_pipeline catch + cover Codex review gaps

Codex independent review (2026-04-28) flagged 1 MED + 2 LOW gaps the
prior reviewer missed:

MED1: run_pipeline's except clause was (base_err, TotPipelineCouplingError),
which let TypeError/ValueError from wrong kwargs escape without
TotPipelineRunError wrapping — so partial_result was lost.
Broadened to `except Exception`. KeyboardInterrupt and SystemExit
still propagate (they are BaseException, not Exception). The
lazy `_import_module_error(name)` call is kept for forward-compat
debugging clarity (annotated noqa).

LOW2: test_set_param_records_into_params_dict only exercised the
numeric path of set_param. Added a string-value entry so the
string path's _params bookkeeping is also verified.

LOW3: test_run_pipeline_partial_failure_carries_partial_result did
not assert on coupling_applied. Added an empty-list assertion
(regression guard against silently dropping coupling metadata).

Plus a new test test_run_pipeline_typeerror_from_bad_kwargs_is_wrapped
that exercises the broadened catch path directly.

61 tests pass under --forked --timeout=120 --timeout-method=signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(totlib): address PR #178 Bugbot LOW findings

Three independent fixes surfaced by the Cursor Bugbot pass on PR #178:

1. close() no longer silently drops secondary finalize errors
   (pipeline.py:259-269). When more than one module's close() raises,
   the caller now sees an ExceptionGroup carrying every collected
   error instead of just errors[0]. Single-failure path is unchanged
   (the original exception type is preserved). Tested via the new
   test_close_raises_exception_group_when_multiple_modules_fail.

2. set_param's _params bookkeeping no longer diverges from the value
   the wrapper actually received (pipeline.py:235-249). Numeric inputs
   are coerced via float() at the boundary, but _params used to store
   the original (e.g. True or int 2). Coupling rules reading
   self._params now see the same float the underlying module saw.
   Existing test_set_param_records_into_params_dict updated for the
   int → float coercion; new test_set_param_records_bool_as_float
   pins the bool path explicitly.

3. run_pipeline no longer calls _import_module_error per step
   (pipeline.py:329-336). The call had become dead after the broad
   "except Exception" catch in commit 941f353 — base_err was
   F841'd noqa'd, the comment misleadingly claimed "lazy-import
   semantics," and the call still triggered an .errors module import
   per step. _validate_steps already gates on _MODULE_REGISTRY
   membership pre-execution, and _ensure_module raises on bad
   names, so removing the dead call doesn't reduce safety.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(totlib): address PR #178 remaining Bugbot findings (4 items)

Bugbot's second pass on PR #178 surfaced 4 issues that survived the
first round of LOW fixes (e1e5a37). All four addressed in this
commit, with targeted regression tests:

#1 (LOW) PipelineResult.to_dict() now returns defensive copies of
   inner scalars dicts and coupling_applied lists. Previously, MCP-
   side mutation of the JSON payload would propagate into the
   originating PipelineStep. New test_to_dict_returns_defensive_copies
   asserts the contract.

#2 (LOW) _validate_steps now accepts list-of-2-element steps in
   addition to tuple-of-2-element. JSON-decoded MCP payloads
   deserialise tuples as lists, so the API now round-trips through
   to_dict() without manual coercion. New
   test_run_pipeline_accepts_list_shaped_steps regression guard.

#3 (LOW) set_param now explicitly rejects bool inputs with
   TotPipelineCouplingError instead of silently coercing to 1.0/0.0
   via the numeric branch (Python's bool is a subclass of int).
   The previous test_set_param_records_bool_as_float (which
   documented the silent-coercion behavior) is replaced with
   test_set_param_rejects_bool that asserts the explicit rejection.

#4 (MED) ExceptionGroup compatibility shim for Python 3.10. The
   close() multi-failure path uses builtin ExceptionGroup (3.11+);
   tot_mcp's pyproject.toml declares requires-python=">=3.10". The
   shim probes for the builtin first, falls back to the
   exceptiongroup PyPI backport if available, and degrades
   gracefully to raising errors[0] only when neither is reachable
   (same as pre-fix behavior). CI runs 3.11+ so the existing
   ExceptionGroup test still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
k-yoshimi added a commit that referenced this pull request Apr 29, 2026
Code-review feedback on 8deed08: the handoff stated `768 commits
behind chore` but the actual `git log master..HEAD` count is **755**.
Updated all four occurrences (lines 10, 17, 163, 251) to match.

The "768" figure was a copy from the prior handoff's "765" base plus
+3 PRs guess; the right way is `git log --oneline master..HEAD | wc -l`,
which is 755 today. Future handoffs should re-derive rather than
extrapolate.

(Codex MED #2 about hard-coded `/Users/k-yoshimi/...` paths is
intentionally not addressed: every prior handoff in `.claude/`
follows the same convention, the file is a per-checkout self-
reference, and the user's git config already publishes the same
identity in commit metadata.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
k-yoshimi added a commit that referenced this pull request May 4, 2026
…188)

* docs(spec): add L-7b-ii BPSD broker coupling verification design

Brainstorm-stage spec for L-7b-ii: orchestrator-level verification of
the eq -> tr BPSD coupling. MVP scope ("A-medium"):

  - new non-mutating Fortran helper tr_check_bpsd_pull (4-slot check)
  - C ABI export + Python Trlib.check_bpsd_pull() wrapper
  - CouplingRule extended with kind/verify fields
  - new ("eq","tr") verify rule in COUPLING_RULES
  - 3-layer test strategy (mock dispatch / unit / integration)

Codex independent reviewed in 4 rounds (overall design + sec 6 + sec 7 + sec 8); all
HIGH/MED findings addressed:

  - non-mutating BPSD pull (per-slot ierr accumulation, no TRCOMM mutation)
  - existing TotPipelineCouplingError reused (no new exception class)
  - 3-level exception chain via existing broad except
  - MODELG=0 + eq->tr -> CouplingError as acceptance criterion
  - wr-side BPSD speaker pending evaluation (L-7a divergence-risk
    judgement preserved; re-evaluation trigger documented)

Implementation risks (4) carry explicit fallbacks. Out-of-scope items
explicitly cross-reference L-7a / L-7b-i specs by section name (not
file:line) for churn resilience.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): L-7b-ii self-review fixes (BS7)

  - replace MAX_NRHO placeholder with reference to existing BPSD
    callers' allocation pattern
  - reconcile test case count (12 -> 11; C-2 documented but not
    implemented per drop note)
  - clarify error message format uses callable_repr (with
    repr() fallback for non-qualname callables like partial)

No semantic change; corrects three minor inconsistencies found in
the brainstorming spec self-review pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): L-7b-ii final Codex review fixes

Six fixes from final Codex full-spec review (HIGH 1, MED 3, LOW 2):

  HIGH: verify checks 3 eq-pushed slots only (device/equ1D/metric1D),
        not 4. plasmaf is tr's own BPSD output (tr_bpsd_put), absent
        on a fresh eq->tr pipeline -- including it would cause
        spurious False on first-time pipelines. Updated §1 data flow,
        §2 Fortran helper, §3 Python wrapper docstring, C prototype.

  MED-1: __post_init__ now validates transform is non-None for
        kind=transfer (was missing -- transfer dispatch unconditionally
        calls rule.transform(raw), would crash on None default).

  MED-2: AC6 clarified -- Layer C drop decision is plan-time, not
        post-MVP. Either Layer C is required (Eq wrapper sufficient)
        or removed from MVP scope (insufficient + follow-up PR).
        Removes the 'both required and droppable' ambiguity.

  MED-3: §8.4 risk #5 added: stale BPSD slot data across tests/runs.
        Layer A/B unaffected; Layer C uses --forked or per-test BPSD
        reset for isolation.

  LOW-1: line 76 estimate table 12 -> 11 cases (matches §7 count
        after C-2 drop).

  LOW-2: @DataClass(frozen=True) preserved in §4 dataclass; explicit
        note that __post_init__ raises ValueError only (no field
        mutation) so frozen remains compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): L-7b-ii confirmation review fixes

3 missed fixes from confirmation review:

  HIGH (still-broken): 'all four' verbiage at line 126 + plasmaf in
        rule.doc string at lines 247-255 -- both fixed. Remaining
        plasmaf mentions are all explicit 'intentionally excluded'
        rationale (intentional, kept).

  MED-1 (partial): added type-narrowing asserts in §5 dispatch
        snippet -- assert rule.transform is not None / assert
        rule.dst_param is not None -- to satisfy mypy/pyright on the
        Optional[Callable] / Optional[str] fields. __post_init__
        guarantees they hold for kind=transfer.

  MED-2 (partial): §1 in-scope clarified -- Layer A+B unconditional,
        Layer C conditional on plan-time Eq wrapper sufficiency
        check (cross-references AC6 + §8.4 risk #2). Resolves the
        'integration coverage promised but droppable' contradiction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): L-7b-ii final assert narrowing for rule.verify

One-line fix from final confirmation review: §5 dispatch verify
branch was missing 'assert rule.verify is not None' for static
checker narrowing on the Optional[Callable] field. Symmetric to
the transfer-branch asserts already present.

Codex verdict after this fix: READY-FOR-IMPLEMENTATION.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(plan): add L-7b-ii BPSD broker coupling implementation plan

Implementation plan for the L-7b-ii spec
(docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md).

4-phase commit pattern mirroring L-7b-i (PR #187):

  Phase 1: tr_check_bpsd_pull Fortran helper + C ABI + Python wrapper
  Phase 2: CouplingRule kind/verify extension + dispatch + ('eq','tr') rule
  Phase 3: 11 test cases across 3 layers + README updates
  Phase 4: pre-push gate + PR open + CI/Bugbot wait + squash merge

Plan-time risks resolved at write-time:

  - Eq wrapper sufficient (set_param/set_param_str/run/get_state all
    present in python/eqlib/eqlib.py); Layer C in scope
  - bpsd_get_data is INTENT(INOUT), self-allocates when nrmax=0;
    helper sets local%nrmax=0 explicitly
  - g_initialized/g_prepared lifecycle flags exist in tr/tr_api.f90:64-65

Plan-time risk deferred to implementation:

  - Layer C runtime (< 5s estimate unmeasured); fallback is
    @pytest.mark.slow gating per spec §8.4

Step granularity bite-sized (read context -> edit -> verify -> commit)
per writing-plans skill convention. Suitable for executing-plans or
subagent-driven-development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(plan): L-7b-ii Codex review fixes

Six fixes from Codex plan review (HIGH 2, MED 3, LOW 1):

  HIGH-1: pre-push pytest commands now include --forked --timeout=120
        --timeout-method=signal per CLAUDE.md non-negotiable. Note
        added on pytest-forked plugin install + documented exception
        when plugin unavailable.

  HIGH-2: reviewer agent type Agent(subagent_type='feature-dev:
        code-reviewer', ...) per CLAUDE.md. Fallback note for
        environments where feature-dev: is unavailable points to
        superpowers:code-reviewer.

  MED-1: Layer C --forked is now required (not advisory). Without
        BPSD broker isolation, C-3 (MODELG=0 expected failure) may
        spuriously pass after C-1 populated slots.

  MED-2: base SHA references updated f3dafae -> bdc0913 (the plan
        commit itself is the new chore tip, so executor will branch
        from it). Phase 0/1.5/2.4/3.6 squash bases corrected.

  MED-3: spec-coverage matrix entry for §7 MODELG fixture corrected
        to acknowledge deviation: plan uses local _eq_params_modelg()
        helper instead of mutating shared tot_demo2014_params.py
        (rationale: avoid Layer 1 baseline side effects). eqdata path
        IS still reused.

  LOW-1: Layer A case count clarified -- 11 logical cases (A-5
        implemented as 3 sub-tests), 13 individual pytest tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(plan): align Phase 4.1 Step 2 --timeout to CLAUDE.md (=120)

Final fix from confirmation review: Phase 4.1 Step 2 used
--timeout=60 (inconsistent with Step 1/3 and CLAUDE.md CI flags).
Aligned to --timeout=120.

Plan now READY-FOR-EXECUTION per Codex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(plan): align all pytest commands to CLAUDE.md flags

5 additional pytest commands (Tasks 1.4, 2.2, 2.3, 3.1, 3.2 sanity
checks) updated to --forked --timeout=120 --timeout-method=signal,
matching CI workflow and Phase 4 pre-push gate. Per CLAUDE.md
non-negotiable: 'Local test: run the relevant pytest locally with
the SAME flags the CI workflow uses.'

All 9 pytest invocations in the plan now consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tr+trlib: add tr_check_bpsd_pull BPSD broker verification helper (L-7b-ii)

New non-mutating Fortran helper tr_check_bpsd_pull pulls the 3
eq-pushed BPSD slots (device, equ1D, metric1D) into local
discardable types and reports ok = 1 iff all 3 per-slot ierr == 0.
plasmaf is intentionally NOT checked: it is tr's own BPSD output
(tr_bpsd_put), absent on a fresh eq->tr pipeline.

Implementation notes:
- Local types initialized with %nrmax = 0 to trigger BPSD's
  self-allocation path (per ../../bpsd/bpsd_equ1D.f90:143-150,
  bpsd_get_* is INTENT(INOUT) and allocates internally when
  the caller passes nrmax=0).
- Per-slot ierr accumulated to AND-condition (avoids masking
  earlier failures by later successes).
- TRCOMM untouched -- safe to call before or after tr.run().
- TR_STATE_ABI_VERSION not bumped (no struct change).

Exposed via C ABI (tr/tr_api.h) and Python wrapper
(Trlib.check_bpsd_pull). Smoke-tested: fresh Trlib() init
returns False (BPSD slots empty); existing trlib tests unchanged.

Spec: docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* pipeline: extend CouplingRule for verify rules (kind/verify dispatch) (L-7b-ii)

CouplingRule dataclass gains two new fields:

  kind: str = "transfer"   # "transfer" | "verify"
  verify: Optional[Callable[[Any], bool]] = None

Existing transfer-rule fields (src_state_key, dst_param, transform)
are defaulted to None and validated by __post_init__:

  * kind="transfer" requires all three transfer fields
  * kind="verify"   requires verify callable
  * unknown kind raises ValueError

@DataClass(frozen=True) preserved; __post_init__ only raises and
never mutates self, so frozen remains compatible.

run_pipeline rule iteration gains a kind branch:

  * transfer: existing extract -> transform -> set_param -> record
  * verify:   ok = rule.verify(curr_inst); raise CouplingError if
              False or if verify itself raised.
              applied.append only on success (matches transfer-rule
              "succeeded snapshot" semantics of PipelineStep.coupling_applied)

Both error paths flow through the existing broad except at
pipeline.py and become TotPipelineRunError with __cause__ chain
(3-level: RunError -> CouplingError -> OriginalError if any).

Scope note (L-7b-ii deferred at registry level):

  The original spec proposed registering an ('eq','tr') verify rule
  using BPSD as a cross-module broker (Trlib.check_bpsd_pull). During
  implementation we found that libeqapi.so and libtrapi.so each carry
  a private copy of the BPSD module-level state -- ___bpsd_equ1d_MOD_
  equ1dx is static (s) in nm output of both shared objects. Therefore
  bpsd_put from libeqapi.so does NOT propagate to libtrapi.so's
  bpsd_get, and the verify rule cannot succeed under the per-module
  TotPipeline architecture. The legacy Tot / libtotapi.so path keeps
  eq+tr+bpsd co-linked and is unaffected.

  Until a cross-.so sharing scheme (unified .so, IPC, or RTLD_GLOBAL
  with weak symbols) is decided, no ('eq','tr') rule is registered.
  The verify dispatch infrastructure is in place and mock-tested
  (Layer A) so that the rule can be added later without further
  changes to pipeline.py.

One existing test (test_run_pipeline_missing_source_state_raises_
coupling_error) was updated to pass an explicit transform=lambda v: v
to satisfy the new __post_init__ contract; it had previously
relied on the old identity default.

Existing ("fp","tr") transfer rule and Layer 1 baselines (demo2014,
ht6m at 1e-10) are unaffected.

Spec: docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test+docs: Layer A/B BPSD coupling verify tests + READMEs (L-7b-ii)

Adds 11 new test cases across 2 layers (Layer C deferred — see below):

  Layer A (test_pipeline_verify.py, no .so): 8 mock-based dispatch
    tests covering verify-True success, verify-False raise+cause-chain,
    verify-raise 3-level chain, mixed transfer+verify ordering,
    __post_init__ validation (3 cases), unregistered-pair silent skip.

  Layer B (test_bpsd_check.py, libtrapi.so): 3 unit tests for
    Trlib.check_bpsd_pull() -- fresh init returns False, closed
    Trlib raises TrlibError, no exception leak from Fortran.

Also updates test_pipeline_dataclasses::test_coupling_rule_defaults
to pass an explicit transform=lambda v: v: the previous version
relied on the old default `transform = lambda v: v`, which is no
longer a default per the new __post_init__ contract introduced in
the previous commit.

Layer C (eq + tr integration) deferred:

  実装中の調査で `libeqapi.so` と `libtrapi.so` が独立した .so で、
  それぞれが BPSD module-level state (`___bpsd_equ1d_MOD_equ1dx` ほか)
  を private に持つことが判明 (nm で確認、両 .so で static linkage)。
  spec が想定した「BPSD = eq/tr 共有ブローカー」は
  `Tot` / `libtotapi.so` (eq+tr+bpsd 同梱) では成立するが、
  `TotPipeline` (per-module .so) では成立しない。
  したがって ('eq','tr') verify ルールは本 PR では登録せず、
  Layer C 統合テストも保留 (実行しても push が反映されないので、
  C-1 は構造的にパスしえず、C-3 は誤った理由で pass する)。

  verify ディスパッチの機構自体は本 PR で完成済み (Layer A 網羅)
  なので、共有方針が決まり次第 ('eq','tr') ルールは
  COUPLING_RULES に 1 行追加するだけで有効化できる。
  詳細は totlib/README.md の Coupling rules 節と
  trlib/README.md の "BPSD ブローカー pull 検証" 節を参照。

README updates:

  python/totlib/README.md: documents the kind="transfer" / "verify"
    distinction in the Coupling rules section, plus a Japanese-language
    deferral note explaining the per-.so BPSD isolation finding.

  python/trlib/README.md: new "BPSD ブローカー pull 検証" section
    (Japanese) documenting Trlib.check_bpsd_pull() with usage example
    and the same scope caveat about per-.so isolation.

Existing Layer 1 baselines (demo2014, ht6m at 1e-10) and existing
pipeline tests are unaffected (the verify dispatch doesn't fire when
no rule is registered for the pair).

Spec: docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(L-7b-ii): address reviewer findings (MED + LOW polish)

Codex MED1 (older .so compatibility): wrap tr_check_bpsd_pull
ctypes prototype attachment in try/except, mirroring the existing
tr_validate pattern in _ffi.py. Older builds of libtrapi.so that
predate L-7b-ii will now defer the AttributeError to first call of
Trlib.check_bpsd_pull instead of failing at module import.

Codex MED3 (chain-depth accuracy): totlib/README's "3-level cause
chain" was inaccurate for the False-return path. Clarify that
False produces a 2-level chain (TotPipelineRunError ->
TotPipelineCouplingError, no original cause), while a verify()
exception produces the 3-level chain (... -> original exception).

In-house L-1 (docstring count): test_pipeline_verify.py header
said "Six cases" but defines 8 test functions because A-5 expands
into three separate validation tests. Updated the docstring to
clearly distinguish "six logical cases" from "eight test
functions" with a per-case description.

Codex MED2 (spec staleness): added §0 (in Japanese, matching
totlib/trlib README convention) at the top of the spec doc
recording the implementation-time deferral, summarising what
shipped vs deferred and listing candidate resolution paths
(shared .so / IPC / RTLD_GLOBAL+weak / serialisation). Status
field changed from "Draft" to "Partial".

Verification:
- Layer A (8) + Layer B (3) + dataclasses (8) + pipeline (30) +
  equivalence (2) = 51 passed; 1 pre-existing Py3.10
  ExceptionGroup test still fails (documented in plan).
- Trlib() smoke load + check_bpsd_pull still returns False
  on fresh init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant