test(tr): Phase 0 - regression test infrastructure by k-yoshimi · Pull Request #2 · k-yoshimi/task

k-yoshimi · 2026-04-17T23:16:34Z

Stacked on #1. After #1 merges, the base of this PR should be changed to `develop`:

gh pr edit 2 --repo k-yoshimi/task --base develop

Summary

TASK/TR モジュールのリファクタリングに先立ち、数値挙動を固定する回帰テスト基盤を `test_run/` に整備する Phase 0 の実装。

`tr/trregress.f90` を追加（環境変数 `TR_REGRESS_DUMP=1` のときのみ `tr_regress.dat` を `1PE24.16` 書式で書き出す）
`test_run/scripts/` に dump → JSON 抽出・比較スクリプト + unittest
`test_run/inputs/` に `tr_m0904` と `tr_tst2` の短縮入力を追加
`test_run/baselines/` に 3 ケース分のゴールデン指標
`run_tests.sh` に TR モジュール実行時の `TR_REGRESS_DUMP=1` エクスポートと回帰判定を統合

設計書と実装計画

設計書: `docs/superpowers/specs/2026-04-17-tr-refactoring-design.md`
実装計画: `docs/superpowers/plans/2026-04-17-tr-refactoring-phase0.md`

設計方針は Option B（構造＋危険箇所のみの限定的近代化）、既存開発者の参画しやすさを優先。

回帰判定の仕組み

`run_tests.sh` は TR モジュールに対してだけ `TR_REGRESS_DUMP=1` をエクスポートして `tr/tr2` を実行
CLOSED メッセージで計算成功を判定
`scripts/check_regression.sh` が `tr_regress.dat` → JSON → `baselines/` と相対誤差 `1e-10` で比較
不一致なら `REGRESSION` で FAIL

通常実行への影響

なし。 `TR_REGRESS_DUMP` 未設定時は `tr_regress_dump_if_enabled` が即 `RETURN` するため、従来の挙動と完全に同一。

変更の要約（15 コミット）

#	内容
1	設計書＋実装計画を追加
2-3	Phase 0 作業開始マーカー & 戦略ロック（空コミット）
4	`tr/trregress.f90` 追加（本 PR の中核）
5-6	`extract_tr_metrics.py` と pytest→unittest 移植
7	`compare_metrics.py`（許容誤差 1e-10）
8-9	M0904, TST-2 の入力ケース
10	`check_regression.sh` ラッパ
11	`test_definitions.conf` に 2 ケース追加
12	`run_tests.sh` に `TR_REGRESS_DUMP=1` + 回帰判定を統合
13	3 ケース分のベースライン JSON
14	README
15	`.gitignore` 調整

Test plan

通常実行（`TR_REGRESS_DUMP` 未設定）で `tr_regress.dat` が生成されないことを確認
`./run_tests.sh tr_iter01 tr_m0904 tr_tst2` が 3 連続で全 PASS
同一入力の run-to-run で dump が bit-exact 一致
Python の単体テスト 8 件全 PASS（`python3 -m unittest discover test_run/scripts/tests`）
RKEV 定数の 0.05% 改変で 207 件の mismatch を検出・REGRESSION 報告を確認
リストア後に全 PASS に復帰

pytest → unittest について

設計書と実装計画では pytest を想定していたが、対象環境に pytest が未インストールのため stdlib `unittest` に移植（コミット a5a6096）。機能的な差異はなく、`python3 -m unittest discover` で同等のテストを実行できる。

次のフェーズ

本 PR 後、Phase 1（`USE TRCOMM` への ONLY 句付与）から順次別 PR で進める予定。

🤖 Generated with Claude Code

Note

Medium Risk
Adds a new TR codepath (file dump) and wires numerical comparisons into the test runner; while gated by TR_REGRESS_DUMP, failures can now come from baseline drift/tolerance and the dump hook runs on all TR test executions.

Overview
Adds Phase 0 TR regression testing infrastructure by introducing tr/trregress.f90 and a one-line hook in trloop.f90 (plus tr/Makefile update) to write tr_regress.dat in a fixed high-precision format when TR_REGRESS_DUMP=1.

Extends test_run to enforce numeric stability: run_tests.sh now sets TR_REGRESS_DUMP=1 for TR runs and, after CLOSED, calls a new scripts/check_regression.sh wrapper that extracts metrics (extract_tr_metrics.py) and compares them to committed baselines (compare_metrics.py, default rel tol 1e-10), reporting REGRESSION on mismatch.

Adds two new TR input cases (tr_m0904, tr_tst2) to test_definitions.conf, commits baseline metrics.json for tr_iter01, tr_m0904, and tr_tst2, and documents the workflow in test_run/README.md (with test_output/ and Python caches ignored via test_run/.gitignore).

^{Reviewed by Cursor Bugbot for commit fe94c1e. Bugbot is set up for automated code reviews on this repo. Configure here.}

- specs: TR module refactoring design (Option B: structure + danger-spot modernization) - plans: Phase 0 regression test infrastructure with env-guarded high-precision dump - Parked on feature/tr-regression-phase0-docs pending dealloc cleanup merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add trregress module providing tr_regress_dump_if_enabled. When the environment variable TR_REGRESS_DUMP=1 is set, tr_loop writes key global quantities (WPT, AJT, Q0, BETAn, TAUEn, ZEFF0, ALI, RQ1) and radial profiles (RN, RT, AJ, QP) to tr_regress.dat in 1PE24.16 format at the end of the transport calculation loop. When the variable is unset, the routine returns immediately so normal runs are unaffected. This is Phase 0 Task 3 of the TR regression test infrastructure. Run-to-run bit-exact reproducibility verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pytest is not installed on the target environment. Switch to unittest so the regression suite runs on a vanilla Python 3 install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ests.sh

Baselines captured with: - tr/tr2 built at commit a361df2 with TR_REGRESS_DUMP=1 env-guarded dump - gfortran OFLAGS = -g -O3 -m64 -std=legacy - Verified run-to-run reproducible within 1e-10 tolerance across 3 consecutive runs.

k-yoshimi · 2026-04-17T23:43:39Z

@cursor review

… passing Cursor Bugbot flagged that `_rel_err(41.13, inf)` evaluates to NaN (`inf / inf`) and `NaN > tol` is False, so a regression that produces `inf` would be silently accepted. Guard `_check_scalar` with `math.isinf` the same way `math.isnan` is guarded: identical infinities still match, but any finite-vs-inf or +inf/-inf mismatch is reported. Add three unittest cases covering actual-is-inf, baseline-is-inf, and matching-infs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

k-yoshimi · 2026-04-18T04:20:09Z

Cursor Bugbot の指摘に対応しました。

Fix: compare_metrics.py:_check_scalar に math.isinf ガードを追加し、finite-vs-inf / 符号違い inf を検知するようにしました。同値 inf（baseline=actual=inf）は従来通り match 扱いです。

Tests: test_compare_metrics.py に 3 ケース追加（test_fails_when_actual_is_inf / test_fails_when_baseline_is_inf / test_passes_when_both_are_same_inf）。unittest 11 件全 PASS。

@cursor review

k-yoshimi · 2026-04-18T04:50:07Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit fe94c1e. Configure here.}

Bugbot's second pass on PR #178 surfaced 4 issues that survived the first round of LOW fixes (e1e5a37). All four addressed in this commit, with targeted regression tests: #1 (LOW) PipelineResult.to_dict() now returns defensive copies of inner scalars dicts and coupling_applied lists. Previously, MCP- side mutation of the JSON payload would propagate into the originating PipelineStep. New test_to_dict_returns_defensive_copies asserts the contract. #2 (LOW) _validate_steps now accepts list-of-2-element steps in addition to tuple-of-2-element. JSON-decoded MCP payloads deserialise tuples as lists, so the API now round-trips through to_dict() without manual coercion. New test_run_pipeline_accepts_list_shaped_steps regression guard. #3 (LOW) set_param now explicitly rejects bool inputs with TotPipelineCouplingError instead of silently coercing to 1.0/0.0 via the numeric branch (Python's bool is a subclass of int). The previous test_set_param_records_bool_as_float (which documented the silent-coercion behavior) is replaced with test_set_param_rejects_bool that asserts the explicit rejection. #4 (MED) ExceptionGroup compatibility shim for Python 3.10. The close() multi-failure path uses builtin ExceptionGroup (3.11+); tot_mcp's pyproject.toml declares requires-python=">=3.10". The shim probes for the builtin first, falls back to the exceptiongroup PyPI backport if available, and degrades gracefully to raising errors[0] only when neither is reachable (same as pre-fix behavior). CI runs 3.11+ so the existing ExceptionGroup test still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec): add L-7a cross-module coupling design Design for incremental Python-side scalar coupling between fp and tr in the tot orchestrator. Establishes TotPipeline class skeleton in a new file (python/totlib/pipeline.py) without touching the legacy Tot wrapper. Covers fp -> tr driven current (RJT volume integral -> PNBCD scalar) only; profile/coordinate transforms deferred to L-7b via BPSD. Two Codex review passes: 9 findings (HIGH 2 / MED 5 / LOW 1) all addressed inline. Pre-flight Research R1 (singleton coexistence + finalize/init state-leak) / R2 (rjt_volint extraction) / R3 (tr param name confirmation) gated before implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): add L-7a cross-module coupling implementation plan Bite-sized TDD task list executing the L-7a design spec (0a12bd0) across 4 phases / 16 tasks / 3 sub-PRs: - Phase 0: Pre-flight research R1/R2/R3 (singleton coexistence, rjt_volint extraction, tr driven-current param confirmation) - Phase 1 (PR 1): TotPipeline class + errors + mock unit tests - Phase 2 (PR 2): compute_rjt_volint helper + COUPLING_RULES population + fp -> tr 1e-10 equivalence test - Phase 3 (PR 3): run_pipeline MCP tool + docs Each PR phase ends with the CLAUDE.md pre-push gate (code-reviewer + codex:codex-rescue + REVIEW_OK marker + --forked --timeout=120 --timeout-method=signal pytest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): record L-7a R1 (singleton coexistence) outcome R1-a (concurrent live) and R1-b (re-init state leak) verified per the L-7a design spec. Outcome and fallback decision recorded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): record L-7a R2 (rjt_volint extraction) outcome R2 PASS. compute_rjt_volint(state, *, R0, a) verified against fp internal aggregate rtotalIP at 1e-10 (sanity case NRMAX=10, RR=3, RA=1, default fixture: helper -2.769528e-13 MA == fp stdout -2.7695E-13 MA). FpState shape characterized: top-level dataclass attrs [nrmax, nsamax, npmax, nthmax, ntg2, timefp, RNT/RWT/RTT/RJT/RPCT/RPWT]; **no `.scalars` dict** (asymmetric to TrState — R1 concern #3 confirmed). FpState exposes no grid info (RG/DV/VOLR/RA/RR absent), so helper takes R0, a as kwargs and assumes uniform rho-grid in [0,1] (matches fp/fpprep.f90:97-99). RJT units: [MA/m^2] (matches RJS in fp/fpcale.f90:290 where RJ_P = RJS*1.D6 converts to A/m^2). Helper applies *1e6 internally and returns total I in [A]. fplib unchanged — helper lives in totlib/pipeline.py (Phase 2 Task 2.1). Probe stdout (truncated): FpState attributes: RJT/RNT/RPCT/RPWT/RTT/RWT (list[list[float]]), npmax=50, nrmax=1, nsamax=1, ntg2=2, nthmax=50, timefp=0.01. Grid candidates RG/DV/VOLP/RM/RMAX/DR/RA/RR/RB: ALL ABSENT. Scalar candidates AJT/AJT_driven/.../scalars: ALL ABSENT. Sanity check (NRMAX=10): fp stdout: total plasma current [MA] -2.7695E-13 helper: compute_rjt_volint = -2.769528e-07 A = -2.769528e-13 MA (1e-10 match) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): record L-7a R3 (tr driven current param) outcome PARTIAL PASS: PNBCD is NOT in tr_param_registry. Registered scalar candidates are PICCD, PECCD, PLHCD (dimensionless current-drive factors) and PLHTOT (LH input power [MW]). Selected PLHCD as L-7a skeleton target with documented unit mismatch (Amperes -> dimensionless factor); rigorous physical conversion deferred to L-7b. Picked a = tr.RA for compute_rjt_volint(state, R0=tr.RR, a=tr.RA) because fp's radial mesh is RA-normalized (fpcale.f90:34, 56). Active-drive fixture for Phase 2.3 equivalence test: fp.set_param('E0', 0.001) # induction E-field [V/m] yields ~0.946 MA (vs ~1e-13 MA for default), well above the 1e-10 equivalence-test floor. MODEL_WAVE+PABS_LH path failed with rc=3 and needs additional setup; E0 is the minimum-cost active-drive trigger. Open issue surfaced: tr lacks a registered scalar that semantically accepts an external driven current in MA. L-7b should add such a param (or expose PNBCD via the registry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(l7a): R4 decision gate — accept skeleton coupling, update spec/plan R3 confirmed PNBCD is not registered in tr_param_registry.f90; only PLHCD/PICCD/PECCD (dimensionless multipliers) accept set_param. Per user decision (option A), L-7a proceeds as API-plumbing skeleton with PLHCD as the dst_param; physical fidelity (proper EXTERNAL_DRIVEN_I scalar in tr) is moved to L-7b follow-ups. Spec updates: - §2 table: PNBCD -> PLHCD with skeleton-coupling caveat - §3 non-goals: explicit "physical fidelity is L-7b scope" - §5.2 CouplingRule: src_state_key now (state, params) 2-arg callable - §5.3 PipelineResult + new _state_to_scalars adapter (handles FpState which lacks the .scalars dict per R1/R2) - §5.4 TotPipeline: self._params tracking for callable rules - §6.2 run_pipeline: uses _state_to_scalars + records dst_param injections back into _params - §12 follow-ups: tr EXTERNAL_DRIVEN_I_MA addition + fplib .scalars property consideration Plan updates: - Task 1.4: __init__ adds self._params; set_param records into it - Task 1.5: _extract_source 2-arg callable + _state_to_scalars helper + 5 new tests (callable+params, set_param tracking, adapter) - Task 2.1: compute_rjt_volint(state, *, R0, a) signature with the R2-derived implementation pasted in - Task 2.2: COUPLING_RULES uses lambda wrapper pulling tr:RR/tr:RA from params; dst_param=PLHCD; documents skeleton nature - Task 2.3: equivalence test uses E0=0.001 active-drive fixture (R3), PLHCD instead of PNBCD, _state_to_scalars adapter, plus a sanity guard that the integral magnitude is non-trivial (>= 1e3 A) - Test count expectations updated throughout Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(totlib): add TotPipeline error hierarchy Five new exception classes for the L-7a pipeline: - TotPipelineError (base, extends TotlibError) - TotPipelineUnknownModuleError, TotPipelineCouplingError, TotPipelineLifecycleError - TotPipelineRunError carrying partial_result + failed_step_index + failed_module attributes for debugging mid-pipeline failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(totlib): add CouplingRule/PipelineStep/PipelineResult dataclasses Pure data carriers for the run_pipeline orchestrator. CouplingRule is frozen (immutable) so the registry can be hashable in the future. PipelineResult.last(module) returns the most recent matching step, to_dict() is the MCP serialization format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(totlib): match typing convention from surrounding modules Code-quality reviewer flagged divergence: pipeline.py used PEP 585 (dict[...], list[...]) while totlib/state.py and sibling state.py files use Dict[...], List[...] from typing. Match the surrounding codebase. Also addresses two reviewer nits: explicit "frozen=True rationale" docstring on CouplingRule, and explicit shallow-ref caveat docstring on PipelineResult.to_dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(totlib): add module registry + lazy import helpers _MODULE_REGISTRY maps short names (fp/tr/eq/wr/wrx/ti) to (package, wrapper_class, base_error_class) triples. _import_wrapper / _import_module_error keep the lazy-load contract from spec section 5.5: only modules used in the current pipeline get their lib*.so / errors.py loaded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(totlib): correct ti wrapper class name + add parametrized registry tests Code-quality reviewer flagged a coverage gap: only fp was exercised by the registry-resolves-class tests. Adding parametrized tests across all 6 modules immediately surfaced a real bug: the ti wrapper class is `TiLib` (capital L), not `Tilib` as the registry claimed. The sibling `TilibError` IS lowercase, so ti is the only entry where the wrapper-class and error-class casing diverge — easy to miss. Fixed registry entry; added a comment flagging the asymmetry. Spec and plan updated to match. Now 31/31 pipeline tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(totlib): add TotPipeline lifecycle (init/set_param/close) Implements the lifecycle subset of TotPipeline: - __init__ creates empty _modules / _params / _closed=False; opens nothing. - _ensure_module lazy-instantiates a wrapper on first reference, gated by the closed flag. - set_param routes strings to set_param_str, numerics to set_param, and records the value into _params after the wrapper accepts it (so failed wrapper validation keeps _params consistent). - close is idempotent and finalizes all modules even when one raises; the first exception is re-raised so the root cause is visible. - Context manager support via __enter__/__exit__. run_pipeline / _extract_source / COUPLING_RULES are deferred to Task 1.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(totlib): guard set_param_str when module wrapper lacks the method Code-quality reviewer flagged that wrlib.Wrlib, wrxlib.Wrxlib, and tilib.TiLib do not expose set_param_str (only fp/tr/eq do). Routing a string value through TotPipeline.set_param previously raised a bare AttributeError leaking from getattr. Now raises a typed TotPipelineCouplingError with a clear message identifying the module. Added a regression test using a mock wrapper without set_param_str. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(totlib): implement TotPipeline.run_pipeline orchestrator - _state_to_scalars adapter (handles fp's missing .scalars dict + the standard .scalars-bearing modules uniformly). - COUPLING_RULES registry (empty at this commit; ('fp','tr') is filled in Phase 2 after R2/R3 outcomes are baked into pipeline.py). - _validate_steps performs side-effect-free pre-flight checks. - _extract_source resolves callable src_state_key with (state, params) signature; string keys go through _state_to_scalars. - run_pipeline iterates steps, applies coupling rules between adjacent pairs, and wraps per-module errors as TotPipelineRunError with the partial PipelineResult attached for debugging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(totlib): export TotPipeline + dataclasses at package level Backwards-compatible: legacy 'from totlib import Tot' / 'TotState' imports continue to work. New 'from totlib import TotPipeline, CouplingRule, PipelineStep, PipelineResult' is added for L-7a users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(totlib): broaden run_pipeline catch + cover Codex review gaps Codex independent review (2026-04-28) flagged 1 MED + 2 LOW gaps the prior reviewer missed: MED1: run_pipeline's except clause was (base_err, TotPipelineCouplingError), which let TypeError/ValueError from wrong kwargs escape without TotPipelineRunError wrapping — so partial_result was lost. Broadened to `except Exception`. KeyboardInterrupt and SystemExit still propagate (they are BaseException, not Exception). The lazy `_import_module_error(name)` call is kept for forward-compat debugging clarity (annotated noqa). LOW2: test_set_param_records_into_params_dict only exercised the numeric path of set_param. Added a string-value entry so the string path's _params bookkeeping is also verified. LOW3: test_run_pipeline_partial_failure_carries_partial_result did not assert on coupling_applied. Added an empty-list assertion (regression guard against silently dropping coupling metadata). Plus a new test test_run_pipeline_typeerror_from_bad_kwargs_is_wrapped that exercises the broadened catch path directly. 61 tests pass under --forked --timeout=120 --timeout-method=signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(totlib): address PR #178 Bugbot LOW findings Three independent fixes surfaced by the Cursor Bugbot pass on PR #178: 1. close() no longer silently drops secondary finalize errors (pipeline.py:259-269). When more than one module's close() raises, the caller now sees an ExceptionGroup carrying every collected error instead of just errors[0]. Single-failure path is unchanged (the original exception type is preserved). Tested via the new test_close_raises_exception_group_when_multiple_modules_fail. 2. set_param's _params bookkeeping no longer diverges from the value the wrapper actually received (pipeline.py:235-249). Numeric inputs are coerced via float() at the boundary, but _params used to store the original (e.g. True or int 2). Coupling rules reading self._params now see the same float the underlying module saw. Existing test_set_param_records_into_params_dict updated for the int → float coercion; new test_set_param_records_bool_as_float pins the bool path explicitly. 3. run_pipeline no longer calls _import_module_error per step (pipeline.py:329-336). The call had become dead after the broad "except Exception" catch in commit 941f353 — base_err was F841'd noqa'd, the comment misleadingly claimed "lazy-import semantics," and the call still triggered an .errors module import per step. _validate_steps already gates on _MODULE_REGISTRY membership pre-execution, and _ensure_module raises on bad names, so removing the dead call doesn't reduce safety. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(totlib): address PR #178 remaining Bugbot findings (4 items) Bugbot's second pass on PR #178 surfaced 4 issues that survived the first round of LOW fixes (e1e5a37). All four addressed in this commit, with targeted regression tests: #1 (LOW) PipelineResult.to_dict() now returns defensive copies of inner scalars dicts and coupling_applied lists. Previously, MCP- side mutation of the JSON payload would propagate into the originating PipelineStep. New test_to_dict_returns_defensive_copies asserts the contract. #2 (LOW) _validate_steps now accepts list-of-2-element steps in addition to tuple-of-2-element. JSON-decoded MCP payloads deserialise tuples as lists, so the API now round-trips through to_dict() without manual coercion. New test_run_pipeline_accepts_list_shaped_steps regression guard. #3 (LOW) set_param now explicitly rejects bool inputs with TotPipelineCouplingError instead of silently coercing to 1.0/0.0 via the numeric branch (Python's bool is a subclass of int). The previous test_set_param_records_bool_as_float (which documented the silent-coercion behavior) is replaced with test_set_param_rejects_bool that asserts the explicit rejection. #4 (MED) ExceptionGroup compatibility shim for Python 3.10. The close() multi-failure path uses builtin ExceptionGroup (3.11+); tot_mcp's pyproject.toml declares requires-python=">=3.10". The shim probes for the builtin first, falls back to the exceptiongroup PyPI backport if available, and degrades gracefully to raising errors[0] only when neither is reachable (same as pre-fix behavior). CI runs 3.11+ so the existing ExceptionGroup test still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code-review feedback on 8deed08: the handoff stated `768 commits behind chore` but the actual `git log master..HEAD` count is **755**. Updated all four occurrences (lines 10, 17, 163, 251) to match. The "768" figure was a copy from the prior handoff's "765" base plus +3 PRs guess; the right way is `git log --oneline master..HEAD | wc -l`, which is 755 today. Future handoffs should re-derive rather than extrapolate. (Codex MED #2 about hard-coded `/Users/k-yoshimi/...` paths is intentionally not addressed: every prior handoff in `.claude/` follows the same convention, the file is a per-checkout self- reference, and the user's git config already publishes the same identity in commit metadata.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…188) * docs(spec): add L-7b-ii BPSD broker coupling verification design Brainstorm-stage spec for L-7b-ii: orchestrator-level verification of the eq -> tr BPSD coupling. MVP scope ("A-medium"): - new non-mutating Fortran helper tr_check_bpsd_pull (4-slot check) - C ABI export + Python Trlib.check_bpsd_pull() wrapper - CouplingRule extended with kind/verify fields - new ("eq","tr") verify rule in COUPLING_RULES - 3-layer test strategy (mock dispatch / unit / integration) Codex independent reviewed in 4 rounds (overall design + sec 6 + sec 7 + sec 8); all HIGH/MED findings addressed: - non-mutating BPSD pull (per-slot ierr accumulation, no TRCOMM mutation) - existing TotPipelineCouplingError reused (no new exception class) - 3-level exception chain via existing broad except - MODELG=0 + eq->tr -> CouplingError as acceptance criterion - wr-side BPSD speaker pending evaluation (L-7a divergence-risk judgement preserved; re-evaluation trigger documented) Implementation risks (4) carry explicit fallbacks. Out-of-scope items explicitly cross-reference L-7a / L-7b-i specs by section name (not file:line) for churn resilience. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): L-7b-ii self-review fixes (BS7) - replace MAX_NRHO placeholder with reference to existing BPSD callers' allocation pattern - reconcile test case count (12 -> 11; C-2 documented but not implemented per drop note) - clarify error message format uses callable_repr (with repr() fallback for non-qualname callables like partial) No semantic change; corrects three minor inconsistencies found in the brainstorming spec self-review pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): L-7b-ii final Codex review fixes Six fixes from final Codex full-spec review (HIGH 1, MED 3, LOW 2): HIGH: verify checks 3 eq-pushed slots only (device/equ1D/metric1D), not 4. plasmaf is tr's own BPSD output (tr_bpsd_put), absent on a fresh eq->tr pipeline -- including it would cause spurious False on first-time pipelines. Updated §1 data flow, §2 Fortran helper, §3 Python wrapper docstring, C prototype. MED-1: __post_init__ now validates transform is non-None for kind=transfer (was missing -- transfer dispatch unconditionally calls rule.transform(raw), would crash on None default). MED-2: AC6 clarified -- Layer C drop decision is plan-time, not post-MVP. Either Layer C is required (Eq wrapper sufficient) or removed from MVP scope (insufficient + follow-up PR). Removes the 'both required and droppable' ambiguity. MED-3: §8.4 risk #5 added: stale BPSD slot data across tests/runs. Layer A/B unaffected; Layer C uses --forked or per-test BPSD reset for isolation. LOW-1: line 76 estimate table 12 -> 11 cases (matches §7 count after C-2 drop). LOW-2: @DataClass(frozen=True) preserved in §4 dataclass; explicit note that __post_init__ raises ValueError only (no field mutation) so frozen remains compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): L-7b-ii confirmation review fixes 3 missed fixes from confirmation review: HIGH (still-broken): 'all four' verbiage at line 126 + plasmaf in rule.doc string at lines 247-255 -- both fixed. Remaining plasmaf mentions are all explicit 'intentionally excluded' rationale (intentional, kept). MED-1 (partial): added type-narrowing asserts in §5 dispatch snippet -- assert rule.transform is not None / assert rule.dst_param is not None -- to satisfy mypy/pyright on the Optional[Callable] / Optional[str] fields. __post_init__ guarantees they hold for kind=transfer. MED-2 (partial): §1 in-scope clarified -- Layer A+B unconditional, Layer C conditional on plan-time Eq wrapper sufficiency check (cross-references AC6 + §8.4 risk #2). Resolves the 'integration coverage promised but droppable' contradiction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(spec): L-7b-ii final assert narrowing for rule.verify One-line fix from final confirmation review: §5 dispatch verify branch was missing 'assert rule.verify is not None' for static checker narrowing on the Optional[Callable] field. Symmetric to the transfer-branch asserts already present. Codex verdict after this fix: READY-FOR-IMPLEMENTATION. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): add L-7b-ii BPSD broker coupling implementation plan Implementation plan for the L-7b-ii spec (docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md). 4-phase commit pattern mirroring L-7b-i (PR #187): Phase 1: tr_check_bpsd_pull Fortran helper + C ABI + Python wrapper Phase 2: CouplingRule kind/verify extension + dispatch + ('eq','tr') rule Phase 3: 11 test cases across 3 layers + README updates Phase 4: pre-push gate + PR open + CI/Bugbot wait + squash merge Plan-time risks resolved at write-time: - Eq wrapper sufficient (set_param/set_param_str/run/get_state all present in python/eqlib/eqlib.py); Layer C in scope - bpsd_get_data is INTENT(INOUT), self-allocates when nrmax=0; helper sets local%nrmax=0 explicitly - g_initialized/g_prepared lifecycle flags exist in tr/tr_api.f90:64-65 Plan-time risk deferred to implementation: - Layer C runtime (< 5s estimate unmeasured); fallback is @pytest.mark.slow gating per spec §8.4 Step granularity bite-sized (read context -> edit -> verify -> commit) per writing-plans skill convention. Suitable for executing-plans or subagent-driven-development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): L-7b-ii Codex review fixes Six fixes from Codex plan review (HIGH 2, MED 3, LOW 1): HIGH-1: pre-push pytest commands now include --forked --timeout=120 --timeout-method=signal per CLAUDE.md non-negotiable. Note added on pytest-forked plugin install + documented exception when plugin unavailable. HIGH-2: reviewer agent type Agent(subagent_type='feature-dev: code-reviewer', ...) per CLAUDE.md. Fallback note for environments where feature-dev: is unavailable points to superpowers:code-reviewer. MED-1: Layer C --forked is now required (not advisory). Without BPSD broker isolation, C-3 (MODELG=0 expected failure) may spuriously pass after C-1 populated slots. MED-2: base SHA references updated f3dafae -> bdc0913 (the plan commit itself is the new chore tip, so executor will branch from it). Phase 0/1.5/2.4/3.6 squash bases corrected. MED-3: spec-coverage matrix entry for §7 MODELG fixture corrected to acknowledge deviation: plan uses local _eq_params_modelg() helper instead of mutating shared tot_demo2014_params.py (rationale: avoid Layer 1 baseline side effects). eqdata path IS still reused. LOW-1: Layer A case count clarified -- 11 logical cases (A-5 implemented as 3 sub-tests), 13 individual pytest tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): align Phase 4.1 Step 2 --timeout to CLAUDE.md (=120) Final fix from confirmation review: Phase 4.1 Step 2 used --timeout=60 (inconsistent with Step 1/3 and CLAUDE.md CI flags). Aligned to --timeout=120. Plan now READY-FOR-EXECUTION per Codex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): align all pytest commands to CLAUDE.md flags 5 additional pytest commands (Tasks 1.4, 2.2, 2.3, 3.1, 3.2 sanity checks) updated to --forked --timeout=120 --timeout-method=signal, matching CI workflow and Phase 4 pre-push gate. Per CLAUDE.md non-negotiable: 'Local test: run the relevant pytest locally with the SAME flags the CI workflow uses.' All 9 pytest invocations in the plan now consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tr+trlib: add tr_check_bpsd_pull BPSD broker verification helper (L-7b-ii) New non-mutating Fortran helper tr_check_bpsd_pull pulls the 3 eq-pushed BPSD slots (device, equ1D, metric1D) into local discardable types and reports ok = 1 iff all 3 per-slot ierr == 0. plasmaf is intentionally NOT checked: it is tr's own BPSD output (tr_bpsd_put), absent on a fresh eq->tr pipeline. Implementation notes: - Local types initialized with %nrmax = 0 to trigger BPSD's self-allocation path (per ../../bpsd/bpsd_equ1D.f90:143-150, bpsd_get_* is INTENT(INOUT) and allocates internally when the caller passes nrmax=0). - Per-slot ierr accumulated to AND-condition (avoids masking earlier failures by later successes). - TRCOMM untouched -- safe to call before or after tr.run(). - TR_STATE_ABI_VERSION not bumped (no struct change). Exposed via C ABI (tr/tr_api.h) and Python wrapper (Trlib.check_bpsd_pull). Smoke-tested: fresh Trlib() init returns False (BPSD slots empty); existing trlib tests unchanged. Spec: docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * pipeline: extend CouplingRule for verify rules (kind/verify dispatch) (L-7b-ii) CouplingRule dataclass gains two new fields: kind: str = "transfer" # "transfer" | "verify" verify: Optional[Callable[[Any], bool]] = None Existing transfer-rule fields (src_state_key, dst_param, transform) are defaulted to None and validated by __post_init__: * kind="transfer" requires all three transfer fields * kind="verify" requires verify callable * unknown kind raises ValueError @DataClass(frozen=True) preserved; __post_init__ only raises and never mutates self, so frozen remains compatible. run_pipeline rule iteration gains a kind branch: * transfer: existing extract -> transform -> set_param -> record * verify: ok = rule.verify(curr_inst); raise CouplingError if False or if verify itself raised. applied.append only on success (matches transfer-rule "succeeded snapshot" semantics of PipelineStep.coupling_applied) Both error paths flow through the existing broad except at pipeline.py and become TotPipelineRunError with __cause__ chain (3-level: RunError -> CouplingError -> OriginalError if any). Scope note (L-7b-ii deferred at registry level): The original spec proposed registering an ('eq','tr') verify rule using BPSD as a cross-module broker (Trlib.check_bpsd_pull). During implementation we found that libeqapi.so and libtrapi.so each carry a private copy of the BPSD module-level state -- ___bpsd_equ1d_MOD_ equ1dx is static (s) in nm output of both shared objects. Therefore bpsd_put from libeqapi.so does NOT propagate to libtrapi.so's bpsd_get, and the verify rule cannot succeed under the per-module TotPipeline architecture. The legacy Tot / libtotapi.so path keeps eq+tr+bpsd co-linked and is unaffected. Until a cross-.so sharing scheme (unified .so, IPC, or RTLD_GLOBAL with weak symbols) is decided, no ('eq','tr') rule is registered. The verify dispatch infrastructure is in place and mock-tested (Layer A) so that the rule can be added later without further changes to pipeline.py. One existing test (test_run_pipeline_missing_source_state_raises_ coupling_error) was updated to pass an explicit transform=lambda v: v to satisfy the new __post_init__ contract; it had previously relied on the old identity default. Existing ("fp","tr") transfer rule and Layer 1 baselines (demo2014, ht6m at 1e-10) are unaffected. Spec: docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test+docs: Layer A/B BPSD coupling verify tests + READMEs (L-7b-ii) Adds 11 new test cases across 2 layers (Layer C deferred — see below): Layer A (test_pipeline_verify.py, no .so): 8 mock-based dispatch tests covering verify-True success, verify-False raise+cause-chain, verify-raise 3-level chain, mixed transfer+verify ordering, __post_init__ validation (3 cases), unregistered-pair silent skip. Layer B (test_bpsd_check.py, libtrapi.so): 3 unit tests for Trlib.check_bpsd_pull() -- fresh init returns False, closed Trlib raises TrlibError, no exception leak from Fortran. Also updates test_pipeline_dataclasses::test_coupling_rule_defaults to pass an explicit transform=lambda v: v: the previous version relied on the old default `transform = lambda v: v`, which is no longer a default per the new __post_init__ contract introduced in the previous commit. Layer C (eq + tr integration) deferred: 実装中の調査で `libeqapi.so` と `libtrapi.so` が独立した .so で、それぞれが BPSD module-level state (`___bpsd_equ1d_MOD_equ1dx` ほか) を private に持つことが判明 (nm で確認、両 .so で static linkage)。 spec が想定した「BPSD = eq/tr 共有ブローカー」は `Tot` / `libtotapi.so` (eq+tr+bpsd 同梱) では成立するが、 `TotPipeline` (per-module .so) では成立しない。したがって ('eq','tr') verify ルールは本 PR では登録せず、 Layer C 統合テストも保留 (実行しても push が反映されないので、 C-1 は構造的にパスしえず、C-3 は誤った理由で pass する)。 verify ディスパッチの機構自体は本 PR で完成済み (Layer A 網羅) なので、共有方針が決まり次第 ('eq','tr') ルールは COUPLING_RULES に 1 行追加するだけで有効化できる。詳細は totlib/README.md の Coupling rules 節と trlib/README.md の "BPSD ブローカー pull 検証" 節を参照。 README updates: python/totlib/README.md: documents the kind="transfer" / "verify" distinction in the Coupling rules section, plus a Japanese-language deferral note explaining the per-.so BPSD isolation finding. python/trlib/README.md: new "BPSD ブローカー pull 検証" section (Japanese) documenting Trlib.check_bpsd_pull() with usage example and the same scope caveat about per-.so isolation. Existing Layer 1 baselines (demo2014, ht6m at 1e-10) and existing pipeline tests are unaffected (the verify dispatch doesn't fire when no rule is registered for the pair). Spec: docs/superpowers/specs/2026-05-03-l7b-ii-bpsd-broker-coupling-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(L-7b-ii): address reviewer findings (MED + LOW polish) Codex MED1 (older .so compatibility): wrap tr_check_bpsd_pull ctypes prototype attachment in try/except, mirroring the existing tr_validate pattern in _ffi.py. Older builds of libtrapi.so that predate L-7b-ii will now defer the AttributeError to first call of Trlib.check_bpsd_pull instead of failing at module import. Codex MED3 (chain-depth accuracy): totlib/README's "3-level cause chain" was inaccurate for the False-return path. Clarify that False produces a 2-level chain (TotPipelineRunError -> TotPipelineCouplingError, no original cause), while a verify() exception produces the 3-level chain (... -> original exception). In-house L-1 (docstring count): test_pipeline_verify.py header said "Six cases" but defines 8 test functions because A-5 expands into three separate validation tests. Updated the docstring to clearly distinguish "six logical cases" from "eight test functions" with a per-case description. Codex MED2 (spec staleness): added §0 (in Japanese, matching totlib/trlib README convention) at the top of the spec doc recording the implementation-time deferral, summarising what shipped vs deferred and listing candidate resolution paths (shared .so / IPC / RTLD_GLOBAL+weak / serialisation). Status field changed from "Draft" to "Partial". Verification: - Layer A (8) + Layer B (3) + dataclasses (8) + pipeline (30) + equivalence (2) = 51 passed; 1 pre-existing Py3.10 ExceptionGroup test still fails (documented in plan). - Trlib() smoke load + check_bpsd_pull still returns False on fresh init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

k-yoshimi and others added 15 commits April 17, 2026 21:08

chore: start Phase 0 regression test scaffolding

8f15e7f

docs: lock comparison strategy (trregress dump, tol 1e-10)

526b89d

test(tr): add metric extractor reading tr_regress.dat

ccc4cfd

test(tr): port extract_tr_metrics tests from pytest to stdlib unittest

a5a6096

pytest is not installed on the target environment. Switch to unittest so the regression suite runs on a vanilla Python 3 install.

test(tr): add metric comparison tool (default tolerance 1e-10)

05bdd52

test(tr): add M0904 regression input (modelg=2, EQ-independent)

9e14630

test(tr): add TST-2 regression input

63f6762

test(tr): add regression check wrapper script

9ad4abb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(tr): register M0904 and TST-2 regression cases

802648a

test(tr): enable TR_REGRESS_DUMP and wire regression check into run_t…

c2dcaba

…ests.sh

docs(test_run): document regression workflow with dump mechanism

afabb8f

chore(test_run): ignore generated outputs and python caches

0e0f324

k-yoshimi changed the base branch from fix/tr-dealloc-cleanup to develop April 17, 2026 23:17

cursor Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread test_run/scripts/compare_metrics.py

k-yoshimi mentioned this pull request Apr 18, 2026

docs(tr): Phase L library-ization design spec #3

Merged

3 tasks

cursor Bot reviewed Apr 18, 2026

View reviewed changes

k-yoshimi merged commit 926b25b into develop Apr 18, 2026
1 check passed

This was referenced Apr 18, 2026

docs(ti): Phase L implementation plans for ti library-ization #6

Merged

docs(tr): Phase L-7 library documentation #62

Merged

k-yoshimi deleted the feature/tr-regression-phase0 branch April 18, 2026 15:44

k-yoshimi mentioned this pull request May 11, 2026

feat(tot): mirror AJRFT triangle onto TOT orchestrator (#191, PR #187 follow-up) #193

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(tr): Phase 0 - regression test infrastructure#2

test(tr): Phase 0 - regression test infrastructure#2
k-yoshimi merged 16 commits into
developfrom
feature/tr-regression-phase0

k-yoshimi commented Apr 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

k-yoshimi commented Apr 17, 2026

Uh oh!

Uh oh!

k-yoshimi commented Apr 18, 2026

Uh oh!

k-yoshimi commented Apr 18, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

k-yoshimi commented Apr 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

設計書と実装計画

回帰判定の仕組み

通常実行への影響

変更の要約（15 コミット）

Test plan

pytest → unittest について

次のフェーズ

Uh oh!

k-yoshimi commented Apr 17, 2026

Uh oh!

Uh oh!

k-yoshimi commented Apr 18, 2026

Uh oh!

k-yoshimi commented Apr 18, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

k-yoshimi commented Apr 17, 2026 •

edited by cursor Bot

Loading