trop: methodology-review tracker promotion + test_methodology_trop.py by igerber · Pull Request #491 · igerber/diff-diff

igerber · 2026-05-25T14:32:54Z

Summary

Promotes TROP (Athey, Imbens, Qu & Viviano 2025, Triply Robust Panel Estimators, arXiv:2508.21536) on the methodology-review tracker — In Progress → Complete (paper method="local") with full Verified Components / Test Coverage / Deviations / Outstanding Concerns / R Parity structure mirroring HAD (HeterogeneousAdoptionDiD methodology-review-tracker promotion: In Progress -> Complete #473) / ContinuousDiD (ContinuousDiD methodology-review-tracker promotion: In Progress -> Complete #476) / DCDH (ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion: In Progress -> Complete #481) / WooldridgeDiD (PR-B: WooldridgeDiD tracker promotion + methodology bundle #486) precedents.
New tests/test_methodology_trop.py (10 paper-equation-anchored classes, 36 tests): Eq. 2 nuclear-norm prox + plain prox-gradient monotonicity + weighted-solver; Eq. 3 unit-distance untreated-only mask + time-decay formula (direct hand-computed assertions); Eqs. 4-5 / Algorithm 1 LOOCV control-set semantics + cycling-search convergence; Corollary 1 single-draw sanity checks across three balance conditions; Theorem 5.1 MC-ranking realisation (TROP RMSE < DID RMSE on factor-confounded DGP); Section 2.2 DID + MC reductions; Eq. 13 / Algorithm 2 per-(i, t) estimation; Algorithm 3 stratified pairs bootstrap; Section 3 / Eq. 6 factor-DGP recovery; plus a TestTROPDeviations class locking 11 documented library deviations.
Net tests/test_trop.py cleanup: 9 methodology-relevant tests migrated to the new file (TestMethodologyVerification, paper-conformance subset of TestPaperConformanceFixes, 3 of 4 TestTROPNuclearNormSolver tests, TestCyclingSearch::test_cycling_search_converges, TestTROPvsSDID::test_trop_handles_factor_dgp); the TestPaperConformanceFixes and TestTROPvsSDID shells are deleted.

Methodology references

Method: TROP (Triply Robust Panel) method="local" — paper Algorithm 2. method="global" is a library-side efficiency adaptation (REGISTRY) and stays defensively covered in tests/test_trop.py::TestTROPGlobalMethod.
Paper / source link: Athey, S., Imbens, G.W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. arXiv:2508.21536. Paper review on file at `docs/methodology/papers/athey-2025-review.md` (merged PR docs: add retrospective paper reviews for TROP and Wooldridge ETWFE #443).
Intentional deviations (all documented under REGISTRY `## TROP` + METHODOLOGY_REVIEW `#### TROP`):
- Gap Add comprehensive code review for diff-diff library #5 (paper review): paper Section 5 says weights sum to one; Eq. 3 writes unnormalised exponential weights. Library matches Eq. 2 (unnormalised). Locked by direct kernel inspection.
- Gap Improve code quality and reduce technical debt #9: paper assumes balanced panels; library accepts unbalanced (control / pre-treatment cell drops).
- Rank selection is implicit via nuclear-norm soft-thresholding (paper Section 5.3 + Appendix); corrects an earlier REGISTRY overclaim that listed cv / ic / elbow methods. No discrete `rank_selection` constructor parameter.
- `λ_nn = ∞` → 1e10 internal sentinel with original-value storage on results.
Outstanding Concerns (deferred): Equation 14 covariate extension + Theorem 8.1 covariate triple robustness (`TROP.fit()` has no `covariates` kwarg; non-support locked via `inspect.signature`); SC / SDID reductions (paper-claimed under "specific (omega, theta) weight choices" not provided in the paper text); Eq. 10 direct numerical reconstruction (requires exposing internal per-(i, t) θ / ω weight vectors); R parity (paper-author reference implementation "forthcoming").
Scope note: Theorem 5.1 is verified as a simulation sanity check (TROP RMSE < DID RMSE under LOOCV-tuned weights), NOT as a direct fixed-weight conditional-bias-bound lock. The Matrix Completion reduction is verified as code-path activation (effective_rank > 0 + beats DID baseline), NOT as equivalence against an independent MC reference.

Validation

Tests added/updated: `tests/test_methodology_trop.py` (new, 36 tests across 10 classes); `tests/test_trop.py` (-784 lines, 9 tests migrated, 2 class shells deleted).
All 36 new methodology tests pass under `pytest tests/test_methodology_trop.py -W error::UserWarning` (no UserWarnings fire after `max_iter=500` bumps on convergence-sensitive tests).
7 rounds of local `/ai-review-local --backend codex` to ✅ Looks good with no findings.
No source-code changes to `diff_diff/trop*.py`.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Closes the Athey, Imbens, Qu & Viviano (2025) Triply Robust Panel Estimators (arXiv:2508.21536) primary-source review on the methodology tracker. PR-A (paper review on file at docs/methodology/papers/athey- 2025-review.md) was previously merged as #443; this PR is the F.L.I.P. consolidation: new tests/test_methodology_trop.py with paper-equation- numbered Verified Components walk-through (10 classes, 36 tests covering Eq. 2 nuclear-norm prox / FISTA / weighted-prox, Eq. 3 unit + time weights, Eqs. 4-5 + Algorithm 1 LOOCV with two-stage cycling, Corollary 1 three-condition unbiasedness, Theorem 5.1 MC-ranking realisation, Section 2.2 DID + MC reductions, Eq. 13 + Algorithm 2 per-(i, t) estimation, Algorithm 3 stratified pairs bootstrap, Section 3 / Eq. 6 factor-DGP recovery, plus a TestTROPDeviations class locking 11 documented library deviations). Migrated from tests/test_trop.py: TestMethodologyVerification (5 tests -> TestTROPEquation6FactorDGPRecovery), four paper-conformance tests + one weighted-solver convergence test from TestPaperConformanceFixes (distributed across the new equation-numbered classes), three prox / FISTA / weighted-objective tests from TestTROPNuclearNormSolver (-> TestTROPNuclearNormProx), plus a cycling-convergence test from TestCyclingSearch and the factor-DGP smoke from TestTROPvsSDID. The TestPaperConformanceFixes and TestTROPvsSDID shells are deleted; TestTROPNuclearNormSolver retains its single defensive test_zero_weights_no_division_error. METHODOLOGY_REVIEW.md TROP row promoted In Progress -> Complete (paper method="local") with merge date 2026-05-24, full Verified Components / Test Coverage / Deviations / Outstanding Concerns / R Parity structure mirroring HAD (PR #473), ContinuousDiD (PR #476), DCDH (PR #481), WooldridgeDiD (PR #486). The methodology promotion is scoped to the paper-aligned method="local" path (paper Algorithm 2); method="global" is a library-side efficiency adaptation per REGISTRY and stays defensively covered in tests/test_trop.py::TestTROPGlobalMethod. Documented deviations: Gap #5 (unnormalised weights match Eq. 2, not Section 5 sum-to-one) — locked by a direct kernel-weight inspection test against TROP._compute_observation_weights; Gap #9 (control / pre- treatment cell drops supported beyond paper's balanced-panel assumption); rank selection is implicit via nuclear-norm soft-thresholding (no discrete rank_selection constructor parameter — corrects an earlier REGISTRY overclaim that listed cv / ic / elbow methods); lambda_nn=inf -> 1e10 internal sentinel with original-value storage on results. Outstanding Concerns (deferred): Equation 14 covariate extension (TROP.fit() has no `covariates` kwarg; non-support locked by TestTROPDeviations::test_covariates_not_supported via inspect.signature to guard against future **kwargs) and Theorem 8.1 deferred until use cases motivate. SC / SDID reductions paper-claimed under "specific (omega, theta) weight choices" not provided in the paper text; cross- language anchor deferred until paper-author code clarifies the weight map. Eq. 10 direct numerical reconstruction deferred — requires exposing the internal per-(i, t) theta / omega weight vectors. R parity deferred ("forthcoming" per the paper). Methodology sign-off scope: paper-aligned identification ingredients (Eq. 2 prox + Eq. 3 weights + Eqs. 4-5 LOOCV + Algorithms 1-3 + Corollary 1 single-draw sanity checks + Eq. 6 simulation recovery + DID reduction + documented deviations) are directly locked. Theorem 5.1 is verified as a simulation sanity check (TROP RMSE < DID RMSE under LOOCV-tuned weights), NOT a direct fixed-weight conditional-bias-bound lock. The Matrix Completion reduction is verified as code-path activation (effective_rank > 0 + beats DID baseline), NOT equivalence against an independent MC reference. Plain (non-accelerated) prox- gradient objective monotonicity is locked; the shipped accelerated FISTA outer loop does NOT guarantee per-step monotonicity (Nesterov momentum gives O(1/k^2) but not monotonicity) and is not directly tested. REGISTRY.md ## TROP section gains a Verified Components expansion: 10 ticked requirements + four **Note:** / **Note (paper resolution):** / **Note (deferral):** annotations consolidating the deviation surface. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T14:38:54Z

Overall Assessment

✅ Looks good

Executive Summary

The diff is documentation/tests only; no diff_diff/trop*.py estimator code changed.
I found no unmitigated P0/P1 issues on methodology correctness or inference behavior in the changed surface.
There are three P3 documentation-scope issues: the weight-normalization ambiguity is marked as resolved even though the paper-review file still says it is unresolved, the new tracker wording overstates Section 6.1 support, and one note misstates plain prox-gradient monotonicity as “FISTA monotonicity.”
The test migration from tests/test_trop.py into tests/test_methodology_trop.py looks coverage-preserving for the moved methodology cases.
This review is static only: the sandbox here does not have pytest or numerical deps like numpy, so I could not execute the new test file.

Methodology

Severity P3. Impact: the promotion now marks TROP Complete and labels Gap Add comprehensive code review for diff-diff library #5 as a “paper resolution” in METHODOLOGY_REVIEW.md:L829-L830 and docs/methodology/REGISTRY.md:L2393-L2394, but the underlying paper review still says weight normalization is an open source ambiguity and explicitly says not to promote until it is resolved against the source in docs/methodology/papers/athey-2025-review.md:L12-L13, docs/methodology/papers/athey-2025-review.md:L267-L267, and docs/methodology/papers/athey-2025-review.md:L352-L352. That leaves the methodology trail internally inconsistent. Concrete fix: either update the paper-review file with the actual source-backed resolution/clarification and keep the promotion, or soften the new wording to a documented deviation pending source clarification.
Severity P3. Impact: the new Eq. 13 / Algorithm 2 language says “general assignment patterns (Section 6.1) including staggered adoption and heterogeneous treatment effects” are supported in METHODOLOGY_REVIEW.md:L842-L842 and docs/methodology/REGISTRY.md:L2390-L2390, but the implementation and source-review context still require an absorbing-state treatment path and reject non-absorbing/on-off patterns in diff_diff/trop.py:L392-L402, docs/methodology/REGISTRY.md:L2361-L2369, and docs/methodology/papers/athey-2025-review.md:L280-L282. Concrete fix: narrow that wording to absorbing-state staggered adoption / heterogeneous per-cell effects, or explicitly say that Section 6.1 switching/on-off patterns remain out of scope.

Code Quality

No findings in the changed code/test surface.

Performance

No findings in the changed code/test surface.

Maintainability

No findings beyond the documentation inconsistencies noted above.

Tech Debt

No findings. The remaining covariate-extension, SC/SDID-reduction, and R-parity gaps are explicitly documented as deferred, which is acceptable under this project’s review rules.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new Eq. 10 note says the test class verifies “FISTA monotonicity” in METHODOLOGY_REVIEW.md:L865-L865 and docs/methodology/REGISTRY.md:L2392-L2392, but the new methodology file explicitly says it only verifies plain prox-gradient monotonicity and that the accelerated FISTA loop is not monotone in tests/test_methodology_trop.py:L8-L16 and tests/test_methodology_trop.py:L384-L396. That can mislead future reviewers about what the test actually proves. Concrete fix: replace “FISTA monotonicity” with “plain prox-gradient monotonicity” or “prox-gradient ingredient” in both docs.
Residual verification gap: I could not run tests/test_methodology_trop.py in this sandbox because pytest and numerical runtime dependencies are unavailable here.

…scope + plain prox-grad wording Address 3 P3 documentation-scope findings from CI codex on PR #491 (verdict was ✅ Looks good; no P0/P1). P3 - Gap #5 paper-review inconsistency: METHODOLOGY_REVIEW.md and REGISTRY.md marked weight normalization as a "paper resolution", but the paper review at docs/methodology/papers/athey-2025-review.md still said "open source-ambiguity item ... do not promote until resolved against the source". Updated paper-review file with explicit "resolution status" header noting Gap #5 is resolved as a LIBRARY-SIDE choice (deliberate deviation from Section 5 sum-to-one), locked by TestTROPDeviations::test_unnormalized_weights_match_eq2. The source-side ambiguity remains open pending paper-author clarification; the unnormalized form is now the documented library contract. P3 - Eq. 13 / Algorithm 2 general-assignment overclaim: the wording "general assignment patterns (Section 6.1) including staggered adoption and heterogeneous treatment effects" was too broad — the implementation rejects non-absorbing D-matrices via the absorbing- state validator in trop_local.py. Narrowed to "absorbing-state staggered adoption / heterogeneous per-cell effects (paper Remark 6.1)" + explicit "Section 6.1 non-absorbing / on-off / switching assignment patterns are OUT OF SCOPE" with pointer to the rejection- contract test (TestTROPDeviations::test_non_absorbing_treatment_rejected_with_value_error). P3 - "FISTA monotonicity" wording in the Eq. 10 note: the test explicitly says it only verifies plain prox-gradient monotonicity and that the accelerated FISTA outer loop is not monotone (Nesterov momentum). Replaced "FISTA monotonicity" with "plain prox-gradient monotonicity — NOT the shipped accelerated FISTA outer loop, which uses Nesterov momentum and does not guarantee per-step monotonicity" in METHODOLOGY_REVIEW Outstanding Concerns + REGISTRY Eq. 10 note. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T14:47:44Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6744b8857f5ed38653a5ddcd2f14f46f87cd268f

Overall Assessment

✅ Looks good

Executive Summary

This remains a docs/tests-only PR; no diff_diff/trop*.py estimator code changed.
The prior re-review issues are mostly resolved: the promotion is now explicitly scoped to method="local", Section 6.1 non-absorbing/on-off patterns are called out as out of scope, and the methodology docs now distinguish plain prox-gradient checks from the accelerated FISTA loop.
I found no unmitigated P0/P1 issues. The remaining items are P3 documentation/traceability cleanup only.
The test migration looks coverage-preserving on the moved methodology surface: tests/test_methodology_trop.py now carries 36 methodology tests, while tests/test_trop.py still carries 107 defensive/API tests.
Static review only: this environment does not have pytest, numpy, pandas, or scipy, so I could not execute the new suite.

Methodology

Severity P3. Impact: docs/methodology/REGISTRY.md:L2393-L2393 still labels Gap Add comprehensive code review for diff-diff library #5 as Note (paper resolution), but the updated paper review now explicitly says the paper-side ambiguity remains open and only the library-side choice is resolved as a documented deviation (docs/methodology/papers/athey-2025-review.md:L14-L14, docs/methodology/papers/athey-2025-review.md:L354-L354). This is not a blocker because the deviation is documented, but the label still overstates how fully the source discrepancy was resolved. Concrete fix: rename the REGISTRY/CHANGELOG wording to Note (library-side choice) or Note (documented deviation) while keeping the current explanation (CHANGELOG.md:L14-L14).

Code Quality

No findings in the changed surface.

Performance

No findings in the changed surface.

Maintainability

Severity P3. Impact: the new methodology file still contains provenance pointers to deleted class names (TestPaperConformanceFixes, TestTROPvsSDID, TestMethodologyVerification) and its TestTROPDeviations docstring says the class covers bootstrap-failure/FISTA-warning tests that actually remain in tests/test_trop.py. That makes the new methodology surface harder to navigate and slightly overstates what lives in this file (tests/test_methodology_trop.py:L347-L353, tests/test_methodology_trop.py:L667-L669, tests/test_methodology_trop.py:L848-L850, tests/test_methodology_trop.py:L1143-L1146, tests/test_methodology_trop.py:L1254-L1256, tests/test_methodology_trop.py:L1506-L1508, tests/test_methodology_trop.py:L1592-L1594, tests/test_methodology_trop.py:L1913-L1930). Concrete fix: rewrite those “Origin”/“Tests cover” docstrings to either say “ported from removed pre-migration class X” or point to current live anchors only.

Tech Debt

No findings. The remaining TROP gaps called out here (Eq. 14 covariates, SC/SDID reductions, R parity) are explicitly documented as deferred in METHODOLOGY_REVIEW.md:L860-L867, so they are acceptable under this repo’s review rules.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the main methodology docs correctly narrow Eq. 2 coverage to the plain prox-gradient ingredient, not the accelerated FISTA loop, but the changelog still summarizes the new surface as prox / FISTA / weighted-prox (CHANGELOG.md:L14-L14 vs. tests/test_methodology_trop.py:L8-L16, tests/test_methodology_trop.py:L382-L396, METHODOLOGY_REVIEW.md:L836-L836). That residual shorthand can reintroduce the same FISTA-scope confusion the earlier review identified. Concrete fix: align the changelog wording with the narrower phrasing already used in the methodology docs.
Verification note: I could not run tests/test_methodology_trop.py in this sandbox because the runtime dependencies are not installed.

…HANGELOG wording Address 3 P3 findings from CI codex R2 on PR #491 (verdict was ✅ Looks good; no P0/P1). P3 - Note label overclaim: REGISTRY/CHANGELOG used "Note (paper resolution)" but the resolution is library-side only (the paper-side ambiguity remains open per the updated paper-review file). Renamed to "Note (library-side choice)" in both REGISTRY and CHANGELOG; expanded the note text to make the open paper-side status explicit and add a "will be revisited" line for when the paper-author reference lands. P3 - Provenance docstrings reference deleted classes: 10 references to TestPaperConformanceFixes / TestTROPvsSDID / TestMethodologyVerification formerly pointed at `tests/test_trop.py::ClassName::method` as if the classes still existed there. Rewrote each provenance pointer to explicitly say "pre-migration class X in test_trop.py (deleted in the methodology-promotion PR)" so future readers don't try to navigate to a non-existent class. Also tightened the TestTROPDeviations docstring: removed two "Tests cover" bullets (bootstrap-failure 5% warning, FISTA convergence warning) that overclaimed in-class coverage — those defensive tests live in test_trop.py and are cross-referenced; the "Cross-references for context" header is now explicit that these are NOT locked by tests in TestTROPDeviations. Added the three actual in-class tests that were missing from the cover list (lambda inf grid rejection, n_bootstrap < 2 rejection, safe_inference NaN-propagation). P3 - CHANGELOG "prox / FISTA / weighted-prox" wording: aligned with the methodology docs' narrower phrasing — "Eq. 2 soft-threshold SVD prox / plain prox-gradient monotonicity on a toy setup / weighted- prox solver (the shipped accelerated FISTA outer loop is NOT directly tested for per-step monotonicity because Nesterov momentum does not guarantee it)". The earlier shorthand was the same FISTA-scope overclaim that R1 P3 fixed in the methodology docs. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T14:58:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0ff934ce8f561fb398f146b5fec02dee87b8643b

Overall Assessment

✅ Looks good

Executive Summary

This remains a docs/tests-only PR for TROP promotion; no diff_diff/trop*.py estimator logic changed.
Affected method: TROP, specifically the paper-aligned method="local" methodology promotion, with some shared Eq. 2 solver-internal coverage discussed for local/global helper code.
The prior re-review concerns are mostly resolved: the promotion is explicitly scoped to method="local", the monotonicity claim is narrowed to plain prox-gradient rather than the accelerated FISTA loop, and migrated-test provenance is now clearly marked as pre-migration/deleted.
I found no unmitigated P0/P1 issues.
One minor P3 documentation/traceability inconsistency remains in the TROP paper review cross-reference.
Static review only: syntax compilation succeeded for the changed test files, but I could not run pytest because this environment lacks pytest, numpy, pandas, and scipy.

Methodology

Severity P3. Impact: docs/methodology/papers/athey-2025-review.md:L14 still tells readers the REGISTRY note is labeled **Note (paper resolution):**, but the actual promoted registry now uses docs/methodology/REGISTRY.md:L2393 **Note (library-side choice):**. The underlying deviation is documented correctly, so this is not a blocker, but it leaves the methodology traceability chain slightly inconsistent. Concrete fix: update the paper-review cross-reference at line 14 to the current label text, or refer to the weight-normalization note without hard-coding the label.

Code Quality

No findings in the changed surface.

Performance

No findings in the changed surface.

Maintainability

No findings. The prior navigation/provenance issue appears resolved: the migrated methodology tests now clearly identify deleted pre-migration classes instead of implying those classes still exist.

Tech Debt

No findings. The remaining TROP limitations in this PR are explicitly documented as deferred in METHODOLOGY_REVIEW.md, so they are acceptable under the repo’s review rules.

Security

No findings.

Documentation/Tests

No new blocking issues.
Verification note: syntax-only compilation of tests/test_methodology_trop.py and tests/test_trop.py succeeded; runtime execution was not possible in this sandbox because the test/runtime dependencies are not installed.

Address single P3 from CI codex R3 on PR #491 (verdict was ✅ Looks good; no P0/P1/P2). P3 - Paper-review cross-reference drift: docs/methodology/papers/athey- 2025-review.md L14 hard-coded "**Note (paper resolution):**" as the REGISTRY label, but my R2 fix renamed the registry note to "**Note (library-side choice):**". Replaced the hard-coded label reference with a structural pointer ("the weight-normalization note in the ## TROP block of REGISTRY.md") so the cross-reference stays correct if the label is renamed again. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:04:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7b0ef9577d724002b50b9eb59a8686f7f21d4ce8

Overall Assessment

✅ Looks good

Executive Summary

This remains a docs/tests-only TROP promotion; no diff_diff/trop*.py estimator logic changed.
Affected method: TROP, with the promotion correctly scoped to paper-aligned method="local"; method="global" remains documented as a library-side adaptation.
The prior P3 paper-review label mismatch appears resolved.
I found no unmitigated P0/P1 issues.
Two minor P3 traceability issues remain in the newly added methodology documentation.
Static review only: this environment does not have pytest, numpy, pandas, or scipy, so I could not run the new tests; the changed Python files do parse successfully.

Methodology

Severity P3. Impact: the new promotion text says the non-absorbing D-matrix rejection is locked by TestTROPDeviations::test_non_absorbing_treatment_rejected_with_value_error, but the added test is actually named test_event_style_d_rejected_with_value_error. That leaves the new methodology evidence chain slightly broken exactly where the PR documents the Section 6.1 scope restriction. Concrete fix: either rename the test or update the doc references in METHODOLOGY_REVIEW.md:L842 and docs/methodology/REGISTRY.md:L2390 to match tests/test_methodology_trop.py:L2132.

Code Quality

No findings in the changed surface.

Performance

No findings. The new heavier checks are appropriately isolated behind @pytest.mark.slow and ci_params scaling.

Maintainability

Severity P3. Impact: the PR adds several hard-coded REGISTRY L22xx/L23xx references that are already stale after the same edit set, for example METHODOLOGY_REVIEW.md:L845, METHODOLOGY_REVIEW.md:L857, tests/test_methodology_trop.py:L2021, and tests/test_methodology_trop.py:L2299. This does not affect behavior, but it makes the new methodology trail fragile and harder to audit over time. Concrete fix: replace raw line-number references with stable section/item names (for example, REGISTRY.md ## TROP -> "Inference CI distribution") or drop the line numbers entirely.

Tech Debt

No findings. The remaining TROP limitations called out here (Eq. 10 reconstruction, Eq. 14 covariates, SC/SDID reduction mapping, R parity) are explicitly documented as deferred, so they are acceptable non-blockers.

Security

No findings.

Documentation/Tests

No additional blocking issues beyond the P3 traceability items above.
Verification note: I could not execute pytest here because the environment lacks the test/runtime dependencies. I did verify that the changed Python files parse successfully with Python’s built-in compile().

…e numbers Address 2 P3 findings from CI codex R4 on PR #491 (verdict was ✅ Looks good; no P0/P1). P3 - Test-name cross-ref drift: METHODOLOGY_REVIEW and REGISTRY cited `TestTROPDeviations::test_non_absorbing_treatment_rejected_with_value_error` as the lock for the Section 6.1 non-absorbing rejection contract, but the actual implemented test is named `test_event_style_d_rejected_with_value_error`. Event-style D is one specific non-absorbing pattern, and the same absorbing-state validator catches all 1→0 transitions. Updated both doc references to point to the actual test name and clarified the scope (the validator covers all non-absorbing patterns, not just event-style). P3 - Stale REGISTRY line-number references: 7 `REGISTRY L22xx/L23xx` hard-coded references across METHODOLOGY_REVIEW.md and tests/test_methodology_trop.py (in METHODOLOGY_REVIEW Verified Components + Deviations sections, plus several TestTROPDeviations test docstrings) drift as REGISTRY.md gets edited. Replaced each with the semantic section name (e.g., `REGISTRY ## TROP "λ_nn=∞ implementation" edge-case note`, `REGISTRY ## TROP "Inference CI distribution" note`, etc.) so future edits to REGISTRY.md cannot orphan these cross-refs. No source-code changes to diff_diff/trop*.py. 36 methodology tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:16:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4481836418907195a7b3138fb365b4259dc5435e

Overall Assessment

✅ Looks good

Executive Summary

This is still a docs/tests-only TROP promotion; I found no changes to diff_diff/trop*.py estimator logic.
The prior re-review traceability issues appear resolved: the event-style rejection test name now matches the tracker references, and the fragile raw line-number cross-references are gone.
I found no unmitigated P0 or P1 issues.
Two minor P3 source-traceability wording issues remain in the methodology promotion surface.
Static review only: this environment lacks numpy/pytest, so I could not execute the new tests. I did verify parser-only compilation of the changed Python files.

Methodology

Severity P3. Impact: the promotion trail still hinges on a v2-pinned paper review in docs/methodology/papers/athey-2025-review.md, while TROP is now marked Complete in METHODOLOGY_REVIEW.md. The current arXiv entry is v3, so the review record does not show whether the latest source version was diffed before promotion. Concrete fix: update the paper review to v3, or add a short note stating that v3 was checked and made no TROP-relevant methodology changes before keeping the estimator at Complete. citeturn8view0
Severity P3. Impact: the new TROP prose now describes Eq. 10 as a “paper-side asymptotic identity” in tests/test_methodology_trop.py, METHODOLOGY_REVIEW.md, and docs/methodology/REGISTRY.md, but the paper review on file describes Eq. 10 as a balancing representation/decomposition in docs/methodology/papers/athey-2025-review.md. Concrete fix: replace “asymptotic identity” with “balancing representation” or “decomposition” across the newly added tracker/test prose, and optionally in CHANGELOG.md.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology traceability items above.

Tech Debt

No findings. The remaining TROP limitations called out here are documented as deviations/deferrals and are acceptable non-blockers for this promotion.

Security

No findings.

Documentation/Tests

No additional findings beyond the two P3 methodology items above.
I could not run pytest in this environment because numpy/pytest are unavailable. The changed Python files parse successfully via compile().

…epresentation" wording Address 2 P3 findings from CI codex R5 on PR #491 (verdict was ✅ Looks good; no P0/P1). P3 - Paper version pinning: paper review on file is v2-pinned (arXiv:2508.21536v2) but the current arXiv version is v3. Added an explicit "Version-pinning note (2026-05-25)" to the paper review file acknowledging that the methodology promotion ships against v2 and that a formal v2-vs-v3 delta-check has NOT been performed. Articulated the action item: refresh the review against the most recent arXiv version when the paper-author reference implementation lands ("forthcoming") and re-validate the verified-component checklist. This is honest deferral rather than an unverified "no methodology changes" claim. P3 - Eq. 10 wording drift: my new prose called Eq. 10 a "paper-side asymptotic identity" but the paper review on file describes it as a "balancing representation / decomposition" (per paper Section 5.2). Replaced "asymptotic identity" with "balancing representation" / "balancing representation / decomposition" across tests/test_methodology_trop.py (module + class docstrings), METHODOLOGY_REVIEW.md (Verified Components Eq. 2 bullet), and docs/methodology/REGISTRY.md (Eq. 10 note). CHANGELOG already used "balancing-decomposition pointer" — no change there. No source-code changes to diff_diff/trop*.py. 36 methodology tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:28:23Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5f7f6eef93d710dd22eb65c4d5cd50eb45496e67

Overall Assessment

✅ Looks good

Executive Summary

This is still a docs/tests-only TROP promotion; I found no changes to diff_diff/trop*.py estimator logic.
I found no unmitigated P0 or P1 issues.
The prior Eq. 10 wording issue appears resolved: the changed surfaces now consistently describe Eq. 10 as a balancing representation/decomposition rather than an “asymptotic identity.”
One P3 traceability item remains around the paper-version audit trail for the TROP promotion.
I could not run pytest here because pytest, numpy, pandas, and scipy are unavailable; this is a static review plus parse-only validation of the changed Python files.

Methodology

Severity P3. Impact: the promotion is still anchored to a v2-pinned paper review that explicitly says no formal v2-vs-v3 delta-check has been performed in docs/methodology/papers/athey-2025-review.md:L8-L8, while TROP is now marked Complete in METHODOLOGY_REVIEW.md:L829-L867. That is not a demonstrated methodology bug, but it leaves the “Complete” promotion with a source-traceability gap. Concrete fix: add a short v2→v3 delta note now stating either that no TROP-relevant methodology changed, or enumerate the changed sections and confirm the tracker mapping was rechecked. The current arXiv record is v3, and the v3 HTML at least restructures the exact-unbiasedness material by adding a 5.1.1 Conditions for exact unbiasedness subsection that is not present in the v2 HTML. (arxiv.org)
No other methodology findings. The earlier Eq. 10 terminology issue looks fixed in tests/test_methodology_trop.py:L17-L24, METHODOLOGY_REVIEW.md:L836-L865, and docs/methodology/REGISTRY.md:L2392-L2394.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The split between methodology locks in tests/test_methodology_trop.py:L1-L2383 and defensive/unit coverage in tests/test_trop.py:L1978-L2018 is cleaner than before.

Tech Debt

No findings. The remaining TROP limitations in the changed docs are documented as deferred scope rather than silent behavior changes.

Security

No findings.

Documentation/Tests

No additional PR defect beyond the P3 methodology traceability note above.
I could not execute the new tests in this environment because pytest, numpy, pandas, and scipy are not installed. I did verify that the changed Python files parse successfully via compile().

Address single P3 from CI codex R6 on PR #491 (verdict was ✅ Looks good; no P0/P1/P2). P3 - Paper-version audit trail: codex surfaced a specific v2→v3 structural change (v3 adds Section 5.1.1 "Conditions for exact unbiasedness" subsection not present in v2). My R5 version-pinning note acknowledged the general v3 gap but didn't cite this specific change. Updated the note to: - Cite the codex finding (v3 adds 5.1.1 subsection) - Note this APPEARS to be restructuring (Corollary 1 material moved into its own subsection), not a methodological change to the underlying three balance conditions - Make the action item more specific: confirm v3's 5.1.1 is the same set of three balance conditions as v2's Corollary 1 when the paper-author reference implementation lands This is honest deferral with a concrete refresh-time check. I attempted a WebFetch of the arXiv abstract page to do the v2-vs-v3 diff inline, but the HTML page only exposes the submission history (size delta only) not the actual PDF/TeX diff — a real diff requires downloading both PDFs which is outside the methodology-promotion scope. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:34:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 29a74d61f63b08ece7669f544525db323163a7b3

Overall Assessment

✅ Looks good

Executive Summary

Re-review result: no unmitigated P0/P1 findings. This is still a docs/tests-only TROP promotion, and the promotion is correctly scoped to paper-aligned method="local" while keeping method="global" explicitly deferred.
The previous paper-version traceability concern is now documented clearly in docs/methodology/papers/athey-2025-review.md:L8-L12.
One P2 methodology/docs issue remains: the promoted tracker says the weighted non-uniform prox path “monotonically decreases” the weighted objective, but the added test only checks final-vs-initial decrease and the shipped solver uses FISTA, which does not guarantee per-iteration monotonicity.
I could not run pytest here because pytest, numpy, pandas, and scipy are unavailable. Static validation was limited to compile() and AST inspection of the changed test files.

Methodology

Severity P2. Impact: the promoted verification claim is stronger than the evidence in the PR. METHODOLOGY_REVIEW.md:L836-L836 says the weighted non-uniform prox path “monotonically decreases the weighted objective,” but tests/test_methodology_trop.py:L429-L468 only asserts obj_final <= obj_init, and the actual solver diff_diff/trop_local.py:L610-L690 uses FISTA/Nesterov acceleration, which the test class itself already notes is not per-step monotone tests/test_methodology_trop.py:L325-L333. Concrete fix: either soften the tracker text to “decreases relative to initialization / reduces singular-value mass,” or add an explicit plain weighted prox-gradient monotonicity test and scope it away from the accelerated solver.
No other methodology findings. The method="local" promotion scope and method="global" deferral are stated clearly in METHODOLOGY_REVIEW.md:L832-L865.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No additional findings.
Static validation only: both changed test files compile via compile(), and the file-level inventories match the PR narrative: 36 tests in tests/test_methodology_trop.py and 107 tests in tests/test_trop.py. I could not execute the suite in this environment.

Address single P2 from CI codex R7 on PR #491 (verdict was ✅ Looks good; no P0/P1). P2 - Weighted-prox monotonicity overclaim: METHODOLOGY_REVIEW prose said the weighted non-uniform prox path "monotonically decreases the weighted objective" but: - The test (test_local_nonuniform_weights_objective) only asserts obj_final <= obj_init + 1e-10 (final-vs-initial check), NOT per-iteration monotonicity - The shipped solver _weighted_nuclear_norm_solve uses FISTA / Nesterov acceleration which is allowed to have transient per-step objective increases (already acknowledged in the TestTROPNuclearNormProx class docstring + the test_plain_prox_gradient_objective_decreases scope note) Softened the METHODOLOGY_REVIEW Eq. 2 bullet to: - "at-or-below initialisation (final-vs-initial check ... NOT per-iteration monotonicity — the accelerated FISTA loop is allowed to have transient per-step increases)" Added a "Scope note" to the test docstring explicitly distinguishing this final-vs-initial check from the plain prox-gradient per-step monotonicity verified by test_plain_prox_gradient_objective_decreases. No source-code changes to diff_diff/trop*.py. 6 TestTROPNuclearNormProx tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:43:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6da12fc771261d0e1e5c322eddb60646583210d4

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review concern is resolved: the tracker now correctly scopes monotonicity to the plain prox-gradient ingredient and limits the shipped weighted-solver claim to final-vs-initial objective decrease (METHODOLOGY_REVIEW.md:L836-L845, tests/test_methodology_trop.py:L383-L477).
No unmitigated P0/P1 findings. The promotion is correctly limited to paper-aligned method="local" and explicitly defers method="global" (METHODOLOGY_REVIEW.md:L832-L833).
Severity P2: the Section 2.2 DID-reduction check is described as a TWFE benchmark, but the helper actually calls DifferenceInDifferences, not TwoWayFixedEffects, so the prose slightly overstates what was verified (tests/test_methodology_trop.py:L299-L313, tests/test_methodology_trop.py:L1281-L1327, METHODOLOGY_REVIEW.md:L841-L845, docs/methodology/REGISTRY.md:L2391-L2391).
The new paper-review note materially improves traceability around the version-pinned source and deferred follow-up (docs/methodology/papers/athey-2025-review.md:L8-L12).
I could not run pytest here because pytest, numpy, pandas, and scipy are unavailable in this environment; validation was static only.

Methodology

Severity P2. Impact: the new “DID/TWFE reduction verified” wording is stronger than the actual test evidence. _fit_did() is labeled “Fit a TWFE DiD” but instantiates DifferenceInDifferences() on outcome ~ treat * post_flag (tests/test_methodology_trop.py:L299-L313), while the library’s actual TWFE estimator is a separate class, TwoWayFixedEffects (diff_diff/twfe.py:L22-L27). DifferenceInDifferences.fit() builds the basic [const, D, T, D×T] design unless fixed effects are explicitly added (diff_diff/estimators.py:L467-L469). That means the new tracker text in METHODOLOGY_REVIEW.md:L841-L845 and docs/methodology/REGISTRY.md:L2391-L2391 currently evidences a DiD-style benchmark on a clean panel, not a literal TWFE comparator.
Concrete fix: either replace _fit_did() with a real TwoWayFixedEffects benchmark using a binary post indicator and unit FE, or soften the tracker/test wording from “TWFE-DiD” to “DiD-style benchmark.”
No additional methodology findings. The prior monotonicity wording problem is fixed by the narrower claims in METHODOLOGY_REVIEW.md:L836-L845 and the matching split tests in tests/test_methodology_trop.py:L383-L477.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The remaining deferred items are now clearly tracked in the TROP entry rather than presented as verified scope (METHODOLOGY_REVIEW.md:L860-L867).

Security

No findings.

Documentation/Tests

No additional findings beyond the P2 above.
Static validation only: I reviewed the migrated tests and surrounding implementation, but could not execute the suite in this environment because pytest and the scientific Python stack are unavailable.

Address single P2 from CI codex R8 on PR #491 (verdict was ✅ Looks good; no P0/P1). P2 - "TWFE-DiD" wording vs actual comparator class: the methodology test uses `_fit_did()` which instantiates `DifferenceInDifferences` on `outcome ~ treat * post_flag`, but the METHODOLOGY_REVIEW + REGISTRY prose said "TWFE-DiD". The library distinguishes `DifferenceInDifferences` (basic 2×2 DiD design) from `TwoWayFixedEffects` (separate explicit-FE class). On the balanced two-period block- assignment panel used in the DID-reduction test, the two coincide numerically (and paper Section 2.2 uses "DID/TWFE" interchangeably for this special case), but the prose overstated by naming the `TwoWayFixedEffects` class. Softened wording across all surfaces: - `_fit_did()` docstring now explains it's the basic 2×2 DiD estimator (not `TwoWayFixedEffects`) - METHODOLOGY_REVIEW + REGISTRY DID-reduction bullets now say "library's basic-DiD estimator (`DifferenceInDifferences` ... numerically coincides with TWFE on balanced two-period block- assignment panels per paper Section 2.2's 'DID/TWFE' interchangeable framing — though `TwoWayFixedEffects` is a separate explicit-FE class)" - Test class + Theorem 5.1 docstrings: "TWFE-DiD" → "the DiD benchmark" No source-code changes to diff_diff/trop*.py. TestTROPSpecialCases + TestTROPTheorem51TripleRobustness still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:51:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 32efd04533f466800194fab38ab7b800374b8f83

Overall Assessment

✅ Looks good

Executive Summary

Previous DID/TWFE wording concern is substantially resolved: the tracker now benchmarks the Section 2.2 reduction against the library’s basic DiD helper and limits TWFE equivalence to the balanced 2-period special case (METHODOLOGY_REVIEW.md:L841-L842, docs/methodology/REGISTRY.md:L2391-L2391, tests/test_methodology_trop.py:L299-L322).
No unmitigated P0/P1 findings in the changed TROP methodology promotion. The estimator implementation is unchanged; this PR is docs/tests only.
Severity P2: the promotion is marked Complete against generic arXiv:2508.21536, but the companion paper review now says the audit is still pinned to v2 and no v2→v3 delta check has been done for the exact sections this PR promotes, including Corollary 1 and Theorem 5.1 (METHODOLOGY_REVIEW.md:L827-L845, docs/methodology/papers/athey-2025-review.md:L4-L12, tests/test_methodology_trop.py:L998-L1253).
Static validation only: pytest is unavailable in this environment. I confirmed the changed test files are syntactically valid with an in-memory compile() check, and git diff --check is clean.

Methodology

Severity P2. Impact: the TROP entry now claims Status | Complete and enumerates verified components in both the tracker and registry, but the companion paper review explicitly says the reviewed source is still arXiv v2 and that a v2→v3 delta check has not been performed for Sections 2.2, 5.2-5.3, 6.1-6.2, Corollary 1, Theorem 5.1, and Appendix Theorem 8.1. That leaves the methodology sign-off version-ambiguous at the exact spots newly promoted in this PR. Concrete fix: either complete the v2→v3 source audit before keeping the Complete promotion, or add a visible **Note:** version-pinned to arXiv v2 pending v3 refresh in the TROP blocks of METHODOLOGY_REVIEW.md and docs/methodology/REGISTRY.md. Refs: METHODOLOGY_REVIEW.md:L827-L845, docs/methodology/REGISTRY.md:L2379-L2395, docs/methodology/papers/athey-2025-review.md:L4-L12.
No additional methodology findings. The documented weight-normalization and covariate-scope deviations are properly labeled as deviations/deferrals rather than being presented as defects.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. Existing TROP follow-ups in TODO.md remain unrelated to this promotion.

Security

No findings.

Documentation/Tests

No additional findings beyond the P2 above. The new methodology file is careful about what is directly verified versus only smoke-checked or deferred; I just could not execute the suite in this environment because pytest is not installed.

…IEW + REGISTRY Address single P2 from CI codex R9 on PR #491 (verdict was ✅ Looks good; no P0/P1). P2 - Version-pinning visibility: the v2→v3 paper-version gap surfaced as a P3 in R5 (general acknowledgement) and R6 (specific Section 5.1.1 delta noted in paper review file). R9 escalates because the paper review honestly admits the gap but the TROP entry in METHODOLOGY_REVIEW and REGISTRY still says "Complete" without a visible caveat, making the methodology sign-off version-ambiguous at the exact sections newly promoted (Eqs. 2-3, Algorithms 1-3, Section 2.2, Section 5.2-5.3, Section 6.1-6.2, Theorem 5.1, Corollary 1, Appendix Theorem 8.1). Added a visible "**Version Pinning**" subsection to METHODOLOGY_REVIEW TROP entry immediately below the Status row, and a "**Note (version pinning):**" annotation to REGISTRY ## TROP section immediately below the Primary source line. Both: - Anchor the promotion on arXiv:2508.21536v2 explicitly - List the current version (v3, submitted 2026-02-09) - Enumerate the sections covered by the promotion that have NOT been v2→v3 delta-checked - Cite the known v3 structural change (Section 5.1.1 added) - Action-item to refresh when paper-author reference implementation lands Status row also updated to "**Complete** (paper `method="local"`, version-pinned to arXiv v2 — see Version Pinning below)" so the caveat is visible even from a quick scan of the status table. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T15:57:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 867090603517036d169cd1fb4e7f7692d67ee71f

Overall Assessment

✅ Looks good

Executive Summary

The previous re-review concern is addressed: the TROP promotion is now explicitly version-pinned to arXiv v2 in the tracker, registry, and paper review (METHODOLOGY_REVIEW.md:L829-L834, docs/methodology/REGISTRY.md:L2247-L2247, docs/methodology/papers/athey-2025-review.md:L8-L12).
No estimator code changed; this PR is a docs/tests promotion for the paper-aligned TROP(method="local") surface.
I did not find any unmitigated P0/P1 methodology defects in the shipped TROP implementation or in the migrated test coverage.
Two P2 documentation inconsistencies remain inside the promoted ## TROP registry entry: stale rank-selection prose and a stale claim that TROP returns separate factor loadings/scores.
Static review only: the changed test files AST-parse clean, but I could not run pytest here because pytest, numpy, and the project test dependencies are unavailable in this environment.

Methodology

Severity P2. Impact: the promoted TROP registry entry is still internally inconsistent on rank diagnostics. The new checklist correctly says rank selection is implicit via nuclear-norm shrinkage, but the same ## TROP block still says TROP warns when estimated rank is too high/low and that rank selection is automatic via cross-validation / information criterion / elbow. The shipped result surface only computes and stores effective_rank; I could not find a corresponding public rank-warning or discrete rank-selection surface. Concrete fix: remove or rewrite the stale bullets so the whole TROP block matches the actual contract: implicit shrinkage only, effective_rank diagnostic only, no rank_selection mode list. Refs: docs/methodology/REGISTRY.md, docs/methodology/REGISTRY.md, docs/methodology/REGISTRY.md, diff_diff/trop.py, diff_diff/trop_results.py.
Severity P2. Impact: the same registry checklist still marks “Returns factor loadings and scores for interpretation” as satisfied, but TROP results only expose factor_matrix and effective_rank; there are no separate loading/score outputs on TROPResults. That overstates the promoted methodology/API surface. Concrete fix: change the checklist item to match the shipped surface (factor_matrix / effective_rank), or implement and test separate loading/score outputs before leaving it checked. Refs: docs/methodology/REGISTRY.md, diff_diff/trop.py, diff_diff/trop_results.py.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the two registry inconsistencies above.

Tech Debt

No new tracked tech-debt items. The two registry inconsistencies above are not currently recorded in TODO.md.

Security

No findings.

Documentation/Tests

No additional findings. tests/test_methodology_trop.py and tests/test_trop.py parse cleanly, but runtime verification was not possible here because pytest and project dependencies are missing from the environment.

…or-loadings overclaim Address 2 P2 internal-inconsistency findings from CI codex R10 on PR #491 (verdict was ✅ Looks good; no P0/P1). Both predate this PR but became visible after the methodology-promotion checklist cleanup made the rest of the ## TROP block inconsistent. P2 - Stale rank-selection prose: REGISTRY ## TROP block had three locations claiming rank-selection automation that contradict the new checklist: 1. Assumption checks bullet: "Warns if estimated rank seems too high/low relative to panel dimensions" — no such warning exists in the shipped code (only effective_rank is computed/stored). Removed. 2. Edge cases bullet: "Rank selection: automatic via cross-validation, information criterion, or elbow" — same overclaim as the previously-corrected checklist line. Replaced with the actual contract (implicit via nuclear-norm soft-thresholding; effective_rank is a diagnostic only; no rank_selection constructor parameter) + back-reference to the corrected Requirements checklist bullet + explicit note that this was corrected in the 2026-05-24 promotion PR. P2 - "Returns factor loadings and scores" overclaim: the Requirements checklist had `- [x] Returns factor loadings and scores for interpretation`, but TROPResults only exposes factor_matrix (n_periods × n_units) and effective_rank — no separate loading / score fields. Rewrote the checklist item to describe the actual API surface (`factor_matrix` and `effective_rank`) and flag the overclaim as a 2026-05-24 promotion-PR correction. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T16:16:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c7bd6e942d88f63f867e7fe5c66bc935f4cfbf59

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The prior rerun findings are fixed, and the remaining issues I found are documentation/test overclaims rather than shipped estimator defects.

Executive Summary

The previous rerun findings are addressed: the TROP registry no longer claims discrete cv / ic / elbow rank selection or separate factor loadings/scores, and now matches the shipped TROP/TROPResults surface at docs/methodology/REGISTRY.md:2349-2386, diff_diff/trop.py:126-147, and diff_diff/trop_results.py:109-143.
No estimator code changed in this PR; the review scope is the TROP methodology-promotion docs plus the new tests/test_methodology_trop.py.
One methodology-level documentation overclaim remains: the new “DID reduction verified” promotion text is stronger than what the added test actually locks.
One smaller docs/tests mapping issue remains: the new unbalanced-panel coverage note points to the wrong defensive test surface and appears to overstate donor-support coverage.
Static review only: tests/test_methodology_trop.py and tests/test_trop.py AST-parse clean, but I could not run pytest here because this environment is missing numpy / project test dependencies.

Methodology

Severity P2. Impact: the promoted “DID reduction verified” claim is stronger than the new test actually establishes. tests/test_methodology_trop.py:1290-1335 compares TROP on a 10-period panel to DifferenceInDifferences fitted as a pooled outcome ~ treat * post_flag regression via _fit_did() at tests/test_methodology_trop.py:299-322. But that helper explicitly notes the library’s basic DiD surface only coincides with TWFE on a balanced two-period panel, while the test DGP is multi-period (n_pre=6, n_post=4). Combined with the underlying DiD design matrix at diff_diff/estimators.py:160-168 and diff_diff/estimators.py:467-489, this means the current test is a friendly-DGP sanity check, not a direct Section 2.2 reduction lock. The promotion text in METHODOLOGY_REVIEW.md:843-843 and docs/methodology/REGISTRY.md:2392-2392 should not call that verified reduction yet. Concrete fix: either change the reduction test to a true two-period block-assignment panel where the basic DiD comparator is algebraically the right target, or keep the multi-period panel and compare against TwoWayFixedEffects; until then, soften the tracker prose from “DID reduction verified” to “DiD benchmark sanity check.”

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new unbalanced-panel coverage note points reviewers to the wrong defensive surface and overstates what is actually locked. docs/methodology/REGISTRY.md:2390-2390 and METHODOLOGY_REVIEW.md:857-857 cite tests/test_trop.py::TestDMatrixValidation for missing-treated-cell / thinner-donor-support coverage, but that class only covers absorbing-state validation at tests/test_trop.py:776-910. The actual unbalanced-panel regressions live under tests/test_trop.py:1644-1975, and I did not find a dedicated thinner-donor-support regression. Concrete fix: repoint those notes to the actual unbalanced-panel tests and drop the donor-support claim unless a dedicated test is added.
No additional findings. Static verification note: both changed test files AST-parse clean, but runtime execution was not possible in this environment because numpy / test dependencies are unavailable.

… unbalanced-panel xref Address 1 P2 + 1 P3 from CI codex R11 on PR #491 (verdict was ✅ Looks good; no P0/P1). P2 - "DID reduction verified" overclaim: the test test_did_reduction_lambda_nn_inf_uniform_weights uses a 10-period panel (n_pre=6, n_post=4) and compares TROP to DifferenceInDifferences, but `_fit_did()`'s own docstring notes that the library's basic DiD only coincides with TWFE on a balanced two-period panel. So the multi-period test is a friendly-DGP sanity check, NOT a direct algebraic-equivalence lock of paper Section 2.2's DID/TWFE reduction. Softened METHODOLOGY_REVIEW + REGISTRY wording from "DID reduction verified" to "DiD benchmark sanity check (NOT a direct algebraic- equivalence proof)" + explicit caveat that "a direct Section 2.2 reduction lock (true 2-period block-assignment panel where basic DiD is the algebraic target, OR a comparison against TwoWayFixedEffects with explicit unit FE) is deferred". The test remains useful as a friendly-DGP smoke check, but the prose now matches the evidence. P3 - Wrong unbalanced-panel test cross-reference: prose cited tests/test_trop.py::TestDMatrixValidation for "missing-treated-cell and thinner-donor-support coverage", but TestDMatrixValidation only covers absorbing-state validation (L776-910). The actual unbalanced- panel regressions are in TestPR110FeedbackRound8 (3 tests: test_unbalanced_panel_d_matrix_validation, ..._real_violation_still_caught, ..._multiple_missing_periods). Repointed the cross-references in both METHODOLOGY_REVIEW and REGISTRY; also dropped the "thinner-donor- support" claim since no dedicated regression exists for that surface. No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T16:28:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6fb8013c9804180eb1b75562143ecdd3311ea311

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Re-review scope: the prior rerun findings are addressed. The Section 2.2 tracker text is now framed as a DiD benchmark sanity check, and the unbalanced-panel note now points to the actual regression tests rather than the old validation class.
No estimator code changed in this PR; the substantive scope is the TROP methodology-promotion docs plus the new tests/test_methodology_trop.py.
One methodology-doc issue remains: the new version-pinning note records specific v3-paper claims even though it also says no v3 PDF delta-check was performed, and those specifics are attributed to a prior AI review rather than a primary-source audit.
One smaller docs/tests issue remains: some residual Section 2.2 prose still overstates the DiD reduction test, and one tracker sentence misdescribes the helper DGP as having no time trend.
Static review only: I verified the changed docs/tests against the current TROP code, and confirmed the new file contains 36 tests across 10 classes, but I could not run pytest here because importing the package fails without numpy in this environment.

Methodology

Severity P2; Impact: the new version-pinning note turns unverified v3-paper details into methodology-tracker facts even though the same note says no v3 PDF audit was performed, and the details are explicitly attributed to “CI codex” rather than the paper itself. That weakens the source-material standard for a tracker entry being promoted to Complete. Concrete fix: either reduce the note to “review remains pinned to arXiv v2; v3 audit pending” until someone actually checks v3, or perform a real v2→v3 PDF delta review and replace the AI-sourced wording with source-grounded findings. References: METHODOLOGY_REVIEW.md:L832-L832, docs/methodology/REGISTRY.md:L2247-L2247, docs/methodology/papers/athey-2025-review.md:L8-L12.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The newly called-out TROP deferrals (Eq. 10, Eq. 14, SC/SDID reductions, R parity) are explicitly documented in the tracker/registry, so I treated them as tracked deferred work rather than blockers.

Security

No findings.

Documentation/Tests

Severity P3; Impact: residual Section 2.2 prose still overstates what the new reduction test proves. The new file header says the DiD reduction is “verified as an algebraic match,” the test docstring says the reduction is “exact in expectation,” and the tracker summary describes the DGP as having “no time trends,” but the helper used by that test includes additive time effects and the implemented assertion is only ATT closeness against the basic DiD benchmark. Concrete fix: align the remaining Section 2.2 wording to the already-correct “benchmark sanity check” framing, and describe _make_no_factor_panel() as “no interactive FE; additive unit/time effects” rather than “no time trends.” References: tests/test_methodology_trop.py:L45-L55, tests/test_methodology_trop.py:L277-L295, tests/test_methodology_trop.py:L1291-L1297, METHODOLOGY_REVIEW.md:L843-L843.

…Section 2.2 wording Address 1 P2 + 1 P3 from CI codex R12 on PR #491 (verdict was ✅ Looks good; no P0/P1). P2 - Version-pinning note carrying AI-sourced v3 details: the 2026-05-25 version-pinning notes recorded specific v3-paper claims ("Section 5.1.1 Conditions for exact unbiasedness added") attributed to CI codex rather than a primary-source audit. Reviewer noted that turning unverified AI claims into methodology-tracker facts weakens the source-material standard for a Complete-promoted entry. Stripped the AI-sourced specifics from all three notes (METHODOLOGY_REVIEW, REGISTRY, paper-review file) and reduced to bare "review pinned to v2; no v3 PDF delta-check performed; action item to refresh on next substantive release". This is honest deferral without surfacing unverified specifics. P3 - Section 2.2 residual wording overclaim: even after R11 softening, residual prose in the module docstring + test docstring + tracker summary still used "TWFE-clean (no factor structure, no time trends)" language. The DGP helper _make_no_factor_panel actually has additive time_fe = 0.2*t (not "no time trends"). Aligned with the already- correct "benchmark sanity check" framing: - Module docstring: "TWFE-clean panel" → "no-interactive-FE panel (additive unit + time effects only)"; "verified as an algebraic match" → "benchmark sanity check" - test_did_reduction docstring: "exact in expectation" → "benchmark sanity check, not an algebraic-equivalence proof"; explicit note that the multi-period DGP is the canonical comparator but not the algebraic target - METHODOLOGY_REVIEW: "TWFE-clean ... no time trends" → "no-interactive-FE ... additive unit + time effects only" No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T16:40:30Z

🔁 AI review rerun (requested by @igerber)

Head SHA: df90c812f4721537d21f44f48d22635702868193

Overall Assessment

Looks good - no unmitigated P0/P1 findings.

Executive Summary

Re-review scope: the prior rerun's two doc/test issues appear addressed. The TROP promotion is now explicitly anchored to arXiv v2, and the missing v2->v3 audit is framed as deferred rather than as settled v3 methodology. METHODOLOGY_REVIEW.md:L829-L832, docs/methodology/REGISTRY.md:L2245-L2247, docs/methodology/papers/athey-2025-review.md:L8-L18
The Section 2.2 wording is now consistent: the DiD reduction is described as a benchmark sanity check on an additive unit + time effects panel, not an algebraic equivalence proof. METHODOLOGY_REVIEW.md:L843-L843, tests/test_methodology_trop.py:L45-L59, tests/test_methodology_trop.py:L271-L285, tests/test_methodology_trop.py:L1294-L1307
No TROP estimator code changed in this PR; the substantive scope is methodology docs plus the new dedicated methodology test file, with legacy methodology cases migrated out of tests/test_trop.py. CHANGELOG.md:L14-L14, tests/test_methodology_trop.py:L109-L120, tests/test_trop.py:L1978-L1990
Deferred items such as Eq. 10 numerical reconstruction, Eq. 14 covariates, SC/SDID reductions, and R parity are explicitly documented, so I treated them as accepted non-blocking debt rather than methodology defects. METHODOLOGY_REVIEW.md:L862-L869, docs/methodology/REGISTRY.md:L2393-L2396
One minor issue remains: the Eq. 10 "structural pointer" test asserts nonzero factor rank on an additive-FE/noise DGP, which is not a source-grounded consequence of the paper and risks locking a solver artifact rather than the intended low-rank behavior. tests/test_methodology_trop.py:L623-L660, docs/methodology/REGISTRY.md:L2393-L2393
Review limitation: static only. I could not run tests/test_methodology_trop.py here because pytest and numpy are unavailable in this environment.

Methodology

Severity P3; Impact: TestTROPNuclearNormProx.test_factor_matrix_consistent_with_treatment_effects builds a panel with additive unit effects, linear time trend, and iid noise, but then treats effective_rank > 0 as evidence of a non-trivial Eq. 10 low-rank factor term. That is not implied by the paper's setup, where additive FE are already absorbed by alpha_i/beta_t; on this DGP, a near-zero L_hat is methodologically plausible. The linked registry text therefore overstates what this test verifies. Concrete fix: either switch this test to an explicit interactive-factor DGP if nonzero L_hat is the point, or drop the effective_rank > 0 assertion and keep the test limited to shape/finiteness plus finite treatment_effects. tests/test_methodology_trop.py:L623-L660, docs/methodology/REGISTRY.md:L2393-L2393

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining TROP gaps are documented as deferred in the tracker/registry, so I treated them as accepted non-blocking debt rather than review defects. METHODOLOGY_REVIEW.md:L862-L869, docs/methodology/REGISTRY.md:L2394-L2396

Security

No findings.

Documentation/Tests

No additional findings.
Review limitation: I could not execute the new methodology suite here because python -m pytest fails with No module named pytest, and importing diff_diff fails with ModuleNotFoundError: No module named 'numpy'.

…-factor DGP Address single P3 from CI codex R13 on PR #491 (verdict was ✅ Looks good; no P0/P1/P2). P3 - effective_rank > 0 assertion locks solver artifact on no-factor DGP: test_factor_matrix_consistent_with_treatment_effects builds a panel with additive unit + time effects + iid noise only (no interactive factor structure). Per the paper's framework, additive FE are absorbed by alpha_i / beta_t, so a near-zero L_hat is methodologically correct on this DGP. The R6 addition of `assert results.effective_rank > 0` was meant to guard the "non-triviality" claim that R5 codex flagged, but on a no-factor DGP this could lock a solver artifact (e.g., regularisation under- shrinkage) rather than the intended low-rank behavior. Removed the effective_rank > 0 assertion from test_factor_matrix_consistent_with_treatment_effects; replaced with a docstring note explaining why no rank claim is made (test DGP has no interactive factors). Updated REGISTRY Eq. 10 note to drop the "non-triviality" claim and explicitly state the test is a structural pointer only (shape + finiteness + treatment_effects existence), with the rationale why effective_rank is not asserted on this DGP. If a future test wants to assert non-triviality of L_hat as Eq. 10 evidence, it would need an explicit interactive-factor DGP — that test is deferred (the test surface here remains the structural pointer per the original promotion plan). No source-code changes to diff_diff/trop*.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T16:52:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 974ffcc28d80bf4fdb8084a6afdeaf10769434c4

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings.

Executive Summary

Re-review scope: the prior Eq. 10 concern is addressed. The new structural-pointer test no longer asserts nonzero rank on an additive-FE DGP, and the registry/tracker now explicitly scope that surface to shape/finiteness rather than L_hat magnitude. tests/test_methodology_trop.py:L649-L668, docs/methodology/REGISTRY.md:L2393-L2393, METHODOLOGY_REVIEW.md:L843-L844
The PR is docs/tests-only for TROP; I did not find a new methodology mismatch in the estimator implementation itself.
The new tracker/registry text correctly treats the v2→v3 paper audit, Eq. 14 covariates, SC/SDID reductions, and R parity as deferred rather than silently claiming they are complete. METHODOLOGY_REVIEW.md:L829-L869, docs/methodology/REGISTRY.md:L2245-L2247
One minor issue remains: several new docstrings in tests/test_methodology_trop.py still describe rank/non-triviality checks that the code intentionally no longer performs. tests/test_methodology_trop.py:L17-L24, tests/test_methodology_trop.py:L355-L358, tests/test_methodology_trop.py:L493-L496, tests/test_methodology_trop.py:L554-L555, tests/test_methodology_trop.py:L603-L606
Review limitation: static only. pytest/numpy are unavailable in this environment, so I could not execute the new suite; I only verified the changed test files parse.

Methodology

No findings. The methodology surfaces in this diff are aligned with the current TROP implementation, and the remaining paper mismatches are documented as deviations or deferrals rather than being left implicit.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining TROP limitations in this PR are already tracked in the methodology docs as deferred work, so I treated them as accepted non-blocking debt.

Security

No findings.

Documentation/Tests

Severity P3; Impact: the new methodology test file still contains stale prose saying the Eq. 10 structural-pointer test checks a “non-trivial” factor_matrix, and that test_weighted_nuclear_norm_objective_recovers_att checks a “non-zero effective_rank”, but the code now only checks shape/finiteness (or effective_rank >= 0). That leaves the file internally inconsistent with the corrected registry note and can mislead future reviewers into thinking the old rank-based claim is still locked. Concrete fix: update those docstrings to match the actual assertions, or reintroduce a nonzero-rank assertion only on a DGP where nonzero L_hat is source-grounded. tests/test_methodology_trop.py:L17-L24, tests/test_methodology_trop.py:L355-L358, tests/test_methodology_trop.py:L493-L496, tests/test_methodology_trop.py:L554-L555, tests/test_methodology_trop.py:L603-L606

…ective_rank" docstring claims Address single P3 from CI codex R14 on PR #491 (verdict was ✅ Looks good; no P0/P1/P2). P3 - Stale rank/non-triviality docstrings: after the R13 fix removed the `effective_rank > 0` assertion from test_factor_matrix_consistent_with_treatment_effects, several docstrings still described the test as checking "non-triviality" / "non-zero effective_rank", which the code intentionally no longer does. The inconsistency could mislead future readers into thinking the old rank-based claim is still locked. Updated five locations to match the actual assertions: - Module docstring Eq. 10 line: "non-triviality" → "shape + finiteness ... + treatment_effects populated with finite entries" - TestTROPNuclearNormProx class docstring: similar correction + explicit "does NOT assert non-triviality of L_hat magnitude" - test_factor_matrix_consistent_with_treatment_effects docstring: explicit "**No non-triviality / magnitude claim**" with rationale - test_weighted_nuclear_norm_objective_recovers_att docstring: "non-zero effective_rank" → "non-negative effective_rank" (matches the actual `assert effective_rank >= 0` in the body) with explicit note that the active code path is verified by positive-ATT recovery, not by a non-zero rank claim - The remaining "non-trivial" / "non-triviality" mentions in the file are now either (a) about a different test surface, (b) in a NEGATION clarifying what the test does NOT assert, or (c) on the MC reduction test where the claim IS source-grounded (interactive- factor DGP) No source-code changes to diff_diff/trop*.py. 6 TestTROPNuclearNormProx tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-25T17:06:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b1734506f24fc9a4d0a99df0ab6f61d4dd53c56f

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings.

Executive Summary

Re-review scope: the prior stale-prose finding is resolved. The new methodology file now explicitly scopes the Eq. 10 test as a structural pointer and removes the earlier “non-trivial/non-zero rank” overclaim. tests/test_methodology_trop.py:L17-L26, tests/test_methodology_trop.py:L350-L363, tests/test_methodology_trop.py:L601-L627
This PR is docs/tests-only for TROP; there are no estimator changes in diff_diff/trop*.py.
The promotion text correctly limits “Complete” to the paper-aligned method="local" surface and treats method="global" as a documented library-side adaptation, not as paper-validated methodology. METHODOLOGY_REVIEW.md:L826-L869, docs/methodology/REGISTRY.md:L2398-L2425
The important deviations/deferrals are explicitly documented rather than left implicit: weight normalization, unbalanced panels, Eq. 10 non-lock, Eq. 14 covariates, SC/SDID reductions, R parity, and the v2-pinned / v3-pending source audit. METHODOLOGY_REVIEW.md:L832-L869, docs/methodology/REGISTRY.md:L2247-L2396, docs/methodology/papers/athey-2025-review.md:L8-L18
The implementation surfaces I spot-checked still match the promoted claims: Eq. 3 masking/weights, per-cell ATT aggregation, and safe_inference NaN/t-based inference behavior. diff_diff/trop_local.py:L442-L583, diff_diff/trop.py:L666-L814, diff_diff/utils.py:L177-L211
Review limitation: static only. pytest is unavailable here, and this environment also lacks numpy, so I could not execute the new suite.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining TROP gaps are explicitly documented as deferred work in the registry/tracker, so I treated them as accepted non-blocking debt.

Security

No findings.

Documentation/Tests

No findings. The previous P3 about stale methodology-test prose appears resolved in this revision.

igerber added the ready-for-ci Triggers CI test workflows label May 25, 2026

igerber removed the ready-for-ci Triggers CI test workflows label May 25, 2026

igerber added the ready-for-ci Triggers CI test workflows label May 25, 2026

igerber merged commit 7c5dba1 into main May 25, 2026
31 of 32 checks passed

igerber deleted the feature/trop-methodology-review-tracker-promotion branch May 25, 2026 20:02

Conversation

igerber commented May 25, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant