Close BR/DR gap #6: target-parameter block in schemas by igerber · Pull Request #347 · igerber/diff-diff

igerber · 2026-04-20T22:48:28Z

Summary

Adds a new top-level target_parameter block to both BusinessReport and DiagnosticReport schemas naming what the headline scalar represents as an estimand for each of the 17 covered result classes (16 estimator classes from _APPLICABILITY + BaconDecompositionResults).
Fields: name, definition, aggregation (machine-readable dispatch tag), headline_attribute (raw result attribute), reference (REGISTRY.md citation pointer). Additive schema change (experimental per REPORTING.md stability policy).
Per-estimator dispatch lives in the new diff_diff/_reporting_helpers.py::describe_target_parameter (own module to avoid circular-import risk — both BR and DR import from it). Three branches read fit-time config: EfficientDiD's pt_assumption, StackedDiD's clean_control, and dCDH's L_max / covariate_residuals / linear_trends_effects. The rest emit a fixed tag.
Prose: BR summary and DR overall-interpretation paragraph each emit "Target parameter: ." after the headline; both full reports carry a "## Target Parameter" section with the full definition.
Closes BR/DR foundation gap Add multi-period DiD support #6 (target-parameter clarity).

Methodology references (required if estimator / math changes)

Method name(s): no new methodology. This is a reporting-layer metadata addition that names canonical estimands sourced from REGISTRY.md per estimator.
Paper / source link(s): per-estimator references embedded in _reporting_helpers.py: Callaway & Sant'Anna (2021); Sun & Abraham (2021); Borusyak-Jaravel-Spiess (2024); Gardner (2022); Wing-Freedman-Hollingsworth (2024); Wooldridge (2023); Chen-Sant'Anna-Xie (2025); Callaway-Goodman-Bacon-Sant'Anna (2024); Ortiz-Villavicencio & Sant'Anna (2025); de Chaisemartin & D'Haultfoeuille (2020, 2024); Arkhangelsky et al. (2021); Goodman-Bacon (2021). All already cited in REGISTRY.md.
Any intentional deviations from the source (and why): None. Every per-estimator branch's wording is sourced from the corresponding REGISTRY.md section (see feedback_verify_claims.md). Two plan-review catches documented in the code:
- ImputationDiDResults / TwoStageDiDResults do not persist aggregate on the result object; overall_att is always the sample-mean aggregation regardless of fit-time config. Emitted as the fixed "simple" tag; the definition names this explicitly so the user is not misled.
- ContinuousDiDResults has no PT-vs-SPT regime attribute; the definition is disjunctive (ATT^loc under PT, ATT^glob under SPT) so the user can pick the interpretation that matches their assumption.

Validation

Tests added/updated: tests/test_target_parameter.py (new, 37 tests across per-estimator dispatch, fit-config branching, cross-surface parity, exhaustiveness, and prose rendering). tests/test_business_report.py + tests/test_diagnostic_report.py top-level-key contract tests updated to include target_parameter. Total 319 BR/DR tests pass.
Exhaustiveness guard: TestTargetParameterCoversEveryResultClass iterates _APPLICABILITY and asserts every result-class name gets a non-default, non-empty target-parameter block. New result classes (e.g., HAD when it lands) will fail this test until an explicit branch is added.
Cross-surface parity: TestTargetParameterCrossSurfaceParity asserts BR and DR emit byte-identical target-parameter blocks on the same fit (both consume the same helper).
Backtest / simulation / notebook evidence: N/A.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Closes BR/DR foundation gap #6 from project_br_dr_foundation.md: BusinessReport and DiagnosticReport now name what the headline scalar actually represents as an estimand, for each of the 16 result classes. Baker et al. (2025) Step 2 ("define the target parameter") was previously in BR's next_steps list but not done by BR itself — this PR closes that gap. New top-level ``target_parameter`` block (additive schema change; experimental per REPORTING.md stability policy): { "name": str, # stakeholder-facing name "definition": str, # plain-English description "aggregation": str, # machine-readable dispatch tag "headline_attribute": str, # which raw result attribute "reference": str, # REGISTRY.md citation pointer } Schema placement: top-level block (user preference, selected via AskUserQuestion in planning). Aggregation tags include "simple", "event_study", "group", "2x2", "twfe", "iw", "stacked", "ddd", "staggered_ddd", "synthetic", "factor_model", "M", "l", "l_x", "l_fd", "l_x_fd", "dose_overall", "pt_all_combined", "pt_post_single_baseline", "unknown". Per-estimator dispatch lives in the new ``diff_diff/_reporting_helpers.py::describe_target_parameter`` (own module rather than business_report / diagnostic_report to avoid circular-import risk — plan-review LOW #7). All 17 result classes covered (16 from _APPLICABILITY + BaconDecompositionResults); exhaustiveness locked in by TestTargetParameterCoversEveryResultClass. Fit-time config reads: - ``EfficientDiDResults.pt_assumption`` branches the aggregation tag between pt_all_combined and pt_post_single_baseline. - ``StackedDiDResults.clean_control`` varies the definition clause (never_treated / strict / not_yet_treated). - ``ChaisemartinDHaultfoeuilleResults.L_max`` + ``covariate_residuals`` + ``linear_trends_effects`` branches the dCDH estimand between DID_M / DID_l / DID^X_l / DID^{fd}_l / DID^{X,fd}_l. Fixed-tag branches (per plan-review CRITICAL #1 and #2): - ``CallawaySantAnna`` / ``ImputationDiD`` / ``TwoStageDiD`` / ``WooldridgeDiD``: the fit-time ``aggregate`` kwarg does not change the ``overall_att`` scalar — it only populates additional horizon / group tables on the result object. Disambiguating those tables in prose is tracked under gap #9. - ``ContinuousDiDResults``: the PT-vs-SPT regime is a user-level assumption, not a library setting. Emits a single "dose_overall" tag with disjunctive definition naming both regime readings (ATT^loc under PT, ATT^glob under SPT). Prose rendering: - BR ``_render_summary``: emits "Target parameter: <name>." after the headline sentence (short name only; full definition lives in the full_report and schema). - BR ``_render_full_report``: "## Target Parameter" section between "## Headline" and "## Identifying Assumption". - DR ``_render_overall_interpretation``: mirror sentence. - DR ``_render_dr_full_report``: "## Target Parameter" section with name, definition, aggregation tag, headline attribute, and reference. Cross-surface parity: both BR and DR consume the same helper (the single source of truth), so their ``target_parameter`` blocks are byte-identical (verified by TestTargetParameterCrossSurfaceParity). Tests: 37 new (TestTargetParameterPerEstimator + TestTargetParameterFitConfigReads + TestTargetParameterCoversEveryResultClass + TestTargetParameterCrossSurfaceParity + TestTargetParameterProseRendering). Existing BR/DR top-level-key contract tests updated to include ``target_parameter``. Total 319 tests pass (282 prior + 37 new). Docs: REPORTING.md gains a "Target parameter" section documenting the per-estimator dispatch and schema shape. business_report.rst and diagnostic_report.rst note the new field with a pointer to REPORTING.md. CHANGELOG entry under Unreleased. Out of scope: REGISTRY.md per-estimator "Target parameter" sub-sections (plan-review additional-note); the reporting-layer doc in REPORTING.md is the current source of truth. A follow-up docs PR can land those sub-sections if maintainers want the registry to own the canonical wording directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-20T22:54:39Z

Overall Assessment

⚠️ Needs changes

Executive Summary

P1: StackedDiD’s new target_parameter block describes the headline as a treated-share-weighted aggregate across sub-experiments, but the actual overall_att in this library is the simple average of post-treatment event-study coefficients. No REGISTRY note mitigates that mismatch.
P1: the TWFE-specific branch is effectively dead. describe_target_parameter() dispatches on type(results).__name__, but TwoWayFixedEffects.fit() returns DiDResults, so real TWFE reports still get the generic 2x2 ATT block.
P1: both dCDH branches set headline_attribute="att" even though the headline scalar lives in overall_att, which breaks the new machine-readable schema contract.
P2: the added tests are mostly stub-based and miss the live TWFE path and the dCDH raw-attribute mismatch.
No performance or security issues stood out in the changed files.

Methodology

Severity P1. Impact: StackedDiD’s new target-parameter text names the wrong estimand for the headline scalar. The helper says overall_att is a treated-unit-share-weighted aggregate across sub-experiments, but the implementation and result contract define it as the average of post-treatment delta_h coefficients. That is an undocumented mismatch against the Methodology Registry and will mislabel the reported estimand. Concrete fix: rewrite the StackedDiDResults branch so it describes overall_att as the average of post-treatment event-study coefficients; if helpful, separately note that each delta_h estimates theta_kappa^e. Refs: diff_diff/_reporting_helpers.py#L200, diff_diff/stacked_did.py#L541, diff_diff/stacked_did_results.py#L28, docs/methodology/REGISTRY.md#L1219, docs/methodology/REGISTRY.md#L1291
Severity P1. Impact: the PR claims TWFE-specific target-parameter coverage, but the runtime path does not support it. describe_target_parameter() branches on type(results).__name__, yet TwoWayFixedEffects.fit() returns DiDResults, so real TWFE fits will still emit the generic DiDResults 2x2 target-parameter block, not the TWFE one added in this PR. Concrete fix: either return a dedicated TwoWayFixedEffectsResults type from TWFE, or persist estimator provenance on DiDResults and branch on that provenance in BR/DR; then add an integration test on an actual TWFE fit. Refs: diff_diff/_reporting_helpers.py#L70, diff_diff/_reporting_helpers.py#L100, diff_diff/twfe.py#L472, tests/test_target_parameter.py#L57

Code Quality

Severity P1. Impact: the new headline_attribute field is wrong for dCDH. The schema says this field should name the raw result attribute holding the headline scalar, but both dCDH branches emit "att" even though the results object stores the headline in overall_att, and the report headline extractor already reads overall_att first. Downstream consumers using the new machine-readable field will be pointed at a non-existent attribute. Concrete fix: change both dCDH branches to headline_attribute="overall_att" and add assertions for both L_max=None and L_max>=1. Refs: diff_diff/_reporting_helpers.py#L325, diff_diff/_reporting_helpers.py#L365, diff_diff/chaisemartin_dhaultfoeuille_results.py#L357, diff_diff/diagnostic_report.py#L2503, docs/methodology/REPORTING.md#L96

Performance

No findings.

Maintainability

Severity P2. Impact: the new feature extends the existing string-name dispatch pattern without anchoring it to real result types, which is how the dead TWFE branch slipped in. That leaves future report coverage brittle and easy to drift from actual estimator outputs. Concrete fix: centralize report metadata in one registry shared by estimator outputs and reporting helpers, and drive exhaustiveness tests from that registry rather than from manually invented class-name stubs. Refs: diff_diff/_reporting_helpers.py#L70, docs/methodology/REPORTING.md#L104, tests/test_target_parameter.py#L239

Tech Debt

No deferrable findings. I found no matching TODO.md entry or REGISTRY deviation note that mitigates the P1 issues above, so they remain blockers under the stated review policy.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests validate helper output on synthetic class-name stubs and the _APPLICABILITY set, so they miss both the live TWFE runtime path and the incorrect dCDH headline_attribute. That means the current suite would pass while the shipped schema is still wrong for real consumers. Concrete fix: add integration tests that assert target_parameter on an actual TwoWayFixedEffects().fit(...) result, assert dCDH uses headline_attribute="overall_att", and assert StackedDiD wording matches the estimator’s own overall_att contract. Refs: tests/test_target_parameter.py#L57, tests/test_target_parameter.py#L239, diff_diff/diagnostic_report.py#L93

Path to Approval

Fix the StackedDiDResults branch in describe_target_parameter() so it describes the actual overall_att headline scalar: the average of post-treatment delta_h coefficients.
Make the TWFE-specific path reachable on real estimator outputs, either by returning a dedicated TWFE result class or by persisting estimator provenance on DiDResults and teaching BR/DR to branch on it.
Change both dCDH branches’ headline_attribute from "att" to "overall_att".
Add integration tests for real TWFE, dCDH, and StackedDiD runtime objects so these contracts are enforced end-to-end.

Local execution was not possible in this review environment because the Python environment is missing numpy/pandas; assessment is based on static diff inspection.

…x dCDH headline_attribute R1 surfaced three P1s, all legitimate: 1. StackedDiD wording mismatch. Claimed ``overall_att`` is a treated-share-weighted aggregate across sub-experiments; actual implementation (``stacked_did.py`` ~line 541) computes ``overall_att`` as the simple average of post-treatment event- study coefficients ``delta_h`` with delta-method SE. Per-horizon ``delta_h`` is the paper's ``theta_kappa^e`` cross-event aggregate, but the headline is an equally-weighted average over those per-horizon coefficients, not a separate cross-event weighting at the ATT level. Definition rewritten to describe the actual estimand. 2. Dead ``TwoWayFixedEffectsResults`` branch. ``TwoWayFixedEffects`` is a subclass of ``DifferenceInDifferences`` and its ``fit()`` returns ``DiDResults`` — there is no separate TWFE result class, so the ``type(results).__name__ == "TwoWayFixedEffectsResults"`` dispatch branch was unreachable on any real fit. Removed the dead branch and rewrote the ``DiDResults`` branch to cover both 2x2 DiD and TWFE interpretations explicitly (both estimators route here). Follow-up for future PR: persist estimator provenance on ``DiDResults`` (or return a dedicated TWFE result class) so the branch can split again; documented inline. 3. dCDH ``headline_attribute="att"``. Both dCDH branches (``DID_M`` for ``L_max=None``, ``DID_l``/derivatives for ``L_max >= 1``) named ``"att"`` as the headline attribute, but ``ChaisemartinDHaultfoeuilleResults`` stores the headline in ``overall_att`` (``chaisemartin_dhaultfoeuille_results.py:357``). Fixed both branches to ``"overall_att"``; downstream consumers using the machine-readable contract now point at the correct attribute. Tests: new ``TestTargetParameterRealFitIntegration`` covers the gap R1 P2 flagged — prior coverage was stub-based and would not have caught any of the three P1s. Four new real-fit tests: - ``TwoWayFixedEffects().fit(...)`` returns ``DiDResults``; target- parameter block uses the shared DiD/TWFE branch. - ``StackedDiD(...).fit(...)`` on a staggered panel; the ``headline_attribute`` matches the actual real attribute and the definition names the event-study-coefficient estimand. - ``ChaisemartinDHaultfoeuille().fit(...)`` on a reversible- treatment panel (both ``DID_M`` and ``DID_l`` regimes); ``headline_attribute == "overall_att"`` and the named attribute actually exists on the real fit object. Existing stub-based dispatch tests updated: the ``test_twfe_results`` test is now ``test_did_results_mentions_twfe`` (asserts the DiD branch describes both estimators). The dCDH stub tests now also assert ``headline_attribute == "overall_att"``. All 323 BR/DR tests pass (319 prior + 4 new real-fit integration). Out of scope (plan-review MEDIUM #2 — centralizing report metadata in a single registry shared by estimator outputs and reporting helpers): queued as a separate PR. Current approach (string dispatch on ``type(results).__name__`` + REGISTRY.md references) is working but brittle; a centralized registry is the principled fix for the TWFE-dispatch-dead-code class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:04:36Z

/ai-review

github-actions · 2026-04-20T23:10:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e8e26479f4e10b4f9100b180d59a53a56880277

Overall Assessment

⚠️ Needs changes

Executive Summary

P1: ChaisemartinDHaultfoeuille dynamic fits are still misclassified by the new target_parameter helper. It treats all L_max>=1 cases as DID_l, but the library’s actual overall_att contract is DID_1 at L_max=1, delta at L_max>=2, and no scalar aggregate under trends_linear + L_max>=2.
The same dCDH flattening is now documented in REPORTING.md, so the reporting docs and schema drift together from REGISTRY.md and the result-class contract.
P2: the new tests codify the wrong dCDH expectation (L_max=2 -> aggregation "l") and miss the L_max=1 and trend-adjusted no-aggregate cases.
The previous blocking findings from the earlier review look addressed: StackedDiD’s wording now matches overall_att, real TWFE fits route through the live DiD/TWFE branch, and dCDH headline_attribute now points to overall_att.
No performance or security issues stood out in the changed files.

Methodology

Severity P1. Impact: affected method is ChaisemartinDHaultfoeuille’s dynamic path. describe_target_parameter() currently maps every L_max>=1 result to a DID_l-style estimand, but the source-of-truth contract is different: overall_att is DID_1 when L_max=1, delta when L_max>=2, and intentionally NaN/no aggregate when trends_linear=True and L_max>=2. BR/DR will therefore misstate what the headline scalar represents for real dynamic dCDH fits, and the new reporting docs repeat the same mismatch. Concrete fix: split the dCDH branch by L_max and linear_trends_effects to mirror the result object exactly: DID_1/DID^X_1/DID^{fd}_1 for L_max=1, delta/delta^X for L_max>=2 without linear-trends suppression, and an explicit no-scalar/no-headline case for trends_linear + L_max>=2 instead of claiming headline_attribute="overall_att". Refs: diff_diff/_reporting_helpers.py:L315, docs/methodology/REPORTING.md:L120, docs/methodology/REGISTRY.md:L532, diff_diff/chaisemartin_dhaultfoeuille_results.py:L466, diff_diff/chaisemartin_dhaultfoeuille.py:L2602, diff_diff/chaisemartin_dhaultfoeuille.py:L2823, docs/methodology/REGISTRY.md:L631

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology drift above.

Tech Debt

No deferrable findings. I found no TODO.md entry that mitigates the dCDH target-parameter mismatch, and this class of methodology bug would remain blocking even if it were tracked.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new regression suite currently locks in the wrong dCDH contract by asserting that L_max=2 maps to aggregation == "l" / "DID_l" both on stubs and on a real fit, and it never covers the valid L_max=1 or trends_linear + L_max>=2 cases. REPORTING.md mirrors the same flattened branch list, so the docs reinforce the test bug. Concrete fix: replace the current dCDH dynamic assertions with separate cases for L_max=1, L_max=2, covariate-only L_max=2, and trends_linear + L_max>=2, and update REPORTING.md to match those branches. Refs: tests/test_target_parameter.py:L146, tests/test_target_parameter.py:L223, tests/test_target_parameter.py:L468, docs/methodology/REPORTING.md:L120, tests/test_chaisemartin_dhaultfoeuille.py:L2558

Path to Approval

Update describe_target_parameter() so dCDH dynamic results follow the actual overall_att contract: DID_M for L_max=None, DID_1 for L_max=1, delta for L_max>=2, and no scalar aggregate for trends_linear + L_max>=2.
Update docs/methodology/REPORTING.md so the documented dCDH dispatch matches REGISTRY.md and the result-class behavior.
Replace the current dCDH target-parameter tests with cases that cover L_max=1, L_max>=2, and trends_linear + L_max>=2 on both stubbed and real results.

Static review only: this environment lacks pytest and numpy, so I could not execute the new test file.

R2 surfaced one P1 methodology finding: the dCDH dynamic branch flattened every ``L_max >= 1`` into a generic ``DID_l`` estimand, but the library's actual ``overall_att`` contract is: - ``L_max = None`` -> ``DID_M`` (Phase 1 per-period aggregate). - ``L_max = 1`` -> ``DID_1`` (single-horizon per-group estimand, Equation 3 of the dynamic companion paper — NOT the generic ``DID_l``). - ``L_max >= 2`` (no ``trends_linear``) -> ``delta`` (cost-benefit cross-horizon aggregate, Lemma 4; ``chaisemartin_dhaultfoeuille.py:2602-2634``). - ``trends_linear = True`` AND ``L_max >= 2`` -> ``overall_att`` is intentionally NaN by design (``chaisemartin_dhaultfoeuille.py:2828-2834``). No scalar aggregate; per-horizon level effects live on ``linear_trends_effects[l]``. Fix: ``describe_target_parameter()`` now mirrors the result class's own ``_estimand_label()`` at ``chaisemartin_dhaultfoeuille_results.py:454-490``. New aggregation tags: ``DID_1`` / ``DID_1_x`` / ``DID_1_fd`` / ``DID_1_x_fd`` for single-horizon, ``delta`` / ``delta_x`` for cost-benefit, and ``no_scalar_headline`` for the trends+L_max>=2 suppression case. On the no-scalar case, ``headline_attribute`` is ``None`` so downstream consumers do not point at a field whose value is NaN by design. Tests: added stub-based branches for every new case (``DID_1``, ``DID_1^X``, ``delta``, ``delta^X``, trends + L_max>=2 no-scalar, trends + L_max=1 still-has-scalar) and split the real-fit integration test into ``L_max=1`` and ``L_max=2`` real-panel cases so the contract is enforced end-to-end per R2 P2. The parameterized ``test_dcdh_config_branches_tag`` now covers 10 cases and also asserts ``headline_attribute`` flips to ``None`` only on the no-scalar case. Docs: ``REPORTING.md`` dCDH section rewritten to match the corrected dispatch, including the ``no_scalar_headline`` case and the L_max=None/1/>=2 contract. 332 BR/DR tests pass. Out of scope (still open from R1): centralizing report metadata in a single registry shared by estimator outputs and reporting helpers (plan-review MEDIUM #2 / R1 P2 maintainability). The current string dispatch on ``type(results).__name__`` + explicit REGISTRY.md citations is source-faithful but requires manual mirroring of result-class contracts; a centralized registry is the principled fix. Tracked for a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:25:46Z

/ai-review

github-actions · 2026-04-20T23:31:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f6b4d1899b63b7c0dde602cbd1589c6ec55ed53

Overall Assessment

✅ Looks good

Executive Summary

The prior blocking dCDH methodology issue appears resolved: the new helper now mirrors the documented overall_att contract for L_max=None, L_max=1, L_max>=2, and the trends_linear=True, L_max>=2 no-scalar branch.
No new P0/P1 findings stood out in the changed code.
Severity P2: the public docs for the new machine-readable aggregation contract are still stale and describe old dCDH tags (DID_l, l_x, etc.) instead of the implemented DID_1* / delta* / no_scalar_headline behavior.
Severity P3: the new regression file still lacks an end-to-end real-fit check for the trends_linear=True, L_max>=2 no-scalar dCDH branch; that path is only stub-tested in this PR.
Static review only: I could not execute the added tests because this environment lacks pytest and numpy.

Methodology

No unmitigated findings. The dCDH target-parameter dispatch now matches the registry/result-class contract for DID_M, DID_1, delta, and the suppressed-headline trends_linear branch in diff_diff/_reporting_helpers.py#L334, consistent with docs/methodology/REGISTRY.md#L532, docs/methodology/REGISTRY.md#L631, and diff_diff/chaisemartin_dhaultfoeuille_results.py#L454. The new tests pin DID_1 and delta in both stubbed and real-fit cases at tests/test_target_parameter.py#L146 and tests/test_target_parameter.py#L543.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I found no TODO.md entry that changes the severity of the remaining items; they are non-blocking documentation/test gaps.

Security

No findings.

Documentation/Tests

Severity P2. Impact: REPORTING.md still advertises obsolete machine-readable aggregation values ("l", "l_x", "l_fd", "l_x_fd") and documents headline_attribute as if it were always a string, while the implementation now returns DID_1*, delta*, and no_scalar_headline, with headline_attribute=None for the no-scalar dCDH branch. business_report.rst still summarizes the dCDH side as DID_l. Consumers following the docs can wire dispatch logic to tags the implementation no longer emits. Concrete fix: sync the public schema docs to the actual helper outputs, explicitly documenting DID_1*, delta*, no_scalar_headline, and the headline_attribute=None case. Refs: docs/methodology/REPORTING.md#L90, docs/methodology/REPORTING.md#L125, docs/api/business_report.rst#L52, diff_diff/_reporting_helpers.py#L352, diff_diff/_reporting_helpers.py#L450.
Severity P3. Impact: the new regression file adds live dCDH checks for DID_1 and delta, but the trends_linear=True, L_max>=2 reporting branch is still only covered with stubs in this PR. That leaves the reporting-layer integration unpinned for the one branch where headline_attribute intentionally becomes None. Concrete fix: add a real-fit ChaisemartinDHaultfoeuille(..., L_max=2, trends_linear=True) case asserting non-finite overall_att, populated linear_trends_effects, aggregation="no_scalar_headline", and headline_attribute is None. Refs: tests/test_target_parameter.py#L205, tests/test_target_parameter.py#L543.

…fit no-scalar test R3 approved (✅) with two non-blocking follow-ups; this commit addresses both. P2 (docs): ``REPORTING.md`` and ``business_report.rst`` still listed the obsolete dCDH aggregation tags (``"DID_l"``, ``"l"``, ``"l_x"``, ``"l_fd"``, ``"l_x_fd"``) and documented ``headline_attribute`` as always a string, even though R2 replaced those with ``"DID_1"`` / ``"DID_1_x"`` / ``"DID_1_fd"`` / ``"DID_1_x_fd"`` / ``"delta"`` / ``"delta_x"`` / ``"no_scalar_headline"`` and introduced the ``headline_attribute=None`` no-scalar case. Consumers wiring dispatch logic off the docs would have pointed at tags the helper no longer emits. Rewrote the ``aggregation`` enum in REPORTING.md as a full per-estimator dispatch list, and updated the ``headline_attribute`` description to name the ``None`` case explicitly. ``business_report.rst`` summary replaced ``DID_l`` with ``DID_1`` / cost-benefit delta and added a pointer to the no-scalar case. P3 (tests): added ``test_dcdh_trends_linear_with_l_max_geq_2_fit_real`` — a real-fit regression that exercises the ``ChaisemartinDHaultfoeuille(..., L_max=2, trends_linear=True)`` path end-to-end. Asserts (a) ``fit.overall_att`` is NaN by design (matching ``chaisemartin_dhaultfoeuille.py:2828-2834``), (b) ``linear_trends_effects`` is populated, (c) the target-parameter block emits ``aggregation="no_scalar_headline"`` and ``headline_attribute is None``, (d) the definition references ``linear_trends_effects``. Previously this branch was only stub-tested; now the reporting-layer integration is pinned by a live dCDH fit. 333 BR/DR tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:41:38Z

/ai-review

github-actions · 2026-04-20T23:48:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 450c5925bae4e4a828d287012c745c4b94d7c019

Overall Assessment

⚠️ Needs changes

The previous re-review’s non-blocking doc/test items look addressed, but I found one new unmitigated P1 in the new target_parameter integration path.

Executive Summary

The earlier non-blocking gaps appear resolved: the public reporting docs now describe the implemented dCDH tags, and there is now a real-fit regression test for the trends_linear=True, L_max>=2 dCDH helper branch.
P1: the new documented dCDH no_scalar_headline branch is not propagated through BR/DR headline extraction and prose rendering, so valid fits still surface a NaN headline and “estimation failed” messaging.
P2: the Wooldridge target-parameter branch is not method-aware; OLS ETWFE fits are labeled as ASF-based even though the OLS path reports an observation-count-weighted average of ATT(g,t) coefficients.
I did not find new inference anti-patterns or security issues in the changed files.
Static review only: I could not run pytest here because pytest is not installed in this environment.

Methodology

Severity P1. Impact: the new dCDH branch explicitly documents that trends_linear=True with L_max>=2 has no scalar headline and sets headline_attribute=None, but BR/DR still unconditionally extract headline / headline_metric from overall_att, and their renderers treat any non-finite headline as a fit failure. On a valid dCDH fit for this branch, the schema/report becomes internally inconsistent: target_parameter says “no scalar aggregate; consult linear_trends_effects,” while the headline block still carries NaN and the prose tells users to inspect rank deficiency / zero effective sample. Refs: diff_diff/_reporting_helpers.py:334, docs/methodology/REPORTING.md:101, diff_diff/business_report.py:436, diff_diff/business_report.py:1941, diff_diff/diagnostic_report.py:966, diff_diff/diagnostic_report.py:2968. Concrete fix: compute target_parameter before headline extraction, suppress or replace the scalar headline whenever headline_attribute is None / aggregation=="no_scalar_headline", and render explicit “no scalar headline by design; use linear_trends_effects” prose instead of the generic estimation-failure message.
Severity P2. Impact: describe_target_parameter() gives every WooldridgeDiDResults the nonlinear ASF description and Wooldridge (2023) reference, but the registry splits OLS ETWFE from the nonlinear ASF extensions, and the OLS path here computes overall_att from an observation-count-weighted aggregation of ATT(g,t) coefficient estimates. Since the result object already persists method, OLS fits will now be mislabeled in the new schema/report. Refs: diff_diff/_reporting_helpers.py:220, docs/methodology/REGISTRY.md:1317, docs/methodology/REGISTRY.md:1330, diff_diff/wooldridge_results.py:48, diff_diff/wooldridge.py:730, diff_diff/wooldridge.py:1031. Concrete fix: branch the Wooldridge target-parameter text/reference on results.method ("ols" vs "logit" / "poisson"), and add method-specific dispatch tests.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No additional findings. Neither issue above is tracked in TODO.md, so neither is mitigated for assessment purposes.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the previous re-review’s doc drift and live-fit helper coverage look fixed, but the new prose/rendering tests still only exercise a Callaway stub, and the real dCDH trends_linear=True, L_max>=2 test stops at describe_target_parameter(). That leaves the broken no-scalar BR/DR surfaces and the Wooldridge method split unpinned. Refs: tests/test_target_parameter.py:374, tests/test_target_parameter.py:592. Concrete fix: add end-to-end assertions for to_dict(), summary(), full_report(), and DR interpretation/full report on a real dCDH no-scalar fit, plus Wooldridge method="ols" vs nonlinear target-parameter cases.

Path to Approval

Handle headline_attribute=None / aggregation="no_scalar_headline" as a first-class BR/DR branch: do not extract or narrate a scalar headline as a failed fit, and instead emit explicit no-scalar-by-design messaging that points to linear_trends_effects.
Make Wooldridge target-parameter dispatch method-aware so OLS ETWFE and nonlinear ASF paths get different wording/references.
Add end-to-end regression tests for both branches so the report surfaces, not just the helper, are covered.

…ridge method-aware R4 surfaced one P1 + one P2, both addressed. P1 (methodology): the dCDH no-scalar branch was documented in the schema but not plumbed through BR/DR rendering. When ``aggregation="no_scalar_headline"`` and ``headline_attribute=None`` (``trends_linear=True`` + ``L_max>=2``), BR/DR still extracted ``overall_att`` (NaN by design) and narrated it via the estimation- failure path, producing internally inconsistent output — the ``target_parameter`` block said "no scalar aggregate; consult linear_trends_effects" while the headline prose told users to inspect rank deficiency. Fix (both surfaces): - BR ``_build_schema``: compute ``target_parameter`` BEFORE ``_extract_headline``; if the aggregation tag is ``no_scalar_headline``, route through a dedicated headline block with ``status="no_scalar_by_design"`` / ``effect=None`` / ``sign="none"`` and an explicit ``reason`` field naming the ``linear_trends_effects`` alternative. - BR ``_render_headline_sentence``: detect ``status == "no_scalar_by_design"`` and emit explicit "does not produce a scalar aggregate effect ... by design" prose instead of the non-finite / estimation-failure sentence. - BR ``_build_caveats``: the existing ``sign == "undefined"`` estimation-failure caveat does not fire because we emit ``sign == "none"`` (not ``"undefined"``) on the no-scalar case. - DR ``_execute``: analogous headline-metric short-circuit with ``status="no_scalar_by_design"`` on detection of the no_scalar_headline tag. - DR ``_render_overall_interpretation``: explicit no-scalar sentence takes precedence over the non-finite estimation-failure branch. P2 (Wooldridge method awareness): the Wooldridge branch previously labeled every fit as ASF-based, but REGISTRY.md Sec. WooldridgeDiD splits OLS ETWFE (observation-count-weighted average of ATT(g,t) from a saturated regression) from the nonlinear (logit / Poisson) ASF path. Branch on ``results.method`` ("ols" -> coefficient- aggregation wording; other -> ASF wording). Tests: added 4 end-to-end regressions. - ``test_dcdh_trends_linear_no_scalar_propagates_through_br``: real dCDH fit with ``trends_linear=True`` + ``L_max=2``; asserts BR schema emits ``status="no_scalar_by_design"``, summary prose contains "no scalar" / "does not produce a scalar", does NOT contain "rank deficiency" / "estimation failed", and caveats do NOT include ``estimation_failure``. - ``test_dcdh_trends_linear_no_scalar_propagates_through_dr``: mirror on the DR side (``headline_metric`` status and ``overall_interpretation`` prose). - ``test_wooldridge_ols``: asserts the OLS branch names ATT(g,t) aggregation and does NOT include "ASF" in the name. - ``test_wooldridge_nonlinear``: asserts logit/poisson routes through the ASF branch. 336 BR/DR tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:55:17Z

/ai-review

github-actions · 2026-04-21T00:00:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a8e1719058c42895ab52dbf8a8d129b255474870

Overall Assessment

✅ Looks good

Executive Summary

The prior P1 dCDH no-scalar-headline issue appears resolved: BR/DR now branch on aggregation=="no_scalar_headline" and no longer narrate that configuration as an estimation failure.
The prior Wooldridge mislabel also appears resolved: target-parameter dispatch is now method-aware and no longer applies ASF wording to OLS ETWFE fits.
Severity P2: DiagnosticReport.full_report() still renders a malformed top headline on the no-scalar dCDH branch because the markdown headline line does not special-case status="no_scalar_by_design".
Severity P3: the new helper’s own Wooldridge docstring is stale and still describes overall_att generically as ASF-based, which now contradicts both the registry and the method-aware implementation.
Static review only: I could not run the added tests in this environment because numpy and pytest are not installed.

Methodology

No unmitigated P0/P1 methodology findings. The prior no-scalar propagation issue is fixed in diff_diff/business_report.py:L436-L472, diff_diff/diagnostic_report.py:L930-L959, and diff_diff/diagnostic_report.py:L3010-L3022. The prior Wooldridge OLS/nonlinear mismatch is fixed in diff_diff/_reporting_helpers.py:L220-L270.
Severity P3. Impact: the new helper’s top-level docstring still says Wooldridge overall_att is ASF-based across cohort×time cells for all fits, which conflicts with the now-correct OLS vs nonlinear split and with the registry’s method split. That is not a runtime defect, but it is a source-of-truth mismatch in changed code that can mislead future maintenance. diff_diff/_reporting_helpers.py:L65-L68, diff_diff/_reporting_helpers.py:L220-L270, docs/methodology/REGISTRY.md:L1317-L1355. Concrete fix: update the docstring bullet so OLS is described as an observation-count-weighted average of ATT(g,t) coefficients and ASF wording is reserved for nonlinear logit/Poisson paths.

Code Quality

Severity P2. Impact: the new dCDH no_scalar_by_design handling is still incomplete in the DR markdown renderer. DiagnosticReport.full_report() will currently interpolate None into the top headline line (**Headline**: ... = None (SE None, p = None)) for a valid trends_linear=True, L_max>=2 fit, even though the rest of the report correctly explains that no scalar headline exists by design. This is a user-facing rendering bug on the new code path. diff_diff/diagnostic_report.py:L3271-L3277, diff_diff/diagnostic_report.py:L930-L959, diff_diff/diagnostic_report.py:L3010-L3022. Concrete fix: branch on headline.get("status") == "no_scalar_by_design" in _render_dr_full_report() and render explicit no-scalar prose there instead of formatting value/se/p_value.

Performance

No findings.

Maintainability

No findings beyond the stale helper docstring noted above.

Tech Debt

No findings. The residual items above are not tracked in TODO.md, but neither rises to P1+.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the added no-scalar real-fit tests stop at run_all() / interpretation and do not assert DiagnosticReport.full_report(), which is how the malformed = None (SE None, p = None) headline can slip through unpinned. tests/test_target_parameter.py:L444-L450, tests/test_target_parameter.py:L649-L676. Concrete fix: add an end-to-end DiagnosticReport(fit).full_report() assertion for the trends_linear=True, L_max>=2 dCDH case that rejects the raw None headline and checks for the explicit no-scalar wording.
Residual risk: runtime validation was not possible here because the local environment is missing numpy and pytest.

…e docstring + full_report test R5 approved (✅) with two small follow-throughs from R4. P2: DR ``_render_dr_full_report`` still formatted ``value`` / ``se`` / ``p_value`` straight into the top ``**Headline**:`` line on the no-scalar-by-design dCDH branch. With those fields all ``None`` (by design), the markdown rendered as ``**Headline**: ... = None (SE None, p = None)`` even though the "## Target Parameter" section below correctly explained the suppression. Added a ``status == "no_scalar_by_design"`` branch that emits ``**Headline**: no scalar aggregate by design.`` plus the headline's ``reason`` field. P3 (stale docstring): the ``_reporting_helpers.py`` top-level docstring still described Wooldridge ``overall_att`` as always ASF-based. R4 split that into OLS vs nonlinear dispatch; updated the docstring bullet to match. P3 (test gap): the dCDH no-scalar real-fit regression asserted ``run_all()`` + ``interpretation`` but not ``full_report()``. That gap is exactly how the R5 P2 malformed headline could slip through unpinned. Extended the test to assert the ``**Headline**: no scalar aggregate by design`` line appears in the markdown and the raw ``= None (SE None, p = None)`` pattern does NOT appear. 336 BR/DR tests pass. Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T00:31:16Z

/ai-review

github-actions · 2026-04-21T00:38:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e995954a13c74bd579d4441890d4d11df4c8e2c1

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review items appear resolved: the dCDH no_scalar_by_design case now propagates through BR/DR schema generation and prose rendering, and DR full-report no longer formats the no-scalar branch as = None (SE None, p = None) at diff_diff/business_report.py:443, diff_diff/diagnostic_report.py:938, and diff_diff/diagnostic_report.py:3280.
The Wooldridge target-parameter dispatch now matches the documented OLS vs nonlinear split rather than using one generic ASF label for all fits at diff_diff/_reporting_helpers.py:225, docs/methodology/REGISTRY.md:1299, and diff_diff/wooldridge_results.py:15.
I did not find any unmitigated P0/P1 methodology, inference, or edge-case regressions in the changed code.
Severity P3: the new StackedDiD target-parameter prose embeds an internal implementation line reference, which is brittle and will drift as the file changes at diff_diff/_reporting_helpers.py:211.
Severity P3: diff_diff/guides/llms-full.txt still documents the pre-PR BR/DR schema top-level keys and omits the new target_parameter field at diff_diff/guides/llms-full.txt:1775 and diff_diff/guides/llms-full.txt:1832.
Static review only: this environment does not have pytest, numpy, or pandas, so I could not execute the added tests.

Methodology

No findings. The re-review methodology issues are addressed, and the new reporting-layer dispatch is consistent with the registry and result-object contracts in the changed paths, especially for dCDH no-scalar handling and Wooldridge OLS/nonlinear labeling at diff_diff/business_report.py:443, diff_diff/diagnostic_report.py:938, diff_diff/diagnostic_report.py:3280, and diff_diff/_reporting_helpers.py:225.

Code Quality

No findings. The changed BR/DR renderers now special-case the intentional no-scalar dCDH branch instead of falling through to generic failed-estimate formatting at diff_diff/business_report.py:1975 and diff_diff/diagnostic_report.py:3280.

Performance

No findings.

Maintainability

Severity P3. Impact: the StackedDiD target_parameter.definition includes stacked_did.py line-number prose in a user-facing metadata field. That reference is not methodology source material and will go stale under routine edits, even if the estimand itself does not change. Concrete fix: remove the internal line reference and keep the definition in paper/registry terms only at diff_diff/_reporting_helpers.py:211.

Tech Debt

No new findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the AI guide still lists the old BR/DR schema top-level keys, so downstream agent-facing documentation is now out of sync with the runtime schema and will hide the new target_parameter block from guide-driven consumers. Concrete fix: add target_parameter to both schema key lists in diff_diff/guides/llms-full.txt:1775 and diff_diff/guides/llms-full.txt:1832, and optionally add a doc-smoke assertion to keep those lists synchronized.
Residual risk: runtime validation was not possible here because pytest, numpy, and pandas are unavailable.

…full.txt schema Two P3 cleanups from R6. P3 #1: the StackedDiD ``target_parameter.definition`` embedded an internal implementation line reference (``stacked_did.py`` around line 541). That pointer is not methodology source material and will go stale under routine estimator edits even when the estimand itself is unchanged. Removed the reference; definition now stands on paper/registry terms alone. P3 #2: ``diff_diff/guides/llms-full.txt`` listed the pre-PR BR/DR schema top-level keys and omitted ``target_parameter``, so agent- facing documentation disagreed with the runtime schema. Added ``target_parameter`` to both schema-key lists (BR around line 1779 and DR around line 1844). Documented the field shape (``name`` / ``definition`` / ``aggregation`` / ``headline_attribute`` / ``reference``), the dispatch tag set, and the ``headline_attribute=None`` / ``aggregation="no_scalar_headline"`` edge case for the dCDH ``trends_linear=True, L_max>=2`` fit. Also noted the ``headline.status="no_scalar_by_design"`` value so guide-driven agents can dispatch correctly. UTF-8 fingerprint preserved per ``feedback_llms_guide_utf8_fingerprint.md`` (``tests/test_guides.py`` passes). 354 BR/DR + guide tests pass (337 BR/DR + 17 guide). Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Formatting-only follow-up to the R6 edit — the previous commit landed the StackedDiD-line-reference cleanup before black could reflow the affected block.

igerber · 2026-04-21T00:49:20Z

/ai-review

github-actions · 2026-04-21T00:58:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5c3d0ba004343126fd619b6872566d93d40a5903

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous re-review items look resolved: the dCDH no-scalar case is now special-cased in BR/DR rendering, the Wooldridge OLS vs nonlinear split is reflected in the new helper, and llms-full.txt now documents the added target_parameter block.
Severity P1 [Newly identified]: the new EfficientDiD target_parameter branch still does not identify the actual headline estimand. The implementation and registry define overall_att as a cohort-size-weighted average over post-treatment (g,t) cells, distinct from the paper’s ES_avg, but the new PT-All/PT-Post prose omits that distinction.
Severity P1 [Newly identified]: the PR introduces a new no_scalar_by_design schema status without bumping BUSINESS_REPORT_SCHEMA_VERSION or DIAGNOSTIC_REPORT_SCHEMA_VERSION, even though REPORTING.md says new status-enum values are breaking changes.
I did not find a new P0 numerical/inference regression in the changed code.
Static review only: pytest is not installed and numpy is unavailable here, so I could not execute the added tests; I only verified AST parsing of the changed Python files.

Methodology

Severity P1 [Newly identified]. Impact: the EfficientDiD target-parameter metadata still leaves the headline ambiguous between the library’s overall_att and the paper’s ES_avg. The implementation aggregates overall_att with cohort-size weights across post-treatment (g,t) cells, and the registry explicitly documents that this differs from the paper’s uniform event-time average; the new PT-All/PT-Post text only describes the identification regime, not the reported estimand. That undercuts the PR’s target-parameter goal and is an undocumented methodology mismatch in the new reporting surface. Concrete fix: update both EfficientDiD branches in the helper to say that BR/DR headline overall_att is the library’s cohort-size-weighted post-treatment ATT aggregate, not ES_avg, mirror that note in the reporting docs, and add a regression test for that wording. References: diff_diff/_reporting_helpers.py:L277-L304, diff_diff/efficient_did.py:L1274-L1282, docs/methodology/REGISTRY.md:L875-L876, tests/test_target_parameter.py:L127-L135

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P1 [Newly identified]. Impact: this PR adds status="no_scalar_by_design" to BR and DR headline payloads and documents that new status, but both schema-version constants remain 1.0. REPORTING.md explicitly says new status enum values are breaking changes, so downstream agents keying behavior off schema version can silently mis-handle the new branch. Concrete fix: bump both schema versions and update the affected docs/tests, or avoid introducing a new status enum by encoding the no-scalar case with existing fields only. References: diff_diff/business_report.py:L442-L468, diff_diff/diagnostic_report.py:L938-L955, diff_diff/guides/llms-full.txt:L1793-L1795, docs/methodology/REPORTING.md:L351-L356, diff_diff/business_report.py:L48, diff_diff/diagnostic_report.py:L43

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No additional PR-specific findings. Residual risk: runtime validation was not possible here because pytest is unavailable and numpy is not installed; I only verified that the changed Python files parse successfully.

Path to Approval

Update the EfficientDiD target-parameter prose so it explicitly states that the reported BR/DR headline is the library’s cohort-size-weighted overall_att over post-treatment (g,t) cells, not the paper’s ES_avg, and add a regression test that locks that wording in.
Bump BUSINESS_REPORT_SCHEMA_VERSION and DIAGNOSTIC_REPORT_SCHEMA_VERSION for the new no_scalar_by_design status (or remove that new enum and reuse existing schema states), then refresh the corresponding docs and schema-version tests.

…y vs ES_avg note Two P1 findings from R7, both addressed. P1 #1 (schema version bump): the new ``headline.status`` / ``headline_metric.status`` value ``"no_scalar_by_design"`` added in R4 for the dCDH ``trends_linear=True, L_max>=2`` configuration is a breaking change per REPORTING.md stability policy (new status-enum values are breaking — agents doing exhaustive match will break on unknown enums). Bumped ``BUSINESS_REPORT_SCHEMA_VERSION`` and ``DIAGNOSTIC_REPORT_SCHEMA_VERSION`` from ``"1.0"`` to ``"2.0"``, updated the in-tree schema-version tests (one explicit ``== "1.0"`` assertion and six ``"schema_version": "1.0"`` stub dicts in BR / DR test files), added a REPORTING.md "Schema version 2.0" note, and documented the bump in the CHANGELOG Unreleased entry. The schemas remain marked experimental so the formal deprecation policy does not yet apply. P1 #2 (EfficientDiD library vs paper estimand): both EfficientDiD branches now explicitly state that BR/DR's headline ``overall_att`` is the library's cohort-size-weighted average over post-treatment ``(g, t)`` cells, NOT the paper's ``ES_avg`` uniform event-time average. The regime (PT-All / PT-Post) describes identification; the aggregation choice is a separate library-level policy that REGISTRY.md Sec. EfficientDiD documents. Added ``cohort-size-weighted`` + ``ES_avg`` / ``post-treatment`` assertions to ``test_efficient_did_pt_all`` and ``test_efficient_did_pt_post`` so the wording is pinned. 354 BR/DR + guide + target-parameter tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T09:58:05Z

/ai-review

github-actions · 2026-04-21T10:04:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fdaf94d81627ac6c353fbe35a2b848d7ea1475dc

Overall Assessment
✅ Looks good

Executive Summary

The previous re-review blockers appear resolved: the EfficientDiD branch now distinguishes the library’s cohort-size-weighted overall_att from the paper’s ES_avg, and both report schema versions are bumped to 2.0 for the new no_scalar_by_design headline state.
I did not find a new P0/P1 methodology, inference, or edge-case regression in the changed reporting layer.
Severity P3: the new BR API docs still describe the TROP target parameter as a “factor-model residual,” which does not match the helper/registry definition.
Severity P3: the new exhaustiveness tests cover _APPLICABILITY only, so the BaconDecompositionResults target-parameter branch remains untested.
Static review only: pytest is not installed in this environment, so I could not execute the added tests.

Methodology
No findings.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

Severity P3. Impact: the new BR API docs describe the TROP target_parameter as a “factor-model residual,” but the implementation and registry define it as an ATT-style weighted average over treated cells. That can mislead downstream readers of the schema docs even though the code path is correct. Concrete fix: change the wording in docs/api/business_report.rst:L52 to “factor-model-adjusted ATT” or equivalent wording that matches diff_diff/_reporting_helpers.py:L550 and docs/methodology/REGISTRY.md:L1899.
Severity P3. Impact: the new “covers every result class” guard only iterates _APPLICABILITY, so it never exercises the newly added BaconDecompositionResults branch even though the helper now supports it. That leaves the diagnostic-only target_parameter path able to drift untested. Concrete fix: add a direct Bacon unit test, and ideally one DR integration test, for the branch at diff_diff/_reporting_helpers.py:L536; the current exhaustiveness guard in tests/test_target_parameter.py:L348 is limited to estimator classes listed in diff_diff/diagnostic_report.py:L93.

…n tests Both P3 cleanups from R8. P3 #1 (TROP wording in rst): ``business_report.rst`` summary listed TROP's target parameter as "factor-model residual" — which does not match the helper / REGISTRY definition. Both say the TROP target parameter is a factor-model-adjusted weighted average over treated cells (not a residual). Fixed the rst wording to "factor-model-adjusted ATT". P3 #2 (Bacon branch untested): the exhaustiveness guard iterates ``_APPLICABILITY``, but ``BaconDecompositionResults`` is a diagnostic read-out on the DR side and is NOT listed in ``_APPLICABILITY`` (BR rejects it with a TypeError). The helper branch for Bacon therefore slipped through the 16-class guard. Added two regressions: - ``test_bacon_decomposition`` (unit-level, direct helper call): asserts aggregation / headline_attribute / definition wording / Goodman-Bacon reference. - ``test_dr_with_bacon_result_emits_target_parameter`` (integration): passes a real ``BaconDecompositionResults`` from ``bacon_decompose`` on a staggered panel through DR, asserts the ``target_parameter`` block propagates into DR's schema, and confirms the named ``headline_attribute`` (``twfe_estimate``) exists on the real fit object. 356 BR/DR + guide + target-parameter tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T23:34:48Z

/ai-review

github-actions · 2026-04-21T23:43:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f2fc763457f8afd5730cb12fbc88f0cf0cb9d728

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review P3s appear addressed: the BR API docs now use ATT-style TROP wording, and BaconDecompositionResults now has direct target-parameter coverage in the new tests.
Severity P1: the new dCDH target-parameter dispatch infers trends_linear=True from linear_trends_effects is not None, but the estimator can legitimately set linear_trends_effects=None when the cumulated-horizon surface is empty while still forcing overall_att to NaN by design. In that case BR/DR fall out of the new no_scalar_by_design path and misreport an intentional no-scalar configuration as an estimation failure.
Severity P3: the added tests cover populated trends_linear outputs, but they do not cover the empty-surface case above, so the regression is currently unguarded.
Static review only: pytest is not installed here, and the Python environment is also missing required deps like numpy, so I could not execute the added tests.

Methodology

Severity P1. Impact: the new dCDH reporting path is not source-faithful on an empty trends-linear surface. [diff_diff/_reporting_helpers.py:L372-L410](/home/runner/work/diff-diff/diff-diff/diff_diff/_reporting_helpers.py#L372) uses linear_trends_effects is not None as the proxy for trends_linear=True, but the estimator itself sets linear_trends_effects = None whenever the cumulated dict is empty while still unconditionally NaN-ing overall_att for trends_linear=True, L_max>=2 at [diff_diff/chaisemartin_dhaultfoeuille.py:L2772-L2834](/home/runner/work/diff-diff/diff-diff/diff_diff/chaisemartin_dhaultfoeuille.py#L2772). Because BR and DR gate their new no-scalar branch off that helper at [diff_diff/business_report.py:L436-L472](/home/runner/work/diff-diff/diff-diff/diff_diff/business_report.py#L436) and [diff_diff/diagnostic_report.py:L930-L959](/home/runner/work/diff-diff/diff-diff/diff_diff/diagnostic_report.py#L930), an empty-horizon trends-linear fit will fall through to the non-trends delta path and be narrated as a failed estimate instead of “no scalar by design.” This is both an undocumented methodology mismatch and a missed empty-result-set edge case. Concrete fix: persist an explicit fit-time marker for trends_linear (or a direct “overall aggregate kind” flag) on ChaisemartinDHaultfoeuilleResults, and drive describe_target_parameter() plus the BR/DR no-scalar routing from that persisted config rather than from the presence of linear_trends_effects.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. TODO.md does not currently track the P1 above, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new regression coverage assumes a populated trends-linear surface and misses the empty-surface case that breaks the new dispatch. The current dCDH tests exercise trends_linear=True only with non-empty linear_trends_effects stubs or real fits at [tests/test_target_parameter.py:L234-L254](/home/runner/work/diff-diff/diff-diff/tests/test_target_parameter.py#L234) and [tests/test_target_parameter.py:L665-L786](/home/runner/work/diff-diff/diff-diff/tests/test_target_parameter.py#L665). Concrete fix: add a helper-level stub test and BR/DR end-to-end tests for trends_linear=True with no estimable cumulated horizons, asserting aggregation="no_scalar_headline" and headline.status / headline_metric.status == "no_scalar_by_design".

Path to Approval

Persist trends_linear (or an equivalent explicit aggregate-kind/status field) on ChaisemartinDHaultfoeuilleResults.
Update [diff_diff/_reporting_helpers.py:L372-L410](/home/runner/work/diff-diff/diff-diff/diff_diff/_reporting_helpers.py#L372), [diff_diff/business_report.py:L436-L472](/home/runner/work/diff-diff/diff-diff/diff_diff/business_report.py#L436), and [diff_diff/diagnostic_report.py:L930-L959](/home/runner/work/diff-diff/diff-diff/diff_diff/diagnostic_report.py#L930) to use that persisted flag so every trends_linear=True, L_max>=2 fit routes to the documented no-scalar branch, even when linear_trends_effects is empty.
Add regression coverage for the empty trends-linear surface in [tests/test_target_parameter.py](/home/runner/work/diff-diff/diff-diff/tests/test_target_parameter.py), including both helper-level dispatch and BR/DR rendering/schema assertions.

…rface no-scalar dispatch R9 surfaced a real P1 edge case: the helper inferred ``trends_linear=True`` from ``linear_trends_effects is not None``, but the estimator can set ``linear_trends_effects = None`` when the cumulated-horizon dict is empty (no estimable horizons) while still unconditionally NaN-ing ``overall_att`` under ``trends_linear=True`` + ``L_max >= 2`` (``chaisemartin_dhaultfoeuille.py:2828-2834``). The inference missed that case — an empty-horizon fit would fall through to the ``delta`` branch, BR/DR would extract ``overall_att`` (NaN), and the headline would be narrated as an estimation failure instead of "no scalar by design." Fix: - Persisted the fit-time ``trends_linear`` flag explicitly on ``ChaisemartinDHaultfoeuilleResults`` as a new ``Optional[bool]`` field (with a docstring note). - The dCDH estimator now threads ``_is_trends_linear`` into the result constructor at ``chaisemartin_dhaultfoeuille.py:3139``. - ``describe_target_parameter()`` reads the persisted flag first and only falls back to the ``linear_trends_effects is not None`` inference when the flag is absent (older cached fits predating the persisted field). BR/DR no-scalar routing is unchanged — both surfaces dispatch on ``target_parameter["aggregation"] == "no_scalar_headline"``, which now fires correctly for empty-surface fits too. Tests: - ``test_dcdh_trends_linear_with_l_max_geq_2_emits_no_scalar_headline`` updated to pass ``trends_linear=True`` explicitly (primary contract). - ``test_dcdh_trends_linear_empty_surface_still_no_scalar``: new R9 regression. ``linear_trends_effects=None`` + ``trends_linear=True`` (the empty-surface case) routes to the no-scalar branch. - ``test_dcdh_legacy_fit_without_persisted_flag_still_routes_correctly``: preserves backwards compatibility for cached fits that predate the persisted flag — the legacy ``linear_trends_effects is not None`` fallback still fires. 368 tests pass (BR/DR + guide + target-parameter + dCDH methodology). Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T23:50:10Z

/ai-review

github-actions · 2026-04-21T23:56:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c0d012781503e462f986a410ed2d0046cb3cb457

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior dCDH BR/DR blocker is addressed: the PR now persists trends_linear on dCDH results and uses it in the new target-parameter/no-scalar routing, which fixes the earlier empty-surface headline bug in BR/DR.
Severity P1 [Newly identified]: that same fix is only partially propagated. Remaining public dCDH reporting paths still infer linear-trends mode from linear_trends_effects is not None, so empty-surface trends_linear=True, L_max>=2 fits are still mislabeled outside the new BR/DR headline path.
I did not find new variance/SE or NaN-inference regressions in the modified estimator/reporting code.
No performance or security blockers stood out in the changed files.
Static review only: I could not run the added tests here because pytest is not installed, and direct Python imports also fail because numpy is unavailable.

Methodology

Severity P1 [Newly identified]. Impact: for ChaisemartinDHaultfoeuille with trends_linear=True and L_max>=2, the registry and estimator contract say the scalar aggregate is intentionally suppressed and users should read linear_trends_effects instead: see docs/methodology/REGISTRY.md:631, docs/methodology/REPORTING.md:140, and the estimator’s own suppression at diff_diff/chaisemartin_dhaultfoeuille.py:2852. The PR fixes this in the new target-parameter helper via the persisted flag at diff_diff/_reporting_helpers.py:375, but the native results-object labels still use linear_trends_effects is not None at diff_diff/chaisemartin_dhaultfoeuille_results.py:455 and diff_diff/chaisemartin_dhaultfoeuille_results.py:464, and BR’s identifying-assumption block does the same at diff_diff/business_report.py:1205. On empty-surface fits, BR headline/target-parameter now correctly say “no scalar by design”, but ChaisemartinDHaultfoeuilleResults and BR’s assumption prose can still fall back to delta / omit the linear-trends identification clause. That is an undocumented methodology mismatch and an incomplete propagation of the fix. Concrete fix: thread results.trends_linear through all remaining dCDH reporting/label consumers, with linear_trends_effects is not None kept only as a backward-compat fallback for legacy result objects that predate the new field.

Code Quality

No findings.

Performance

No findings.

Maintainability

No separate findings beyond the P1 above.

Tech Debt

No separate findings. TODO.md does not track the P1 above, so it remains unmitigated.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new regression suite covers the fixed BR/DR headline path, but not the remaining empty-surface consumers. The added no-scalar tests in tests/test_target_parameter.py:711 and tests/test_target_parameter.py:749 stop at BR/DR headline rendering, while the existing BR assumption tests only cover populated linear_trends_effects stubs at tests/test_business_report.py:1805. That leaves the stale assumption/label paths unguarded. Concrete fix: add regression coverage for trends_linear=True, L_max>=2, linear_trends_effects=None asserting that BR’s assumption text still mentions the DID^{fd}/linear-trends identification contract, and that native dCDH result labels/tables do not report delta on that path.

Path to Approval

Update the remaining dCDH reporting/label paths to read results.trends_linear when present, with legacy fallback only when the field is absent. At minimum this needs to cover diff_diff/chaisemartin_dhaultfoeuille_results.py:455, diff_diff/chaisemartin_dhaultfoeuille_results.py:464, diff_diff/chaisemartin_dhaultfoeuille_results.py:605, diff_diff/chaisemartin_dhaultfoeuille_results.py:1058, and diff_diff/business_report.py:1205.
Add a regression test for the empty-surface case (trends_linear=True, L_max>=2, linear_trends_effects=None) that asserts BusinessReport(...).to_dict()["assumption"]["description"] still includes the linear-trends / DID^{fd} identification wording.
Add a native dCDH results regression test for the same case asserting _estimand_label() and to_dataframe("overall") do not label the overall estimand as delta.

…DH reporting consumers R10 found my R9 fix was partial. The persisted ``trends_linear`` flag was only read by ``describe_target_parameter``. Three other dCDH reporting paths still inferred trends-linear from ``linear_trends_effects is not None`` and silently mis-labeled empty-surface fits as ``delta`` or omitted the linear-trends identification clause: 1. ``ChaisemartinDHaultfoeuilleResults._horizon_label`` and ``_estimand_label`` (also reached via ``to_dataframe("overall")``) — per-horizon labels and overall estimand label. 2. ``ChaisemartinDHaultfoeuilleResults.summary`` — the covariate/trend-adjusted tag in the overall-results summary. 3. ``BusinessReport._describe_assumption`` dCDH branch — the identifying-assumption prose that names ``DID^{fd}_l`` vs ``DID_l``. Fix: added a ``_has_trends_linear()`` helper on ``ChaisemartinDHaultfoeuilleResults`` that reads the persisted flag first and falls back to the legacy inference, and rewired all three result-class callsites to use it. BR's ``_describe_assumption`` branch gained a matching persisted-first- then-inference lookup via ``getattr(results, "trends_linear", None)``. Tests: new ``test_dcdh_empty_surface_propagates_to_assumption_and_native_label`` stubs a ``ChaisemartinDHaultfoeuilleResults`` with ``trends_linear=True``, ``L_max=2``, and ``linear_trends_effects=None`` (the exact R9/R10 edge case), then asserts: - ``stub._estimand_label()`` returns ``DID^{fd}_l`` / ``DID^{X,fd}_l``, NOT ``delta``. - ``stub.to_dataframe("overall")`` does not label the overall row as ``delta``. - BR's ``_describe_assumption`` description includes the linear- trends / first-differenced identification clause. 527 tests pass across BR/DR + guide + target-parameter + dCDH methodology + dCDH unit suites. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T00:00:49Z

/ai-review

github-actions · 2026-04-22T00:07:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5a82ab79bc9af67747e430d6d122ac3c65e6176d

Overall Assessment

✅ Looks good

Executive Summary

All previously identified P1 findings appear resolved. The dCDH trends_linear=True, L_max>=2 empty-surface case now persists trends_linear, uses it in native dCDH labels and BR assumption prose, and is covered by new regression tests at diff_diff/_reporting_helpers.py:375-428, diff_diff/chaisemartin_dhaultfoeuille_results.py:452-490, diff_diff/business_report.py:1208-1242, and tests/test_target_parameter.py:795-888.
The new target_parameter block is consistently threaded through BR/DR schemas and prose, and the schema bump to 2.0 is documented with the new no_scalar_by_design enum at diff_diff/business_report.py:442-515, diff_diff/diagnostic_report.py:938-995, and docs/methodology/REPORTING.md:59-156.
I did not find new variance/SE, NaN-inference, control-group, or parameter-propagation regressions in the changed estimator/reporting paths.
One P3 informational issue remains: the OLS Wooldridge target-parameter branch cites Wooldridge (2023) even though the registry anchors the OLS ETWFE path in Wooldridge (2025).
Static review only: I could not run the added tests here because pytest, numpy, and pandas are unavailable in this environment.

Methodology

Severity: P3. Impact: The new OLS Wooldridge target-parameter branch uses Wooldridge (2023) in both the user-visible definition and the reference field, but the registry treats OLS ETWFE as the Wooldridge (2025) path and reserves 2023 for the nonlinear ASF extension. This does not affect estimates, but it is a source-provenance mismatch in the new reporting surface. Concrete fix: update the OLS branch at diff_diff/_reporting_helpers.py:242-252 to cite Wooldridge (2025) (or explicitly split 2025 for OLS and 2023 for nonlinear), matching docs/methodology/REGISTRY.md:1301-1303 and docs/methodology/REGISTRY.md:1317-1333.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The prior dCDH propagation issue is fixed in code rather than deferred to TODO.md.

Security

No findings.

Documentation/Tests

No findings. The PR adds focused regression coverage for the previously flagged dCDH empty-surface path and updates the BR/DR schema-contract tests at tests/test_target_parameter.py:711-930, tests/test_business_report.py:47-57, and tests/test_diagnostic_report.py:46-57. I could not execute them locally because this environment lacks the required Python test/runtime dependencies.

R11 was ✅ with one P3: REGISTRY anchors OLS ETWFE in Wooldridge (2025) and reserves Wooldridge (2023) for the nonlinear ASF extension. The target-parameter helper's OLS branch was citing 2023 in both the definition prose and the ``reference`` field. Updated both to ``Wooldridge (2025)``. The nonlinear branch's joint ``Wooldridge (2023, 2025)`` reference is unchanged. 334 BR/DR + target-parameter tests pass. Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T00:19:52Z

/ai-review

github-actions · 2026-04-22T00:27:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 696d2cd141dde5f05dbb94e2b8bb4ae0f42c96d3

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior Wooldridge citation issue looks resolved: the OLS ETWFE branch now points to Wooldridge (2025), and the nonlinear ASF branch points to Wooldridge (2023/2025), matching the registry split in docs/methodology/REGISTRY.md:L1299-L1345.
The main dCDH fix is mostly in place: the persisted trends_linear flag now flows through the target-parameter helper, BR assumption prose, and the dCDH native estimand label in diff_diff/_reporting_helpers.py:L372-L428, diff_diff/business_report.py:L1205-L1238, and diff_diff/chaisemartin_dhaultfoeuille_results.py:L452-L490.
One blocker remains on the exact empty-surface edge case this PR is trying to address: BR/DR now route trends_linear=True, L_max>=2 fits into the no-scalar branch, but they still tell users to inspect linear_trends_effects even when that surface is None, and the public accessor still throws the wrong remediation message.
One non-blocking methodology issue remains in the new machine-readable schema: target_parameter.aggregation uses "2x2" for both true 2x2 DiD and TWFE fits because both return DiDResults, so downstream dispatch cannot actually distinguish the two estimands.
Static review only: I could not run the new tests here because pytest is not installed and runtime deps like numpy are unavailable.

Methodology

Severity: P1 [Newly identified]. Location: diff_diff/_reporting_helpers.py:L407-L428, diff_diff/business_report.py:L443-L469, diff_diff/diagnostic_report.py:L939-L956, with the underlying estimator contract at diff_diff/chaisemartin_dhaultfoeuille.py:L2845-L2858 and surrounding public accessors at diff_diff/chaisemartin_dhaultfoeuille_results.py:L917-L921 and diff_diff/chaisemartin_dhaultfoeuille_results.py:L1268-L1272. Impact: for ChaisemartinDHaultfoeuilleResults fits with trends_linear=True and L_max>=2, the PR correctly recognizes the no-scalar-by-design case, but on the empty-horizon subcase (linear_trends_effects=None) it still tells users to “see linear_trends_effects” even though there is nothing there; to_dataframe("linear_trends") then raises “Pass trends_linear=True to fit()”, which is wrong on precisely this configuration. That is incomplete empty-result handling on a newly changed code path and can produce production-facing errors/misleading guidance. Concrete fix: add an explicit empty-surface branch for trends_linear=True && L_max>=2 && linear_trends_effects is None in describe_target_parameter(), BR/DR headline reasons/prose, and the dCDH public accessors so they say “no estimable cumulated level effects survived” instead of pointing to a nonexistent dict.
Severity: P2. Location: diff_diff/_reporting_helpers.py:L77-L106, docs/methodology/REPORTING.md:L90-L93. Impact: the new machine-readable target_parameter.aggregation tag is "2x2" for both DifferenceInDifferences and TwoWayFixedEffects because both emit DiDResults, even though the same definition text acknowledges that TWFE can be a weighted average with forbidden comparisons. Since the schema explicitly says agents should branch on aggregation, this tag is not actually estimator-faithful for TWFE fits. Concrete fix: either persist estimator provenance on DiDResults and split the branch, or replace "2x2" with a neutral non-misleading tag such as did_or_twfe until provenance exists.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The blocking dCDH empty-surface issue is not tracked in TODO.md, so it should not be treated as deferred work.

Security

No findings.

Documentation/Tests

No separate findings beyond the blocker above. Validation note: this was a static review; I could not execute tests/test_target_parameter.py, tests/test_business_report.py, or tests/test_diagnostic_report.py because pytest is unavailable here and imports fail without numpy.

Path to Approval

Update the dCDH no-scalar branch so trends_linear=True, L_max>=2, linear_trends_effects=None emits explicit empty-surface messaging instead of “see results.linear_trends_effects[l].”
Fix the public dCDH inspection surface for that same case, especially to_dataframe("linear_trends"), so it returns a targeted empty-result response or error message rather than telling the user to re-fit with trends_linear=True.
Add regression coverage for that exact configuration: the target_parameter block, BR/DR no-scalar prose, and the to_dataframe("linear_trends") path on a ChaisemartinDHaultfoeuilleResults object with trends_linear=True, L_max=2, and linear_trends_effects=None.

…gation tag R12 identified two issues, both addressed. P1 (empty-surface dead-end guidance): on ``trends_linear=True, L_max>=2, linear_trends_effects=None`` (no horizons survived), the PR's no-scalar prose still told users to "see ``linear_trends_effects``" even though the dict is empty, and ``to_dataframe("linear_trends")`` raised the wrong remediation ("Pass ``trends_linear=True`` to fit()") — which the user already did. Fixed by distinguishing the populated-surface case from the empty-surface subcase in three places: - ``describe_target_parameter`` (dCDH no-scalar branch): the ``definition`` on empty surfaces now names the empty state explicitly ("no cumulated level effects survived estimation") and points at re-fit remediation, rather than pointing at the nonexistent horizon dict. - ``BusinessReport._build_schema`` (no-scalar headline): the ``reason`` field branches on ``getattr(self._results, "linear_trends_effects", None) is None`` and emits the empty-state message accordingly. - ``DiagnosticReport._execute`` (no-scalar headline): mirror branching for the DR ``headline_metric`` reason + name. - ``ChaisemartinDHaultfoeuilleResults.to_dataframe("linear_trends")`` now returns an empty DataFrame with the expected columns when ``trends_linear=True`` is already active but no horizons survived. The "Pass ``trends_linear=True`` to fit()" error fires only when the user actually did not request it. P2 (ambiguous aggregation tag): both ``DifferenceInDifferences`` and ``TwoWayFixedEffects`` return ``DiDResults``, so the old ``"2x2"`` aggregation tag was not faithful for TWFE fits that can be weighted averages with forbidden later-vs-earlier weights. Renamed to ``"did_or_twfe"`` — a neutral tag that signals the ambiguity until estimator provenance is persisted. Downstream agents dispatching on the tag now know not to treat TWFE fits as clean 2x2. REPORTING.md updated to enumerate the new tag. Tests: 3 new regressions pin the empty-surface contract (target-parameter definition names the empty state, BR/DR headline reasons avoid "see linear_trends_effects", and ``to_dataframe("linear_trends")`` returns an empty frame rather than raising). Existing ``"2x2"`` assertions updated to ``"did_or_twfe"``. 502 BR/DR + target-parameter + dCDH unit tests pass. Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T00:37:20Z

/ai-review

github-actions · 2026-04-22T00:44:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8343eeb2582ae19b4912774f2e0fdfb40b0f80c4

Overall Assessment

Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior aggregation="2x2" ambiguity is fixed: the PR now uses the neutral did_or_twfe tag and documents that deviation in the reporting docs, so I am not counting that as a defect.
The prior dCDH empty-surface blocker is only partially resolved. The schema reason and to_dataframe("linear_trends") path were fixed, but several user-facing surfaces still tell users to inspect linear_trends_effects even when that surface is empty.
Because this is the exact empty-result edge case on a changed dCDH no-scalar path, it remains a P1 under the review rubric.
I did not find a new variance/SE mismatch in the touched estimator code.
Static review only. I could not execute the new tests locally because the environment is missing runtime deps such as numpy, so this review is based on source inspection and AST parsing.

Methodology

Severity: P1. Impact: the empty-surface dCDH no-scalar fix is still incomplete. The schema-level reason is now correct in diff_diff/business_report.py#L443 and diff_diff/diagnostic_report.py#L938, but the actual BR/DR prose renderers ignore that and still hard-code “see linear_trends_effects” in diff_diff/business_report.py#L2004 and diff_diff/diagnostic_report.py#L3034. That stale sentence is what headline(), summary(), and the BR full report use via diff_diff/business_report.py#L368, diff_diff/business_report.py#L2078, and diff_diff/business_report.py#L2403. dCDH’s own overall label helper still returns DID^{fd}_l (see linear_trends_effects) on empty-surface fits in diff_diff/chaisemartin_dhaultfoeuille_results.py#L487, and that propagates into to_dataframe("overall") at diff_diff/chaisemartin_dhaultfoeuille_results.py#L1074. So users can still be sent to a nonexistent surface on the precise empty-result configuration this PR is trying to fix. Concrete fix: centralize the no-scalar messaging/label generation, reuse it in BR/DR renderers and _estimand_label(), and make the empty-surface subcase explicitly say that no estimable cumulated level effects survived, instead of pointing to linear_trends_effects. That shared message should also reuse the control-adjusted DID^{X,fd}_l label when covariates are active.

Code Quality

No additional findings.

Performance

No findings.

Maintainability

No separate finding beyond the P1 above, but the root cause is duplicated no-scalar messaging across the helper, BR, DR, and dCDH result labeling. Centralizing that contract would prevent this class of partial fix.

Tech Debt

No mitigating entry exists for this issue under TODO.md#L51, so I did not treat it as deferred work.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the reporting docs still describe headline_attribute=None / aggregation="no_scalar_headline" as meaning “consult linear_trends_effects instead” in docs/methodology/REPORTING.md#L111 and docs/api/business_report.rst#L56, which is no longer true on the empty-surface subcase. The new tests around tests/test_target_parameter.py#L794, tests/test_target_parameter.py#L869, tests/test_target_parameter.py#L893, and tests/test_target_parameter.py#L906 cover the schema definition/reason, to_dataframe("linear_trends"), and the native label not regressing to delta, but they do not pin the empty-surface rendered prose paths that still contain the stale guidance. Concrete fix: update the docs to distinguish populated-surface vs empty-surface no_scalar_headline, and add regression coverage for BR headline()/summary()/full_report(), DR overall_interpretation(), and the dCDH overall dataframe label on the empty-surface stub.

Path to Approval

Make the empty-surface dCDH no-scalar message single-sourced and reuse it in BusinessReport headline rendering, DiagnosticReport overall interpretation, and dCDH overall labels, so those surfaces never say “see linear_trends_effects” when linear_trends_effects is None.
Ensure that same shared message/label respects the controlled branch too, i.e. DID^{X,fd}_l when covariates are active instead of hard-coded DID^{fd}_l.
Add regression tests for the empty-surface stub covering BR rendered prose, DR rendered prose, and dCDH to_dataframe("overall") / overall label surfaces.
Update the reporting docs so headline_attribute=None on no_scalar_headline distinguishes the populated-surface and empty-surface subcases instead of always directing agents to linear_trends_effects.

…n; branch dCDH native label on empty surface R13 identified three remaining surfaces that still hardcoded "see ``linear_trends_effects``" on the empty-surface subcase (``trends_linear=True, L_max>=2, linear_trends_effects=None``): 1. BR ``_render_headline_sentence`` (headline prose used by ``headline()``, ``summary()``, and ``full_report()``). 2. DR ``_render_overall_interpretation`` (top-level paragraph). 3. dCDH ``ChaisemartinDHaultfoeuilleResults._estimand_label`` (also surfaced via ``to_dataframe("overall")``). Fix: BR and DR renderers now read the headline-level ``reason`` field (already branched on populated-vs-empty surface in ``_build_schema`` / ``_execute``), so the rendered prose never drifts from the schema message. ``_estimand_label`` on dCDH results gains an empty-surface branch that returns ``DID^{fd}_l (no cumulated level effects survived estimation)`` (or ``DID^{X,fd}_l (...)`` when covariates are active) instead of pointing at an empty dict. Docs: REPORTING.md and business_report.rst now document the ``headline.reason``-based populated-vs-empty branching for agents dispatching on ``aggregation="no_scalar_headline"``. Tests: 4 new regressions pin the rendered-prose contract on the empty-surface stub: (a) BR ``headline()`` / ``summary()`` / ``full_report()`` emit the empty-surface remediation wording, NOT "see linear_trends_effects"; (b) DR ``interpretation`` does the same; (c) dCDH ``_estimand_label`` returns the empty-state label rather than pointing at the empty dict; (d) ``to_dataframe("overall")`` surfaces the empty-state label. 505 BR/DR + target-parameter + dCDH tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T00:50:22Z

/ai-review

igerber · 2026-04-22T00:51:09Z

/ai-review

github-actions · 2026-04-22T00:56:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5046a51e98c1ad4a2a59d6f1cafec5c782cfc046

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior empty-surface dCDH blocker is mostly fixed: the changed BR/DR/native-label/docs paths no longer point users to a nonexistent linear_trends_effects table on the uncontrolled empty-surface branch.
One methodology-facing P1 remains. On the covariate-adjusted empty-surface dCDH branch, the new reporting layer still hardcodes the unadjusted label DID^{fd}_l in several user-facing strings instead of DID^{X,fd}_l.
This does not change estimates or SEs, but it does misstate the estimand on exactly the reporting surface this PR is adding.
The new empty-surface tests only exercise covariate_residuals=None, so the controlled branch regression slipped through.
I did not find a new variance/SE defect in the touched estimator code.
Static review only: I could not run the test suite because pytest is not installed in this environment.

Methodology

Severity: P1. Impact: The dCDH Phase 3 contract distinguishes covariate adjustment (DID^X) and linear-trend adjustment (DID^{fd}), and the result class’s own native label already emits the combined DID^{X,fd}_l form on the no-scalar branch. But the new target-parameter helper and BR/DR empty-surface reason strings still hardcode DID^{fd}_l when covariate_residuals is active, so a trends_linear=True, L_max>=2, linear_trends_effects=None fit would present inconsistent estimand text across surfaces. See docs/methodology/REGISTRY.md:L629-L631, diff_diff/_reporting_helpers.py:L423-L440, diff_diff/business_report.py:L452-L460, diff_diff/diagnostic_report.py:L946-L954, and the already-correct native label branch at diff_diff/chaisemartin_dhaultfoeuille_results.py:L491-L495. Concrete fix: single-source the empty-surface label/reason from a control-aware helper, or reuse the already-correct base label/target_parameter.name, so the covariate-adjusted branch consistently emits DID^{X,fd}_l.

Code Quality

No additional findings beyond the P1 above.

Performance

No findings.

Maintainability

No separate finding, but the remaining P1 exists because the empty-surface wording is still duplicated across the helper and both report builders instead of being generated from one shared contract.

Tech Debt

No separate findings. TODO.md does not mitigate the P1 above.

Security

No findings.

Documentation/Tests

No separate code finding, but the new empty-surface regression coverage only pins the uncontrolled branch (covariate_residuals=None), so it does not protect the still-broken controlled branch. See tests/test_target_parameter.py:L802-L809 and tests/test_target_parameter.py:L876-L891. Static validation only: pytest is unavailable in this environment.

Path to Approval

Make the dCDH empty-surface message control-aware everywhere it is emitted: describe_target_parameter(), BR headline.reason, and DR headline_metric.reason should use DID^{X,fd}_l when covariate_residuals is active.
Add regression coverage for the covariate-adjusted empty-surface case (trends_linear=True, L_max>=2, covariate_residuals populated, linear_trends_effects=None) that asserts the helper definition and BR/DR reason/prose surfaces all use the DID^{X,fd}_l label.

When trends_linear=True + L_max>=2 + linear_trends_effects=None (empty-surface subcase) AND covariate_residuals is populated, the reporting-layer prose now emits the covariate-adjusted label ``DID^{X,fd}_l`` rather than the bare ``DID^{fd}_l``. Propagates to: - ``_reporting_helpers.describe_target_parameter``: empty-surface definition_text now interpolates the already-control-aware estimand_label (``DID^{X,fd}_l`` when has_controls, else ``DID^{fd}_l``). - ``BusinessReport._build_schema``: reads ``covariate_residuals`` to select the empty-surface label used in ``headline.reason``. - ``DiagnosticReport._execute``: mirrors BR's control-aware label selection for ``headline_metric.reason``. Regression test added: ``test_dcdh_empty_surface_with_controls_*`` (three tests covering target_parameter definition, BR/DR reason fields, and rendered prose surfaces). Asserts every consumer emits ``DID^{X,fd}_l`` on the covariate-adjusted empty-surface subcase and does NOT emit bare ``DID^{fd}_l`` as a stale fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T01:03:00Z

/ai-review R14 addressed: control-aware empty-surface label. The covariate-adjusted empty-surface subcase (trends_linear=True + L_max>=2 + covariate_residuals populated + linear_trends_effects=None) now emits DID^{X,fd}_l consistently across the helper definition, BR headline.reason, DR headline_metric.reason, and rendered prose. Regression tests added (3) asserting every surface uses the covariate-adjusted label and does not emit bare DID^{fd}_l as a stale fallback.

github-actions · 2026-04-22T01:07:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d6ab7ec9116b3f08fa82bcb38b3ee7cfbeea7a81

Overall Assessment

✅ Looks good

Executive Summary

Re-review focus was the prior dCDH reporting P1, and it appears resolved. The controlled empty-surface branch now uses DID^{X,fd}_l consistently in the shared helper, BR, DR, and the native dCDH result-label path: diff_diff/_reporting_helpers.py#L415, diff_diff/business_report.py#L443, diff_diff/diagnostic_report.py#L939, diff_diff/chaisemartin_dhaultfoeuille_results.py#L491.
I did not find a new undocumented methodology mismatch against the Methodology Registry on the changed reporting paths. The new no_scalar_by_design reporting branch is documented in docs/methodology/REPORTING.md#L59 and docs/methodology/REPORTING.md#L379, so it is not a defect.
The only estimator-side logic change in scope is persisting trends_linear on dCDH results so reporting can distinguish true empty-surface trends-linear fits from ordinary non-trends fits; that aligns with the dCDH registry note on suppressed scalar surfaces under trends_linear: diff_diff/chaisemartin_dhaultfoeuille.py#L2852, docs/methodology/REGISTRY.md#L631.
Regression coverage materially improved. The new tests explicitly cover the previously missing covariate-adjusted empty-surface dCDH case and BR/DR parity/exhaustiveness: tests/test_target_parameter.py#L1011, tests/test_target_parameter.py#L1038, tests/test_target_parameter.py#L446, tests/test_target_parameter.py#L468.
Static review only: pytest is not installed in this environment, so I could not execute the suite.

Methodology

No findings. Affected methods reviewed were the BR/DR target-parameter dispatch across supported result classes, with special attention to dCDH trends_linear=True, L_max>=2. The prior P1 methodology-facing label mismatch appears fixed, and the new no-scalar branch is documented rather than silent.

Code Quality

No findings. Moving per-estimator target-parameter dispatch into the shared helper reduces BR/DR drift and fixes the earlier duplicated empty-surface wording problem by centralizing the core contract in diff_diff/_reporting_helpers.py#L25.

Performance

No findings. The PR is reporting-layer metadata/prose work plus one persisted result flag; I did not see a new hot-path estimator cost in the changed code.

Maintainability

No findings. The persisted trends_linear flag and _has_trends_linear() fallback remove the brittle linear_trends_effects is not None inference for empty-surface fits: diff_diff/chaisemartin_dhaultfoeuille_results.py#L452.

Tech Debt

No findings. I did not identify a new deferrable limitation that should be added to TODO.md.

Security

No findings.

Documentation/Tests

No findings. The schema/versioning change and the new target-parameter contract are documented in docs/methodology/REPORTING.md#L59, docs/api/business_report.rst#L52, and docs/api/diagnostic_report.rst#L18. The new regression tests are appropriately targeted, but I could only assess them statically because pytest is unavailable.

igerber and others added 2 commits April 20, 2026 20:48

Apply black formatting to _reporting_helpers.py

5c3d0ba

Formatting-only follow-up to the R6 edit — the previous commit landed the StackedDiD-line-reference cleanup before black could reflow the affected block.

Apply black formatting to business_report.py

5046a51

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber merged commit 8999baf into main Apr 22, 2026
23 of 24 checks passed

igerber deleted the br-dr-target-parameter branch April 22, 2026 10:21

Conversation

igerber commented Apr 20, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant