Close BR/DR gap #6: target-parameter block in schemas#347
Conversation
Closes BR/DR foundation gap #6 from project_br_dr_foundation.md: BusinessReport and DiagnosticReport now name what the headline scalar actually represents as an estimand, for each of the 16 result classes. Baker et al. (2025) Step 2 ("define the target parameter") was previously in BR's next_steps list but not done by BR itself — this PR closes that gap. New top-level ``target_parameter`` block (additive schema change; experimental per REPORTING.md stability policy): { "name": str, # stakeholder-facing name "definition": str, # plain-English description "aggregation": str, # machine-readable dispatch tag "headline_attribute": str, # which raw result attribute "reference": str, # REGISTRY.md citation pointer } Schema placement: top-level block (user preference, selected via AskUserQuestion in planning). Aggregation tags include "simple", "event_study", "group", "2x2", "twfe", "iw", "stacked", "ddd", "staggered_ddd", "synthetic", "factor_model", "M", "l", "l_x", "l_fd", "l_x_fd", "dose_overall", "pt_all_combined", "pt_post_single_baseline", "unknown". Per-estimator dispatch lives in the new ``diff_diff/_reporting_helpers.py::describe_target_parameter`` (own module rather than business_report / diagnostic_report to avoid circular-import risk — plan-review LOW #7). All 17 result classes covered (16 from _APPLICABILITY + BaconDecompositionResults); exhaustiveness locked in by TestTargetParameterCoversEveryResultClass. Fit-time config reads: - ``EfficientDiDResults.pt_assumption`` branches the aggregation tag between pt_all_combined and pt_post_single_baseline. - ``StackedDiDResults.clean_control`` varies the definition clause (never_treated / strict / not_yet_treated). - ``ChaisemartinDHaultfoeuilleResults.L_max`` + ``covariate_residuals`` + ``linear_trends_effects`` branches the dCDH estimand between DID_M / DID_l / DID^X_l / DID^{fd}_l / DID^{X,fd}_l. Fixed-tag branches (per plan-review CRITICAL #1 and #2): - ``CallawaySantAnna`` / ``ImputationDiD`` / ``TwoStageDiD`` / ``WooldridgeDiD``: the fit-time ``aggregate`` kwarg does not change the ``overall_att`` scalar — it only populates additional horizon / group tables on the result object. Disambiguating those tables in prose is tracked under gap #9. - ``ContinuousDiDResults``: the PT-vs-SPT regime is a user-level assumption, not a library setting. Emits a single "dose_overall" tag with disjunctive definition naming both regime readings (ATT^loc under PT, ATT^glob under SPT). Prose rendering: - BR ``_render_summary``: emits "Target parameter: <name>." after the headline sentence (short name only; full definition lives in the full_report and schema). - BR ``_render_full_report``: "## Target Parameter" section between "## Headline" and "## Identifying Assumption". - DR ``_render_overall_interpretation``: mirror sentence. - DR ``_render_dr_full_report``: "## Target Parameter" section with name, definition, aggregation tag, headline attribute, and reference. Cross-surface parity: both BR and DR consume the same helper (the single source of truth), so their ``target_parameter`` blocks are byte-identical (verified by TestTargetParameterCrossSurfaceParity). Tests: 37 new (TestTargetParameterPerEstimator + TestTargetParameterFitConfigReads + TestTargetParameterCoversEveryResultClass + TestTargetParameterCrossSurfaceParity + TestTargetParameterProseRendering). Existing BR/DR top-level-key contract tests updated to include ``target_parameter``. Total 319 tests pass (282 prior + 37 new). Docs: REPORTING.md gains a "Target parameter" section documenting the per-estimator dispatch and schema shape. business_report.rst and diagnostic_report.rst note the new field with a pointer to REPORTING.md. CHANGELOG entry under Unreleased. Out of scope: REGISTRY.md per-estimator "Target parameter" sub-sections (plan-review additional-note); the reporting-layer doc in REPORTING.md is the current source of truth. A follow-up docs PR can land those sub-sections if maintainers want the registry to own the canonical wording directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Local execution was not possible in this review environment because the Python environment is missing |
…x dCDH headline_attribute R1 surfaced three P1s, all legitimate: 1. StackedDiD wording mismatch. Claimed ``overall_att`` is a treated-share-weighted aggregate across sub-experiments; actual implementation (``stacked_did.py`` ~line 541) computes ``overall_att`` as the simple average of post-treatment event- study coefficients ``delta_h`` with delta-method SE. Per-horizon ``delta_h`` is the paper's ``theta_kappa^e`` cross-event aggregate, but the headline is an equally-weighted average over those per-horizon coefficients, not a separate cross-event weighting at the ATT level. Definition rewritten to describe the actual estimand. 2. Dead ``TwoWayFixedEffectsResults`` branch. ``TwoWayFixedEffects`` is a subclass of ``DifferenceInDifferences`` and its ``fit()`` returns ``DiDResults`` — there is no separate TWFE result class, so the ``type(results).__name__ == "TwoWayFixedEffectsResults"`` dispatch branch was unreachable on any real fit. Removed the dead branch and rewrote the ``DiDResults`` branch to cover both 2x2 DiD and TWFE interpretations explicitly (both estimators route here). Follow-up for future PR: persist estimator provenance on ``DiDResults`` (or return a dedicated TWFE result class) so the branch can split again; documented inline. 3. dCDH ``headline_attribute="att"``. Both dCDH branches (``DID_M`` for ``L_max=None``, ``DID_l``/derivatives for ``L_max >= 1``) named ``"att"`` as the headline attribute, but ``ChaisemartinDHaultfoeuilleResults`` stores the headline in ``overall_att`` (``chaisemartin_dhaultfoeuille_results.py:357``). Fixed both branches to ``"overall_att"``; downstream consumers using the machine-readable contract now point at the correct attribute. Tests: new ``TestTargetParameterRealFitIntegration`` covers the gap R1 P2 flagged — prior coverage was stub-based and would not have caught any of the three P1s. Four new real-fit tests: - ``TwoWayFixedEffects().fit(...)`` returns ``DiDResults``; target- parameter block uses the shared DiD/TWFE branch. - ``StackedDiD(...).fit(...)`` on a staggered panel; the ``headline_attribute`` matches the actual real attribute and the definition names the event-study-coefficient estimand. - ``ChaisemartinDHaultfoeuille().fit(...)`` on a reversible- treatment panel (both ``DID_M`` and ``DID_l`` regimes); ``headline_attribute == "overall_att"`` and the named attribute actually exists on the real fit object. Existing stub-based dispatch tests updated: the ``test_twfe_results`` test is now ``test_did_results_mentions_twfe`` (asserts the DiD branch describes both estimators). The dCDH stub tests now also assert ``headline_attribute == "overall_att"``. All 323 BR/DR tests pass (319 prior + 4 new real-fit integration). Out of scope (plan-review MEDIUM #2 — centralizing report metadata in a single registry shared by estimator outputs and reporting helpers): queued as a separate PR. Current approach (string dispatch on ``type(results).__name__`` + REGISTRY.md references) is working but brittle; a centralized registry is the principled fix for the TWFE-dispatch-dead-code class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings beyond the methodology drift above. Tech Debt No deferrable findings. I found no Security No findings. Documentation/Tests
Path to Approval
Static review only: this environment lacks |
R2 surfaced one P1 methodology finding: the dCDH dynamic branch flattened every ``L_max >= 1`` into a generic ``DID_l`` estimand, but the library's actual ``overall_att`` contract is: - ``L_max = None`` -> ``DID_M`` (Phase 1 per-period aggregate). - ``L_max = 1`` -> ``DID_1`` (single-horizon per-group estimand, Equation 3 of the dynamic companion paper — NOT the generic ``DID_l``). - ``L_max >= 2`` (no ``trends_linear``) -> ``delta`` (cost-benefit cross-horizon aggregate, Lemma 4; ``chaisemartin_dhaultfoeuille.py:2602-2634``). - ``trends_linear = True`` AND ``L_max >= 2`` -> ``overall_att`` is intentionally NaN by design (``chaisemartin_dhaultfoeuille.py:2828-2834``). No scalar aggregate; per-horizon level effects live on ``linear_trends_effects[l]``. Fix: ``describe_target_parameter()`` now mirrors the result class's own ``_estimand_label()`` at ``chaisemartin_dhaultfoeuille_results.py:454-490``. New aggregation tags: ``DID_1`` / ``DID_1_x`` / ``DID_1_fd`` / ``DID_1_x_fd`` for single-horizon, ``delta`` / ``delta_x`` for cost-benefit, and ``no_scalar_headline`` for the trends+L_max>=2 suppression case. On the no-scalar case, ``headline_attribute`` is ``None`` so downstream consumers do not point at a field whose value is NaN by design. Tests: added stub-based branches for every new case (``DID_1``, ``DID_1^X``, ``delta``, ``delta^X``, trends + L_max>=2 no-scalar, trends + L_max=1 still-has-scalar) and split the real-fit integration test into ``L_max=1`` and ``L_max=2`` real-panel cases so the contract is enforced end-to-end per R2 P2. The parameterized ``test_dcdh_config_branches_tag`` now covers 10 cases and also asserts ``headline_attribute`` flips to ``None`` only on the no-scalar case. Docs: ``REPORTING.md`` dCDH section rewritten to match the corrected dispatch, including the ``no_scalar_headline`` case and the L_max=None/1/>=2 contract. 332 BR/DR tests pass. Out of scope (still open from R1): centralizing report metadata in a single registry shared by estimator outputs and reporting helpers (plan-review MEDIUM #2 / R1 P2 maintainability). The current string dispatch on ``type(results).__name__`` + explicit REGISTRY.md citations is source-faithful but requires manual mirroring of result-class contracts; a centralized registry is the principled fix. Tracked for a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…fit no-scalar test R3 approved (✅) with two non-blocking follow-ups; this commit addresses both. P2 (docs): ``REPORTING.md`` and ``business_report.rst`` still listed the obsolete dCDH aggregation tags (``"DID_l"``, ``"l"``, ``"l_x"``, ``"l_fd"``, ``"l_x_fd"``) and documented ``headline_attribute`` as always a string, even though R2 replaced those with ``"DID_1"`` / ``"DID_1_x"`` / ``"DID_1_fd"`` / ``"DID_1_x_fd"`` / ``"delta"`` / ``"delta_x"`` / ``"no_scalar_headline"`` and introduced the ``headline_attribute=None`` no-scalar case. Consumers wiring dispatch logic off the docs would have pointed at tags the helper no longer emits. Rewrote the ``aggregation`` enum in REPORTING.md as a full per-estimator dispatch list, and updated the ``headline_attribute`` description to name the ``None`` case explicitly. ``business_report.rst`` summary replaced ``DID_l`` with ``DID_1`` / cost-benefit delta and added a pointer to the no-scalar case. P3 (tests): added ``test_dcdh_trends_linear_with_l_max_geq_2_fit_real`` — a real-fit regression that exercises the ``ChaisemartinDHaultfoeuille(..., L_max=2, trends_linear=True)`` path end-to-end. Asserts (a) ``fit.overall_att`` is NaN by design (matching ``chaisemartin_dhaultfoeuille.py:2828-2834``), (b) ``linear_trends_effects`` is populated, (c) the target-parameter block emits ``aggregation="no_scalar_headline"`` and ``headline_attribute is None``, (d) the definition references ``linear_trends_effects``. Previously this branch was only stub-tested; now the reporting-layer integration is pinned by a live dCDH fit. 333 BR/DR tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The previous re-review’s non-blocking doc/test items look addressed, but I found one new unmitigated Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No additional findings. Neither issue above is tracked in Security No findings. Documentation/Tests
Path to Approval
|
…ridge method-aware
R4 surfaced one P1 + one P2, both addressed.
P1 (methodology): the dCDH no-scalar branch was documented in the
schema but not plumbed through BR/DR rendering. When
``aggregation="no_scalar_headline"`` and ``headline_attribute=None``
(``trends_linear=True`` + ``L_max>=2``), BR/DR still extracted
``overall_att`` (NaN by design) and narrated it via the estimation-
failure path, producing internally inconsistent output — the
``target_parameter`` block said "no scalar aggregate; consult
linear_trends_effects" while the headline prose told users to
inspect rank deficiency.
Fix (both surfaces):
- BR ``_build_schema``: compute ``target_parameter`` BEFORE
``_extract_headline``; if the aggregation tag is
``no_scalar_headline``, route through a dedicated headline block
with ``status="no_scalar_by_design"`` / ``effect=None`` /
``sign="none"`` and an explicit ``reason`` field naming the
``linear_trends_effects`` alternative.
- BR ``_render_headline_sentence``: detect
``status == "no_scalar_by_design"`` and emit explicit "does not
produce a scalar aggregate effect ... by design" prose instead
of the non-finite / estimation-failure sentence.
- BR ``_build_caveats``: the existing ``sign == "undefined"``
estimation-failure caveat does not fire because we emit
``sign == "none"`` (not ``"undefined"``) on the no-scalar case.
- DR ``_execute``: analogous headline-metric short-circuit with
``status="no_scalar_by_design"`` on detection of the
no_scalar_headline tag.
- DR ``_render_overall_interpretation``: explicit no-scalar
sentence takes precedence over the non-finite estimation-failure
branch.
P2 (Wooldridge method awareness): the Wooldridge branch previously
labeled every fit as ASF-based, but REGISTRY.md Sec. WooldridgeDiD
splits OLS ETWFE (observation-count-weighted average of ATT(g,t)
from a saturated regression) from the nonlinear (logit / Poisson)
ASF path. Branch on ``results.method`` ("ols" -> coefficient-
aggregation wording; other -> ASF wording).
Tests: added 4 end-to-end regressions.
- ``test_dcdh_trends_linear_no_scalar_propagates_through_br``:
real dCDH fit with ``trends_linear=True`` + ``L_max=2``; asserts
BR schema emits ``status="no_scalar_by_design"``, summary prose
contains "no scalar" / "does not produce a scalar", does NOT
contain "rank deficiency" / "estimation failed", and caveats do
NOT include ``estimation_failure``.
- ``test_dcdh_trends_linear_no_scalar_propagates_through_dr``:
mirror on the DR side (``headline_metric`` status and
``overall_interpretation`` prose).
- ``test_wooldridge_ols``: asserts the OLS branch names
ATT(g,t) aggregation and does NOT include "ASF" in the name.
- ``test_wooldridge_nonlinear``: asserts logit/poisson routes
through the ASF branch.
336 BR/DR tests pass. Black and ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…e docstring + full_report test R5 approved (✅) with two small follow-throughs from R4. P2: DR ``_render_dr_full_report`` still formatted ``value`` / ``se`` / ``p_value`` straight into the top ``**Headline**:`` line on the no-scalar-by-design dCDH branch. With those fields all ``None`` (by design), the markdown rendered as ``**Headline**: ... = None (SE None, p = None)`` even though the "## Target Parameter" section below correctly explained the suppression. Added a ``status == "no_scalar_by_design"`` branch that emits ``**Headline**: no scalar aggregate by design.`` plus the headline's ``reason`` field. P3 (stale docstring): the ``_reporting_helpers.py`` top-level docstring still described Wooldridge ``overall_att`` as always ASF-based. R4 split that into OLS vs nonlinear dispatch; updated the docstring bullet to match. P3 (test gap): the dCDH no-scalar real-fit regression asserted ``run_all()`` + ``interpretation`` but not ``full_report()``. That gap is exactly how the R5 P2 malformed headline could slip through unpinned. Extended the test to assert the ``**Headline**: no scalar aggregate by design`` line appears in the markdown and the raw ``= None (SE None, p = None)`` pattern does NOT appear. 336 BR/DR tests pass. Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…full.txt schema Two P3 cleanups from R6. P3 #1: the StackedDiD ``target_parameter.definition`` embedded an internal implementation line reference (``stacked_did.py`` around line 541). That pointer is not methodology source material and will go stale under routine estimator edits even when the estimand itself is unchanged. Removed the reference; definition now stands on paper/registry terms alone. P3 #2: ``diff_diff/guides/llms-full.txt`` listed the pre-PR BR/DR schema top-level keys and omitted ``target_parameter``, so agent- facing documentation disagreed with the runtime schema. Added ``target_parameter`` to both schema-key lists (BR around line 1779 and DR around line 1844). Documented the field shape (``name`` / ``definition`` / ``aggregation`` / ``headline_attribute`` / ``reference``), the dispatch tag set, and the ``headline_attribute=None`` / ``aggregation="no_scalar_headline"`` edge case for the dCDH ``trends_linear=True, L_max>=2`` fit. Also noted the ``headline.status="no_scalar_by_design"`` value so guide-driven agents can dispatch correctly. UTF-8 fingerprint preserved per ``feedback_llms_guide_utf8_fingerprint.md`` (``tests/test_guides.py`` passes). 354 BR/DR + guide tests pass (337 BR/DR + 17 guide). Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Formatting-only follow-up to the R6 edit — the previous commit landed the StackedDiD-line-reference cleanup before black could reflow the affected block.
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability
Tech Debt No findings. Security No findings. Documentation/Tests No additional PR-specific findings. Residual risk: runtime validation was not possible here because Path to Approval
|
…y vs ES_avg note Two P1 findings from R7, both addressed. P1 #1 (schema version bump): the new ``headline.status`` / ``headline_metric.status`` value ``"no_scalar_by_design"`` added in R4 for the dCDH ``trends_linear=True, L_max>=2`` configuration is a breaking change per REPORTING.md stability policy (new status-enum values are breaking — agents doing exhaustive match will break on unknown enums). Bumped ``BUSINESS_REPORT_SCHEMA_VERSION`` and ``DIAGNOSTIC_REPORT_SCHEMA_VERSION`` from ``"1.0"`` to ``"2.0"``, updated the in-tree schema-version tests (one explicit ``== "1.0"`` assertion and six ``"schema_version": "1.0"`` stub dicts in BR / DR test files), added a REPORTING.md "Schema version 2.0" note, and documented the bump in the CHANGELOG Unreleased entry. The schemas remain marked experimental so the formal deprecation policy does not yet apply. P1 #2 (EfficientDiD library vs paper estimand): both EfficientDiD branches now explicitly state that BR/DR's headline ``overall_att`` is the library's cohort-size-weighted average over post-treatment ``(g, t)`` cells, NOT the paper's ``ES_avg`` uniform event-time average. The regime (PT-All / PT-Post) describes identification; the aggregation choice is a separate library-level policy that REGISTRY.md Sec. EfficientDiD documents. Added ``cohort-size-weighted`` + ``ES_avg`` / ``post-treatment`` assertions to ``test_efficient_did_pt_all`` and ``test_efficient_did_pt_post`` so the wording is pinned. 354 BR/DR + guide + target-parameter tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology Code Quality Performance Maintainability Tech Debt Security Documentation/Tests
|
…n tests Both P3 cleanups from R8. P3 #1 (TROP wording in rst): ``business_report.rst`` summary listed TROP's target parameter as "factor-model residual" — which does not match the helper / REGISTRY definition. Both say the TROP target parameter is a factor-model-adjusted weighted average over treated cells (not a residual). Fixed the rst wording to "factor-model-adjusted ATT". P3 #2 (Bacon branch untested): the exhaustiveness guard iterates ``_APPLICABILITY``, but ``BaconDecompositionResults`` is a diagnostic read-out on the DR side and is NOT listed in ``_APPLICABILITY`` (BR rejects it with a TypeError). The helper branch for Bacon therefore slipped through the 16-class guard. Added two regressions: - ``test_bacon_decomposition`` (unit-level, direct helper call): asserts aggregation / headline_attribute / definition wording / Goodman-Bacon reference. - ``test_dr_with_bacon_result_emits_target_parameter`` (integration): passes a real ``BaconDecompositionResults`` from ``bacon_decompose`` on a staggered panel through DR, asserts the ``target_parameter`` block propagates into DR's schema, and confirms the named ``headline_attribute`` (``twfe_estimate``) exists on the real fit object. 356 BR/DR + guide + target-parameter tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…rface no-scalar dispatch R9 surfaced a real P1 edge case: the helper inferred ``trends_linear=True`` from ``linear_trends_effects is not None``, but the estimator can set ``linear_trends_effects = None`` when the cumulated-horizon dict is empty (no estimable horizons) while still unconditionally NaN-ing ``overall_att`` under ``trends_linear=True`` + ``L_max >= 2`` (``chaisemartin_dhaultfoeuille.py:2828-2834``). The inference missed that case — an empty-horizon fit would fall through to the ``delta`` branch, BR/DR would extract ``overall_att`` (NaN), and the headline would be narrated as an estimation failure instead of "no scalar by design." Fix: - Persisted the fit-time ``trends_linear`` flag explicitly on ``ChaisemartinDHaultfoeuilleResults`` as a new ``Optional[bool]`` field (with a docstring note). - The dCDH estimator now threads ``_is_trends_linear`` into the result constructor at ``chaisemartin_dhaultfoeuille.py:3139``. - ``describe_target_parameter()`` reads the persisted flag first and only falls back to the ``linear_trends_effects is not None`` inference when the flag is absent (older cached fits predating the persisted field). BR/DR no-scalar routing is unchanged — both surfaces dispatch on ``target_parameter["aggregation"] == "no_scalar_headline"``, which now fires correctly for empty-surface fits too. Tests: - ``test_dcdh_trends_linear_with_l_max_geq_2_emits_no_scalar_headline`` updated to pass ``trends_linear=True`` explicitly (primary contract). - ``test_dcdh_trends_linear_empty_surface_still_no_scalar``: new R9 regression. ``linear_trends_effects=None`` + ``trends_linear=True`` (the empty-surface case) routes to the no-scalar branch. - ``test_dcdh_legacy_fit_without_persisted_flag_still_routes_correctly``: preserves backwards compatibility for cached fits that predate the persisted flag — the legacy ``linear_trends_effects is not None`` fallback still fires. 368 tests pass (BR/DR + guide + target-parameter + dCDH methodology). Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…DH reporting consumers
R10 found my R9 fix was partial. The persisted ``trends_linear``
flag was only read by ``describe_target_parameter``. Three other
dCDH reporting paths still inferred trends-linear from
``linear_trends_effects is not None`` and silently mis-labeled
empty-surface fits as ``delta`` or omitted the linear-trends
identification clause:
1. ``ChaisemartinDHaultfoeuilleResults._horizon_label`` and
``_estimand_label`` (also reached via
``to_dataframe("overall")``) — per-horizon labels and overall
estimand label.
2. ``ChaisemartinDHaultfoeuilleResults.summary`` — the
covariate/trend-adjusted tag in the overall-results summary.
3. ``BusinessReport._describe_assumption`` dCDH branch — the
identifying-assumption prose that names ``DID^{fd}_l`` vs
``DID_l``.
Fix: added a ``_has_trends_linear()`` helper on
``ChaisemartinDHaultfoeuilleResults`` that reads the persisted
flag first and falls back to the legacy inference, and rewired
all three result-class callsites to use it. BR's
``_describe_assumption`` branch gained a matching persisted-first-
then-inference lookup via ``getattr(results, "trends_linear", None)``.
Tests: new ``test_dcdh_empty_surface_propagates_to_assumption_and_native_label``
stubs a ``ChaisemartinDHaultfoeuilleResults`` with
``trends_linear=True``, ``L_max=2``, and
``linear_trends_effects=None`` (the exact R9/R10 edge case), then
asserts:
- ``stub._estimand_label()`` returns ``DID^{fd}_l`` /
``DID^{X,fd}_l``, NOT ``delta``.
- ``stub.to_dataframe("overall")`` does not label the overall row
as ``delta``.
- BR's ``_describe_assumption`` description includes the linear-
trends / first-differenced identification clause.
527 tests pass across BR/DR + guide + target-parameter + dCDH
methodology + dCDH unit suites. Black and ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R11 was ✅ with one P3: REGISTRY anchors OLS ETWFE in Wooldridge (2025) and reserves Wooldridge (2023) for the nonlinear ASF extension. The target-parameter helper's OLS branch was citing 2023 in both the definition prose and the ``reference`` field. Updated both to ``Wooldridge (2025)``. The nonlinear branch's joint ``Wooldridge (2023, 2025)`` reference is unchanged. 334 BR/DR + target-parameter tests pass. Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…gation tag
R12 identified two issues, both addressed.
P1 (empty-surface dead-end guidance): on
``trends_linear=True, L_max>=2, linear_trends_effects=None`` (no
horizons survived), the PR's no-scalar prose still told users to
"see ``linear_trends_effects``" even though the dict is empty,
and ``to_dataframe("linear_trends")`` raised the wrong
remediation ("Pass ``trends_linear=True`` to fit()") — which the
user already did. Fixed by distinguishing the populated-surface
case from the empty-surface subcase in three places:
- ``describe_target_parameter`` (dCDH no-scalar branch): the
``definition`` on empty surfaces now names the empty state
explicitly ("no cumulated level effects survived estimation")
and points at re-fit remediation, rather than pointing at the
nonexistent horizon dict.
- ``BusinessReport._build_schema`` (no-scalar headline): the
``reason`` field branches on
``getattr(self._results, "linear_trends_effects", None) is None``
and emits the empty-state message accordingly.
- ``DiagnosticReport._execute`` (no-scalar headline): mirror
branching for the DR ``headline_metric`` reason + name.
- ``ChaisemartinDHaultfoeuilleResults.to_dataframe("linear_trends")``
now returns an empty DataFrame with the expected columns when
``trends_linear=True`` is already active but no horizons
survived. The "Pass ``trends_linear=True`` to fit()" error
fires only when the user actually did not request it.
P2 (ambiguous aggregation tag): both ``DifferenceInDifferences``
and ``TwoWayFixedEffects`` return ``DiDResults``, so the old
``"2x2"`` aggregation tag was not faithful for TWFE fits that can
be weighted averages with forbidden later-vs-earlier weights.
Renamed to ``"did_or_twfe"`` — a neutral tag that signals the
ambiguity until estimator provenance is persisted. Downstream
agents dispatching on the tag now know not to treat TWFE fits as
clean 2x2. REPORTING.md updated to enumerate the new tag.
Tests: 3 new regressions pin the empty-surface contract
(target-parameter definition names the empty state, BR/DR
headline reasons avoid "see linear_trends_effects", and
``to_dataframe("linear_trends")`` returns an empty frame rather
than raising). Existing ``"2x2"`` assertions updated to
``"did_or_twfe"``.
502 BR/DR + target-parameter + dCDH unit tests pass. Black clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Needs changes Highest unmitigated severity: P1. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…n; branch dCDH native label on empty surface
R13 identified three remaining surfaces that still hardcoded
"see ``linear_trends_effects``" on the empty-surface subcase
(``trends_linear=True, L_max>=2, linear_trends_effects=None``):
1. BR ``_render_headline_sentence`` (headline prose used by
``headline()``, ``summary()``, and ``full_report()``).
2. DR ``_render_overall_interpretation`` (top-level paragraph).
3. dCDH ``ChaisemartinDHaultfoeuilleResults._estimand_label``
(also surfaced via ``to_dataframe("overall")``).
Fix: BR and DR renderers now read the headline-level ``reason``
field (already branched on populated-vs-empty surface in
``_build_schema`` / ``_execute``), so the rendered prose never
drifts from the schema message. ``_estimand_label`` on dCDH
results gains an empty-surface branch that returns
``DID^{fd}_l (no cumulated level effects survived estimation)``
(or ``DID^{X,fd}_l (...)`` when covariates are active) instead
of pointing at an empty dict.
Docs: REPORTING.md and business_report.rst now document the
``headline.reason``-based populated-vs-empty branching for
agents dispatching on ``aggregation="no_scalar_headline"``.
Tests: 4 new regressions pin the rendered-prose contract on
the empty-surface stub: (a) BR ``headline()`` /
``summary()`` / ``full_report()`` emit the empty-surface
remediation wording, NOT "see linear_trends_effects"; (b) DR
``interpretation`` does the same; (c) dCDH ``_estimand_label``
returns the empty-state label rather than pointing at the
empty dict; (d) ``to_dataframe("overall")`` surfaces the
empty-state label.
505 BR/DR + target-parameter + dCDH tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity: P1. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
When trends_linear=True + L_max>=2 + linear_trends_effects=None
(empty-surface subcase) AND covariate_residuals is populated, the
reporting-layer prose now emits the covariate-adjusted label
``DID^{X,fd}_l`` rather than the bare ``DID^{fd}_l``. Propagates to:
- ``_reporting_helpers.describe_target_parameter``: empty-surface
definition_text now interpolates the already-control-aware
estimand_label (``DID^{X,fd}_l`` when has_controls, else
``DID^{fd}_l``).
- ``BusinessReport._build_schema``: reads ``covariate_residuals`` to
select the empty-surface label used in ``headline.reason``.
- ``DiagnosticReport._execute``: mirrors BR's control-aware label
selection for ``headline_metric.reason``.
Regression test added: ``test_dcdh_empty_surface_with_controls_*``
(three tests covering target_parameter definition, BR/DR reason
fields, and rendered prose surfaces). Asserts every consumer emits
``DID^{X,fd}_l`` on the covariate-adjusted empty-surface subcase
and does NOT emit bare ``DID^{fd}_l`` as a stale fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review R14 addressed: control-aware empty-surface label. The covariate-adjusted empty-surface subcase (trends_linear=True + L_max>=2 + covariate_residuals populated + linear_trends_effects=None) now emits |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
target_parameterblock to both BusinessReport and DiagnosticReport schemas naming what the headline scalar represents as an estimand for each of the 17 covered result classes (16 estimator classes from_APPLICABILITY+BaconDecompositionResults).name,definition,aggregation(machine-readable dispatch tag),headline_attribute(raw result attribute),reference(REGISTRY.md citation pointer). Additive schema change (experimental per REPORTING.md stability policy).diff_diff/_reporting_helpers.py::describe_target_parameter(own module to avoid circular-import risk — both BR and DR import from it). Three branches read fit-time config: EfficientDiD'spt_assumption, StackedDiD'sclean_control, and dCDH'sL_max/covariate_residuals/linear_trends_effects. The rest emit a fixed tag.Methodology references (required if estimator / math changes)
_reporting_helpers.py: Callaway & Sant'Anna (2021); Sun & Abraham (2021); Borusyak-Jaravel-Spiess (2024); Gardner (2022); Wing-Freedman-Hollingsworth (2024); Wooldridge (2023); Chen-Sant'Anna-Xie (2025); Callaway-Goodman-Bacon-Sant'Anna (2024); Ortiz-Villavicencio & Sant'Anna (2025); de Chaisemartin & D'Haultfoeuille (2020, 2024); Arkhangelsky et al. (2021); Goodman-Bacon (2021). All already cited in REGISTRY.md.feedback_verify_claims.md). Two plan-review catches documented in the code:ImputationDiDResults/TwoStageDiDResultsdo not persistaggregateon the result object;overall_attis always the sample-mean aggregation regardless of fit-time config. Emitted as the fixed"simple"tag; the definition names this explicitly so the user is not misled.ContinuousDiDResultshas no PT-vs-SPT regime attribute; the definition is disjunctive (ATT^locunder PT,ATT^globunder SPT) so the user can pick the interpretation that matches their assumption.Validation
tests/test_target_parameter.py(new, 37 tests across per-estimator dispatch, fit-config branching, cross-surface parity, exhaustiveness, and prose rendering).tests/test_business_report.py+tests/test_diagnostic_report.pytop-level-key contract tests updated to includetarget_parameter. Total 319 BR/DR tests pass.TestTargetParameterCoversEveryResultClassiterates_APPLICABILITYand asserts every result-class name gets a non-default, non-empty target-parameter block. New result classes (e.g., HAD when it lands) will fail this test until an explicit branch is added.TestTargetParameterCrossSurfaceParityasserts BR and DR emit byte-identical target-parameter blocks on the same fit (both consume the same helper).Security / privacy
Generated with Claude Code