Skip to content

Add BusinessReport and DiagnosticReport (experimental preview)#318

Merged
igerber merged 49 commits intomainfrom
business-report
Apr 19, 2026
Merged

Add BusinessReport and DiagnosticReport (experimental preview)#318
igerber merged 49 commits intomainfrom
business-report

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 18, 2026

Summary

  • Ships BusinessReport and DiagnosticReport as an experimental preview foundation: plain-English narrative output + a stable to_dict() schema, dispatching across all 16 result types. Core promise is the schema shape and estimator-native routing; the narrative prose will continue to evolve.
  • DiagnosticReport orchestrates seven diagnostic checks (parallel trends, pre-trends power, HonestDiD sensitivity, Goodman-Bacon, heterogeneity, design-effect, EPV) and routes SyntheticDiD (pre-treatment fit + in-time placebo + zeta sensitivity), EfficientDiD (hausman_pretest), and TROP (factor-model fit) through their native validation surfaces instead of the generic event-study path.
  • BusinessReport auto-constructs DiagnosticReport by default so .summary() mentions pre-trends, robustness, and design-effect findings in one call. Power-aware phrasing (via compute_pretrends_power) tiers the no_detected_violation language into well-powered / moderately-powered / underpowered so a well-designed study is not under-sold.
  • Both schemas are marked experimental in CHANGELOG and README; wording, verdict thresholds (0.05 / 0.30), and schema shape will change before this is promoted as a first-class feature.

Methodology references (required if estimator / math changes)

  • Methods consumed (no new inference): compute_pretrends_power (Roth 2022), HonestDiD.sensitivity (Rambachan & Roth 2023), bacon_decompose (Goodman-Bacon 2021), check_parallel_trends + check_parallel_trends_robust, compute_deff_diagnostics, estimator-native hausman_pretest (Chen, Sant'Anna & Xie 2025), SyntheticDiD pre_treatment_fit / in_time_placebo / sensitivity_to_zeta_omega / get_weight_concentration.
  • Workflow reference: Baker et al. (2025) "Difference-in-Differences Designs: A Practitioner's Guide" — DR covers steps 3, 6, 7; BR covers step 8.
  • Intentional deviations (all documented with the - **Note:** label in docs/methodology/REPORTING.md):
    • No traffic-light pass/fail gates — severity via natural-language phrasing.
    • Pre-trends verdict thresholds (0.30 / 0.05) are explicit diff-diff heuristics, not a field convention.
    • DiagnosticReport does not call check_parallel_trends on event-study or staggered result objects (uses joint Wald / Bonferroni on pre-period coefficients instead).
    • Estimator-native routing for SDiD / TROP / EfficientDiD instead of generic diagnostics.
    • Placebo battery opt-in and reserved-but-skipped in MVP.
    • No arithmetic translation of log-points.
    • Schemas marked experimental; formal deprecation policy is a committed follow-up.

Validation

  • Tests added: tests/test_business_report.py (32 tests covering schema contract, BR/DR integration, honest_did_results= passthrough, unit labels, log-points caveat, significance-chasing guard, pre-trends verdict bins, power-aware phrasing, NaN ATT, appendix toggle, BaconDecompositionResults TypeError, survey-metadata passthrough, single-knob alpha, outcome_direction verb selection, error-provenance passthrough).
  • Tests added: tests/test_diagnostic_report.py (37 tests covering applicability matrix across result types, schema contract + JSON round-trip, precomputed= passthrough, joint-Wald vs Bonferroni alignment with a closed-form χ² check, Bonferroni fallback on missing / misaligned interaction_indices and missing vcov, verdict / power-tier boundaries, EfficientDiD hausman pathway, SDiD native diagnostics, error-doesn't-break-report).
  • tests/test_guides.py still passes (UTF-8 fingerprint preserved in llms-full.txt).
  • All 121 targeted tests pass in 0.36s; black, ruff, mypy clean on the two new modules.
  • Manual smoke test on CallawaySantAnna(aggregate="event_study") produces a readable strategy-doc-style summary; .to_dict() round-trips through json.dumps.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes (diff-based + filename + full-file pre-commit scans clean on changed diff_diff/ sources).

Docs

  • New: docs/methodology/REPORTING.md records all methodology deviations under - **Note:** labels. docs/methodology/REGISTRY.md cross-links to it. docs/api/business_report.rst and docs/api/diagnostic_report.rst registered in a new "Reporting" section in docs/api/index.rst. docs/doc-deps.yaml tracks both modules.
  • Modified: README adds an "Experimental preview" subsection under "For Data Scientists" that explicitly flags the schema / wording as unstable. CHANGELOG.md entry in [Unreleased] is marked experimental. ROADMAP.md moves BR/DR to Recently Shipped and splits the bundled bullet so context-aware practitioner_next_steps() remains queued as its own follow-up. diff_diff/guides/llms-full.txt documents both public APIs and schemas; llms-practitioner.txt notes that DR covers Baker steps 3/6/7 in one call.

Known follow-ups (not in this PR)

  • With/without-covariates comparison (Baker et al. require this as mandatory reporting).
  • Event-study plot embedding (currently referenced by name only).
  • Simple-2x2 placebo battery (schema reserves the placebo key; not yet implemented).
  • Real-dataset validation + worked-example parity (Card-Krueger, MPDTA, Favara-Imbs).
  • Target-parameter disambiguation in the narrative.
  • Dedicated tutorial notebook + PPT/HTML export.
  • Arithmetic unit translation for log-points.
  • Per-cohort narrative rendering when heterogeneity is flagged.

These gaps are why the schema is experimental and why tutorials / external positioning do not route practitioners through BR/DR yet.

🤖 Generated with Claude Code

igerber and others added 3 commits April 18, 2026 15:34
Implements the 'practitioner-ready output' roadmap pair: plain-English
stakeholder narratives from any of the 16 fitted result types, backed by
a stable AI-legible to_dict() schema (single source of truth; prose renders
from the dict).

BusinessReport:
- summary()/full_report()/export_markdown() surface stakeholder text
- to_dict()/to_json() expose the structured schema for AI agents
- Optional outcome_label/outcome_unit/business_question/treatment_label
  for context; single-knob alpha drives CI level and phrasing threshold
- auto_diagnostics=True (default) constructs an internal DiagnosticReport
  so the summary mentions pre-trends, sensitivity, and design effects in
  one call; diagnostics=<DR|DRResults|None> overrides explicitly
- Rejects BaconDecompositionResults with a helpful TypeError

DiagnosticReport:
- Orchestrates check_parallel_trends, compute_pretrends_power,
  HonestDiD.sensitivity (grid form yielding breakdown_M), bacon_decompose,
  compute_deff_diagnostics, results.epv_diagnostics, plus heterogeneity
  (CV + range + sign consistency)
- Estimator-native routing: SDiD uses pre_treatment_fit + in_time_placebo
  + sensitivity_to_zeta_omega; EfficientDiD uses native hausman_pretest;
  TROP surfaces factor-model fit (effective_rank / loocv_score / lambdas)
- Lazy: construction is free; run_all() triggers compute and caches
- precomputed={...} escape hatch for user-supplied diagnostic results
- Power-aware phrasing tiers (well/moderately/underpowered) drive the
  'no_detected_violation' verdict prose rather than always hedging

Docs:
- New docs/methodology/REPORTING.md records design deviations via the
  '- **Note:**' label pattern (no-traffic-light gates, pre-trends verdict
  thresholds, unit-translation policy, schema stability policy)
- docs/methodology/REGISTRY.md cross-links to REPORTING.md
- New docs/api/business_report.rst and docs/api/diagnostic_report.rst
  registered under a new 'Reporting' section in docs/api/index.rst
- docs/doc-deps.yaml tracks both modules
- README adds a stakeholder-report example under 'For Data Scientists'
- CHANGELOG marks both schemas experimental for this release
- ROADMAP moves BR/DR to Recently Shipped; splits the original bundled
  bullet so context-aware practitioner_next_steps() remains queued
- diff_diff/guides/llms-full.txt documents the public API and both
  schemas for AI agents; llms-practitioner.txt notes that DR covers
  Baker steps 3/6/7 in one call

Tests: 61 new tests (32 BR + 29 DR) cover schema contract,
applicability matrix across all 16 result types, JSON round-trip,
precomputed-sensitivity passthrough (no re-compute), error handling,
power-tier/verdict boundaries, unit-label behavior, significance-chasing
guard, NaN ATT, include_appendix toggle, BaconDecompositionResults
TypeError, survey metadata passthrough, and alpha single-knob behavior.
All pass in 0.26s under pytest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…provenance

Follows up on review findings on the prior commit:

- **P1 Wald test coverage** — add targeted tests for ``_pt_event_study``
  (``TestJointWaldAlignment``):
    * joint_wald runs when pre-period keys align with ``interaction_indices``
    * computed chi-squared statistic matches a closed-form expectation
    * Bonferroni fallback when ``interaction_indices`` is missing
    * Bonferroni fallback when the key namespace is misaligned
    * Bonferroni fallback when ``vcov`` is missing
  Also document the alignment contract and fallback rule inline near the
  Wald codepath so the invariant is discoverable without reading tests.

- **P2 outcome_direction** — implement direction-aware verbs in the
  headline sentence via ``_direction_verb``:
    * ``higher_is_better`` + positive effect -> "lifted"
    * ``higher_is_better`` + negative effect -> "reduced"
    * ``lower_is_better``  + positive effect -> "worsened"
    * ``lower_is_better``  + negative effect -> "improved"
    * ``None``             -> neutral "increased" / "decreased"
  Covered by ``TestOutcomeDirection`` with three scenarios.

- **P2 warning provenance** — populate top-level ``schema["warnings"]``
  from every section that ended in ``status="error"`` so agents do not
  have to scan each section dict to discover diagnostic failures.
  ``DiagnosticReportResults.warnings`` now mirrors the top-level list.
  Covered by ``TestWarningsPassthrough``.

- **P2 string dispatch** — add an inline note above ``_APPLICABILITY``
  explaining the ``type(results).__name__`` convention (mirrors
  ``practitioner._HANDLERS`` to avoid circular imports) and pointing at
  the applicability-matrix test as the regression guard.

No behavioral changes outside the review items; existing tests remain
unchanged. 121 tests pass, black / ruff / mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the "Stakeholder-ready report from any fit" subsection framing
with "Experimental preview: BusinessReport and DiagnosticReport" and
reword the introductory paragraph to emphasize that wording, verdict
thresholds, and schema shape will change. Drop the expected-output
comment from the example (the prose will evolve) and invite feedback.

This matches the foundation-not-shipped-feature posture: the schema and
narrative prototype are worth validating in isolation, but the library
still lacks several items a methodologically-rigorous practitioner
(covariate comparison, event-study plot embedding, 2x2 placebo battery,
real-dataset validation, target-parameter clarity, tutorial integration)
would expect. Keeping external framing conservative until those gaps
close.

No functional changes; only README prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⛔ Blocker

Static review only; I did not run the new tests in this workspace because the local Python environment is missing numpy.

Executive Summary

  • The documented HonestDiDResults passthrough can produce a false “robust across the full grid” claim even when the caller supplied only a single-M sensitivity result.
  • EPV reporting is silently wrong on existing dict-backed result objects: the new report can say 0 low-EPV cells and min_epv=None even when the fitted result already flagged low-EPV cells.
  • CallawaySantAnnaResults loses two advertised diagnostics because the applicability gate requires a nonexistent results.vcov; that skips both HonestDiD sensitivity and pre-trends power on the flagship CS event-study path.
  • The CS pre-trends path also ignores the available event_study_vcov and falls back to Bonferroni, which is an undocumented deviation from the documented joint-Wald route.
  • The “all 16 result types” contract is already broken for ContinuousDiDResults: both report classes miss its existing SE/p-value/CI fields.

Methodology

  1. P0 The single-M HonestDiD passthrough is rendered as full-grid robustness. DiagnosticReport explicitly accepts a single-M HonestDiDResults object and tags it as conclusion="single_M_precomputed", but both renderers equate breakdown_M is None with “remains significant across the full grid.” See diff_diff/diagnostic_report.py:L932-L958, diff_diff/diagnostic_report.py:L1678-L1686, diff_diff/business_report.py:L109-L111, and diff_diff/business_report.py:L996-L1003. Impact: BusinessReport(honest_did_results=compute_honest_did(...)) and DiagnosticReport(precomputed={"sensitivity": honest_result}) can overstate robustness from one checked M value to the entire Rambachan-Roth grid. Concrete fix: preserve the single_M_precomputed state through prose rendering and phrase it as “at M=<value> the CI excludes zero”; reserve “across the full grid” for true SensitivityResults grids.

  2. P0 EPV diagnostics are silently wrong for existing dict-backed result objects. _check_epv() expects low_epv_cells/min_epv attributes and hardcodes threshold = 10, but current result surfaces store epv_diagnostics as dicts plus epv_threshold. See diff_diff/diagnostic_report.py:L1087-L1104, diff_diff/staggered_results.py:L126-L130, diff_diff/triple_diff.py:L108-L112, diff_diff/staggered_triple_diff_results.py:L87-L90, and the existing low-EPV semantics in diff_diff/staggered.py:L1767-L1778. Impact: the report can return status="ran" with n_cells_low=0 and min_epv=None even when the fitted estimator already detected low EPV, which is a silent false-clean diagnostic. Concrete fix: detect dict-backed epv_diagnostics, compute low_cells from diag.get("is_low"), derive min_epv from diag.get("epv"), and read results.epv_threshold instead of hardcoding 10.

  3. P1 CallawaySantAnnaResults sensitivity and pre-trends power are skipped entirely because applicability is gated on results.vcov. The new gate requires results.vcov for both checks, but CS exposes event_study_vcov/event_study_vcov_index, and the underlying helper implementations already support CS without results.vcov. See diff_diff/diagnostic_report.py:L451-L467, diff_diff/diagnostic_report.py:L505-L529, diff_diff/staggered_results.py:L118-L120, diff_diff/pretrends.py:L605-L634, diff_diff/honest_did.py:L663-L775, and the BusinessReport passthrough hook in diff_diff/business_report.py:L236-L245. Impact: BR/DR skip two of the PR’s headline diagnostics on one of the core supported estimators, and honest_did_results= cannot rescue that path because skipped checks are never executed. Concrete fix: gate these checks on actual helper support and precomputed overrides, not on raw results.vcov existence.

  4. P1 The CS pre-trends route ignores the available event-study covariance and falls back to Bonferroni. The methodology note says staggered/event-study PT should use a joint Wald test unless covariance is unavailable, but _pt_event_study() only looks at results.vcov plus interaction_indices; it never checks CS’s event_study_vcov. See docs/methodology/REPORTING.md:L44-L50, diff_diff/diagnostic_report.py:L700-L761, diff_diff/staggered_results.py:L118-L120, and the existing CS covariance handling in diff_diff/honest_did.py:L738-L756. Impact: CS can get a different pre-trends p-value/verdict than the documented method, and the change is not documented as a deviation in the registry. Concrete fix: teach _pt_event_study() to use event_study_vcov/event_study_vcov_index before falling back to Bonferroni.

Code Quality

  1. P1 The “any of the 16 result types” headline extraction is already broken for ContinuousDiDResults. Both report classes assume overall_se / overall_p_value / overall_conf_int, but ContinuousDiD stores overall_att_se / overall_att_p_value / overall_att_conf_int. See diff_diff/business_report.py:L322-L358, diff_diff/diagnostic_report.py:L1392-L1422, diff_diff/continuous_did_results.py:L115-L119, and the public “any of the 16 result types” claims in diff_diff/business_report.py:L83-L91 and diff_diff/diagnostic_report.py:L233-L236. Impact: ContinuousDiD reports silently drop existing inference fields and under-specify the headline. Concrete fix: add overall_att_* fallbacks or centralize headline extraction behind a shared utility.

Performance

  • No material findings in the diff.

Maintainability

  1. P2 The headline-extraction mapping is duplicated in both report classes instead of being shared, even though the repo already has a reusable fallback pattern. See diff_diff/business_report.py:L322-L358, diff_diff/diagnostic_report.py:L1392-L1422, and diff_diff/power.py:L476-L488. Impact: schema drift like the ContinuousDiDResults mismatch has to be fixed in two places and is likely to recur. Concrete fix: move scalar-headline extraction into one shared helper with alias lists for nonstandard result classes.

Tech Debt

  • No separate deferrable findings. None of the issues above are tracked under TODO.md:L51-L99, so no mitigation applies.

Security

  • No findings.

Documentation/Tests

  1. P2 The new tests cover only a subset of the advertised surfaces, and the passthrough tests only exercise grid-shaped fake sensitivity objects. See tests/test_business_report.py:L68-L105, tests/test_diagnostic_report.py:L76-L120, tests/test_business_report.py:L188-L205, and tests/test_diagnostic_report.py:L277-L286. Impact: the single-M HonestDiD path, dict-backed EPV surfaces, the CS event_study_vcov route, and ContinuousDiDResults all shipped without regression coverage. Concrete fix: add explicit tests for those four paths.

Path to Approval

  1. Preserve the single_M_precomputed sensitivity state through DR/BR rendering, and add regression tests that pass an actual single-M HonestDiDResults object into both DiagnosticReport(precomputed=...) and BusinessReport(honest_did_results=...).
  2. Rewrite _check_epv() to support dict-backed epv_diagnostics, compute n_cells_low/min_epv from the dict contents, and read results.epv_threshold; add regressions on at least one CS result and one DDD result with low EPV.
  3. Fix applicability gating for CS/SA power/sensitivity to use the real helper support matrix and to honor precomputed overrides; specifically, CS should not require results.vcov for HonestDiD or pre-trends power.
  4. Update _pt_event_study() to use event_study_vcov/event_study_vcov_index when present before falling back to Bonferroni; add a CS analytical-fit test asserting method == "joint_wald".
  5. Centralize headline extraction with overall_att_* aliases and add ContinuousDiDResults coverage in both report test suites.

Fixes the five issues the CI reviewer flagged against the initial
BusinessReport / DiagnosticReport PR:

P0 — Single-M HonestDiDResults passthrough was being narrated as
  "robust across the full grid" because both renderers checked
  ``breakdown_M is None`` and fell through to the grid-wide phrasing.
  Preserves the ``conclusion="single_M_precomputed"`` state through
  both BR._render_summary and DR._render_overall_interpretation; the
  point check is now rendered as "at M=<value>, the robust CI
  (excludes|includes) zero — run HonestDiD.sensitivity() across a
  grid for a breakdown value." Regression tests in
  ``TestSingleMSensitivityPrecomputed`` cover both DR.summary() and
  BR(honest_did_results=...).summary().

P0 — EPV diagnostics were silently reporting 0 low cells and
  min_epv=None on every fit because _check_epv() expected
  ``low_epv_cells`` / ``min_epv`` attributes but the library's
  ``epv_diagnostics`` convention is a dict keyed by cell identifier
  with per-cell ``{"is_low": bool, "epv": float}`` entries. Rewrites
  _check_epv() to handle the dict shape, counts low cells via
  ``v.get("is_low")``, derives min_epv from the ``epv`` values, and
  reads ``results.epv_threshold`` instead of hardcoding 10. Legacy
  object-shape fallback retained for custom subclasses. Regression
  tests in ``TestEPVDictBacked`` cover low-cell detection, no-low
  clean case, and the configurable threshold.

P1 — CallawaySantAnnaResults sensitivity + pretrends_power were
  skipped entirely because the applicability gate required
  ``results.vcov``, but CS exposes ``event_study_vcov`` /
  ``event_study_vcov_index`` alongside a populated
  ``event_study_effects`` surface. ``HonestDiD.sensitivity_analysis``
  and ``compute_pretrends_power`` already handle CS via those
  attributes, so the gate now accepts any of the three covariance
  sources. Also honors precomputed overrides regardless of gate.
  Regression tests in ``TestCSEventStudyVCovSupport`` confirm both
  checks are applicable on an aggregated CS fit.

P1 — _pt_event_study() was forcing Bonferroni on CS even though
  event_study_vcov + event_study_vcov_index were available. Added a
  second covariance source branch that builds an index map from
  ``event_study_vcov_index`` and reports
  ``method="joint_wald_event_study"`` on the Wald path. Regression
  test in ``TestCSJointWaldViaEventStudyVCov`` verifies the
  closed-form chi-squared statistic on a CS stub with known
  coefficients and diagonal vcov.

P1 — ContinuousDiDResults headline extraction was broken: BR and DR
  both assumed ``overall_se`` / ``overall_p_value`` /
  ``overall_conf_int`` but ContinuousDiDResults stores them as
  ``overall_att_se`` / ``overall_att_p_value`` /
  ``overall_att_conf_int``. Centralized the attribute-alias lookup in
  a new module-level ``_extract_scalar_headline`` helper that both
  report classes call; the helper accepts multiple SE / p / CI alias
  names per point-estimate row and tries them in order.
  Regression test in ``TestContinuousDiDHeadline`` confirms the
  helper resolves the ``overall_att_*`` aliases.

P2 — Headline extraction was duplicated in BR and DR. Eliminated by
  the shared helper above.

131 targeted tests pass (BR + DR + guides + practitioner); black,
ruff, and mypy clean on the new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2577d6c7e6e88a716fc4e4979572f4677f301637


Overall Assessment

⛔ Blocker

Static review only. I could not run the new tests in this workspace because pytest is unavailable, and importing diff_diff fails without numpy.

Executive Summary

  • The previously reported EPV dict-surface bug, ContinuousDiD headline mapping, and most CS applicability / event_study_vcov wiring issues look resolved.
  • BusinessReport and DiagnosticReport still mislabel confidence intervals: BR relabels stored intervals to the caller's alpha without recomputation, and DR hardcodes 95% CI in prose.
  • BusinessReport.full_report() still turns a single-M HonestDiD passthrough into “robust across full grid,” so one public renderer still overstates robustness.
  • The new pre-trends collector treats synthetic reference-marker / zero-observation rows as real pre-period evidence and keeps NaN p-values in the Bonferroni fallback, which can silently yield false-clean PT verdicts.
  • The new power-aware PT wording is driven by compute_pretrends_power() even on staggered event-study results where the helper ignores available full covariance, which is an undocumented methodology deviation.

Methodology

  • P0 CI / significance presentation is statistically mislabeled across the new report surfaces. In _extract_headline() at diff_diff/business_report.py:L313, BR reads the stored CI from the fitted result but overwrites its level with self._context.alpha; in _render_headline_sentence() at diff_diff/business_report.py:L889 that relabeled level is rendered as if it were the actual interval. DR separately hardcodes 95% CI in prose at _render_overall_interpretation() in diff_diff/diagnostic_report.py:L1751. This contradicts the documented single-knob alpha contract in docs/methodology/REPORTING.md:L111. Impact: BusinessReport(results, alpha=0.10) can label a stored 95% interval as 90%, DR can label a stored 90% interval as 95%, and BR's significance phrasing can diverge from the interval actually shown. Concrete fix: either recompute the interval with an existing inference utility such as safe_inference() when alpha != results.alpha, or preserve the fitted result's original CI level and ignore / reject mismatched alpha for headline rendering; DR prose should use the extracted interval level instead of a hardcoded 95.

  • P0 The prior single-M HonestDiD robustness blocker is only partially fixed. The summary path now respects conclusion == "single_M_precomputed", but BusinessReport.full_report() still prints Breakdown M: robust across full grid (no breakdown) whenever breakdown_M is non-numeric at diff_diff/business_report.py:L1109, even though single-M passthrough is explicitly encoded with breakdown_M = None at diff_diff/diagnostic_report.py:L985. Impact: full_report() still overclaims Rambachan-Roth robustness from a single checked M value. Concrete fix: mirror the summary renderer's conclusion == "single_M_precomputed" branch inside full_report() and add a regression on BusinessReport(...).full_report().

  • P0 The new pre-trends fallback is not reference-marker / NaN safe. _collect_pre_period_coefs() accepts any negative-time event-study row with non-None effect and se at diff_diff/diagnostic_report.py:L1615, which pulls in the synthetic reference markers emitted by CS / stacked / two-stage / imputation event-study outputs at diff_diff/staggered_aggregation.py:L712, diff_diff/stacked_did.py:L503, diff_diff/two_stage.py:L1353, and diff_diff/imputation.py:L1685. _pt_event_study() then Bonferroni-adjusts on p_value is not None rather than np.isfinite(p_value) at diff_diff/diagnostic_report.py:L810. Impact: universal-base CS results can miss the available joint-Wald path, Bonferroni multipliers can be inflated by synthetic rows, and all-NaN fallback inputs can collapse to joint_p=1.0 / no_detected_violation instead of an inconclusive PT result. Concrete fix: exclude explicit reference markers and rows with n_groups == 0 / n_obs == 0 or non-finite SE / p-values before both applicability and PT computation; if nothing valid remains, return skipped / inconclusive, not a clean p-value.

  • P1 The new power-aware PT wording is an undocumented diagonal approximation on staggered event-study results. DiagnosticReport now runs compute_pretrends_power() whenever an event-study surface exists at diff_diff/diagnostic_report.py:L451 and diff_diff/diagnostic_report.py:L827, but for CS / SA the helper still drops to np.diag(ses**2) at diff_diff/pretrends.py:L605 even though full event_study_vcov is available and already used by HonestDiD, as shown by tests/test_honest_did.py:L1165. The documented reporting method says DR uses compute_pretrends_power() “whenever the result has an event-study surface with a vcov” at docs/methodology/REPORTING.md:L76. Impact: the new “well-powered / moderately-powered / underpowered” wording can materially change based on an undocumented covariance simplification rather than Roth-style full pre-period covariance. Concrete fix: either teach compute_pretrends_power() to consume event_study_vcov / event_study_vcov_index, or suppress power-tiered prose whenever only a diagonal fallback is available; if the fallback is intentional, document that deviation explicitly in REPORTING.md.

Code Quality

Performance

No material findings in the diff.

Maintainability

No separate maintainability findings beyond the incomplete public-API contract above.

Tech Debt

No separate deferrable findings. None of the P0 / P1 issues above are tracked under TODO.md:L37, so TODO tracking does not mitigate them.

Security

No findings.

Documentation/Tests

  • P2 Regression coverage still misses the blocker paths above, and one new test locks in the CI-label bug as expected behavior. TestAlphaKnob only asserts that the displayed CI level flips to 90 / 95 at tests/test_business_report.py:L497, there is no full_report() regression for single-M HonestDiD despite the prior blocker at tests/test_diagnostic_report.py:L418, and the PT tests do not cover reference-marker / NaN pre-period rows at tests/test_diagnostic_report.py:L305. Impact: the current suite would let the blocker-level output bugs above ship or recur. Concrete fix: add regressions for alpha != results.alpha CI consistency, BusinessReport.full_report() on single-M passthrough, and universal-base / reference-marker PT cases for CS / stacked / two-stage / imputation stubs.

Path to Approval

  1. Make headline CI rendering alpha-safe: either recompute intervals with an existing inference utility when the requested alpha differs from the fitted result's alpha, or preserve the stored CI level and reject mismatched relabeling; make DR prose use the real interval level instead of a hardcoded 95.
  2. Apply the single_M_precomputed branch to BusinessReport.full_report() and add a regression that asserts no “full grid” claim appears there.
  3. Filter synthetic reference markers, zero-observation rows, and non-finite pre-period p-values out of _collect_pre_period_coefs() / Bonferroni PT; if no valid pre-periods remain, emit skipped or inconclusive instead of a clean verdict.
  4. Align the power-tiering method with the documented covariance contract: either wire event_study_vcov into compute_pretrends_power() or skip / downgrade power-aware wording when only a diagonal fallback is available, and document any retained deviation in REPORTING.md.
  5. Either implement the remaining advertised precomputed passthrough keys or narrow the documented accepted-key set and add explicit validation plus tests.

P0 fixes:

1. **CI-level mislabeling across BR and DR.** BR's ``_extract_headline``
   was reading the stored CI from the fitted result and relabeling it
   with the caller's ``alpha``, so ``BusinessReport(results, alpha=0.10)``
   would print a stored 95% interval under a "90% CI" label. DR's
   ``_render_overall_interpretation`` hardcoded "95% CI" in prose,
   inverting the same bug when the caller used a non-default alpha.
   BR now recomputes the interval via ``safe_inference`` when the
   caller's alpha differs from the fit's; DR prose reads the
   headline's alpha to derive the CI level string. Regression in
   ``TestAlphaKnob.test_ci_bounds_recomputed_when_alpha_differs_from_result``.

2. **``full_report()`` single-M HonestDiD rendering.** The summary path
   was fixed earlier, but the structured-markdown path still emitted
   "Breakdown M: robust across full grid (no breakdown)" for a
   single-M passthrough (which has ``breakdown_M=None`` by
   construction, not because it's grid-wide robust). Added the
   ``conclusion == "single_M_precomputed"`` branch in
   ``_render_full_report``. Regression in
   ``TestFullReportSingleM.test_full_report_does_not_claim_full_grid_for_single_m``.

3. **Reference-marker / NaN pre-period filtering.** ``_collect_pre_period_coefs``
   was accepting any negative-time event-study row with non-None effect
   and SE, which pulled in the universal-base reference marker
   (``effect=0, se=NaN, n_groups=0``) emitted by CS / SA /
   ImputationDiD / Stacked event-study output as a real pre-period
   coefficient. ``_pt_event_study`` Bonferroni also treated ``NaN``
   p-values as valid by checking ``is not None`` rather than
   ``np.isfinite``. The combination could produce a false-clean
   ``no_detected_violation`` verdict on fits whose only "evidence" was
   synthetic. Now drop rows with ``n_groups == 0`` and any row whose
   effect, SE, or p-value is non-finite before both applicability and
   PT computation; if no valid entries remain, the check returns
   ``skipped`` rather than a clean p-value. Regressions in
   ``TestReferenceMarkerAndNaNFiltering``.

P1 fixes:

4. **Power-tier covariance source annotation.** ``compute_pretrends_power``
   currently drops to ``np.diag(ses**2)`` for CS / SA / ImputationDiD /
   Stacked / etc. even when the full ``event_study_vcov`` is attached
   on the result. The diagonal-only MDV can be optimistic because it
   ignores correlations across pre-periods; promoting that to
   ``well_powered`` would overstate the evidence. The
   ``pretrends_power`` schema section now records ``covariance_source``
   (one of ``full_pre_period_vcov`` /
   ``diag_fallback_available_full_vcov_unused`` / ``diag_fallback``),
   BR downgrades ``well_powered`` → ``moderately_powered`` when we
   know the diagonal approximation was the only input, and
   ``docs/methodology/REPORTING.md`` documents this as a known
   conservative deviation pending the right long-term fix in
   ``pretrends.py``.

5. **``precomputed=`` contract validation.** The docstring advertised
   passthrough for ``placebo``, ``design_effect``, ``heterogeneity``,
   and ``epv`` but only four checks actually respected it
   (``parallel_trends``, ``sensitivity``, ``pretrends_power``,
   ``bacon``). Narrowed the docstring to match reality and added a
   ``ValueError`` that rejects unsupported ``precomputed=`` keys at
   construction. Regressions in ``TestPrecomputedValidation``. The
   remaining sections (``design_effect``, ``heterogeneity``, ``epv``)
   are read-outs from the fitted result with no expensive call to
   bypass; there is no scenario where a user-supplied override helps.

139 targeted tests pass; black, ruff, and mypy clean on the new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 989f71ab04b41bd8c94796904afd23e6265f66e5


Overall Assessment

⛔ Blocker

Static review only. I could not execute the added tests here because this workspace does not have numpy/pytest.

Executive Summary

  • The prior hardcoded 95% CI wording and the full_report() single-M HonestDiD overclaim look fixed.
  • A new blocker remains in the alpha-override path: BusinessReport now recomputes headline CIs from (att, se) with generic safe_inference(), which discards finite-df and bootstrap inference contracts.
  • DiagnosticReport still overstates diagnostic support across result types: it marks pretrends_power and sensitivity applicable for estimator families that the backing helpers do not support, so valid fits can default to error sections.
  • The earlier pre-period filtering fix is only partial: n_groups==0 markers are filtered, but n_obs==0 synthetic reference rows from Stacked/TwoStage/Imputation are still counted during support gating.
  • The diagonal-covariance power downgrade is now explicitly documented in docs/methodology/REPORTING.md, so I did not count that deviation as a defect.
  • The new tests do not cover bootstrap/finite-df alpha overrides or the unsupported estimator families, so the issues above would likely slip through.

Methodology

  • Severity: P0. Impact: when alpha != results.alpha, BusinessReport recomputes the displayed CI via safe_inference(att, se, alpha=alpha) with no df and no bootstrap-path handling, even though several result surfaces use finite-df or percentile-bootstrap inference. That yields wrong interval bounds and can again decouple the shown CI from the stored p-value/significance surface on survey, TROP, Wooldridge, ContinuousDiD, dCDH bootstrap, etc. diff_diff/business_report.py:L333-L355, diff_diff/utils.py:L175-L209, diff_diff/trop.py:L849-L858, diff_diff/continuous_did_results.py:L48-L64, diff_diff/chaisemartin_dhaultfoeuille_results.py:L158-L173. Concrete fix: only recompute when the report can recreate the estimator’s original inference contract (same distribution, same df, or bootstrap quantiles); otherwise preserve the fitted CI and its native level, or reject mismatched alpha.

  • Severity: P1. Impact: DiagnosticReport marks pretrends_power applicable for ImputationDiDResults, TwoStageDiDResults, StackedDiDResults, EfficientDiDResults, StaggeredTripleDiffResults, WooldridgeDiDResults, and dCDH, but compute_pretrends_power() only supports MultiPeriodDiDResults, CallawaySantAnnaResults, and SunAbrahamResults. On top of that, _collect_pre_period_coefs() still does not drop n_obs==0 synthetic reference rows, so Stacked/TwoStage/Imputation can count reference markers as real pre-period evidence during gating. This produces avoidable pretrends_power errors or misclassified applicability on valid fits. diff_diff/diagnostic_report.py:L98-L172, diff_diff/diagnostic_report.py:L474-L494, diff_diff/diagnostic_report.py:L1680-L1752, diff_diff/pretrends.py:L605-L667, diff_diff/stacked_did.py:L500-L511, diff_diff/two_stage.py:L1353-L1360, diff_diff/imputation.py:L1685-L1692. Concrete fix: restrict pretrends_power to helper-supported families until adapters exist, and filter n_obs==0 rows alongside n_groups==0.

  • Severity: P1. Impact: DiagnosticReport marks sensitivity applicable for SunAbrahamResults, ImputationDiDResults, StackedDiDResults, EfficientDiDResults, StaggeredTripleDiffResults, WooldridgeDiDResults, and dCDH, then calls HonestDiD.sensitivity_analysis() generically. The backing helper only has adapters for MultiPeriodDiDResults, CallawaySantAnnaResults, and dCDH; dCDH itself is misrouted because DR never consults placebo_event_study, which is the pre-period surface HonestDiD expects there. Result: default reports land in error for several advertised families and skip available placebo-based dCDH sensitivity evidence. diff_diff/diagnostic_report.py:L88-L170, diff_diff/diagnostic_report.py:L495-L518, diff_diff/diagnostic_report.py:L966-L1007, diff_diff/honest_did.py:L568-L667, diff_diff/honest_did.py:L834-L860. Concrete fix: only auto-run sensitivity for helper-supported families, and add a dCDH-specific adapter that uses placebo_event_study / existing honest_did_results.

  • Severity: P3. Impact: the staggered-estimator diagonal-covariance downgrade for power wording is explicitly documented, so it is mitigated and not a defect under the review rules. docs/methodology/REPORTING.md:L102-L115. Concrete fix: none required for approval.

Code Quality

  • Severity: P2. Impact: the reserved placebo schema contract is internally inconsistent. REPORTING.md says the "placebo" key is always rendered as {"status": "skipped"} in MVP, but no result type includes "placebo" in _APPLICABILITY, so the implementation usually falls through to "not_applicable", and the new test explicitly blesses both. That weakens the stated “stable AI-legible schema” promise. docs/methodology/REPORTING.md:L36-L42, diff_diff/diagnostic_report.py:L72-L175, diff_diff/diagnostic_report.py:L435-L443, diff_diff/diagnostic_report.py:L561-L570, tests/test_diagnostic_report.py:L266-L271. Concrete fix: pick one status, make code/docs match, and assert the exact value in tests.

Performance

No material findings in the diff.

Maintainability

No separate maintainability findings beyond the over-broad applicability/support matrix issues above.

Tech Debt

No mitigation applies here. None of the P0/P1 items above are tracked under TODO.md’s “Tech Debt from Code Reviews” section, so TODO-tracking does not downgrade them. TODO.md:L51-L99

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the reporting tests only instantiate DiDResults, MultiPeriodDiDResults, CallawaySantAnnaResults, EfficientDiDResults, and SyntheticDiDResults, and the alpha regression only covers the analytic MultiPeriod case. They do not exercise the unsupported-family routing or the bootstrap/finite-df alpha override path, so the current suite would not catch the two approval-blocking issues above. tests/test_diagnostic_report.py:L76-L129, tests/test_diagnostic_report.py:L216-L258, tests/test_business_report.py:L496-L520. Concrete fix: add explicit report tests for SunAbraham, ImputationDiD, StackedDiD, WooldridgeDiD, StaggeredTripleDiff, and dCDH, plus one survey finite-df case and one percentile-bootstrap case with alpha != results.alpha.

Path to Approval

  1. Make BusinessReport’s alpha override method-aware: for bootstrap or finite-df inference surfaces, either recompute with the same stored inference contract or refuse the override instead of calling plain safe_inference(att, se, alpha=alpha).
  2. Narrow pretrends_power routing to helper-supported result types until adapters exist, and filter n_obs==0 synthetic reference rows out of _collect_pre_period_coefs() so support gating ignores reference markers.
  3. Narrow sensitivity routing to helper-supported result types, and add a dCDH-specific path that uses placebo_event_study / honest_did_results rather than the generic event-study pre-period collector.

P0 fix:

* **Alpha override was inference-contract-blind.** Previously, whenever
  the caller's ``alpha`` differed from the result's, BR recomputed the
  displayed CI via ``safe_inference(att, se, alpha=alpha)`` with no
  ``df`` and no bootstrap handling — silently discarding the
  ``bootstrap_distribution`` / finite-df inference contracts used by
  TROP, ContinuousDiD, dCDH-bootstrap, survey fits, SDiD jackknife,
  etc. BR now detects bootstrap-backed (``inference_method='bootstrap'``
  or non-None ``bootstrap_distribution`` or ``variance_method in
  {bootstrap, jackknife, placebo}``) and finite-df (``df_survey > 0``)
  inference paths and preserves the fitted CI at its native level in
  those cases, recording an informational caveat noting that the
  caller's alpha still drives phrasing but the native interval is
  shown. Regressions in ``TestAlphaOverrideBootstrapAndFiniteDF``
  cover both the bootstrap and finite-df survey paths.

P1 fixes:

* **``pretrends_power`` over-broad applicability.** The matrix had
  marked the check applicable for ImputationDiD, TwoStage, Stacked,
  EfficientDiD, StaggeredTripleDiff, Wooldridge, and dCDH, but
  ``compute_pretrends_power`` only has adapters for MultiPeriod, CS,
  and SA; the other families were landing in ``error``. Narrowed the
  applicability matrix to match the real helper support.

* **``sensitivity`` over-broad applicability.** HonestDiD only adapts
  MultiPeriod, CS, and dCDH (via ``placebo_event_study``). The matrix
  had also included SA / Imputation / Stacked / EfficientDiD /
  StaggeredTripleDiff / Wooldridge. Narrowed to the supported set. The
  dCDH-specific instance gate now checks ``placebo_event_study`` rather
  than the generic ``event_study_effects`` so HonestDiD's dCDH branch
  is reached instead of the generic event-study collector.

* **``n_obs == 0`` reference-marker filter.** Stacked / TwoStage /
  Imputation emit synthetic reference-period markers using ``n_obs=0``
  rather than CS / SA's ``n_groups=0`` flag. ``_collect_pre_period_coefs``
  now drops rows with either sentinel so the Bonferroni denominator
  and joint-Wald index are not inflated by non-informative rows.

P2 fix:

* **``placebo`` schema inconsistency.** ``REPORTING.md`` said
  ``placebo`` is always rendered as ``{"status": "skipped"}`` in MVP,
  but no result type had ``placebo`` in its applicability frozenset, so
  implementation fell through to ``"not_applicable"``. Now every
  DiagnosticReport.to_dict() returns ``placebo`` with ``status="skipped"``
  regardless of estimator, matching the stated contract.

Regression tests for each finding added in
``TestNarrowedApplicabilityAndPlaceboSchema`` and
``TestAlphaOverrideBootstrapAndFiniteDF``. 146 targeted tests pass;
black, ruff, mypy clean on the new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 959f84e80d60c8b9f8a785d0177aa99aa83179aa


Overall Assessment

⛔ Blocker

Static review only. I could not execute the added tests in this workspace because numpy is not installed.

Executive Summary

  • The prior rerun’s main routing/filtering blockers look addressed: pretrends_power/sensitivity applicability now matches helper support, dCDH sensitivity gates on placebo_event_study, CS PT can use event_study_vcov, and pre-period reference-marker filtering now handles both n_groups == 0 and n_obs == 0.
  • A new blocker remains in BusinessReport’s alpha override: wild-bootstrap DiDResults are still treated as analytical, so alpha != results.alpha silently replaces the stored wild-bootstrap CI with a normal-approximation CI.
  • BusinessReport’s new “Identifying Assumption” section is not source-faithful for at least ContinuousDiD, TripleDifference, and StaggeredTripleDiff; it falls back to generic DiD/group-time PT language that contradicts the Methodology Registry and in-code guidance.
  • The new tests cover analytical/bootstrap/finite-d.f. alpha handling, but not the wild_bootstrap path, and they do not assert estimator-specific assumption text for the nonstandard designs above.

Methodology

  • Severity: P0. Impact: BusinessReport(..., alpha=...) preserves native CIs for "bootstrap" and survey finite-d.f. fits, but not for DifferenceInDifferences(..., inference="wild_bootstrap"). Those results store inference_method="wild_bootstrap" and a percentile-bootstrap CI, yet BR falls through to safe_inference(att, se, alpha=alpha), which silently swaps in a normal-theory interval while leaving the bootstrap p-value/significance surface in place. That is wrong statistical output with no warning. Concrete fix: treat inference_method == "wild_bootstrap" as bootstrap-like and preserve the fitted CI/level, or reject alpha overrides whenever the native inference surface cannot be faithfully recreated. Refs: diff_diff/business_report.py:L333-L390, diff_diff/estimators.py:L440-L490, diff_diff/utils.py:L617-L638.

  • Severity: P1. Impact: BusinessReport renders ContinuousDiDResults with the fallback “standard DiD parallel-trends assumption plus no anticipation” text, but the registry says ContinuousDiD uses dose-specific PT / strong PT with different identified targets. That is an undocumented methodology mismatch in a new public-facing assumption block. Concrete fix: add a ContinuousDiDResults branch in _describe_assumption() that mirrors the registry’s PT/SPT language, or suppress the assumption section until a vetted summary exists. Refs: diff_diff/business_report.py:L634-L640, docs/methodology/REGISTRY.md:L663-L677.

  • Severity: P1. Impact: TripleDifferenceResults and StaggeredTripleDiffResults are described as ordinary/group-time PT designs, but the registry and practitioner guidance say DDD identification is weaker than separate DiD parallel trends and depends on the triple-difference cancellation structure across cells/cohorts. The new “Identifying Assumption” section therefore states the wrong identifying logic for these methods. Concrete fix: add dedicated DDD assumption text for both result classes, or omit the assumption section for DDD until a source-backed summary is written. Refs: diff_diff/business_report.py:L614-L632, docs/methodology/REGISTRY.md:L1556-L1585, docs/methodology/REGISTRY.md:L1654-L1719, diff_diff/practitioner.py:L722-L739.

Code Quality

  • No material findings beyond the methodology issues above.

Performance

  • No material findings in the changed files.

Maintainability

  • No separate maintainability findings beyond the incorrect assumption mapping above.

Tech Debt

  • No mitigation applies. Neither of the P0/P1 issues above is tracked in TODO.md’s “Tech Debt from Code Reviews” section, so they remain unmitigated for assessment purposes. Ref: TODO.md:L51-L99.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new BR tests cover analytic/bootstrap/finite-d.f. alpha overrides, but not inference="wild_bootstrap", and there are no assertions that the assumption block is estimator-specific for ContinuousDiD, TripleDifference, or StaggeredTripleDiff. The current suite would not catch the two approval-blocking issues above. Concrete fix: add one DifferenceInDifferences(inference="wild_bootstrap") alpha-mismatch regression, plus direct BusinessReport.to_dict()["assumption"] assertions for the nonstandard estimators against registry-backed wording. Refs: tests/test_business_report.py:L496-L606, tests/test_diagnostic_report.py:L854-L874.

Path to Approval

  1. In BusinessReport._extract_headline(), preserve the native CI when results.inference_method == "wild_bootstrap" (or more generally when the native inference surface is non-analytic and cannot be recreated), instead of calling safe_inference(att, se, alpha=alpha). Add a regression test using a wild-bootstrap DiDResults.
  2. Replace _describe_assumption()’s generic fallback/group-time text with source-backed mappings for ContinuousDiDResults, TripleDifferenceResults, and StaggeredTripleDiffResults, or suppress the assumption section for unmapped estimators. Add direct tests of the emitted assumption block for those classes.

P0 fix:

* **``inference_method == 'wild_bootstrap'`` was not detected as
  bootstrap-like.** My prior bootstrap check caught ``'bootstrap'`` and
  ``variance_method in {bootstrap, jackknife, placebo}`` plus an
  attached ``bootstrap_distribution``, but ``DifferenceInDifferences(
  inference='wild_bootstrap')`` returns ``inference_method='wild_bootstrap'``
  and a percentile-bootstrap CI without necessarily attaching the raw
  distribution. The override path silently replaced that CI with a
  normal-approximation one. Fixed by matching both
  ``'bootstrap'`` and ``'wild_bootstrap'``; the preserved-CI caveat
  now calls out "wild cluster bootstrap" specifically when that path
  triggered. Regression: ``TestWildBootstrapAlphaOverride``.

P1 fix:

* **``_describe_assumption()`` emitted generic DiD PT text for
  ContinuousDiD / TripleDifference / StaggeredTripleDiff**, all of
  which have identifying logic different from ordinary group-time PT
  per ``docs/methodology/REGISTRY.md``. Replaced the generic fallback
  with source-backed branches:

  - ``ContinuousDiDResults``: two-level parallel trends (PT vs Strong
    PT) per Callaway, Goodman-Bacon & Sant'Anna (2024), with explicit
    mention of ATT(d|d), ATT(d), ACRT identification sets.
  - ``TripleDifferenceResults`` / ``StaggeredTripleDiffResults``:
    triple-difference cancellation across the 2x2x2 cells per
    Ortiz-Villavicencio & Sant'Anna (2025); notes that identification
    is weaker than ordinary DiD PT and depends on additive
    separability across the three dimensions.

  The ``parallel_trends_variant`` schema field gains two new values:
  ``"dose_pt_or_strong_pt"`` and ``"triple_difference_cancellation"``.
  Direct regressions in ``TestAssumptionBlockSourceFaithful`` assert
  registry-backed language (attribution phrases + method names) is
  present and generic group-time PT text is absent.

150 targeted tests pass; black, ruff, mypy clean on the new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

1 similar comment
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 345f65cc492306a995fc89423aec85ae5c4eb340


Overall Assessment

⛔ Blocker

Static review only. The rerun fixes the narrow wild_bootstrap path and the previously flagged ContinuousDiD / DDD assumption text, but one broader statistical-output bug remains and two methodology-reporting issues are still unmitigated.

Executive Summary

  • The prior rerun’s specific wild_bootstrap CI issue and the ContinuousDiD / TripleDifference / StaggeredTripleDiff assumption-text regressions are addressed.
  • BusinessReport(alpha=...) still silently swaps in normal-theory confidence intervals for most bootstrap-backed result classes that store bootstrap inference in top-level fields rather than inference_method; that is a P0 output bug.
  • DiagnosticReport drops methodology-critical runtime warnings from delegated diagnostics, so default CallawaySantAnna(base_period='varying') fits can be narrated as HonestDiD-robust with no caveat even though the helper explicitly warns that those bounds are not valid for interpretation.
  • BusinessReport still describes dCDH as a generic group-time parallel-trends estimator, which is not source-faithful to the registry’s joiner/leaver transition-based identification.
  • Tests and docs do not cover the remaining bootstrap-surface bug or the varying-base HonestDiD warning path, and the new README example uses the unsafe Callaway-Sant’Anna default.

Methodology

  • Severity: P0. Impact: BusinessReport._extract_headline() only preserves native intervals when it sees inference_method, bootstrap_distribution, variance_method, or positive df_survey. Most bootstrap-backed result classes here do not expose those markers; instead they copy bootstrap p_value/conf_int into the top-level result fields during fit. As a result, BusinessReport(..., alpha=...) can still show a bootstrap p-value/significance label next to a freshly recomputed normal-approximation CI with no warning for at least Callaway-Sant’Anna, Sun-Abraham, ImputationDiD, TwoStageDiD, EfficientDiD, StaggeredTripleDiff, ContinuousDiD, and dCDH. Concrete fix: treat bootstrap_results is not None, n_bootstrap > 0, and analogous bootstrap-backed top-level surfaces as non-recreatable inference contracts and preserve the fitted CI level (or reject alpha override) across all such result classes. Refs: diff_diff/business_report.py:333, diff_diff/staggered.py:1877, diff_diff/continuous_did.py:480, diff_diff/chaisemartin_dhaultfoeuille_results.py:158.

  • Severity: P1. Impact: DiagnosticReport auto-runs HonestDiD.sensitivity_analysis() but only records exceptions; non-fatal methodology warnings are discarded. For CallawaySantAnnaResults, HonestDiD explicitly warns that base_period='varying' is not valid for interpreted bounds, and Callaway-Sant’Anna defaults to base_period='varying'. The new README example and test fixture use that default, so BR/DR can narrate sensitivity findings as if they were valid with no caveat. That is a missing assumption/warning check on a methodology-sensitive path. Concrete fix: capture runtime warnings from delegated diagnostics, surface them in the schema/report text, and skip or mark HonestDiD sensitivity as inconclusive unless base_period='universal' (or the user explicitly opts into the warned path). Refs: diff_diff/diagnostic_report.py:997, diff_diff/diagnostic_report.py:617, diff_diff/honest_did.py:675, diff_diff/staggered.py:295, docs/methodology/REGISTRY.md:2230.

  • Severity: P1. Impact: BusinessReport still routes ChaisemartinDHaultfoeuilleResults through the generic “group-time ATT parallel trends + no anticipation” text. The registry describes dCDH as a reversible-treatment estimator built from joiner/leaver/stable-control transition comparisons (DID_M, DID_l), not a group-time ATT design. That is a public identifying-assumption mismatch in the new assumption block. Concrete fix: add a dCDH-specific assumption description grounded in the registry’s transition-set language, or suppress the assumption section for dCDH until a vetted summary exists. Refs: diff_diff/business_report.py:666, docs/methodology/REGISTRY.md:460.

Code Quality

  • No material findings beyond the methodology/reporting issues above.

Performance

  • No material findings in the changed files.

Maintainability

  • No separate maintainability findings beyond the missing coverage and reporting-surface issues below.

Tech Debt

  • No mitigation applies. None of the blocker items above is tracked under “Tech Debt from Code Reviews” in TODO.md:51, and P0/P1 findings would not be downgraded by TODO tracking anyway.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new regressions only cover bootstrap cases that advertise inference_method or a raw bootstrap distribution; they do not cover the real bootstrap-backed result shapes that expose bootstrap inference via top-level overall_conf_int/overall_p_value only, and they do not assert any caveat/skip behavior for Callaway-Sant’Anna base_period='varying' HonestDiD. The README example also uses the varying-base default while advertising one-call sensitivity. Concrete fix: add regressions with representative bootstrapped CallawaySantAnnaResults, ContinuousDiDResults, and ChaisemartinDHaultfoeuilleResults shapes, plus a CS varying-base DR/BR test that requires a surfaced warning or skipped sensitivity; update docs/examples to use base_period='universal' whenever sensitivity is shown. Refs: tests/test_business_report.py:523, tests/test_business_report.py:609, tests/test_business_report.py:90, README.md:99.

Path to Approval

  1. In BusinessReport._extract_headline(), preserve native CI surfaces for all bootstrap-backed result classes, not just those with inference_method / variance_method markers. Add regressions for at least one staggered bootstrap result, one ContinuousDiDResults bootstrap result, and one dCDH bootstrap result.
  2. Capture and surface delegated diagnostic warnings in DiagnosticReport and BusinessReport. For CallawaySantAnnaResults with base_period!='universal', do not narrate HonestDiD robustness as ordinary “computed” sensitivity without a prominent warning or skip.
  3. Replace _describe_assumption()’s generic ChaisemartinDHaultfoeuilleResults text with source-backed dCDH identification language, or suppress the assumption block for dCDH until that text is written.
  4. Update docs/tests to cover the two paths above, and change any CS sensitivity example to base_period='universal' or disable auto sensitivity.

- P0: ``_extract_headline`` now detects ``bootstrap_results is not None``
  and ``n_bootstrap > 0`` in addition to ``inference_method`` /
  ``bootstrap_distribution`` / ``variance_method`` / ``df_survey``.
  Many staggered / continuous / dCDH result classes copy bootstrap-
  derived se/p/conf_int into their top-level fields without advertising
  ``inference_method``; alpha override must preserve their fitted CI
  rather than silently swapping in a normal-approximation interval.
- P1: ``DiagnosticReport._check_sensitivity`` wraps the HonestDiD call
  in ``warnings.catch_warnings(record=True)`` and propagates captured
  messages onto the returned section dict. ``run_all`` aggregates
  per-section warnings into the top-level ``warnings`` list so both DR
  and BR surface them. CallawaySantAnna fits with
  ``base_period='varying'`` are preemptively skipped at the applicability
  gate with a methodology-critical reason, since HonestDiD explicitly
  warns those bounds are not valid for interpretation. BR renders the
  skip as a warning-severity caveat under a new ``sensitivity_skipped``
  topic.
- P1: ``_describe_assumption`` now gives
  ``ChaisemartinDHaultfoeuilleResults`` a source-backed description of
  transition-based identification (joiners / leavers / stable-control
  transitions, DID_M / DID_l building blocks, non-binary dose matching,
  reversible treatment) rather than generic group-time ATT PT text.
- P2: README example now uses
  ``CallawaySantAnna(base_period='universal')`` so the advertised
  one-call sensitivity path actually runs. Both ``cs_fit`` fixtures
  updated likewise.
- Regressions:
  ``TestBootstrapResultsAndNBootstrapDetection`` (four cases incl.
  dCDH-shaped stub and the analytic zero-bootstrap guard),
  ``TestDCDHAssumptionTransitionBased`` (source-faithful language
  assertions), ``TestCSVaryingBaseSensitivitySkipped`` (DR schema reason
  + BR caveat surfacing).

150 -> 115 targeted tests passing; black / ruff / mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 311a7bec980510beb3dd315ff2c1881a396c8079


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the added tests locally in this environment because numpy is unavailable.

Executive Summary

  • The prior P1+ items from the last review look addressed: native bootstrap/finite-df CIs are now preserved on alpha override, varying-base Callaway-Sant’Anna HonestDiD is skipped with a methodology note, and the dCDH assumption text now matches the registry.
  • Severity: P1. The new heterogeneity diagnostic can mix pre-period placebo coefficients and synthetic reference rows into post-treatment “heterogeneity” summaries, which silently changes the estimand being described.
  • Severity: P1. DiagnosticReport says dCDH supports a parallel-trends check, but the actual pre-period extractor never reads placebo_event_study, so the assumption check is skipped even when valid placebo horizons are present.
  • Severity: P2. The API docs still use default CallawaySantAnna() examples despite the new universal-base sensitivity rule, and the reporting methodology doc overstates some pretrends-power routing relative to the implemented applicability matrix.
  • The new test files cover the earlier regressions well, but they do not pin the two new methodology paths above.

Methodology

  • Severity: P1. Impact: _check_heterogeneity() computes range/CV/sign consistency from _collect_effect_scalars(), and that helper falls back to raw event_study_effects or period_effects when group_effects is absent. For MultiPeriodDiDResults, period_effects explicitly contains both pre- and post-treatment coefficients; for staggered event-study outputs, event_study_effects includes relative-time coefficients and can include synthetic reference rows that other helpers already filter out. That means the new “heterogeneity” block can be driven by placebo/reference terms rather than treatment effects, producing false sign flips or distorted dispersion with no warning. Concrete fix: restrict heterogeneity inputs to post-treatment effect surfaces only (post_period_effects for MultiPeriodDiDResults; positive/identified event-study horizons only for staggered fits, excluding n_groups==0 / n_obs==0 markers), or rename/split this into a separate dynamic-effects diagnostic. Refs: diff_diff/diagnostic_report.py:1236, diff_diff/diagnostic_report.py:1421, diff_diff/results.py:329, diff_diff/results.py:394, diff_diff/diagnostic_report.py:1796.
  • Severity: P1. Impact: dCDH is marked as parallel_trends-applicable and routed to the event-study PT path, but _collect_pre_period_coefs() only knows how to read pre_period_effects or negative keys inside event_study_effects. dCDH stores its pre-period surface in placebo_event_study, and the rest of the library already treats that as the negative-horizon event-study analogue. As written, BR/DR will skip the dCDH assumption check even when valid placebo horizons exist. That is a missing assumption check on a methodology-sensitive path. Concrete fix: special-case ChaisemartinDHaultfoeuilleResults so the PT check reads finite-SE negative horizons from placebo_event_study and runs the same Wald/Bonferroni logic, or remove dCDH from PT applicability until that path exists. Refs: diff_diff/diagnostic_report.py:165, diff_diff/diagnostic_report.py:195, diff_diff/diagnostic_report.py:458, diff_diff/diagnostic_report.py:1745, diff_diff/chaisemartin_dhaultfoeuille_results.py:411, diff_diff/visualization/_event_study.py:670.

Code Quality

  • No material findings beyond the methodology issues above.

Performance

  • No material findings in the changed files.

Maintainability

  • No material maintainability finding beyond the two methodology bugs above.

Tech Debt

  • No independent deferrable finding. The two P1 issues above are not tracked under TODO.md:51, and they would not be mitigated by TODO tracking under the review policy anyway.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Change heterogeneity extraction so it only uses post-treatment effect surfaces, and explicitly filters out pre-period/placebo/reference-marker rows.
  2. Implement the dCDH parallel-trends check on placebo_event_study negative horizons, or remove dCDH from PT applicability until that path is implemented.
  3. Add regressions for both fixes, and update the public CallawaySantAnna auto-diagnostics examples to use base_period="universal" where sensitivity is being showcased.

- P1 heterogeneity: ``_collect_effect_scalars`` no longer mixes pre-
  and post-treatment coefficients into the CV / range / sign-
  consistency summary. ``MultiPeriodDiDResults`` now routes through
  ``post_period_effects``; staggered event-study fits filter to
  ``rel_time >= 0`` AND exclude reference markers
  (``n_groups == 0`` / ``n_obs == 0``) AND exclude non-finite rows;
  CS ``group_time_effects`` filters to ``t >= g`` post-treatment cells.
  ``_heterogeneity_source`` now names the post-only surface (e.g.,
  ``post_period_effects`` / ``event_study_effects_post`` /
  ``group_time_effects_post``) so downstream tooling can verify the
  estimand being summarized.
- P1 dCDH parallel trends: ``_collect_pre_period_coefs`` now reads
  ``placebo_event_study`` as the pre-period surface for
  ``ChaisemartinDHaultfoeuilleResults``. dCDH is advertised as
  PT-applicable in ``_APPLICABILITY`` but the extractor previously
  only looked at ``pre_period_effects`` / negative-horizon
  ``event_study_effects``, silently skipping the PT check on valid
  placebo fits.
- P2: API RST examples (``docs/api/business_report.rst``,
  ``docs/api/diagnostic_report.rst``) updated to construct
  ``CallawaySantAnna(base_period="universal")`` so the advertised
  auto-diagnostics path runs sensitivity instead of being skipped.
  ``docs/methodology/REPORTING.md`` pretrends-power routing text now
  matches the implemented applicability matrix ({MultiPeriod, CS, SA})
  rather than claiming general "event-study with vcov" applicability.
- Regressions:
  ``TestDCDHParallelTrendsViaPlaceboEventStudy`` (two cases: runs when
  ``placebo_event_study`` populated, skips when missing) and
  ``TestHeterogeneityPostTreatmentOnly`` (extractor returns post-only
  scalars for MultiPeriod; event-study filter drops pre-period and
  reference-marker rows).

115 -> 109 targeted tests passing; black / ruff / mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: df2636bde8210935f1946f33eab78233773a8b2c


Overall Assessment

⛔ Blocker

Static review only. I could not execute the new tests locally because numpy is unavailable in this sandbox.

Executive Summary

  • The prior methodology blockers from the last review appear fixed: dCDH PT now reads placebo_event_study, and heterogeneity now restricts itself to post-treatment surfaces instead of mixing in pre/reference rows (diff_diff/diagnostic_report.py:L1828-L1932, diff_diff/diagnostic_report.py:L1421-L1566).
  • The explicit reporting-layer deviations are properly documented in docs/methodology/REPORTING.md and are not defects under the review policy.
  • P0: BusinessReport(alpha != results.alpha) still silently breaks native inference contracts for ordinary finite-d.f. analytical fits, and it can even invent a finite CI when the underlying fit’s inference is undefined.
  • P2: DiagnosticReport.pretrends_power["power_at_M_1"] is mislabeled; it currently stores power at violation_magnitude (default M = mdv), not at M = 1.0.
  • The new regression suite covers many alpha-override cases, but it still does not pin the standard analytical finite-d.f. path that causes the blocker above.

Methodology

  • P0. Impact: BusinessReport._extract_headline() preserves native CIs only when it detects bootstrap-like inference or survey_metadata.df_survey > 0; otherwise it recomputes via safe_inference(att, se, alpha=alpha) with no df, which means a z-based CI. That is not source-faithful for several existing result surfaces. Ordinary analytical DiDResults inherit finite-d.f. t-based inference from LinearRegression.get_inference(), MultiPeriodDiD explicitly uses safe_inference(..., df=df), and TROP uses df_trop = max(1, n_valid_treated - 1). On an alpha override, BR therefore silently swaps the fitted CI family to a z-based one. Worse, replicate-survey paths deliberately use df=0 so safe_inference() returns all-NaN inference, but BR can still recompute and display a finite CI on override. That is wrong statistical output with no warning (diff_diff/business_report.py:L333-L421, diff_diff/linalg.py:L2073-L2163, diff_diff/estimators.py:L1330-L1435, diff_diff/trop.py:L849-L851, diff_diff/estimators.py:L430-L439, diff_diff/utils.py:L175-L209). Concrete fix: do not recompute a CI unless the result explicitly exposes enough metadata to reproduce its native inference contract. At minimum, preserve the fitted CI whenever the fit used any finite-d.f. or undefined-d.f. path; the robust fix is to propagate an explicit result-level inference descriptor such as inference_df / inference_distribution and branch on that.

Code Quality

  • P2. Impact: DiagnosticReport._check_pretrends_power() writes pp.power into power_at_M_1, but PreTrendsPowerResults.power is defined at violation_magnitude, and when M=None the helper sets violation_magnitude = mdv. So the schema advertises “power at M=1” while actually storing power at MDV (or another supplied M). That is a silent schema-contract mismatch for to_dict() consumers (diff_diff/diagnostic_report.py:L976-L987, diff_diff/pretrends.py:L43-L54, diff_diff/pretrends.py:L843-L848). Concrete fix: either rename the field to match the actual quantity and also expose violation_magnitude, or explicitly compute/store a separate power_at_M_1.

Performance

No findings.

Maintainability

No independent findings beyond the schema-label issue above.

Tech Debt

No independent findings. The blocker above is not tracked under TODO.md:L51, and it would not be mitigated by TODO tracking under the review policy.

Security

No findings.

Documentation/Tests

  • P2. Impact: the new alpha-override regressions cover bootstrap, wild-bootstrap, positive df_survey, and bootstrap-marker detection, but they still do not cover the ordinary analytical finite-d.f. path (DiDResults / MultiPeriodDiD / TROP) or the undefined-d.f. survey path. That left the blocker above unpinned (tests/test_business_report.py:L499-L609, tests/test_business_report.py:L805-L869). Concrete fix: add regressions asserting that BusinessReport(alpha != results.alpha) preserves native CI semantics for an analytical t-based fit and keeps all inference fields NaN when the underlying fit used df=0 / undefined replicate d.f.

Path to Approval

  1. Fix BusinessReport._extract_headline() so alpha overrides never recompute z-based CIs for fits whose native inference used finite d.f. or undefined d.f. If BR cannot reconstruct the original inference contract exactly, it should preserve the fitted CI instead of changing distributions.
  2. Add targeted regressions for BusinessReport(alpha != results.alpha) on at least one analytical finite-d.f. result surface and one undefined-d.f. survey surface.
  3. Correct the power_at_M_1 schema field, either by renaming it to the quantity actually returned or by computing a real M=1.0 power value.

- P0 alpha override: ``BusinessReport._extract_headline`` no longer
  recomputes a normal-z CI when the caller's ``alpha`` differs from
  the fit's native alpha. Recomputing via
  ``safe_inference(att, se, alpha)`` silently swapped t-based
  inference (``DifferenceInDifferences`` via
  ``LinearRegression.get_inference()``, ``MultiPeriodDiD`` via
  ``safe_inference(..., df=df)``, TROP's ``df_trop``) for a normal-z
  CI, and could invent a finite CI on undefined-df (replicate-
  survey ``df_survey=0``) fits whose native inference was NaN. The
  fitted CI is now preserved at its native level on any alpha
  mismatch; a ``display_alpha`` / ``phrasing_alpha`` split keeps the
  CI level at the fit's native level while significance phrasing
  (``is_significant``, ``near_significance_threshold``) uses the
  caller's requested alpha. The inference label in the override
  caveat now distinguishes bootstrap, wild bootstrap, jackknife /
  placebo, finite-df survey, undefined-df replicate, and ordinary
  analytical (native degrees of freedom).
- P2 schema mislabel: ``DiagnosticReport`` pretrends-power section
  renames ``power_at_M_1`` to ``power_at_violation_magnitude`` and
  adds an explicit ``violation_magnitude`` field. The underlying
  ``PreTrendsPowerResults.power`` is power at ``violation_magnitude``
  (which defaults to the MDV when the caller passes ``M=None``),
  not power at ``M=1.0`` as the prior label implied.
- Test updates: the round-2 ``test_ci_bounds_recomputed_when_alpha
  _differs_from_result`` assumed recomputation was the correct
  behavior; renamed to
  ``test_alpha_mismatch_preserves_fitted_ci_at_native_level`` and
  inverted the bounds expectations. ``test_alpha_drives_ci_level``
  narrowed to the equal-alpha case.
  ``test_n_bootstrap_zero_does_not_trigger_preserve_path`` replaced
  by ``test_n_bootstrap_zero_still_preserves_on_alpha_mismatch``.
  New ``TestAnalyticalFiniteDfAlphaOverride`` pins the P0 fix on
  real ``DifferenceInDifferences`` / ``MultiPeriodDiD`` fits and on
  a ``df_survey=0`` stub (NaN CI must propagate).

112 targeted tests passing; black / ruff / mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d8fa66d094ec8a193f67fc3907b84f282842a28d


Overall Assessment

⛔ Blocker

Static review only. I could not execute the new tests in this sandbox because numpy/pytest are unavailable.

Executive Summary

  • The earlier dCDH-specific issues from the prior review look addressed; the remaining problems are elsewhere in the new reporting layer.
  • P0: BusinessReport(alpha != results.alpha) is still not source-faithful for ordinary analytical finite-d.f. fits and undefined replicate-d.f. survey fits. It can silently swap the fit’s native inference contract for a normal-approximation CI.
  • P1: The new prose layer ignores method-specific PT semantics. It narrates 2x2 slope-difference checks and EfficientDiD’s Hausman PT-All vs PT-Post comparison as if they were generic pre-period event-study tests, and BusinessReport also collapses EfficientDiD’s identifying assumption to generic group-time PT.
  • P2: DiagnosticReport.pretrends_power["power_at_M_1"] is still mislabeled; it stores power at violation_magnitude, not necessarily at M=1.
  • P2: The new regression suite still locks in the wrong analytical alpha-override behavior, so CI would not catch the blocker above.

Methodology

  • Severity: P0. BusinessReport._extract_headline() still falls through to safe_inference(att, se, alpha=alpha) when it does not detect bootstrap or positive df_survey (diff_diff/business_report.py:333, diff_diff/business_report.py:415). That is not source-faithful for ordinary analytical fits: DiDResults inherit fitted regression d.f. from LinearRegression.get_inference() (diff_diff/linalg.py:2120, diff_diff/linalg.py:2140), MultiPeriodDiD explicitly uses safe_inference(..., df=df) (diff_diff/estimators.py:1389), and replicate-d.f.-undefined survey paths deliberately return NaN inference even though survey_metadata.df_survey is reset to None (diff_diff/estimators.py:436, diff_diff/estimators.py:437). Impact: on BusinessReport(alpha != results.alpha), the report can silently replace a t-based or undefined inference surface with a z-based finite CI. Concrete fix: do not recompute CIs on alpha mismatch unless the result exposes an explicit, reproducible inference descriptor; otherwise preserve the fitted CI/NaN surface and keep the requested alpha only for significance phrasing/schema flags.

  • Severity: P1. The PT/assumption prose is not method-aware. DiagnosticReport.summary() and BusinessReport.summary() hard-code “pre-treatment event-study coefficients/data” wording for any ran PT result (diff_diff/diagnostic_report.py:2058, diff_diff/business_report.py:1147), even though the same section can be populated by a 2x2 slope-difference check (diff_diff/diagnostic_report.py:786) or EfficientDiD’s Hausman PT-All vs PT-Post comparison (diff_diff/diagnostic_report.py:1629). Separately, BusinessReport._describe_assumption() groups EfficientDiDResults under generic group-time PT text (diff_diff/business_report.py:716) even though the registry treats PT-Post and PT-All as distinct identification regimes and documents the Hausman pretest on the post-treatment ES vector (docs/methodology/REGISTRY.md:736, docs/methodology/REGISTRY.md:907). Impact: DiD/EfficientDiD reports describe the wrong hypothesis and, for EfficientDiD, the wrong identifying assumption. Concrete fix: branch prose on parallel_trends.method (slope_difference, joint_wald, hausman, synthetic_fit) and build the EfficientDiD assumption block from results.pt_assumption (and ideally control_group).

Code Quality

  • Severity: P2. DiagnosticReport.pretrends_power["power_at_M_1"] is still populated from PreTrendsPowerResults.power (diff_diff/diagnostic_report.py:984, diff_diff/diagnostic_report.py:1006), but that quantity is defined at violation_magnitude, not specifically at M=1 (diff_diff/pretrends.py:49). Impact: to_dict() consumers read a mislabeled statistic whenever the helper evaluated MDV or any custom M. Concrete fix: rename the field to power_at_violation_magnitude and expose violation_magnitude, or explicitly compute/store a separate power_at_M_1.

Performance

  • No findings.

Maintainability

  • No independent findings beyond the method-specific prose issue above.

Tech Debt

  • No independent findings. The P0/P1 items above are not tracked in TODO.md:51, so TODO tracking does not mitigate them.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. The new regression suite still asserts the opposite of the desired analytical behavior: test_n_bootstrap_zero_does_not_trigger_preserve_path expects an n_bootstrap=0 fit to recompute a 90% CI on alpha override (tests/test_business_report.py:862). There is still no targeted regression on a real analytical DiDResults/MultiPeriodDiDResults surface or an undefined replicate-d.f. survey surface. Impact: CI would not catch the blocker above, and one test currently codifies the bug. Concrete fix: replace that stub expectation with regressions on actual finite-d.f. analytical results and undefined-d.f. replicate-survey results.

Path to Approval

  1. Rewrite BusinessReport._extract_headline() so alpha mismatches never trigger a blind safe_inference(att, se, alpha) recomputation unless the result explicitly exposes the inference contract needed to reproduce it exactly.
  2. Make PT prose method-aware in both DiagnosticReport and BusinessReport, and make the EfficientDiD assumption block read results.pt_assumption instead of using generic staggered-DiD text.
  3. Fix the pre-trends-power schema by renaming power_at_M_1 or by computing a real M=1.0 power value and exposing violation_magnitude.
  4. Replace the current analytical alpha-override stub test with regressions on real DiDResults/MultiPeriodDiDResults outputs and an undefined-d.f. replicate-survey path.

Round 8 flagged one legitimate P1 (method-aware PT prose +
EfficientDiD assumption). The round-8 review's P0 alpha override,
P2 power_at_M_1 rename, and P2 analytical-fit regression findings
are regressions on its own round-7 assessment — all three were
addressed in commit d8fa66d and remain in place at HEAD:
``safe_inference(att, se, alpha)`` is gone from ``_extract_headline``,
``power_at_M_1`` is renamed to ``power_at_violation_magnitude`` with
``violation_magnitude`` exposed, and
``test_n_bootstrap_zero_does_not_trigger_preserve_path`` was replaced
by ``test_n_bootstrap_zero_still_preserves_on_alpha_mismatch`` plus
the new ``TestAnalyticalFiniteDfAlphaOverride`` suite with real
``DiDResults`` / ``MultiPeriodDiD`` fits and an undefined-d.f.
replicate-survey stub. The static reviewer appears to have described
the pre-round-7 state of those paths; grep confirms the fixes are
present at this SHA.

Legitimate round-8 P1 fix:

- ``BusinessReport._describe_assumption`` now has an
  ``EfficientDiDResults``-specific branch that reads
  ``results.pt_assumption`` (``"all"`` vs ``"post"``) and, when
  present, ``results.control_group``. PT-All surfaces Lemma 2.1 /
  over-identification + Hausman pretest language; PT-Post surfaces
  Corollary 3.2 / just-identified single-baseline DiD language.
  EfficientDiD is pulled out of the generic group-time ATT frozenset
  per REGISTRY.md §EfficientDiD lines 736-738 and 907.
- BR summary and DR ``_render_overall_interpretation`` PT sentences
  now branch on ``parallel_trends.method``. New ``_pt_method_subject``
  / ``_pt_subject_phrase`` helpers return source-faithful subjects
  ("The pre-period slope-difference test" for ``slope_difference``,
  "The Hausman PT-All vs PT-Post pretest" for ``hausman``,
  "Pre-treatment event-study coefficients" for Wald / Bonferroni
  paths, "The synthetic-control pre-treatment fit" for SDiD, "The
  factor-model pre-treatment fit" for TROP). A matching
  ``_pt_method_stat_label`` emits ``joint p`` vs ``p`` so single-
  statistic tests (slope_difference, hausman) are not mis-labeled
  with a joint-Wald style statistic label.
- Regressions: ``TestEfficientDiDAssumptionPtAllPtPost`` (three
  cases: PT-All, PT-Post, control_group passthrough) and
  ``TestMethodAwarePTProse`` (BR slope-difference wording on a
  crafted schema; DR hausman wording on a real ``EfficientDiD`` fit).
  New ``edid_fit`` fixture added to ``tests/test_business_report.py``.

117 targeted tests passing; black / ruff / mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7b5c0ad2d1cbcac47d331372054077ead41e5c33


Overall Assessment

⛔ Blocker

Static review only. I could not execute the new tests in this sandbox because numpy / pytest are unavailable.

Executive Summary

  • Several earlier issues do look fixed in the live PR head: dCDH PT now routes through placebo_event_study, reference-marker / NaN pre-period filtering landed, unsupported precomputed= keys now raise, EPV dict handling is corrected, and single-M HonestDiD prose no longer overclaims grid-wide robustness.
  • The blocker from the last review remains: BusinessReport(alpha != results.alpha) still recomputes normal-approximation CIs for ordinary analytical fits, which is not source-faithful to the fitted inference contract and can also invent finite intervals for undefined-d.f. replicate-survey surfaces.
  • EfficientDiD reporting is still not source-faithful: the assumption block remains generic staggered-DiD PT text, and the PT prose still does not branch on the native hausman path.
  • DiagnosticReport._pt_hausman() reruns the native pretest with default design settings instead of the fitted result’s stored control_group / anticipation, so the report can diagnose a different design than the estimate it is summarizing.
  • pretrends_power["power_at_M_1"] is still mislabeled; it stores power at violation_magnitude, not necessarily at M=1.
  • The new tests still lock in the analytical-alpha bug instead of catching it.

Methodology

  • Severity: P0. Impact: BusinessReport._extract_headline() still falls back to safe_inference(att, se, alpha) for alpha mismatches that are not explicitly tagged as bootstrap or df_survey > 0, so ordinary analytical results that were fit with finite-d.f. t-inference are silently rewritten as normal-approximation CIs; rank-deficient replicate-survey surfaces with df=0 can also be turned from all-NaN inference back into finite intervals. That changes statistical output with no warning. Concrete fix: on alpha != results.alpha, preserve the fitted CI / NaN surface unless the exact native inference contract needed to reproduce it is explicitly available; do not use a generic safe_inference(att, se, alpha) fallback for these fits. business_report.py#L333 linalg.py#L2160 estimators.py#L1389 utils.py#L203 chaisemartin_dhaultfoeuille.py#L5174
  • Severity: P1. Impact: the EfficientDiD assumption block is still reported as generic group-time parallel trends, and the PT prose still talks about generic “pre-treatment data” / “pre-treatment event-study coefficients” instead of the PT-All vs PT-Post Hausman pretest. That is not source-faithful to the registry, which treats PT-Post and PT-All as distinct identification regimes and documents the Hausman test on the post-treatment ES(e) vector. Concrete fix: branch _describe_assumption() on results.pt_assumption and results.control_group, and branch BR/DR PT prose on parallel_trends.method so hausman is named explicitly as the PT-All vs PT-Post pretest. business_report.py#L716 business_report.py#L1132 diagnostic_report.py#L2051 REGISTRY.md#L736 REGISTRY.md#L907 REPORTING.md#L52
  • Severity: P1. Impact: DiagnosticReport._pt_hausman() reruns EfficientDiD.hausman_pretest() with only data/outcome/unit/time/first_treat/alpha, so fits that were estimated with non-default control_group or anticipation are diagnosed under a different design than the one being reported. EfficientDiDResults already stores both fields, and the native pretest signature accepts them. Concrete fix: pass through at least results.control_group and results.anticipation; if the original fit used settings you cannot faithfully reconstruct, extend the report API to accept them or skip the check with an explicit reason instead of silently rerunning defaults. diagnostic_report.py#L1568 efficient_did.py#L1511 efficient_did_results.py#L138
  • Severity: P2. Impact: DiagnosticReport.pretrends_power["power_at_M_1"] is still populated from PreTrendsPowerResults.power, but that quantity is defined as power at violation_magnitude, which defaults to MDV when M=None, not specifically at M=1. Schema consumers will misread the statistic whenever a non-1 violation magnitude was evaluated. Concrete fix: rename the field to power_at_violation_magnitude and expose violation_magnitude, or explicitly compute a separate M=1.0 power value. diagnostic_report.py#L975 pretrends.py#L49 pretrends.py#L843

Code Quality

  • No independent findings beyond the methodology issues above.

Performance

  • No findings.

Maintainability

  • No independent findings beyond the EfficientDiD parameter-propagation issue above.

Tech Debt

  • No mitigating TODO.md entry covers the P0/P1 issues above, so TODO tracking does not downgrade them. TODO.md#L51

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new regression suite still asserts the wrong analytical-alpha behavior, so CI would bless the blocker instead of catching it. test_alpha_drives_ci_level, test_ci_bounds_recomputed_when_alpha_differs_from_result, and test_n_bootstrap_zero_does_not_trigger_preserve_path all expect analytical fits to recompute a 90% CI on alpha override. Concrete fix: replace those expectations with regressions on real DifferenceInDifferences / MultiPeriodDiD / undefined-d.f. survey surfaces that assert native-CI preservation, and add a focused EfficientDiD test that verifies control_group / anticipation are propagated into the Hausman path. tests/test_business_report.py#L500 tests/test_business_report.py#L507 tests/test_business_report.py#L862

Path to Approval

  1. Change BusinessReport._extract_headline() so alpha != results.alpha never triggers a generic safe_inference(att, se, alpha) recomputation for analytical finite-d.f. or undefined-d.f. surfaces; preserve the fitted CI / NaN inference unless the exact native contract is reproducible.
  2. Make EfficientDiD reporting source-faithful by branching the assumption block on results.pt_assumption / results.control_group and branching BR/DR PT prose on parallel_trends.method, with explicit Hausman PT-All vs PT-Post wording.
  3. Propagate EfficientDiDResults.control_group and EfficientDiDResults.anticipation into _pt_hausman(); if the original fit used settings the report cannot reconstruct exactly, skip with an explicit reason instead of rerunning defaults.
  4. Replace the current analytical-alpha tests with real native-inference regressions and add an EfficientDiD Hausman-propagation regression.

Round 9 flagged one legitimate new finding (Hausman propagation).
The other four findings are regressions on the reviewer's own
round-7 and round-8 assessments — grep at HEAD confirms all four
were fixed in earlier commits and remain in place:

- P0 alpha override: ``safe_inference(att, se, alpha)`` was removed
  from ``BusinessReport._extract_headline`` in round 7 (commit d8fa66d).
  The only remaining references are in the explanatory comment that
  documents why we preserve rather than recompute. ``grep safe_inference
  diff_diff/business_report.py`` returns only the comment lines.
- P1 EfficientDiD assumption + PT prose: addressed in round 8
  (commit 7b5c0ad). ``_describe_assumption`` has an EfficientDiD-
  specific branch reading ``pt_assumption`` (PT-All vs PT-Post per
  Corollary 3.2 / Lemma 2.1). BR and DR summary prose branch on
  ``parallel_trends.method`` via the new ``_pt_method_subject`` /
  ``_pt_subject_phrase`` helpers, so the 2x2 slope-difference and
  Hausman paths get source-correct subjects.
- P2 ``power_at_M_1`` rename: addressed in round 7. Grep confirms
  the field is ``power_at_violation_magnitude`` with
  ``violation_magnitude`` exposed; only the explanatory comment
  references the old name.
- P2 test regressions: ``test_ci_bounds_recomputed_when_alpha_differs
  _from_result`` and ``test_n_bootstrap_zero_does_not_trigger_preserve
  _path`` were removed in round 7; the new
  ``TestAnalyticalFiniteDfAlphaOverride`` suite locks in the
  preserve-always behavior on real ``DiDResults`` / ``MultiPeriodDiD``
  fits plus an undefined-d.f. replicate-survey stub.

Legitimate round-9 P1 fix:

- ``DiagnosticReport._pt_hausman`` now reads ``control_group`` and
  ``anticipation`` from the fitted result and forwards them to
  ``EfficientDiD.hausman_pretest``. The prior code passed only
  ``data/outcome/unit/time/first_treat/alpha``, so a fit that used
  ``control_group='last_cohort'`` or a non-zero ``anticipation``
  was silently diagnosed under the default design rather than its
  own. New ``TestHausmanPretestPropagatesFitDesign`` regression
  uses ``unittest.mock.patch`` to verify both fields propagate.

118 targeted tests passing; black / ruff / mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 883830351b40b4d97ceafd231aea9d0a16133be1


Overall Assessment

✅ Looks good — the prior P1s appear resolved. BusinessReport now finite-gates undefined CIs before rendering diff_diff/business_report.py:L1693-L1702 and no longer maps the composite StaggeredTripleDiff control total into sample["n_control"] on the fixed never-treated path diff_diff/business_report.py:L571-L629, with targeted regressions in tests/test_business_report.py:L1291-L1398. I did not find any remaining unmitigated P0/P1 issues in the changed inference/reporting logic, and the event-study PT routing, survey finite-df handling, estimator-native diagnostics, and power-aware phrasing align with the documented reporting notes in docs/methodology/REPORTING.md:L64-L163.

Static review only: this sandbox does not have pytest or numpy, so I could not execute the suite.

Executive Summary

  • The earlier re-review blockers are fixed; I did not find a remaining P0/P1 in the changed estimator/reporting paths.
  • The new DiagnosticReport PT routing, survey F-reference branch, native SDiD/TROP/EfficientDiD handling, and power-tier logic match documented deviations rather than introducing a new undocumented methodology change.
  • Severity P2: BusinessReport still fails to narrate the fixed never-enabled comparison cohort for fixed StaggeredTripleDiff; the schema is correct, but summary() and full_report() omit that fixed control information.
  • Severity P3: the regression added for the triple-difference fixed-control fix only checks that the old wrong 500 control phrasing disappeared; it does not assert that the correct never-enabled count appears in rendered prose.
  • Static review only; the sandbox lacks pytest and numpy.

Methodology

  • Severity: P2. Impact: the fix correctly moves fixed StaggeredTripleDiff sample metadata onto n_never_enabled and clears n_control, but the renderers only surface n_never_enabled inside the dynamic-control branch. That means summary() falls through to a generic Sample: N observations. sentence and full_report() omits the fixed comparison cohort entirely, even though the registry says the valid fixed comparison in this mode is the never-enabled cohort. Relevant code: diff_diff/business_report.py:L582-L629, diff_diff/business_report.py:L1898-L1935, diff_diff/business_report.py:L2107-L2137, docs/methodology/REGISTRY.md:L1730-L1732. Concrete fix: add a non-dynamic fixed-control render path for StaggeredTripleDiff that prints the never-enabled count in both summary() and full_report(), while still keeping n_control reserved for true fixed control totals.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining non-blocking items above are not currently tracked in TODO.md:L51-L80.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new regression for the fixed StaggeredTripleDiff control-group contract verifies that the old composite-control wording disappeared and that the schema carries n_never_enabled, but it does not assert that summary() or full_report() actually render the correct fixed comparison count. That gap is why the remaining prose omission above is still unguarded. Relevant test: tests/test_business_report.py:L1291-L1347. Concrete fix: extend that regression to assert the fixed path renders the never-enabled comparison count in both summary() and full_report(), not just that 500 control is absent.

P2 methodology (StaggeredTripleDiff fixed-control prose incomplete).
Round-37 moved StaggeredTripleDiff's fixed ``control_group="never_
treated"`` schema to ``n_never_enabled`` (REGISTRY.md line 1730
names the never-enabled cohort as the valid fixed comparison) and
cleared the composite ``n_control_units`` total from ``n_control``.
The renderers, however, only surface ``n_never_enabled`` inside the
``is_dynamic_control`` branch — so the fixed ``never_treated`` path
fell through to the generic ``Sample: N observations.`` sentence
and the full report omitted the fixed comparison cohort entirely.

Added dedicated fixed-never-enabled branches to both renderers:

  * ``_render_summary`` emits ``Sample: N observations (N_t treated,
    N_ne never-enabled).`` when the estimator is
    ``StaggeredTripleDiffResults``, the dynamic branch is not
    active, and ``n_never_enabled > 0``;
  * ``_render_full_report`` emits a dedicated bullet
    ``- Never-enabled units (fixed comparison cohort): N_ne`` under
    the same condition.

P3 coverage. Round-37 regression only asserted absence of the
wrong ``500 control`` wording; it did not positively assert the
valid never-enabled comparison cohort appeared in rendered prose,
which is why the P2 above slipped through. Regressions extended:

  * ``test_never_treated_mode_summary_renders_never_enabled_count``
    asserts ``300 never-enabled`` appears in ``summary()`` AND the
    generic fallback ``Sample: 800 observations.`` does not fire;
  * new ``test_never_treated_mode_full_report_renders_never_enabled_count``
    asserts the sample section of ``full_report()`` names
    ``never-enabled`` and the ``300`` count while omitting any
    bare ``- Control: 500`` line.

278 BR / DR / practitioner tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 86c4a5a6c764bf3dd6eabace258e7d6f79c2eb41


Overall Assessment

✅ Looks good — the prior re-review issue around fixed StaggeredTripleDiff(control_group="never_treated") reporting appears resolved, and I did not find a new unmitigated P0/P1 in the changed estimator/reporting logic. Static review only: pytest and numpy were unavailable in this sandbox, so verification was limited to diff inspection and ast.parse on the changed Python files.

Executive Summary

Methodology

No findings. The methodology-sensitive changes reviewed here are either consistent with the registry, including fixed StaggeredTripleDiff comparisons using the never-enabled cohort, or explicitly documented as reporting-layer deviations such as event-study PT via joint-Wald/Bonferroni and estimator-native routing for SDiD/TROP/EfficientDiD. docs/methodology/REGISTRY.md:1730, docs/methodology/REPORTING.md:64, docs/methodology/REPORTING.md:86

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity: P3. Impact: BusinessReport.summary() is written to quote the dropped undefined-preperiod count when available, but _lift_pre_trends() strips both n_dropped_undefined and DR’s detailed reason. That means BR’s schema and prose collapse a more informative DR result into the generic inconclusive sentence, even when DR already knows exactly how many pre-period rows failed and why. Relevant locations: diff_diff/business_report.py:699, diff_diff/business_report.py:1795, diff_diff/diagnostic_report.py:999. Concrete fix: preserve n_dropped_undefined and reason in BR’s pre_trends block, then add a regression that asserts the count survives into BusinessReport.to_dict()["pre_trends"] and/or rendered summary text for an undefined-pretrend fit.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: _check_design_effect() now emits band_label="improves_precision" for deff < 0.95, and tests assert that behavior, but REPORTING.md still only says the section includes a label and the _check_design_effect() docstring still says the documented bands start with <1.05 -> trivial. That leaves the experimental schema contract under-documented and makes the inline documentation incorrect. Relevant locations: diff_diff/diagnostic_report.py:1563, docs/methodology/REPORTING.md:30, tests/test_diagnostic_report.py:1264, tests/test_business_report.py:3175. Concrete fix: update docs/methodology/REPORTING.md and the _check_design_effect() docstring to list the full enum and thresholds, including improves_precision, or remove the claim that the older band table is the documented contract.

P3 maintainability (inconclusive-PT provenance dropped at the lift).
DR's ``_pt_event_study`` inconclusive branch emits
``n_dropped_undefined`` (count of real pre-periods rejected for
undefined inference) and a detailed ``reason`` tying the skip to
the ``safe_inference`` contract. BR's ``_lift_pre_trends`` was
dropping both fields at the lift boundary, so BR's own summary
renderer — which is already written to quote the count when
available — never saw it. Preserve both fields on the BR
``pre_trends`` block so the schema carries the same provenance DR
does, and summary prose can render "1 pre-period row had undefined
inference" instead of the generic "pre-period inference was
undefined" fallback.

P3 docs drift. Round-35 added ``band_label="improves_precision"``
for ``deff < 0.95`` and code/tests exercise that enum value, but
``REPORTING.md`` still described only the four-band table
(``trivial`` / ``slightly_reduces`` / ``materially_reduces`` /
``large_warning``) and ``_check_design_effect``'s docstring listed
the same old table. Updated both surfaces to enumerate the full
five-value enum with explicit threshold rules and the intuition
for the precision-improving band.

Tests: 1 new regression.

  * ``TestInconclusivePTProvenancePreservedOnBRSchema``: NaN-p
    ``StackedDiDResults`` fit (Bonferroni-only surface) lifts
    through to BR with ``n_dropped_undefined==1`` on the BR
    ``pre_trends`` block, a populated ``reason``, and the summary
    renderer quotes "1 pre-period row had undefined inference".

All BR / DR / practitioner tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9d8ad3496ddc283e7d8947ab46a790ed00ba7287


Overall Assessment

⚠️ Needs changes — I found two unmitigated P1 methodology issues in the new reporting layer. Static review only: this sandbox lacks numpy/pytest, so I could not run the added test suites.

Executive Summary

  • The prior re-review P3 items appear resolved: BusinessReport now preserves inconclusive parallel-trends provenance, and the design-effect docs/schema now agree on the band_label enum including improves_precision. diff_diff/business_report.py:L699-L724, docs/methodology/REPORTING.md:L30-L42, diff_diff/diagnostic_report.py:L1563-L1582
  • P1: survey-backed Goodman-Bacon replay is unweighted because the new report APIs do not accept or forward survey_design, even though bacon_decompose() and the Methodology Registry explicitly support survey-weighted decomposition. diff_diff/diagnostic_report.py:L1494-L1530, diff_diff/bacon.py:L1056-L1064, docs/methodology/REGISTRY.md:L2396-L2402
  • P1: survey-backed 2x2 parallel-trends checks can still run through unweighted check_parallel_trends() with no skip/warning path. diff_diff/diagnostic_report.py:L949-L989, diff_diff/utils.py:L641-L647
  • The other reporting deviations I checked — event-study PT via joint Wald/Bonferroni, survey finite-df F references, estimator-native routing, and the 0.05/0.30 pre-trend bins — are documented in docs/methodology/REPORTING.md and therefore are not defects under this rubric. docs/methodology/REPORTING.md:L75-L174

Methodology

  • Severity P1. Impact: DiagnosticReport and BusinessReport never accept a survey_design object, so _check_bacon() always calls bacon_decompose() without it. On survey-backed fits, that silently reports an unweighted Goodman-Bacon decomposition for a different design than the fitted estimator, even though the helper and the registry both support survey-weighted Bacon. diff_diff/diagnostic_report.py:L292-L318, diff_diff/diagnostic_report.py:L696-L733, diff_diff/diagnostic_report.py:L1494-L1530, diff_diff/business_report.py:L127-L147, diff_diff/business_report.py:L288-L300, diff_diff/bacon.py:L1056-L1064, diff_diff/bacon.py:L1139-L1140, docs/methodology/REGISTRY.md:L2396-L2402, docs/methodology/REPORTING.md:L22-L28
    Concrete fix: add survey_design to both report constructors, thread it through the auto-diagnostic path, and pass it to bacon_decompose(..., survey_design=survey_design); if the caller omits it on a survey-backed fit, skip with an explicit reason instead of replaying unweighted.
  • Severity P1. Impact: the DiDResults PT route still dispatches directly to check_parallel_trends(), and the applicability gate does not exclude survey-backed results. Because check_parallel_trends() has no survey_design parameter, the report can emit an unweighted “parallel trends” verdict for a survey-weighted DiD fit with no warning. diff_diff/diagnostic_report.py:L92-L93, diff_diff/diagnostic_report.py:L537-L570, diff_diff/diagnostic_report.py:L949-L989, diff_diff/utils.py:L641-L647, diff_diff/results.py:L107-L127, docs/methodology/REPORTING.md:L22-L28
    Concrete fix: either implement a survey-aware 2x2 PT diagnostic and thread survey_design through it, or make parallel_trends skip for survey-backed DiDResults unless the caller supplies precomputed={'parallel_trends': ...}.
  • No other methodology findings. The main reporting-layer deviations I checked are explicitly documented and therefore non-blocking. docs/methodology/REPORTING.md:L75-L174

Code Quality

No findings. I also did not find new inline inference / partial-NaN-guard anti-patterns in the changed files.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. I did not see the two P1 survey-replay issues tracked under TODO.md’s “Tech Debt from Code Reviews” section, so they are not mitigated by existing debt tracking. TODO.md:L51-L80

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new API docs tell users to pass raw panel + column-name kwargs to enable Bacon/PT replay, but they do not describe any survey-design contract, which makes the two survey-path bugs above easy to trigger in normal use. I also did not find corresponding survey-backed coverage in the new report tests. docs/api/business_report.rst:L17-L24, docs/api/diagnostic_report.rst:L29-L37
    Concrete fix: update both API pages and docs/methodology/REPORTING.md alongside the code fix, and add end-to-end survey-backed regressions for both DiagnosticReport and BusinessReport.

Path to Approval

  1. Add survey_design to DiagnosticReport(...) and BusinessReport(...), store/forward it, and pass it through _check_bacon() to bacon_decompose(..., survey_design=...).
  2. Change the DiDResults PT path so survey-backed fits do not call check_parallel_trends() unweighted: either implement a survey-aware helper or skip with an explicit reason unless precomputed['parallel_trends'] is supplied.
  3. Add a regression that proves survey-weighted Bacon replay from both DiagnosticReport and BusinessReport uses the supplied survey design.
  4. Add a regression that proves survey-weighted simple DiD PT either uses the new survey-aware path or skips loudly, and document that contract in the two API pages plus docs/methodology/REPORTING.md.

Round-40 landed two P1 methodology findings on the reporting layer,
both instances of the same silent-failure class: survey-backed fits
routing through diagnostic helpers that don't accept
``survey_design``, silently emitting unweighted results for a
weighted estimate.

Survey-design threading (``diagnostic_report.py``,
``business_report.py``):

- ``DiagnosticReport`` and ``BusinessReport`` now accept
  ``survey_design=<SurveyDesign>``. BR forwards to the auto-
  constructed DR; DR threads through ``bacon_decompose(
  survey_design=...)``.
- When ``results.survey_metadata`` is set but ``survey_design`` is
  not supplied, Bacon and the simple 2x2 parallel-trends helper
  skip with an explicit reason instead of replaying an unweighted
  decomposition / verdict for a design that does not match the
  estimate. Precomputed passthroughs remain honored.
- Defense-in-depth skips added at the runner level
  (``_check_bacon`` / ``_pt_two_x_two``) in case future callers
  bypass the applicability gate.

Docs:

- ``REPORTING.md`` documents the new ``survey_design`` contract
  and the skip-with-reason policy alongside the existing
  finite-df PT note.
- ``business_report.rst`` and ``diagnostic_report.rst`` surface
  the kwarg with a pointer to the methodology file.

Tests: 7 new regressions (4 DR + 3 BR) covering PT skip with reason
on survey-backed DiDResults, precomputed PT override honored, Bacon
skip without survey_design, and survey_design forwarded via
``unittest.mock.patch``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cf2f592c69937faabec0bfde4f3c4449d51e8f38


Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings.

Executive Summary

  • The two prior survey-weighting blockers from the last AI review appear fixed: survey_design now threads through BR/DR into Goodman-Bacon replay, and survey-backed simple-DiD PT is skipped instead of replaying unweighted diagnostics. diff_diff/business_report.py:L221-L227, diff_diff/business_report.py:L309-L322, diff_diff/diagnostic_report.py:L593-L612, diff_diff/diagnostic_report.py:L785-L794, diff_diff/diagnostic_report.py:L1605-L1628, tests/test_diagnostic_report.py:L2068-L2195, tests/test_business_report.py:L3781-L3854
  • Other reviewer-driven contract fixes also look correct: EfficientDiD now persists cluster for fit-faithful Hausman replay, and practitioner step tags were retuned so BR/DR no longer suppress next-step advice that the report never actually executed. diff_diff/efficient_did_results.py:L151-L157, diff_diff/efficient_did.py:L1094-L1102, diff_diff/practitioner.py:L391-L396, diff_diff/practitioner.py:L573-L582, diff_diff/practitioner.py:L650-L655, diff_diff/practitioner.py:L699-L708
  • I did not find a new silent correctness bug in estimator math, weighting, variance/SE, identification assumptions, or defaults in the changed code.
  • One minor P3 remains: the survey-threading docs overstate what survey_design unlocks for simple 2x2 PT. docs/methodology/REPORTING.md:L84-L94, docs/api/business_report.rst:L26-L32, docs/api/diagnostic_report.rst:L26-L33
  • Static review only: this sandbox lacks pytest and numpy, so I could not execute the new tests locally.

Methodology

No findings. The reporting-layer deviations I checked against docs/methodology/REGISTRY.md and docs/methodology/REPORTING.md are documented, and the previous survey-replay P1s are now handled conservatively rather than emitting mismatched weighted/unweighted diagnostics. docs/methodology/REGISTRY.md:L3080-L3087, docs/methodology/REPORTING.md:L75-L120, diff_diff/diagnostic_report.py:L593-L612, diff_diff/diagnostic_report.py:L785-L794, diff_diff/diagnostic_report.py:L1605-L1628

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see any blocker-grade issue that depends on TODO.md mitigation; the only remaining item below is P3 and non-blocking. TODO.md:L51-L110

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new reporting docs and constructor docstrings read as if supplying survey_design can make the survey-backed simple 2x2 parallel-trends path fit-faithful, but the implementation intentionally skips that path unconditionally because utils.check_parallel_trends() has no survey-aware variant. The implementation is conservative and does not emit wrong numbers, but the public contract is overstated. docs/methodology/REPORTING.md:L84-L94, docs/api/business_report.rst:L26-L32, docs/api/diagnostic_report.rst:L26-L33, diff_diff/business_report.py:L129-L137, diff_diff/diagnostic_report.py:L271-L279, diff_diff/diagnostic_report.py:L593-L612, diff_diff/diagnostic_report.py:L1036-L1045, tests/test_diagnostic_report.py:L2068-L2115, tests/test_business_report.py:L3781-L3815
    Concrete fix: reword the survey sections to say survey_design only enables fit-faithful Bacon replay; survey-backed 2x2 PT remains skip-only unless the caller supplies precomputed={'parallel_trends': ...}. Add a regression that passes both survey_metadata and survey_design on DiDResults and asserts the PT block still skips with the survey-aware-precomputed message.

Round-41 landed a single P3 finding (PR overall assessment ✅): the
survey-threading docs and constructor docstrings overstated what
``survey_design`` unlocks. Supplying the design enables fit-faithful
Goodman-Bacon replay, but the simple 2x2 parallel-trends helper
(``utils.check_parallel_trends``) has no survey-aware variant, so
survey-backed ``DiDResults`` PT is skipped unconditionally regardless
of ``survey_design`` — the helper cannot consume the design even when
it is available.

Docs:

- ``REPORTING.md`` now separates the Bacon replay contract (where
  ``survey_design`` is load-bearing) from the 2x2 PT contract
  (skip-only; use ``precomputed={'parallel_trends': ...}`` to opt in).
- ``business_report.rst`` / ``diagnostic_report.rst`` mirror the
  split and point the reader at the precomputed-PT opt-in.
- Constructor docstrings on both classes clarify the scope.

Tests: added
``test_survey_backed_did_skips_2x2_pt_even_when_survey_design_supplied``
which passes both ``survey_metadata`` AND ``survey_design`` on a
``DiDResults`` stub and asserts PT still skips with a reason naming
the precomputed-PT opt-in (not ``survey_design``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 020d53748f06078a2db21a59e8ad8ed65028b175


Overall Assessment

⚠️ Needs changes — two unmitigated P1 methodology/reporting issues remain.

Executive Summary

Methodology

  • Severity P1. Impact: when every genuine pre-period coefficient has undefined inference, _collect_pre_period_coefs() returns ([], n_dropped_undefined>0), but both the applicability gate and _pt_event_study() treat that as “no pre-period coefficients available” and skip the PT check entirely. That suppresses the explicit method="inconclusive" / n_dropped_undefined path the module already uses for mixed valid+undefined surfaces, and BusinessReport can then omit the identifying-assumption warning instead of telling the reader PT was unassessed due to undefined inference. Concrete fix: treat (no valid pre-period coefficients, but n_dropped_undefined > 0) as an inconclusive PT result, not a skip, in both the gate and runner; add a regression with all pre-period rows undefined (for example, all se=0 or all per-period p_value=NaN) and assert parallel_trends.status == "ran", method == "inconclusive", and BR summary says “inconclusive.” diff_diff/diagnostic_report.py, diff_diff/diagnostic_report.py, diff_diff/diagnostic_report.py, diff_diff/business_report.py
  • Severity P1. Impact: BusinessReport’s “Identifying Assumption” section groups ImputationDiDResults and TwoStageDiDResults into the same generic “parallel trends across treatment cohorts and time periods (group-time ATT)” description used for CS-style estimators, but the Methodology Registry defines BJS/Gardner identification through an untreated-only additive FE / untreated-outcome model plus no anticipation. Because this section is explicitly presenting source methodology to users, that is an undocumented source-material mismatch. Concrete fix: split ImputationDiDResults and TwoStageDiDResults into their own _describe_assumption() branches and mirror the registry’s untreated-observations FE wording; add schema/render tests that assert those descriptions and reject the current “group-time ATT” phrasing for these two estimators. diff_diff/business_report.py, docs/methodology/REGISTRY.md, docs/methodology/REGISTRY.md

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new blocker-grade findings. The pre-existing EfficientDiD last_cohort/anticipation limitation remains tracked in TODO.md and does not change this assessment.

Security

  • No findings.

Documentation/Tests

  • No additional PR-side findings beyond the missing regressions implied by the two P1s above.
  • I could not run tests/test_business_report.py or tests/test_diagnostic_report.py in this environment because pytest and numpy are unavailable.

Path to Approval

  1. Fix DiagnosticReport so all-undefined pre-period event-study surfaces emit an explicit inconclusive PT block instead of skipped, and add a regression covering that all-undefined case end-to-end through both DR and BR.
  2. Rewrite the ImputationDiDResults and TwoStageDiDResults assumption text in BusinessReport to match the Methodology Registry, and add schema/render tests that lock those source-faithful descriptions in place.

Round-42 landed two P1 findings:

1. All-undefined pre-period surface routed to ``skipped`` instead of
   ``inconclusive`` (``diagnostic_report.py``). When every pre-row is
   dropped by ``_collect_pre_period_coefs`` for undefined inference
   (all ``se <= 0`` / non-finite effect/se), the collector returns
   ``([], n_dropped_undefined > 0)``. Both the applicability gate and
   ``_pt_event_study`` treated that as "no coefficients available" and
   skipped, letting BR drop the identifying-assumption warning. Fixed
   both sites to detect the all-undefined case and route to the
   explicit ``method="inconclusive"`` runner alongside the partial-
   undefined case already covered by R33. BR's existing inconclusive
   phrasing lifts through unchanged.

2. Source-faithful assumption text for ``ImputationDiDResults`` and
   ``TwoStageDiDResults`` (``business_report.py``). BR's
   ``_describe_assumption`` was grouping both with CS / SA / Wooldridge
   under the generic "parallel trends across treatment cohorts and
   time periods (group-time ATT)" template, but BJS (2024) and Gardner
   (2022) both identify through an untreated-potential-outcome model:
   unit+time FE fitted on untreated observations (``Omega_0`` =
   never-treated + not-yet-treated) deliver the counterfactual, and
   the identifying restriction is on ``E[Y_it(0)] = alpha_i + beta_t``
   — not on cohort-time ATT equality. Split each into its own branch
   mirroring REGISTRY.md §ImputationDiD (lines 1000-1013) and
   §TwoStageDiD (lines 1113-1128), including the Gardner-BJS
   algebraic-equivalence note.

Tests: 3 new regressions.
- ``test_all_pre_periods_undefined_yields_inconclusive_not_skipped``:
  all pre-rows with ``se == 0``, asserts DR emits ``method="inconclusive"``
  / ``status="ran"`` / ``n_pre_periods=0`` / ``n_dropped_undefined=2``,
  and BR summary emits "inconclusive".
- ``test_imputation_did_assumption_uses_untreated_fe_model`` and
  ``test_two_stage_did_assumption_uses_untreated_fe_model``: lock the
  new ``parallel_trends_variant="untreated_outcome_fe_model"`` tag,
  require the registry-backed source attribution and untreated-subset
  detail, and reject the pre-R42 generic-PT template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5788e26300ef27398649a38c84bb43cf64e2a022


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings in the merged diff.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: BusinessReport’s docstring and API docs tell users to opt into survey-aware 2x2 PT by supplying precomputed={'parallel_trends': ...}, but BusinessReport.__init__ has no precomputed parameter, and the auto-diagnostic path only synthesizes precomputed from honest_did_results. A user following the documented BusinessReport API will hit a constructor error; the working path today is to build DiagnosticReport(..., precomputed=...) separately and pass it via diagnostics=. Concrete fix: either add a precomputed passthrough to BusinessReport and forward it in _resolve_diagnostics(), or rewrite the BR docs/docstring to point users to diagnostics=DiagnosticReport(..., precomputed=...); then add a smoke test for whichever public contract you keep. (docs/api/business_report.rst:12, docs/api/business_report.rst:26, diff_diff/business_report.py:129, diff_diff/business_report.py:144, diff_diff/business_report.py:300, diff_diff/diagnostic_report.py:284)
  • Static review only: I could not execute tests/test_business_report.py or tests/test_diagnostic_report.py here because the sandbox is missing pytest, numpy, pandas, and scipy.

Round-43 assessment was ✅ with two P2 findings; both are boundary /
contract mismatches rather than methodology defects.

1. DEFF==1.05 boundary inconsistency. REPORTING.md defines the
   ``trivial`` band as ``0.95 <= deff < 1.05`` (half-open) and
   ``slightly_reduces`` as starting at ``1.05``. The ``is_trivial``
   flag in both DR's ``_check_design_effect`` and BR's sample-block
   copy used ``<= 1.05`` (closed), so exactly ``deff == 1.05`` landed
   in the ``slightly_reduces`` band AND was flagged ``is_trivial=True``
   — internally inconsistent, and the flag suppressed the non-trivial
   prose the documented threshold says should fire. Aligned both
   ``is_trivial`` bounds with REPORTING.md's half-open interval.

2. ``BusinessReport`` did not accept the ``precomputed=`` dict that
   its docstring and API docs advertised as the opt-in path for
   survey-aware 2x2 PT (``precomputed={'parallel_trends': ...}``) and
   other escape hatches. The auto path only synthesized ``{"sensitivity":
   honest_did_results}``, so a user following the BR docs hit a
   ``TypeError`` on ``__init__``. Added ``precomputed=`` kwarg to
   ``BusinessReport``, eager key validation mirroring DR's set (keys:
   ``parallel_trends`` / ``sensitivity`` / ``pretrends_power`` /
   ``bacon``), and forwarded the merged dict to the auto-constructed
   DR. ``honest_did_results`` remains a shorthand for ``sensitivity``;
   explicit ``precomputed['sensitivity']`` wins on conflict.

Tests: 5 new regressions.
- ``test_deff_exactly_1_05_is_slightly_reduces_not_trivial`` +
  ``test_deff_just_under_1_05_is_trivial`` cover the exact-boundary
  + adjacent-point schema behavior across DR and BR.
- ``TestBusinessReportPrecomputedPassthrough`` covers the happy path
  (PT precomputed unlocks survey-backed 2x2), eager key validation
  (unknown key raises ``ValueError`` at BR construction), and the
  explicit-vs-shorthand precedence rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e2115bcf7e7acdf42e7a8182e7941f9a676220fe


Overall Assessment

⚠️ Needs changes — one unmitigated P1 methodology issue remains.

Executive Summary

  • The prior re-review’s non-blocking issues appear fixed.
  • Remaining blocker: precomputed["sensitivity"] now bypasses the varying-base Callaway-Sant’Anna HonestDiD guard, so the new report layer can present robustness as if it were directly valid for the displayed fit.
  • The Methodology Registry is explicit that Callaway-Sant’Anna pre-treatment effects are only methodologically comparable under base_period="universal", and the HonestDiD note says varying-base Callaway-Sant’Anna changes the restriction’s interpretation.
  • No other unmitigated P0/P1 issues stood out in the changed source.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Add an estimator-aware guard so DiagnosticReport / BusinessReport cannot narrate precomputed["sensitivity"] as ordinary HonestDiD robustness on CallawaySantAnnaResults(base_period != "universal").
  2. If supporting a separate universal-base refit is important, encode that provenance explicitly in the API/schema and carry it into BR/DR prose as a warning/caveat.
  3. Add regression tests covering both DiagnosticReport and BusinessReport on that exact interaction.

Round-44 landed one P1 methodology finding: ``precomputed['sensitivity']``
(and BR's ``honest_did_results`` shorthand that merges into it)
bypassed the varying-base ``CallawaySantAnna`` guard.

The applicability gate's precomputed early-return fired before the
``base_period != 'universal'`` check, so a displayed CS fit with
``base_period='varying'`` would get its sensitivity section unlocked
and the Rambachan-Roth bounds narrated as ordinary robustness — even
though HonestDiD explicitly warns those bounds are not valid for
interpretation on the consecutive-comparison pre-period surface
``varying`` base produces (REGISTRY.md §CallawaySantAnna line 410,
§HonestDiD line 2458). Narrating the bounds alongside the displayed
varying-base fit mixes provenance the bounds do not support, which
is the silent-failure pattern the varying-base auto-path skip was
designed to prevent.

Fixes:

- ``diagnostic_report.py`` ``__init__``: raise ``ValueError`` when
  ``precomputed['sensitivity']`` is supplied on
  ``CallawaySantAnnaResults`` with ``base_period != 'universal'``,
  mirroring the existing SDiD/TROP rejection pattern for
  methodology-incompatible passthroughs.
- ``diagnostic_report.py`` ``_instance_skip_reason``: reorder the
  sensitivity gate so the CS varying-base check fires BEFORE the
  precomputed early-return (defense-in-depth behind the ``__init__``
  raise; also protects against callers that mutate ``_precomputed``
  post-construction).
- ``business_report.py`` ``__init__``: raise on the same interaction
  when either ``honest_did_results`` or ``precomputed['sensitivity']``
  is supplied (or both — the error names each rejected input).

Tests: 5 new regressions in ``TestCSVaryingBaseSensitivityRejectsPrecomputed``
covering both DR and BR, both passthrough paths, the union-error case,
and the universal-base positive case (supported and not rejected).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1085e724c0af85a409994a95d053c2107af9d486


Overall Assessment

⚠️ Needs changes — one unmitigated P1 methodology issue remains.

Executive Summary

  • The prior re-review blocker around CallawaySantAnnaResults(base_period="varying") plus precomputed HonestDiD sensitivity appears fixed by the new BR/DR guards and regression coverage.
  • I did not find a new P0 inference/NaN/SE bug in the changed reporting logic or in the small result-metadata threading changes (anticipation / cluster persistence).
  • One methodology issue remains: BusinessReport still turns a Bacon/TWFE contamination diagnostic into “re-estimate with a heterogeneity-robust estimator” advice even when the current fit is already heterogeneity-robust.
  • Goodman-Bacon is explicitly a TWFE decomposition, so on CS/SA/BJS/Gardner/Wooldridge/EfficientDiD/Stacked-style fits the right interpretation is “TWFE would be a bad benchmark on this rollout,” not “replace the estimator you already used.”
  • Static review only; I did not execute the test suite in this environment.

Methodology

  • Severity P1. Impact: BusinessReport converts a large Bacon forbidden-weight share into a caveat that says “Re-estimate with a heterogeneity-robust estimator (CS / SA / BJS / Gardner)” for any fit whose auto-constructed diagnostics produced Bacon output. But Goodman-Bacon is explicitly a decomposition of TWFE weights, not of the heterogeneity-robust estimators this report layer also summarizes. On already-robust fits, this is methodologically wrong: the diagnostic says the rollout design would contaminate a hypothetical TWFE benchmark, not that the displayed estimator needs replacement. DiagnosticReport partly preserves that distinction with “if not already in use,” but BusinessReport drops the guard and renders the stronger recommendation in stakeholder-facing caveats / full reports. Concrete fix: make the Bacon caveat estimator-aware. For already-robust estimators, rephrase to “this rollout would make a TWFE benchmark unreliable” (or suppress the action item entirely); keep “switch to a heterogeneity-robust estimator” only for genuinely TWFE-style contexts. References: diff_diff/business_report.py:L1523-L1540, diff_diff/business_report.py:L2397-L2403, diff_diff/diagnostic_report.py:L3131-L3138, diff_diff/bacon.py:L1-L10, diff_diff/bacon.py:L67-L86.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology issue above.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No separate findings, but this PR should add a regression for the P1 above: a non-TWFE fit with forbidden_weight > 0.10 should not produce a “switch to a heterogeneity-robust estimator” recommendation in BusinessReport. Best target files: tests/test_business_report.py and tests/test_diagnostic_report.py.

Path to Approval

  1. Make the Bacon warning/caveat estimator-aware so already-robust fits are described as avoiding TWFE contamination rather than needing replacement.
  2. Add a regression on at least one already-robust estimator (for example CS or Wooldridge) with a high forbidden-weight Bacon block, asserting BusinessReport does not recommend switching estimators on that path.

Round-45 landed one P1 methodology finding: ``BusinessReport`` emitted
the Bacon "re-estimate with a heterogeneity-robust estimator (CS / SA
/ BJS / Gardner)" caveat on every fit whose Bacon block had
``forbidden_weight > 0.10``, including fits that were already produced
by one of those robust estimators.

Goodman-Bacon is explicitly a decomposition of TWFE weights
(``bacon.py`` header, Goodman-Bacon 2021). On a displayed fit that is
already heterogeneity-robust (CS / SA / BJS / Gardner / Wooldridge /
EfficientDiD / Stacked / dCDH / TripleDifference /
StaggeredTripleDiff / SDiD / TROP), a high forbidden-weight share is a
statement about what TWFE WOULD have done on this rollout, not a claim
that the displayed estimator needs replacement. DR partly preserved
this in its prose with an "if not already in use" guard; BR dropped
that distinction and rendered the stronger recommendation in
stakeholder-facing caveats / full reports.

Fix (``business_report.py`` ``_build_caveats``):

- Introduce ``_TWFE_STYLE_RESULTS = {DiDResults, MultiPeriodDiDResults,
  TwoWayFixedEffectsResults}`` — the fits for which the switch-to-
  robust recommendation is load-bearing.
- Keep the original message for TWFE-style fits.
- Rephrase for already-robust fits: "TWFE benchmark would be
  materially biased on this rollout; the displayed estimator is
  already heterogeneity-robust, so this is a statement about the
  rollout design (avoid reporting TWFE alongside this fit), not about
  the current result's validity."

Tests: 3 new regressions in ``TestBaconCaveatEstimatorAware``:
- CS-like fit with high forbidden-weight does NOT recommend switching.
- Spot-check the same rule across SA / Imputation / TwoStage / Stacked
  / Wooldridge / dCDH / EfficientDiD.
- ``MultiPeriodDiDResults`` (TWFE event-study) DOES keep the switch
  recommendation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: dcfb4fea110702d661b38face15b78e539a74085


Overall assessment

✅ Looks good — the previous P1 from the last AI review appears resolved, and I did not find a new unmitigated P0/P1 in the changed diff.

Executive summary

  • The prior Goodman-Bacon/TWFE blocker is fixed. BusinessReport now limits the “switch to a heterogeneity-robust estimator” recommendation to TWFE-style fits and reframes high forbidden-weight Bacon output as a TWFE-benchmark/rollout-design warning for already-robust estimators diff_diff/business_report.py:L1523-L1568 tests/test_business_report.py:L4088-L4120.
  • Methodology-sensitive reporting deviations are documented rather than silent: event-study PT routing, survey replay guards, and the 0.30 / 0.05 pre-trends verdict heuristics are explicitly recorded in the reporting note docs/methodology/REPORTING.md:L75-L197.
  • The new PT/precomputed applicability logic surfaces explicit skipped, error, or inconclusive states instead of manufacturing clean inference when required inputs are missing or undefined diff_diff/diagnostic_report.py:L582-L857.
  • I did not find a new NaN/SE/inference anti-pattern or an obvious propagation gap in the touched result-threading changes.
  • Static review only; I did not execute the test suite in this environment.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: I did not find new untracked deferred work introduced by this PR. The existing Wooldridge items in TODO.md:L76-L80 are pre-existing and are not blockers for this reporting-layer change. Concrete fix: None required for approval.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit 2e9447e into main Apr 19, 2026
23 of 24 checks passed
@igerber igerber deleted the business-report branch April 19, 2026 22:46
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant