Add BusinessReport and DiagnosticReport (experimental preview)#318
Add BusinessReport and DiagnosticReport (experimental preview)#318
Conversation
Implements the 'practitioner-ready output' roadmap pair: plain-English
stakeholder narratives from any of the 16 fitted result types, backed by
a stable AI-legible to_dict() schema (single source of truth; prose renders
from the dict).
BusinessReport:
- summary()/full_report()/export_markdown() surface stakeholder text
- to_dict()/to_json() expose the structured schema for AI agents
- Optional outcome_label/outcome_unit/business_question/treatment_label
for context; single-knob alpha drives CI level and phrasing threshold
- auto_diagnostics=True (default) constructs an internal DiagnosticReport
so the summary mentions pre-trends, sensitivity, and design effects in
one call; diagnostics=<DR|DRResults|None> overrides explicitly
- Rejects BaconDecompositionResults with a helpful TypeError
DiagnosticReport:
- Orchestrates check_parallel_trends, compute_pretrends_power,
HonestDiD.sensitivity (grid form yielding breakdown_M), bacon_decompose,
compute_deff_diagnostics, results.epv_diagnostics, plus heterogeneity
(CV + range + sign consistency)
- Estimator-native routing: SDiD uses pre_treatment_fit + in_time_placebo
+ sensitivity_to_zeta_omega; EfficientDiD uses native hausman_pretest;
TROP surfaces factor-model fit (effective_rank / loocv_score / lambdas)
- Lazy: construction is free; run_all() triggers compute and caches
- precomputed={...} escape hatch for user-supplied diagnostic results
- Power-aware phrasing tiers (well/moderately/underpowered) drive the
'no_detected_violation' verdict prose rather than always hedging
Docs:
- New docs/methodology/REPORTING.md records design deviations via the
'- **Note:**' label pattern (no-traffic-light gates, pre-trends verdict
thresholds, unit-translation policy, schema stability policy)
- docs/methodology/REGISTRY.md cross-links to REPORTING.md
- New docs/api/business_report.rst and docs/api/diagnostic_report.rst
registered under a new 'Reporting' section in docs/api/index.rst
- docs/doc-deps.yaml tracks both modules
- README adds a stakeholder-report example under 'For Data Scientists'
- CHANGELOG marks both schemas experimental for this release
- ROADMAP moves BR/DR to Recently Shipped; splits the original bundled
bullet so context-aware practitioner_next_steps() remains queued
- diff_diff/guides/llms-full.txt documents the public API and both
schemas for AI agents; llms-practitioner.txt notes that DR covers
Baker steps 3/6/7 in one call
Tests: 61 new tests (32 BR + 29 DR) cover schema contract,
applicability matrix across all 16 result types, JSON round-trip,
precomputed-sensitivity passthrough (no re-compute), error handling,
power-tier/verdict boundaries, unit-label behavior, significance-chasing
guard, NaN ATT, include_appendix toggle, BaconDecompositionResults
TypeError, survey metadata passthrough, and alpha single-knob behavior.
All pass in 0.26s under pytest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…provenance
Follows up on review findings on the prior commit:
- **P1 Wald test coverage** — add targeted tests for ``_pt_event_study``
(``TestJointWaldAlignment``):
* joint_wald runs when pre-period keys align with ``interaction_indices``
* computed chi-squared statistic matches a closed-form expectation
* Bonferroni fallback when ``interaction_indices`` is missing
* Bonferroni fallback when the key namespace is misaligned
* Bonferroni fallback when ``vcov`` is missing
Also document the alignment contract and fallback rule inline near the
Wald codepath so the invariant is discoverable without reading tests.
- **P2 outcome_direction** — implement direction-aware verbs in the
headline sentence via ``_direction_verb``:
* ``higher_is_better`` + positive effect -> "lifted"
* ``higher_is_better`` + negative effect -> "reduced"
* ``lower_is_better`` + positive effect -> "worsened"
* ``lower_is_better`` + negative effect -> "improved"
* ``None`` -> neutral "increased" / "decreased"
Covered by ``TestOutcomeDirection`` with three scenarios.
- **P2 warning provenance** — populate top-level ``schema["warnings"]``
from every section that ended in ``status="error"`` so agents do not
have to scan each section dict to discover diagnostic failures.
``DiagnosticReportResults.warnings`` now mirrors the top-level list.
Covered by ``TestWarningsPassthrough``.
- **P2 string dispatch** — add an inline note above ``_APPLICABILITY``
explaining the ``type(results).__name__`` convention (mirrors
``practitioner._HANDLERS`` to avoid circular imports) and pointing at
the applicability-matrix test as the regression guard.
No behavioral changes outside the review items; existing tests remain
unchanged. 121 tests pass, black / ruff / mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the "Stakeholder-ready report from any fit" subsection framing with "Experimental preview: BusinessReport and DiagnosticReport" and reword the introductory paragraph to emphasize that wording, verdict thresholds, and schema shape will change. Drop the expected-output comment from the example (the prose will evolve) and invite feedback. This matches the foundation-not-shipped-feature posture: the schema and narrative prototype are worth validating in isolation, but the library still lacks several items a methodologically-rigorous practitioner (covariate comparison, event-study plot embedding, 2x2 placebo battery, real-dataset validation, target-parameter clarity, tutorial integration) would expect. Keeping external framing conservative until those gaps close. No functional changes; only README prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ⛔ Blocker Static review only; I did not run the new tests in this workspace because the local Python environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Fixes the five issues the CI reviewer flagged against the initial
BusinessReport / DiagnosticReport PR:
P0 — Single-M HonestDiDResults passthrough was being narrated as
"robust across the full grid" because both renderers checked
``breakdown_M is None`` and fell through to the grid-wide phrasing.
Preserves the ``conclusion="single_M_precomputed"`` state through
both BR._render_summary and DR._render_overall_interpretation; the
point check is now rendered as "at M=<value>, the robust CI
(excludes|includes) zero — run HonestDiD.sensitivity() across a
grid for a breakdown value." Regression tests in
``TestSingleMSensitivityPrecomputed`` cover both DR.summary() and
BR(honest_did_results=...).summary().
P0 — EPV diagnostics were silently reporting 0 low cells and
min_epv=None on every fit because _check_epv() expected
``low_epv_cells`` / ``min_epv`` attributes but the library's
``epv_diagnostics`` convention is a dict keyed by cell identifier
with per-cell ``{"is_low": bool, "epv": float}`` entries. Rewrites
_check_epv() to handle the dict shape, counts low cells via
``v.get("is_low")``, derives min_epv from the ``epv`` values, and
reads ``results.epv_threshold`` instead of hardcoding 10. Legacy
object-shape fallback retained for custom subclasses. Regression
tests in ``TestEPVDictBacked`` cover low-cell detection, no-low
clean case, and the configurable threshold.
P1 — CallawaySantAnnaResults sensitivity + pretrends_power were
skipped entirely because the applicability gate required
``results.vcov``, but CS exposes ``event_study_vcov`` /
``event_study_vcov_index`` alongside a populated
``event_study_effects`` surface. ``HonestDiD.sensitivity_analysis``
and ``compute_pretrends_power`` already handle CS via those
attributes, so the gate now accepts any of the three covariance
sources. Also honors precomputed overrides regardless of gate.
Regression tests in ``TestCSEventStudyVCovSupport`` confirm both
checks are applicable on an aggregated CS fit.
P1 — _pt_event_study() was forcing Bonferroni on CS even though
event_study_vcov + event_study_vcov_index were available. Added a
second covariance source branch that builds an index map from
``event_study_vcov_index`` and reports
``method="joint_wald_event_study"`` on the Wald path. Regression
test in ``TestCSJointWaldViaEventStudyVCov`` verifies the
closed-form chi-squared statistic on a CS stub with known
coefficients and diagonal vcov.
P1 — ContinuousDiDResults headline extraction was broken: BR and DR
both assumed ``overall_se`` / ``overall_p_value`` /
``overall_conf_int`` but ContinuousDiDResults stores them as
``overall_att_se`` / ``overall_att_p_value`` /
``overall_att_conf_int``. Centralized the attribute-alias lookup in
a new module-level ``_extract_scalar_headline`` helper that both
report classes call; the helper accepts multiple SE / p / CI alias
names per point-estimate row and tries them in order.
Regression test in ``TestContinuousDiDHeadline`` confirms the
helper resolves the ``overall_att_*`` aliases.
P2 — Headline extraction was duplicated in BR and DR. Eliminated by
the shared helper above.
131 targeted tests pass (BR + DR + guides + practitioner); black,
ruff, and mypy clean on the new modules.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. I could not run the new tests in this workspace because Executive Summary
Methodology
Code Quality
Performance No material findings in the diff. Maintainability No separate maintainability findings beyond the incomplete public-API contract above. Tech Debt No separate deferrable findings. None of the P0 / P1 issues above are tracked under TODO.md:L37, so TODO tracking does not mitigate them. Security No findings. Documentation/Tests
Path to Approval
|
P0 fixes: 1. **CI-level mislabeling across BR and DR.** BR's ``_extract_headline`` was reading the stored CI from the fitted result and relabeling it with the caller's ``alpha``, so ``BusinessReport(results, alpha=0.10)`` would print a stored 95% interval under a "90% CI" label. DR's ``_render_overall_interpretation`` hardcoded "95% CI" in prose, inverting the same bug when the caller used a non-default alpha. BR now recomputes the interval via ``safe_inference`` when the caller's alpha differs from the fit's; DR prose reads the headline's alpha to derive the CI level string. Regression in ``TestAlphaKnob.test_ci_bounds_recomputed_when_alpha_differs_from_result``. 2. **``full_report()`` single-M HonestDiD rendering.** The summary path was fixed earlier, but the structured-markdown path still emitted "Breakdown M: robust across full grid (no breakdown)" for a single-M passthrough (which has ``breakdown_M=None`` by construction, not because it's grid-wide robust). Added the ``conclusion == "single_M_precomputed"`` branch in ``_render_full_report``. Regression in ``TestFullReportSingleM.test_full_report_does_not_claim_full_grid_for_single_m``. 3. **Reference-marker / NaN pre-period filtering.** ``_collect_pre_period_coefs`` was accepting any negative-time event-study row with non-None effect and SE, which pulled in the universal-base reference marker (``effect=0, se=NaN, n_groups=0``) emitted by CS / SA / ImputationDiD / Stacked event-study output as a real pre-period coefficient. ``_pt_event_study`` Bonferroni also treated ``NaN`` p-values as valid by checking ``is not None`` rather than ``np.isfinite``. The combination could produce a false-clean ``no_detected_violation`` verdict on fits whose only "evidence" was synthetic. Now drop rows with ``n_groups == 0`` and any row whose effect, SE, or p-value is non-finite before both applicability and PT computation; if no valid entries remain, the check returns ``skipped`` rather than a clean p-value. Regressions in ``TestReferenceMarkerAndNaNFiltering``. P1 fixes: 4. **Power-tier covariance source annotation.** ``compute_pretrends_power`` currently drops to ``np.diag(ses**2)`` for CS / SA / ImputationDiD / Stacked / etc. even when the full ``event_study_vcov`` is attached on the result. The diagonal-only MDV can be optimistic because it ignores correlations across pre-periods; promoting that to ``well_powered`` would overstate the evidence. The ``pretrends_power`` schema section now records ``covariance_source`` (one of ``full_pre_period_vcov`` / ``diag_fallback_available_full_vcov_unused`` / ``diag_fallback``), BR downgrades ``well_powered`` → ``moderately_powered`` when we know the diagonal approximation was the only input, and ``docs/methodology/REPORTING.md`` documents this as a known conservative deviation pending the right long-term fix in ``pretrends.py``. 5. **``precomputed=`` contract validation.** The docstring advertised passthrough for ``placebo``, ``design_effect``, ``heterogeneity``, and ``epv`` but only four checks actually respected it (``parallel_trends``, ``sensitivity``, ``pretrends_power``, ``bacon``). Narrowed the docstring to match reality and added a ``ValueError`` that rejects unsupported ``precomputed=`` keys at construction. Regressions in ``TestPrecomputedValidation``. The remaining sections (``design_effect``, ``heterogeneity``, ``epv``) are read-outs from the fitted result with no expensive call to bypass; there is no scenario where a user-supplied override helps. 139 targeted tests pass; black, ruff, and mypy clean on the new modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment⛔ Blocker Static review only. I could not execute the added tests here because this workspace does not have Executive Summary
Methodology
Code Quality
PerformanceNo material findings in the diff. MaintainabilityNo separate maintainability findings beyond the over-broad applicability/support matrix issues above. Tech DebtNo mitigation applies here. None of the SecurityNo findings. Documentation/Tests
Path to Approval
|
P0 fix:
* **Alpha override was inference-contract-blind.** Previously, whenever
the caller's ``alpha`` differed from the result's, BR recomputed the
displayed CI via ``safe_inference(att, se, alpha=alpha)`` with no
``df`` and no bootstrap handling — silently discarding the
``bootstrap_distribution`` / finite-df inference contracts used by
TROP, ContinuousDiD, dCDH-bootstrap, survey fits, SDiD jackknife,
etc. BR now detects bootstrap-backed (``inference_method='bootstrap'``
or non-None ``bootstrap_distribution`` or ``variance_method in
{bootstrap, jackknife, placebo}``) and finite-df (``df_survey > 0``)
inference paths and preserves the fitted CI at its native level in
those cases, recording an informational caveat noting that the
caller's alpha still drives phrasing but the native interval is
shown. Regressions in ``TestAlphaOverrideBootstrapAndFiniteDF``
cover both the bootstrap and finite-df survey paths.
P1 fixes:
* **``pretrends_power`` over-broad applicability.** The matrix had
marked the check applicable for ImputationDiD, TwoStage, Stacked,
EfficientDiD, StaggeredTripleDiff, Wooldridge, and dCDH, but
``compute_pretrends_power`` only has adapters for MultiPeriod, CS,
and SA; the other families were landing in ``error``. Narrowed the
applicability matrix to match the real helper support.
* **``sensitivity`` over-broad applicability.** HonestDiD only adapts
MultiPeriod, CS, and dCDH (via ``placebo_event_study``). The matrix
had also included SA / Imputation / Stacked / EfficientDiD /
StaggeredTripleDiff / Wooldridge. Narrowed to the supported set. The
dCDH-specific instance gate now checks ``placebo_event_study`` rather
than the generic ``event_study_effects`` so HonestDiD's dCDH branch
is reached instead of the generic event-study collector.
* **``n_obs == 0`` reference-marker filter.** Stacked / TwoStage /
Imputation emit synthetic reference-period markers using ``n_obs=0``
rather than CS / SA's ``n_groups=0`` flag. ``_collect_pre_period_coefs``
now drops rows with either sentinel so the Bonferroni denominator
and joint-Wald index are not inflated by non-informative rows.
P2 fix:
* **``placebo`` schema inconsistency.** ``REPORTING.md`` said
``placebo`` is always rendered as ``{"status": "skipped"}`` in MVP,
but no result type had ``placebo`` in its applicability frozenset, so
implementation fell through to ``"not_applicable"``. Now every
DiagnosticReport.to_dict() returns ``placebo`` with ``status="skipped"``
regardless of estimator, matching the stated contract.
Regression tests for each finding added in
``TestNarrowedApplicabilityAndPlaceboSchema`` and
``TestAlphaOverrideBootstrapAndFiniteDF``. 146 targeted tests pass;
black, ruff, mypy clean on the new modules.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment⛔ Blocker Static review only. I could not execute the added tests in this workspace because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P0 fix:
* **``inference_method == 'wild_bootstrap'`` was not detected as
bootstrap-like.** My prior bootstrap check caught ``'bootstrap'`` and
``variance_method in {bootstrap, jackknife, placebo}`` plus an
attached ``bootstrap_distribution``, but ``DifferenceInDifferences(
inference='wild_bootstrap')`` returns ``inference_method='wild_bootstrap'``
and a percentile-bootstrap CI without necessarily attaching the raw
distribution. The override path silently replaced that CI with a
normal-approximation one. Fixed by matching both
``'bootstrap'`` and ``'wild_bootstrap'``; the preserved-CI caveat
now calls out "wild cluster bootstrap" specifically when that path
triggered. Regression: ``TestWildBootstrapAlphaOverride``.
P1 fix:
* **``_describe_assumption()`` emitted generic DiD PT text for
ContinuousDiD / TripleDifference / StaggeredTripleDiff**, all of
which have identifying logic different from ordinary group-time PT
per ``docs/methodology/REGISTRY.md``. Replaced the generic fallback
with source-backed branches:
- ``ContinuousDiDResults``: two-level parallel trends (PT vs Strong
PT) per Callaway, Goodman-Bacon & Sant'Anna (2024), with explicit
mention of ATT(d|d), ATT(d), ACRT identification sets.
- ``TripleDifferenceResults`` / ``StaggeredTripleDiffResults``:
triple-difference cancellation across the 2x2x2 cells per
Ortiz-Villavicencio & Sant'Anna (2025); notes that identification
is weaker than ordinary DiD PT and depends on additive
separability across the three dimensions.
The ``parallel_trends_variant`` schema field gains two new values:
``"dose_pt_or_strong_pt"`` and ``"triple_difference_cancellation"``.
Direct regressions in ``TestAssumptionBlockSourceFaithful`` assert
registry-backed language (attribution phrases + method names) is
present and generic group-time PT text is absent.
150 targeted tests pass; black, ruff, mypy clean on the new modules.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
1 similar comment
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. The rerun fixes the narrow Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- P0: ``_extract_headline`` now detects ``bootstrap_results is not None`` and ``n_bootstrap > 0`` in addition to ``inference_method`` / ``bootstrap_distribution`` / ``variance_method`` / ``df_survey``. Many staggered / continuous / dCDH result classes copy bootstrap- derived se/p/conf_int into their top-level fields without advertising ``inference_method``; alpha override must preserve their fitted CI rather than silently swapping in a normal-approximation interval. - P1: ``DiagnosticReport._check_sensitivity`` wraps the HonestDiD call in ``warnings.catch_warnings(record=True)`` and propagates captured messages onto the returned section dict. ``run_all`` aggregates per-section warnings into the top-level ``warnings`` list so both DR and BR surface them. CallawaySantAnna fits with ``base_period='varying'`` are preemptively skipped at the applicability gate with a methodology-critical reason, since HonestDiD explicitly warns those bounds are not valid for interpretation. BR renders the skip as a warning-severity caveat under a new ``sensitivity_skipped`` topic. - P1: ``_describe_assumption`` now gives ``ChaisemartinDHaultfoeuilleResults`` a source-backed description of transition-based identification (joiners / leavers / stable-control transitions, DID_M / DID_l building blocks, non-binary dose matching, reversible treatment) rather than generic group-time ATT PT text. - P2: README example now uses ``CallawaySantAnna(base_period='universal')`` so the advertised one-call sensitivity path actually runs. Both ``cs_fit`` fixtures updated likewise. - Regressions: ``TestBootstrapResultsAndNBootstrapDetection`` (four cases incl. dCDH-shaped stub and the analytic zero-bootstrap guard), ``TestDCDHAssumptionTransitionBased`` (source-faithful language assertions), ``TestCSVaryingBaseSensitivitySkipped`` (DR schema reason + BR caveat surfacing). 150 -> 115 targeted tests passing; black / ruff / mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the added tests locally in this environment because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- P1 heterogeneity: ``_collect_effect_scalars`` no longer mixes pre-
and post-treatment coefficients into the CV / range / sign-
consistency summary. ``MultiPeriodDiDResults`` now routes through
``post_period_effects``; staggered event-study fits filter to
``rel_time >= 0`` AND exclude reference markers
(``n_groups == 0`` / ``n_obs == 0``) AND exclude non-finite rows;
CS ``group_time_effects`` filters to ``t >= g`` post-treatment cells.
``_heterogeneity_source`` now names the post-only surface (e.g.,
``post_period_effects`` / ``event_study_effects_post`` /
``group_time_effects_post``) so downstream tooling can verify the
estimand being summarized.
- P1 dCDH parallel trends: ``_collect_pre_period_coefs`` now reads
``placebo_event_study`` as the pre-period surface for
``ChaisemartinDHaultfoeuilleResults``. dCDH is advertised as
PT-applicable in ``_APPLICABILITY`` but the extractor previously
only looked at ``pre_period_effects`` / negative-horizon
``event_study_effects``, silently skipping the PT check on valid
placebo fits.
- P2: API RST examples (``docs/api/business_report.rst``,
``docs/api/diagnostic_report.rst``) updated to construct
``CallawaySantAnna(base_period="universal")`` so the advertised
auto-diagnostics path runs sensitivity instead of being skipped.
``docs/methodology/REPORTING.md`` pretrends-power routing text now
matches the implemented applicability matrix ({MultiPeriod, CS, SA})
rather than claiming general "event-study with vcov" applicability.
- Regressions:
``TestDCDHParallelTrendsViaPlaceboEventStudy`` (two cases: runs when
``placebo_event_study`` populated, skips when missing) and
``TestHeterogeneityPostTreatmentOnly`` (extractor returns post-only
scalars for MultiPeriod; event-study filter drops pre-period and
reference-marker rows).
115 -> 109 targeted tests passing; black / ruff / mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. I could not execute the new tests locally because Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No independent findings beyond the schema-label issue above. Tech Debt No independent findings. The blocker above is not tracked under TODO.md:L51, and it would not be mitigated by TODO tracking under the review policy. Security No findings. Documentation/Tests
Path to Approval
|
- P0 alpha override: ``BusinessReport._extract_headline`` no longer recomputes a normal-z CI when the caller's ``alpha`` differs from the fit's native alpha. Recomputing via ``safe_inference(att, se, alpha)`` silently swapped t-based inference (``DifferenceInDifferences`` via ``LinearRegression.get_inference()``, ``MultiPeriodDiD`` via ``safe_inference(..., df=df)``, TROP's ``df_trop``) for a normal-z CI, and could invent a finite CI on undefined-df (replicate- survey ``df_survey=0``) fits whose native inference was NaN. The fitted CI is now preserved at its native level on any alpha mismatch; a ``display_alpha`` / ``phrasing_alpha`` split keeps the CI level at the fit's native level while significance phrasing (``is_significant``, ``near_significance_threshold``) uses the caller's requested alpha. The inference label in the override caveat now distinguishes bootstrap, wild bootstrap, jackknife / placebo, finite-df survey, undefined-df replicate, and ordinary analytical (native degrees of freedom). - P2 schema mislabel: ``DiagnosticReport`` pretrends-power section renames ``power_at_M_1`` to ``power_at_violation_magnitude`` and adds an explicit ``violation_magnitude`` field. The underlying ``PreTrendsPowerResults.power`` is power at ``violation_magnitude`` (which defaults to the MDV when the caller passes ``M=None``), not power at ``M=1.0`` as the prior label implied. - Test updates: the round-2 ``test_ci_bounds_recomputed_when_alpha _differs_from_result`` assumed recomputation was the correct behavior; renamed to ``test_alpha_mismatch_preserves_fitted_ci_at_native_level`` and inverted the bounds expectations. ``test_alpha_drives_ci_level`` narrowed to the equal-alpha case. ``test_n_bootstrap_zero_does_not_trigger_preserve_path`` replaced by ``test_n_bootstrap_zero_still_preserves_on_alpha_mismatch``. New ``TestAnalyticalFiniteDfAlphaOverride`` pins the P0 fix on real ``DifferenceInDifferences`` / ``MultiPeriodDiD`` fits and on a ``df_survey=0`` stub (NaN CI must propagate). 112 targeted tests passing; black / ruff / mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. I could not execute the new tests in this sandbox because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Round 8 flagged one legitimate P1 (method-aware PT prose + EfficientDiD assumption). The round-8 review's P0 alpha override, P2 power_at_M_1 rename, and P2 analytical-fit regression findings are regressions on its own round-7 assessment — all three were addressed in commit d8fa66d and remain in place at HEAD: ``safe_inference(att, se, alpha)`` is gone from ``_extract_headline``, ``power_at_M_1`` is renamed to ``power_at_violation_magnitude`` with ``violation_magnitude`` exposed, and ``test_n_bootstrap_zero_does_not_trigger_preserve_path`` was replaced by ``test_n_bootstrap_zero_still_preserves_on_alpha_mismatch`` plus the new ``TestAnalyticalFiniteDfAlphaOverride`` suite with real ``DiDResults`` / ``MultiPeriodDiD`` fits and an undefined-d.f. replicate-survey stub. The static reviewer appears to have described the pre-round-7 state of those paths; grep confirms the fixes are present at this SHA. Legitimate round-8 P1 fix: - ``BusinessReport._describe_assumption`` now has an ``EfficientDiDResults``-specific branch that reads ``results.pt_assumption`` (``"all"`` vs ``"post"``) and, when present, ``results.control_group``. PT-All surfaces Lemma 2.1 / over-identification + Hausman pretest language; PT-Post surfaces Corollary 3.2 / just-identified single-baseline DiD language. EfficientDiD is pulled out of the generic group-time ATT frozenset per REGISTRY.md §EfficientDiD lines 736-738 and 907. - BR summary and DR ``_render_overall_interpretation`` PT sentences now branch on ``parallel_trends.method``. New ``_pt_method_subject`` / ``_pt_subject_phrase`` helpers return source-faithful subjects ("The pre-period slope-difference test" for ``slope_difference``, "The Hausman PT-All vs PT-Post pretest" for ``hausman``, "Pre-treatment event-study coefficients" for Wald / Bonferroni paths, "The synthetic-control pre-treatment fit" for SDiD, "The factor-model pre-treatment fit" for TROP). A matching ``_pt_method_stat_label`` emits ``joint p`` vs ``p`` so single- statistic tests (slope_difference, hausman) are not mis-labeled with a joint-Wald style statistic label. - Regressions: ``TestEfficientDiDAssumptionPtAllPtPost`` (three cases: PT-All, PT-Post, control_group passthrough) and ``TestMethodAwarePTProse`` (BR slope-difference wording on a crafted schema; DR hausman wording on a real ``EfficientDiD`` fit). New ``edid_fit`` fixture added to ``tests/test_business_report.py``. 117 targeted tests passing; black / ruff / mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. I could not execute the new tests in this sandbox because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Round 9 flagged one legitimate new finding (Hausman propagation). The other four findings are regressions on the reviewer's own round-7 and round-8 assessments — grep at HEAD confirms all four were fixed in earlier commits and remain in place: - P0 alpha override: ``safe_inference(att, se, alpha)`` was removed from ``BusinessReport._extract_headline`` in round 7 (commit d8fa66d). The only remaining references are in the explanatory comment that documents why we preserve rather than recompute. ``grep safe_inference diff_diff/business_report.py`` returns only the comment lines. - P1 EfficientDiD assumption + PT prose: addressed in round 8 (commit 7b5c0ad). ``_describe_assumption`` has an EfficientDiD- specific branch reading ``pt_assumption`` (PT-All vs PT-Post per Corollary 3.2 / Lemma 2.1). BR and DR summary prose branch on ``parallel_trends.method`` via the new ``_pt_method_subject`` / ``_pt_subject_phrase`` helpers, so the 2x2 slope-difference and Hausman paths get source-correct subjects. - P2 ``power_at_M_1`` rename: addressed in round 7. Grep confirms the field is ``power_at_violation_magnitude`` with ``violation_magnitude`` exposed; only the explanatory comment references the old name. - P2 test regressions: ``test_ci_bounds_recomputed_when_alpha_differs _from_result`` and ``test_n_bootstrap_zero_does_not_trigger_preserve _path`` were removed in round 7; the new ``TestAnalyticalFiniteDfAlphaOverride`` suite locks in the preserve-always behavior on real ``DiDResults`` / ``MultiPeriodDiD`` fits plus an undefined-d.f. replicate-survey stub. Legitimate round-9 P1 fix: - ``DiagnosticReport._pt_hausman`` now reads ``control_group`` and ``anticipation`` from the fitted result and forwards them to ``EfficientDiD.hausman_pretest``. The prior code passed only ``data/outcome/unit/time/first_treat/alpha``, so a fit that used ``control_group='last_cohort'`` or a non-zero ``anticipation`` was silently diagnosed under the default design rather than its own. New ``TestHausmanPretestPropagatesFitDesign`` regression uses ``unittest.mock.patch`` to verify both fields propagate. 118 targeted tests passing; black / ruff / mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — the prior P1s appear resolved. Static review only: this sandbox does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P2 methodology (StaggeredTripleDiff fixed-control prose incomplete).
Round-37 moved StaggeredTripleDiff's fixed ``control_group="never_
treated"`` schema to ``n_never_enabled`` (REGISTRY.md line 1730
names the never-enabled cohort as the valid fixed comparison) and
cleared the composite ``n_control_units`` total from ``n_control``.
The renderers, however, only surface ``n_never_enabled`` inside the
``is_dynamic_control`` branch — so the fixed ``never_treated`` path
fell through to the generic ``Sample: N observations.`` sentence
and the full report omitted the fixed comparison cohort entirely.
Added dedicated fixed-never-enabled branches to both renderers:
* ``_render_summary`` emits ``Sample: N observations (N_t treated,
N_ne never-enabled).`` when the estimator is
``StaggeredTripleDiffResults``, the dynamic branch is not
active, and ``n_never_enabled > 0``;
* ``_render_full_report`` emits a dedicated bullet
``- Never-enabled units (fixed comparison cohort): N_ne`` under
the same condition.
P3 coverage. Round-37 regression only asserted absence of the
wrong ``500 control`` wording; it did not positively assert the
valid never-enabled comparison cohort appeared in rendered prose,
which is why the P2 above slipped through. Regressions extended:
* ``test_never_treated_mode_summary_renders_never_enabled_count``
asserts ``300 never-enabled`` appears in ``summary()`` AND the
generic fallback ``Sample: 800 observations.`` does not fire;
* new ``test_never_treated_mode_full_report_renders_never_enabled_count``
asserts the sample section of ``full_report()`` names
``never-enabled`` and the ``300`` count while omitting any
bare ``- Control: 500`` line.
278 BR / DR / practitioner tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — the prior re-review issue around fixed Executive Summary
Methodology No findings. The methodology-sensitive changes reviewed here are either consistent with the registry, including fixed Code Quality No findings. Performance No findings. Maintainability
Tech Debt No findings. Security No findings. Documentation/Tests
|
P3 maintainability (inconclusive-PT provenance dropped at the lift).
DR's ``_pt_event_study`` inconclusive branch emits
``n_dropped_undefined`` (count of real pre-periods rejected for
undefined inference) and a detailed ``reason`` tying the skip to
the ``safe_inference`` contract. BR's ``_lift_pre_trends`` was
dropping both fields at the lift boundary, so BR's own summary
renderer — which is already written to quote the count when
available — never saw it. Preserve both fields on the BR
``pre_trends`` block so the schema carries the same provenance DR
does, and summary prose can render "1 pre-period row had undefined
inference" instead of the generic "pre-period inference was
undefined" fallback.
P3 docs drift. Round-35 added ``band_label="improves_precision"``
for ``deff < 0.95`` and code/tests exercise that enum value, but
``REPORTING.md`` still described only the four-band table
(``trivial`` / ``slightly_reduces`` / ``materially_reduces`` /
``large_warning``) and ``_check_design_effect``'s docstring listed
the same old table. Updated both surfaces to enumerate the full
five-value enum with explicit threshold rules and the intuition
for the precision-improving band.
Tests: 1 new regression.
* ``TestInconclusivePTProvenancePreservedOnBRSchema``: NaN-p
``StackedDiDResults`` fit (Bonferroni-only surface) lifts
through to BR with ``n_dropped_undefined==1`` on the BR
``pre_trends`` block, a populated ``reason``, and the summary
renderer quotes "1 pre-period row had undefined inference".
All BR / DR / practitioner tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality No findings. I also did not find new inline inference / partial-NaN-guard anti-patterns in the changed files. Performance No findings. Maintainability No findings. Tech Debt No separate findings. I did not see the two Security No findings. Documentation/Tests
Path to Approval
|
Round-40 landed two P1 methodology findings on the reporting layer, both instances of the same silent-failure class: survey-backed fits routing through diagnostic helpers that don't accept ``survey_design``, silently emitting unweighted results for a weighted estimate. Survey-design threading (``diagnostic_report.py``, ``business_report.py``): - ``DiagnosticReport`` and ``BusinessReport`` now accept ``survey_design=<SurveyDesign>``. BR forwards to the auto- constructed DR; DR threads through ``bacon_decompose( survey_design=...)``. - When ``results.survey_metadata`` is set but ``survey_design`` is not supplied, Bacon and the simple 2x2 parallel-trends helper skip with an explicit reason instead of replaying an unweighted decomposition / verdict for a design that does not match the estimate. Precomputed passthroughs remain honored. - Defense-in-depth skips added at the runner level (``_check_bacon`` / ``_pt_two_x_two``) in case future callers bypass the applicability gate. Docs: - ``REPORTING.md`` documents the new ``survey_design`` contract and the skip-with-reason policy alongside the existing finite-df PT note. - ``business_report.rst`` and ``diagnostic_report.rst`` surface the kwarg with a pointer to the methodology file. Tests: 7 new regressions (4 DR + 3 BR) covering PT skip with reason on survey-backed DiDResults, precomputed PT override honored, Bacon skip without survey_design, and survey_design forwarded via ``unittest.mock.patch``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0 or P1 findings. Executive Summary
Methodology No findings. The reporting-layer deviations I checked against Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. I did not see any blocker-grade issue that depends on Security No findings. Documentation/Tests
|
Round-41 landed a single P3 finding (PR overall assessment ✅): the
survey-threading docs and constructor docstrings overstated what
``survey_design`` unlocks. Supplying the design enables fit-faithful
Goodman-Bacon replay, but the simple 2x2 parallel-trends helper
(``utils.check_parallel_trends``) has no survey-aware variant, so
survey-backed ``DiDResults`` PT is skipped unconditionally regardless
of ``survey_design`` — the helper cannot consume the design even when
it is available.
Docs:
- ``REPORTING.md`` now separates the Bacon replay contract (where
``survey_design`` is load-bearing) from the 2x2 PT contract
(skip-only; use ``precomputed={'parallel_trends': ...}`` to opt in).
- ``business_report.rst`` / ``diagnostic_report.rst`` mirror the
split and point the reader at the precomputed-PT opt-in.
- Constructor docstrings on both classes clarify the scope.
Tests: added
``test_survey_backed_did_skips_2x2_pt_even_when_survey_design_supplied``
which passes both ``survey_metadata`` AND ``survey_design`` on a
``DiDResults`` stub and asserts PT still skips with a reason naming
the precomputed-PT opt-in (not ``survey_design``).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Round-42 landed two P1 findings: 1. All-undefined pre-period surface routed to ``skipped`` instead of ``inconclusive`` (``diagnostic_report.py``). When every pre-row is dropped by ``_collect_pre_period_coefs`` for undefined inference (all ``se <= 0`` / non-finite effect/se), the collector returns ``([], n_dropped_undefined > 0)``. Both the applicability gate and ``_pt_event_study`` treated that as "no coefficients available" and skipped, letting BR drop the identifying-assumption warning. Fixed both sites to detect the all-undefined case and route to the explicit ``method="inconclusive"`` runner alongside the partial- undefined case already covered by R33. BR's existing inconclusive phrasing lifts through unchanged. 2. Source-faithful assumption text for ``ImputationDiDResults`` and ``TwoStageDiDResults`` (``business_report.py``). BR's ``_describe_assumption`` was grouping both with CS / SA / Wooldridge under the generic "parallel trends across treatment cohorts and time periods (group-time ATT)" template, but BJS (2024) and Gardner (2022) both identify through an untreated-potential-outcome model: unit+time FE fitted on untreated observations (``Omega_0`` = never-treated + not-yet-treated) deliver the counterfactual, and the identifying restriction is on ``E[Y_it(0)] = alpha_i + beta_t`` — not on cohort-time ATT equality. Split each into its own branch mirroring REGISTRY.md §ImputationDiD (lines 1000-1013) and §TwoStageDiD (lines 1113-1128), including the Gardner-BJS algebraic-equivalence note. Tests: 3 new regressions. - ``test_all_pre_periods_undefined_yields_inconclusive_not_skipped``: all pre-rows with ``se == 0``, asserts DR emits ``method="inconclusive"`` / ``status="ran"`` / ``n_pre_periods=0`` / ``n_dropped_undefined=2``, and BR summary emits "inconclusive". - ``test_imputation_did_assumption_uses_untreated_fe_model`` and ``test_two_stage_did_assumption_uses_untreated_fe_model``: lock the new ``parallel_trends_variant="untreated_outcome_fe_model"`` tag, require the registry-backed source attribution and untreated-subset detail, and reject the pre-R42 generic-PT template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings in the merged diff. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Round-43 assessment was ✅ with two P2 findings; both are boundary /
contract mismatches rather than methodology defects.
1. DEFF==1.05 boundary inconsistency. REPORTING.md defines the
``trivial`` band as ``0.95 <= deff < 1.05`` (half-open) and
``slightly_reduces`` as starting at ``1.05``. The ``is_trivial``
flag in both DR's ``_check_design_effect`` and BR's sample-block
copy used ``<= 1.05`` (closed), so exactly ``deff == 1.05`` landed
in the ``slightly_reduces`` band AND was flagged ``is_trivial=True``
— internally inconsistent, and the flag suppressed the non-trivial
prose the documented threshold says should fire. Aligned both
``is_trivial`` bounds with REPORTING.md's half-open interval.
2. ``BusinessReport`` did not accept the ``precomputed=`` dict that
its docstring and API docs advertised as the opt-in path for
survey-aware 2x2 PT (``precomputed={'parallel_trends': ...}``) and
other escape hatches. The auto path only synthesized ``{"sensitivity":
honest_did_results}``, so a user following the BR docs hit a
``TypeError`` on ``__init__``. Added ``precomputed=`` kwarg to
``BusinessReport``, eager key validation mirroring DR's set (keys:
``parallel_trends`` / ``sensitivity`` / ``pretrends_power`` /
``bacon``), and forwarded the merged dict to the auto-constructed
DR. ``honest_did_results`` remains a shorthand for ``sensitivity``;
explicit ``precomputed['sensitivity']`` wins on conflict.
Tests: 5 new regressions.
- ``test_deff_exactly_1_05_is_slightly_reduces_not_trivial`` +
``test_deff_just_under_1_05_is_trivial`` cover the exact-boundary
+ adjacent-point schema behavior across DR and BR.
- ``TestBusinessReportPrecomputedPassthrough`` covers the happy path
(PT precomputed unlocks survey-backed 2x2), eager key validation
(unknown key raises ``ValueError`` at BR construction), and the
explicit-vs-shorthand precedence rule.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests
Path to Approval
|
Round-44 landed one P1 methodology finding: ``precomputed['sensitivity']`` (and BR's ``honest_did_results`` shorthand that merges into it) bypassed the varying-base ``CallawaySantAnna`` guard. The applicability gate's precomputed early-return fired before the ``base_period != 'universal'`` check, so a displayed CS fit with ``base_period='varying'`` would get its sensitivity section unlocked and the Rambachan-Roth bounds narrated as ordinary robustness — even though HonestDiD explicitly warns those bounds are not valid for interpretation on the consecutive-comparison pre-period surface ``varying`` base produces (REGISTRY.md §CallawaySantAnna line 410, §HonestDiD line 2458). Narrating the bounds alongside the displayed varying-base fit mixes provenance the bounds do not support, which is the silent-failure pattern the varying-base auto-path skip was designed to prevent. Fixes: - ``diagnostic_report.py`` ``__init__``: raise ``ValueError`` when ``precomputed['sensitivity']`` is supplied on ``CallawaySantAnnaResults`` with ``base_period != 'universal'``, mirroring the existing SDiD/TROP rejection pattern for methodology-incompatible passthroughs. - ``diagnostic_report.py`` ``_instance_skip_reason``: reorder the sensitivity gate so the CS varying-base check fires BEFORE the precomputed early-return (defense-in-depth behind the ``__init__`` raise; also protects against callers that mutate ``_precomputed`` post-construction). - ``business_report.py`` ``__init__``: raise on the same interaction when either ``honest_did_results`` or ``precomputed['sensitivity']`` is supplied (or both — the error names each rejected input). Tests: 5 new regressions in ``TestCSVaryingBaseSensitivityRejectsPrecomputed`` covering both DR and BR, both passthrough paths, the union-error case, and the universal-base positive case (supported and not rejected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings beyond the methodology issue above. Tech Debt No findings. Security No findings. Documentation/Tests No separate findings, but this PR should add a regression for the P1 above: a non-TWFE fit with Path to Approval
|
Round-45 landed one P1 methodology finding: ``BusinessReport`` emitted
the Bacon "re-estimate with a heterogeneity-robust estimator (CS / SA
/ BJS / Gardner)" caveat on every fit whose Bacon block had
``forbidden_weight > 0.10``, including fits that were already produced
by one of those robust estimators.
Goodman-Bacon is explicitly a decomposition of TWFE weights
(``bacon.py`` header, Goodman-Bacon 2021). On a displayed fit that is
already heterogeneity-robust (CS / SA / BJS / Gardner / Wooldridge /
EfficientDiD / Stacked / dCDH / TripleDifference /
StaggeredTripleDiff / SDiD / TROP), a high forbidden-weight share is a
statement about what TWFE WOULD have done on this rollout, not a claim
that the displayed estimator needs replacement. DR partly preserved
this in its prose with an "if not already in use" guard; BR dropped
that distinction and rendered the stronger recommendation in
stakeholder-facing caveats / full reports.
Fix (``business_report.py`` ``_build_caveats``):
- Introduce ``_TWFE_STYLE_RESULTS = {DiDResults, MultiPeriodDiDResults,
TwoWayFixedEffectsResults}`` — the fits for which the switch-to-
robust recommendation is load-bearing.
- Keep the original message for TWFE-style fits.
- Rephrase for already-robust fits: "TWFE benchmark would be
materially biased on this rollout; the displayed estimator is
already heterogeneity-robust, so this is a statement about the
rollout design (avoid reporting TWFE alongside this fit), not about
the current result's validity."
Tests: 3 new regressions in ``TestBaconCaveatEstimatorAware``:
- CS-like fit with high forbidden-weight does NOT recommend switching.
- Spot-check the same rule across SA / Imputation / TwoStage / Stacked
/ Wooldridge / dCDH / EfficientDiD.
- ``MultiPeriodDiDResults`` (TWFE event-study) DOES keep the switch
recommendation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessment✅ Looks good — the previous P1 from the last AI review appears resolved, and I did not find a new unmitigated P0/P1 in the changed diff. Executive summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
BusinessReportandDiagnosticReportas an experimental preview foundation: plain-English narrative output + a stableto_dict()schema, dispatching across all 16 result types. Core promise is the schema shape and estimator-native routing; the narrative prose will continue to evolve.DiagnosticReportorchestrates seven diagnostic checks (parallel trends, pre-trends power, HonestDiD sensitivity, Goodman-Bacon, heterogeneity, design-effect, EPV) and routes SyntheticDiD (pre-treatment fit + in-time placebo + zeta sensitivity), EfficientDiD (hausman_pretest), and TROP (factor-model fit) through their native validation surfaces instead of the generic event-study path.BusinessReportauto-constructsDiagnosticReportby default so.summary()mentions pre-trends, robustness, and design-effect findings in one call. Power-aware phrasing (viacompute_pretrends_power) tiers theno_detected_violationlanguage into well-powered / moderately-powered / underpowered so a well-designed study is not under-sold.Methodology references (required if estimator / math changes)
compute_pretrends_power(Roth 2022),HonestDiD.sensitivity(Rambachan & Roth 2023),bacon_decompose(Goodman-Bacon 2021),check_parallel_trends+check_parallel_trends_robust,compute_deff_diagnostics, estimator-nativehausman_pretest(Chen, Sant'Anna & Xie 2025), SyntheticDiDpre_treatment_fit/in_time_placebo/sensitivity_to_zeta_omega/get_weight_concentration.- **Note:**label indocs/methodology/REPORTING.md):DiagnosticReportdoes not callcheck_parallel_trendson event-study or staggered result objects (uses joint Wald / Bonferroni on pre-period coefficients instead).Validation
tests/test_business_report.py(32 tests covering schema contract, BR/DR integration,honest_did_results=passthrough, unit labels, log-points caveat, significance-chasing guard, pre-trends verdict bins, power-aware phrasing, NaN ATT, appendix toggle,BaconDecompositionResultsTypeError, survey-metadata passthrough, single-knob alpha,outcome_directionverb selection, error-provenance passthrough).tests/test_diagnostic_report.py(37 tests covering applicability matrix across result types, schema contract + JSON round-trip,precomputed=passthrough, joint-Wald vs Bonferroni alignment with a closed-form χ² check, Bonferroni fallback on missing / misalignedinteraction_indicesand missingvcov, verdict / power-tier boundaries, EfficientDiD hausman pathway, SDiD native diagnostics, error-doesn't-break-report).tests/test_guides.pystill passes (UTF-8 fingerprint preserved inllms-full.txt).black,ruff,mypyclean on the two new modules.CallawaySantAnna(aggregate="event_study")produces a readable strategy-doc-style summary;.to_dict()round-trips throughjson.dumps.Security / privacy
diff_diff/sources).Docs
docs/methodology/REPORTING.mdrecords all methodology deviations under- **Note:**labels.docs/methodology/REGISTRY.mdcross-links to it.docs/api/business_report.rstanddocs/api/diagnostic_report.rstregistered in a new "Reporting" section indocs/api/index.rst.docs/doc-deps.yamltracks both modules.CHANGELOG.mdentry in[Unreleased]is marked experimental.ROADMAP.mdmoves BR/DR to Recently Shipped and splits the bundled bullet so context-awarepractitioner_next_steps()remains queued as its own follow-up.diff_diff/guides/llms-full.txtdocuments both public APIs and schemas;llms-practitioner.txtnotes that DR covers Baker steps 3/6/7 in one call.Known follow-ups (not in this PR)
placebokey; not yet implemented).These gaps are why the schema is experimental and why tutorials / external positioning do not route practitioners through BR/DR yet.
🤖 Generated with Claude Code