Close BR/DR gap #4: canonical-dataset regression guards + wording fixes by igerber · Pull Request #341 · igerber/diff-diff

igerber · 2026-04-19T23:57:21Z

Summary

Adds tests/test_br_dr_canonical_datasets.py — pytest regression guards that run BusinessReport on Card-Krueger (1994), mpdta (Callaway-Sant'Anna 2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under both CS and SA). Assertions are property-level (direction of ATT, pre-trends verdict bin, HonestDiD breakdown-M tier, cross-estimator consistency), not exact-match, so they survive small data-aggregation drift between the bundled dataset and the published author sample.
Fixes two prose bugs surfaced by running BR on real data:
- str.capitalize() lowercased every character after the first, flattening embedded abbreviations ("the NJ minimum-wage increase" -> "The nj minimum-wage increase") and proper-noun phrases ("Castle Doctrine law adoption" -> "Castle doctrine law adoption"). Replaced with a _sentence_first_upper helper that preserves user-supplied casing.
- The HonestDiD fragile sentence rendered breakdown_M == 0 as "violations reach 0x the pre-period variation" — a degenerate sentence (zero-times-anything). At breakdown_M <= 0.05 both BR's summary and DR's overall-interpretation now say "includes zero even at the smallest parallel-trends violations on the sensitivity grid". Cross-surface parity: both BR and DR render helpers fixed.
Closes BR/DR foundation gap Claude/setup pip install k2 m4j #4 (real-dataset validation) from the external-positioning gap list.

Uses the _construct_* fallback data from diff_diff.datasets so the regression tests have no network dependency (same pattern tests/test_datasets.py uses).

Methodology references (required if estimator / math changes)

Method name(s): no new methodology. The changes are reporting-layer prose fixes plus a property-level regression test suite on existing estimators.
Paper / source link(s): canonical references embedded in test docstrings — Card & Krueger (1994) AER; Callaway & Sant'Anna (2021) JOE; Cheng & Hoekstra (2013) JHR; Sun & Abraham (2021) JOE. Already cited in REGISTRY.md.
Any intentional deviations from the source (and why): None. Sensitivity tier thresholds (breakdown_M > 1.0 for robust, < 0.5 for fragile) match the existing BR/DR conventions documented in docs/methodology/REPORTING.md.

Validation

Tests added/updated: tests/test_br_dr_canonical_datasets.py (new, 7 tests); tests/test_business_report.py (+5 tests in TestCanonicalValidationSurfaceFixes pinning the wording fixes).
Backtest / simulation / notebook evidence: N/A. The canonical datasets themselves are the validation surface — direction, PT verdict tier, and HonestDiD tier are asserted against published-applied-work interpretations.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Closes BR/DR foundation gap #4 (real-dataset validation) from the external-positioning gap list in ``project_br_dr_foundation.md``. Validation artifact: - ``docs/validation/validate_br_dr_canonical.py`` runs BusinessReport / DiagnosticReport on Card-Krueger (1994), mpdta (Callaway-Sant'Anna 2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under both CS and SA), dumping summary + full_report + selected to_dict blocks for each. - ``docs/validation/br_dr_canonical_validation.md`` is the regenerable raw output. - ``docs/validation/br_dr_canonical_findings.md`` is the hand-written synthesis: direction / verdict / sensitivity tier all match canonical interpretations, with two small wording bugs surfaced and fixed in this PR and two larger gaps queued as follow-up (SA HonestDiD applicability, target-parameter disambiguation). Wording fixes: 1. Treatment-label capitalization. ``str.capitalize()`` lowercased every character after the first, flattening embedded abbreviations (``"the NJ minimum-wage increase"`` → ``"The nj minimum-wage increase"``) and proper-noun phrases (``"Castle Doctrine law adoption"`` → ``"Castle doctrine law adoption"``). Replaced with a ``_sentence_first_upper`` helper that preserves user-supplied casing. 2. ``breakdown_M == 0`` phrasing. The HonestDiD fragile sentence quoted ``{breakdown_M:.2g}x the pre-period variation``, which renders as a degenerate ``0x`` on the exact-zero case surfaced by Cheng-Hoekstra. At ``breakdown_M <= 0.05`` (covers 0 and near-zero values), both BR's summary and DR's overall_interpretation now say "includes zero even at the smallest parallel-trends violations on the sensitivity grid" instead. Tests: 5 new regressions in ``TestCanonicalValidationSurfaceFixes`` covering both fixes + three boundary cases (exact zero, small positive, normal fragile value). Not in scope: Favara-Imbs (dCDH reversible-treatment dataset not bundled), ImputationDiD / TwoStageDiD on canonical data (needed to exercise the R42 untreated-outcome FE assumption branch on real data), SA HonestDiD applicability gap. All tracked in the findings doc for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sion tests Per review: the validation script + findings doc were one-shot artifacts that would age poorly. Replace them with ``tests/test_br_dr_canonical_datasets.py`` — pytest regression guards that assert canonical properties (direction, PT verdict tier, HonestDiD breakdown_M tier, cross-estimator consistency) on each canonical fit. Uses the ``_construct_*`` fallback data from ``diff_diff.datasets`` so tests have no network dependency (same pattern ``test_datasets.py`` already uses). Tests cover: - Card-Krueger (1994): positive sign, CI includes zero, "consistent with no effect" prose. - mpdta (CS 2021 benchmark): negative ATT, breakdown_M > 1.0, no_detected_violation pre-trends. - Castle Doctrine (Cheng-Hoekstra 2013) under CS: positive ATT, clear_violation pre-trends, fragile sensitivity (breakdown_M < 0.5). - Castle Doctrine cross-estimator consistency: SA agrees with CS on direction and PT verdict bin. - Treatment-label capitalization bugs: ``NJ`` abbreviation and ``Castle Doctrine`` proper noun preserved through BR's sentence capitalization. - ``breakdown_M == 0`` edge case: BR summary uses smallest-grid-point wording, not the degenerate ``0x`` multiplier. Drops: - ``docs/validation/validate_br_dr_canonical.py`` — one-shot script, replaced by the regression tests. - ``docs/validation/br_dr_canonical_validation.md`` — raw dump, regenerable on demand if needed but not checked in. - ``docs/validation/br_dr_canonical_findings.md`` — summary now lives in the regression-test docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-20T00:09:03Z

Overall Assessment
⚠️ Needs changes

Executive Summary

No estimator math, weighting, variance, or identification-assumption logic changes are in scope here; the diff is confined to BR/DR prose and new regression tests.
The str.capitalize() replacement in BusinessReport looks correct, and I did not find remaining report-code uses of that pattern.
P1: the new near-zero HonestDiD wording overstates what the evaluated sensitivity grid established. It treats a small breakdown_M as proof that the smallest grid point failed, which is not what breakdown_M means.
The new regression test explicitly locks in that semantic mismatch by asserting “smallest grid point” wording for breakdown_M=0.03.
I could not execute the tests in this environment because pytest and numpy are unavailable, so this is a static diff review.

Methodology

P1 Impact: The new BR/DR branch for breakdown_M <= 0.05 silently violates the HonestDiD sensitivity contract. SensitivityResults.breakdown_M is defined as the smallest M where the robust CI includes zero, while M_values is the evaluated grid. HonestDiD also supports grids that start at 0, and the docs show that as a normal usage pattern. On a supported input like M_values=[0, 0.25, 0.5, ...] with breakdown_M=0.03, the CI still excludes zero at the smallest grid point (M=0); it only fails between 0 and 0.25. The new prose nonetheless says the CI “includes zero even at the smallest … violations on the sensitivity grid,” which is stronger than the underlying result supports. That is a semantic-contract violation in the reporting layer for HonestDiD. Locations: diff_diff/business_report.py:L2118-L2131, diff_diff/diagnostic_report.py:L3121-L3136, diff_diff/honest_did.py:L372-L389, diff_diff/honest_did.py:L2551-L2588, diff_diff/honest_did.py:L2604-L2645, docs/api/honest_did.rst:L151-L156, docs/methodology/REPORTING.md:L17-L24.
Concrete fix: drive this wording from the actual evaluated grid (sens["grid"] / M_values), not from breakdown_M <= 0.05 alone. Only use “smallest … on the sensitivity grid” when the smallest evaluated grid point itself is non-robust. Otherwise use grid-free wording like “fragile to very small nonzero violations (breakdown M≈0.03)” or keep the numeric multiplier.

Code Quality

No additional findings. The casing helper itself is a contained fix, and I found no remaining report-code str.capitalize() usage.

Performance

No findings. The code-path cost is negligible; only the new tests add runtime.

Maintainability

No additional findings beyond the P1 above. The only maintainability issue is that the new near-zero prose is keyed off a hard-coded sentinel instead of the actual grid data already present in the schema.

Tech Debt

No TODO.md mitigation applies to the P1 above, and per the review policy it would not be TODO-eligible anyway because it is a silent interpretation bug in user-facing output.

Security

No findings.

Documentation/Tests

P3 Impact: The added regression coverage reinforces the incorrect breakdown_M=0.03 -> smallest grid point behavior, and there is still no direct regression on the mirrored DR wording path. Locations: tests/test_business_report.py:L4100-L4131, diff_diff/diagnostic_report.py:L3121-L3144.
Concrete fix: replace the 0.03 test with a precomputed sensitivity case whose grid starts at 0 and assert both BR and DR do not claim “smallest … on the sensitivity grid”; add a separate positive-control case where the smallest evaluated grid point is already non-robust.

Path to Approval

In both BR and DR, derive the near-zero HonestDiD wording from sens["grid"] / M_values, not from breakdown_M <= 0.05 alone.
Add paired regressions for both report surfaces:
1. a grid starting at 0 with breakdown_M≈0.03, where the smallest grid point is still robust, and
2. a grid whose smallest evaluated point is already non-robust, where the “smallest … on the sensitivity grid” wording is actually justified.

…valuated grid R1 caught a semantic bug in the round-1 canonical-validation wording fix. ``breakdown_M`` is the smallest M at which the robust CI includes zero — an interpolated threshold between grid points — not a claim about any specific grid point. Keying the "smallest grid point fails" wording off ``breakdown_M <= 0.05`` was wrong: on a grid starting at M=0 where the smallest evaluated point is still robust (CI excludes zero), a small ``breakdown_M=0.03`` means fragility emerges BETWEEN grid points, not at M=0. Fix (both BR and DR): - Added a ``_smallest_failing_grid_m`` helper (paired helpers in ``business_report.py`` and ``diagnostic_report.py``, intentionally duplicated with cross-reference comments per the parity rule from ``feedback_cross_surface_parity_audit.md``). - Helper returns the smallest evaluated M on the grid if that point has ``robust_to_zero == False``, else ``None``. - Fragile-sensitivity wording now fires "smallest M evaluated on the sensitivity grid (M = X)" ONLY when the helper returns a value; otherwise falls through to the numeric multiplier ``{bkd:.2g}x``. - Castle Doctrine (Cheng-Hoekstra 2013) CS fit: grid starts at M=0.5, every point non-robust — new wording quotes "(M = 0.5)" instead of "0x the pre-period variation". - Reviewer's counterexample (grid ``[0, 0.25, ...]`` with bkd=0.03, smallest point robust): wording falls through to "0.03x the pre-period variation", not "smallest grid point". Tests: - Rewrote ``TestCanonicalValidationSurfaceFixes`` on the BR side to build sensitivity schemas with explicit grids. Added paired cases: (a) smallest M fails, assert "smallest M evaluated"; (b) smallest M robust, breakdown 0.03, assert multiplier wording is used. - Added ``TestDRFragilePhrasingIsGridAware`` on the DR side mirroring the same paired cases against ``_render_overall_interpretation``. - Updated the Castle Doctrine canonical-dataset regression test to assert ``"M = 0.5"`` appears (actual smallest evaluated grid point). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T00:39:08Z

/ai-review

github-actions · 2026-04-20T00:45:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6e24014c3077052c39f35adcef2e4961f79b2ec2

Overall Assessment
✅ Looks good

Executive Summary

The prior P1 is resolved. BR and DR now gate the “smallest M evaluated on the sensitivity grid” wording on the actual evaluated grid, not on breakdown_M <= 0.05, which matches HonestDiD’s breakdown_M contract in diff_diff/business_report.py:L2142-L2166, diff_diff/diagnostic_report.py:L3147-L3173, diff_diff/honest_did.py:L387-L402, and diff_diff/honest_did.py:L2604-L2645.
No estimator math, weighting, variance/SE, identification checks, or inferential defaults changed in this diff; the methodology impact is limited to reporting-layer phrasing and regression coverage.
The str.capitalize() regression is fixed in the BR headline path via _sentence_first_upper() in diff_diff/business_report.py:L1881-L1896, with targeted regression coverage in tests/test_business_report.py:L4061-L4115 and canonical-dataset coverage in tests/test_br_dr_canonical_datasets.py:L81-L133.
The new canonical-dataset tests are property-level guards rather than exact-value anchors, which is the right level for bundled fallback datasets and avoids overfitting to sample-specific numerics in tests/test_br_dr_canonical_datasets.py:L1-L361.
Static review only: I could not execute the test suite here because this environment is missing both pytest and numpy.

Methodology

No findings. The only affected method is HonestDiD reporting-layer narration. The new grid-aware branch in diff_diff/business_report.py:L2142-L2166 and diff_diff/diagnostic_report.py:L3147-L3173 is consistent with the HonestDiD breakdown_M definition and search behavior in diff_diff/honest_did.py:L387-L402 and diff_diff/honest_did.py:L2604-L2645, and I found no undocumented deviation from the HonestDiD registry contract in docs/methodology/REGISTRY.md:L2412-L2461.

Code Quality

No findings. _sentence_first_upper() in diff_diff/business_report.py:L1881-L1896 cleanly fixes the changed headline path, and I did not find remaining capitalize( usage in BR/DR code. The prior semantic bug is now covered on both report surfaces in tests/test_business_report.py:L4117-L4193 and tests/test_diagnostic_report.py:L1894-L2006.

Performance

No findings. The production change is constant-time/reporting-only; the only meaningful runtime cost is the added test coverage.

Maintainability

No findings. The BR/DR fixes are mirrored, and the paired regression tests materially reduce the risk of the two surfaces drifting on this behavior.

Tech Debt

No findings. The prior correctness issue was fixed directly rather than deferred, and I did not identify any new TODO-eligible debt in the changed code.

Security

No findings.

Documentation/Tests

No findings from static review. The new canonical-dataset suite plus the targeted BR/DR phrasing regressions cover the re-review issue directly in tests/test_br_dr_canonical_datasets.py:L145-L361, tests/test_business_report.py:L4117-L4193, and tests/test_diagnostic_report.py:L1894-L2006. I could not run them in this environment because pytest and numpy are unavailable.

igerber and others added 2 commits April 19, 2026 19:45

igerber added the ready-for-ci Triggers CI test workflows label Apr 20, 2026

igerber merged commit 752f2b6 into main Apr 20, 2026
23 of 24 checks passed

igerber deleted the br-dr-canonical-validation branch April 20, 2026 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close BR/DR gap #4: canonical-dataset regression guards + wording fixes#341

Close BR/DR gap #4: canonical-dataset regression guards + wording fixes#341
igerber merged 3 commits intomainfrom
br-dr-canonical-validation

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant