Skip to content

Close BR/DR gap #4: canonical-dataset regression guards + wording fixes#341

Merged
igerber merged 3 commits intomainfrom
br-dr-canonical-validation
Apr 20, 2026
Merged

Close BR/DR gap #4: canonical-dataset regression guards + wording fixes#341
igerber merged 3 commits intomainfrom
br-dr-canonical-validation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

  • Adds tests/test_br_dr_canonical_datasets.py — pytest regression guards that run BusinessReport on Card-Krueger (1994), mpdta (Callaway-Sant'Anna 2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under both CS and SA). Assertions are property-level (direction of ATT, pre-trends verdict bin, HonestDiD breakdown-M tier, cross-estimator consistency), not exact-match, so they survive small data-aggregation drift between the bundled dataset and the published author sample.
  • Fixes two prose bugs surfaced by running BR on real data:
    • str.capitalize() lowercased every character after the first, flattening embedded abbreviations ("the NJ minimum-wage increase" -> "The nj minimum-wage increase") and proper-noun phrases ("Castle Doctrine law adoption" -> "Castle doctrine law adoption"). Replaced with a _sentence_first_upper helper that preserves user-supplied casing.
    • The HonestDiD fragile sentence rendered breakdown_M == 0 as "violations reach 0x the pre-period variation" — a degenerate sentence (zero-times-anything). At breakdown_M <= 0.05 both BR's summary and DR's overall-interpretation now say "includes zero even at the smallest parallel-trends violations on the sensitivity grid". Cross-surface parity: both BR and DR render helpers fixed.
  • Closes BR/DR foundation gap Claude/setup pip install k2 m4j #4 (real-dataset validation) from the external-positioning gap list.

Uses the _construct_* fallback data from diff_diff.datasets so the regression tests have no network dependency (same pattern tests/test_datasets.py uses).

Methodology references (required if estimator / math changes)

  • Method name(s): no new methodology. The changes are reporting-layer prose fixes plus a property-level regression test suite on existing estimators.
  • Paper / source link(s): canonical references embedded in test docstrings — Card & Krueger (1994) AER; Callaway & Sant'Anna (2021) JOE; Cheng & Hoekstra (2013) JHR; Sun & Abraham (2021) JOE. Already cited in REGISTRY.md.
  • Any intentional deviations from the source (and why): None. Sensitivity tier thresholds (breakdown_M > 1.0 for robust, < 0.5 for fragile) match the existing BR/DR conventions documented in docs/methodology/REPORTING.md.

Validation

  • Tests added/updated: tests/test_br_dr_canonical_datasets.py (new, 7 tests); tests/test_business_report.py (+5 tests in TestCanonicalValidationSurfaceFixes pinning the wording fixes).
  • Backtest / simulation / notebook evidence: N/A. The canonical datasets themselves are the validation surface — direction, PT verdict tier, and HonestDiD tier are asserted against published-applied-work interpretations.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 2 commits April 19, 2026 19:45
Closes BR/DR foundation gap #4 (real-dataset validation) from the
external-positioning gap list in ``project_br_dr_foundation.md``.

Validation artifact:

- ``docs/validation/validate_br_dr_canonical.py`` runs BusinessReport
  / DiagnosticReport on Card-Krueger (1994), mpdta (Callaway-Sant'Anna
  2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under
  both CS and SA), dumping summary + full_report + selected to_dict
  blocks for each.
- ``docs/validation/br_dr_canonical_validation.md`` is the regenerable
  raw output.
- ``docs/validation/br_dr_canonical_findings.md`` is the hand-written
  synthesis: direction / verdict / sensitivity tier all match canonical
  interpretations, with two small wording bugs surfaced and fixed in
  this PR and two larger gaps queued as follow-up (SA HonestDiD
  applicability, target-parameter disambiguation).

Wording fixes:

1. Treatment-label capitalization. ``str.capitalize()`` lowercased
   every character after the first, flattening embedded abbreviations
   (``"the NJ minimum-wage increase"`` → ``"The nj minimum-wage
   increase"``) and proper-noun phrases (``"Castle Doctrine law
   adoption"`` → ``"Castle doctrine law adoption"``). Replaced with a
   ``_sentence_first_upper`` helper that preserves user-supplied
   casing.

2. ``breakdown_M == 0`` phrasing. The HonestDiD fragile sentence
   quoted ``{breakdown_M:.2g}x the pre-period variation``, which
   renders as a degenerate ``0x`` on the exact-zero case surfaced by
   Cheng-Hoekstra. At ``breakdown_M <= 0.05`` (covers 0 and near-zero
   values), both BR's summary and DR's overall_interpretation now say
   "includes zero even at the smallest parallel-trends violations on
   the sensitivity grid" instead.

Tests: 5 new regressions in
``TestCanonicalValidationSurfaceFixes`` covering both fixes + three
boundary cases (exact zero, small positive, normal fragile value).

Not in scope: Favara-Imbs (dCDH reversible-treatment dataset not
bundled), ImputationDiD / TwoStageDiD on canonical data (needed to
exercise the R42 untreated-outcome FE assumption branch on real
data), SA HonestDiD applicability gap. All tracked in the findings
doc for follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion tests

Per review: the validation script + findings doc were one-shot
artifacts that would age poorly. Replace them with
``tests/test_br_dr_canonical_datasets.py`` — pytest regression
guards that assert canonical properties (direction, PT verdict tier,
HonestDiD breakdown_M tier, cross-estimator consistency) on each
canonical fit.

Uses the ``_construct_*`` fallback data from ``diff_diff.datasets``
so tests have no network dependency (same pattern
``test_datasets.py`` already uses).

Tests cover:

- Card-Krueger (1994): positive sign, CI includes zero, "consistent
  with no effect" prose.
- mpdta (CS 2021 benchmark): negative ATT, breakdown_M > 1.0,
  no_detected_violation pre-trends.
- Castle Doctrine (Cheng-Hoekstra 2013) under CS: positive ATT,
  clear_violation pre-trends, fragile sensitivity (breakdown_M < 0.5).
- Castle Doctrine cross-estimator consistency: SA agrees with CS on
  direction and PT verdict bin.
- Treatment-label capitalization bugs: ``NJ`` abbreviation and
  ``Castle Doctrine`` proper noun preserved through BR's sentence
  capitalization.
- ``breakdown_M == 0`` edge case: BR summary uses smallest-grid-point
  wording, not the degenerate ``0x`` multiplier.

Drops:

- ``docs/validation/validate_br_dr_canonical.py`` — one-shot script,
  replaced by the regression tests.
- ``docs/validation/br_dr_canonical_validation.md`` — raw dump,
  regenerable on demand if needed but not checked in.
- ``docs/validation/br_dr_canonical_findings.md`` — summary now
  lives in the regression-test docstrings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment
⚠️ Needs changes

Executive Summary

  • No estimator math, weighting, variance, or identification-assumption logic changes are in scope here; the diff is confined to BR/DR prose and new regression tests.
  • The str.capitalize() replacement in BusinessReport looks correct, and I did not find remaining report-code uses of that pattern.
  • P1: the new near-zero HonestDiD wording overstates what the evaluated sensitivity grid established. It treats a small breakdown_M as proof that the smallest grid point failed, which is not what breakdown_M means.
  • The new regression test explicitly locks in that semantic mismatch by asserting “smallest grid point” wording for breakdown_M=0.03.
  • I could not execute the tests in this environment because pytest and numpy are unavailable, so this is a static diff review.

Methodology

  • P1 Impact: The new BR/DR branch for breakdown_M <= 0.05 silently violates the HonestDiD sensitivity contract. SensitivityResults.breakdown_M is defined as the smallest M where the robust CI includes zero, while M_values is the evaluated grid. HonestDiD also supports grids that start at 0, and the docs show that as a normal usage pattern. On a supported input like M_values=[0, 0.25, 0.5, ...] with breakdown_M=0.03, the CI still excludes zero at the smallest grid point (M=0); it only fails between 0 and 0.25. The new prose nonetheless says the CI “includes zero even at the smallest … violations on the sensitivity grid,” which is stronger than the underlying result supports. That is a semantic-contract violation in the reporting layer for HonestDiD. Locations: diff_diff/business_report.py:L2118-L2131, diff_diff/diagnostic_report.py:L3121-L3136, diff_diff/honest_did.py:L372-L389, diff_diff/honest_did.py:L2551-L2588, diff_diff/honest_did.py:L2604-L2645, docs/api/honest_did.rst:L151-L156, docs/methodology/REPORTING.md:L17-L24.
    Concrete fix: drive this wording from the actual evaluated grid (sens["grid"] / M_values), not from breakdown_M <= 0.05 alone. Only use “smallest … on the sensitivity grid” when the smallest evaluated grid point itself is non-robust. Otherwise use grid-free wording like “fragile to very small nonzero violations (breakdown M≈0.03)” or keep the numeric multiplier.

Code Quality

  • No additional findings. The casing helper itself is a contained fix, and I found no remaining report-code str.capitalize() usage.

Performance

  • No findings. The code-path cost is negligible; only the new tests add runtime.

Maintainability

  • No additional findings beyond the P1 above. The only maintainability issue is that the new near-zero prose is keyed off a hard-coded sentinel instead of the actual grid data already present in the schema.

Tech Debt

  • No TODO.md mitigation applies to the P1 above, and per the review policy it would not be TODO-eligible anyway because it is a silent interpretation bug in user-facing output.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: The added regression coverage reinforces the incorrect breakdown_M=0.03 -> smallest grid point behavior, and there is still no direct regression on the mirrored DR wording path. Locations: tests/test_business_report.py:L4100-L4131, diff_diff/diagnostic_report.py:L3121-L3144.
    Concrete fix: replace the 0.03 test with a precomputed sensitivity case whose grid starts at 0 and assert both BR and DR do not claim “smallest … on the sensitivity grid”; add a separate positive-control case where the smallest evaluated grid point is already non-robust.

Path to Approval

  1. In both BR and DR, derive the near-zero HonestDiD wording from sens["grid"] / M_values, not from breakdown_M <= 0.05 alone.
  2. Add paired regressions for both report surfaces:
    1. a grid starting at 0 with breakdown_M≈0.03, where the smallest grid point is still robust, and
    2. a grid whose smallest evaluated point is already non-robust, where the “smallest … on the sensitivity grid” wording is actually justified.

…valuated grid

R1 caught a semantic bug in the round-1 canonical-validation wording
fix. ``breakdown_M`` is the smallest M at which the robust CI
includes zero — an interpolated threshold between grid points — not
a claim about any specific grid point. Keying the "smallest grid
point fails" wording off ``breakdown_M <= 0.05`` was wrong: on a
grid starting at M=0 where the smallest evaluated point is still
robust (CI excludes zero), a small ``breakdown_M=0.03`` means
fragility emerges BETWEEN grid points, not at M=0.

Fix (both BR and DR):

- Added a ``_smallest_failing_grid_m`` helper (paired helpers in
  ``business_report.py`` and ``diagnostic_report.py``, intentionally
  duplicated with cross-reference comments per the parity rule from
  ``feedback_cross_surface_parity_audit.md``).
- Helper returns the smallest evaluated M on the grid if that point
  has ``robust_to_zero == False``, else ``None``.
- Fragile-sensitivity wording now fires "smallest M evaluated on the
  sensitivity grid (M = X)" ONLY when the helper returns a value;
  otherwise falls through to the numeric multiplier ``{bkd:.2g}x``.
- Castle Doctrine (Cheng-Hoekstra 2013) CS fit: grid starts at
  M=0.5, every point non-robust — new wording quotes "(M = 0.5)"
  instead of "0x the pre-period variation".
- Reviewer's counterexample (grid ``[0, 0.25, ...]`` with bkd=0.03,
  smallest point robust): wording falls through to "0.03x the
  pre-period variation", not "smallest grid point".

Tests:

- Rewrote ``TestCanonicalValidationSurfaceFixes`` on the BR side to
  build sensitivity schemas with explicit grids. Added paired cases:
  (a) smallest M fails, assert "smallest M evaluated"; (b) smallest
  M robust, breakdown 0.03, assert multiplier wording is used.
- Added ``TestDRFragilePhrasingIsGridAware`` on the DR side mirroring
  the same paired cases against ``_render_overall_interpretation``.
- Updated the Castle Doctrine canonical-dataset regression test to
  assert ``"M = 0.5"`` appears (actual smallest evaluated grid point).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6e24014c3077052c39f35adcef2e4961f79b2ec2


Overall Assessment
✅ Looks good

Executive Summary

Methodology

No findings. The only affected method is HonestDiD reporting-layer narration. The new grid-aware branch in diff_diff/business_report.py:L2142-L2166 and diff_diff/diagnostic_report.py:L3147-L3173 is consistent with the HonestDiD breakdown_M definition and search behavior in diff_diff/honest_did.py:L387-L402 and diff_diff/honest_did.py:L2604-L2645, and I found no undocumented deviation from the HonestDiD registry contract in docs/methodology/REGISTRY.md:L2412-L2461.

Code Quality

No findings. _sentence_first_upper() in diff_diff/business_report.py:L1881-L1896 cleanly fixes the changed headline path, and I did not find remaining capitalize( usage in BR/DR code. The prior semantic bug is now covered on both report surfaces in tests/test_business_report.py:L4117-L4193 and tests/test_diagnostic_report.py:L1894-L2006.

Performance

No findings. The production change is constant-time/reporting-only; the only meaningful runtime cost is the added test coverage.

Maintainability

No findings. The BR/DR fixes are mirrored, and the paired regression tests materially reduce the risk of the two surfaces drifting on this behavior.

Tech Debt

No findings. The prior correctness issue was fixed directly rather than deferred, and I did not identify any new TODO-eligible debt in the changed code.

Security

No findings.

Documentation/Tests

No findings from static review. The new canonical-dataset suite plus the targeted BR/DR phrasing regressions cover the re-review issue directly in tests/test_br_dr_canonical_datasets.py:L145-L361, tests/test_business_report.py:L4117-L4193, and tests/test_diagnostic_report.py:L1894-L2006. I could not run them in this environment because pytest and numpy are unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 20, 2026
@igerber igerber merged commit 752f2b6 into main Apr 20, 2026
23 of 24 checks passed
@igerber igerber deleted the br-dr-canonical-validation branch April 20, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant