Close BR/DR gap #4: canonical-dataset regression guards + wording fixes#341
Close BR/DR gap #4: canonical-dataset regression guards + wording fixes#341
Conversation
Closes BR/DR foundation gap #4 (real-dataset validation) from the external-positioning gap list in ``project_br_dr_foundation.md``. Validation artifact: - ``docs/validation/validate_br_dr_canonical.py`` runs BusinessReport / DiagnosticReport on Card-Krueger (1994), mpdta (Callaway-Sant'Anna 2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under both CS and SA), dumping summary + full_report + selected to_dict blocks for each. - ``docs/validation/br_dr_canonical_validation.md`` is the regenerable raw output. - ``docs/validation/br_dr_canonical_findings.md`` is the hand-written synthesis: direction / verdict / sensitivity tier all match canonical interpretations, with two small wording bugs surfaced and fixed in this PR and two larger gaps queued as follow-up (SA HonestDiD applicability, target-parameter disambiguation). Wording fixes: 1. Treatment-label capitalization. ``str.capitalize()`` lowercased every character after the first, flattening embedded abbreviations (``"the NJ minimum-wage increase"`` → ``"The nj minimum-wage increase"``) and proper-noun phrases (``"Castle Doctrine law adoption"`` → ``"Castle doctrine law adoption"``). Replaced with a ``_sentence_first_upper`` helper that preserves user-supplied casing. 2. ``breakdown_M == 0`` phrasing. The HonestDiD fragile sentence quoted ``{breakdown_M:.2g}x the pre-period variation``, which renders as a degenerate ``0x`` on the exact-zero case surfaced by Cheng-Hoekstra. At ``breakdown_M <= 0.05`` (covers 0 and near-zero values), both BR's summary and DR's overall_interpretation now say "includes zero even at the smallest parallel-trends violations on the sensitivity grid" instead. Tests: 5 new regressions in ``TestCanonicalValidationSurfaceFixes`` covering both fixes + three boundary cases (exact zero, small positive, normal fragile value). Not in scope: Favara-Imbs (dCDH reversible-treatment dataset not bundled), ImputationDiD / TwoStageDiD on canonical data (needed to exercise the R42 untreated-outcome FE assumption branch on real data), SA HonestDiD applicability gap. All tracked in the findings doc for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion tests Per review: the validation script + findings doc were one-shot artifacts that would age poorly. Replace them with ``tests/test_br_dr_canonical_datasets.py`` — pytest regression guards that assert canonical properties (direction, PT verdict tier, HonestDiD breakdown_M tier, cross-estimator consistency) on each canonical fit. Uses the ``_construct_*`` fallback data from ``diff_diff.datasets`` so tests have no network dependency (same pattern ``test_datasets.py`` already uses). Tests cover: - Card-Krueger (1994): positive sign, CI includes zero, "consistent with no effect" prose. - mpdta (CS 2021 benchmark): negative ATT, breakdown_M > 1.0, no_detected_violation pre-trends. - Castle Doctrine (Cheng-Hoekstra 2013) under CS: positive ATT, clear_violation pre-trends, fragile sensitivity (breakdown_M < 0.5). - Castle Doctrine cross-estimator consistency: SA agrees with CS on direction and PT verdict bin. - Treatment-label capitalization bugs: ``NJ`` abbreviation and ``Castle Doctrine`` proper noun preserved through BR's sentence capitalization. - ``breakdown_M == 0`` edge case: BR summary uses smallest-grid-point wording, not the degenerate ``0x`` multiplier. Drops: - ``docs/validation/validate_br_dr_canonical.py`` — one-shot script, replaced by the regression tests. - ``docs/validation/br_dr_canonical_validation.md`` — raw dump, regenerable on demand if needed but not checked in. - ``docs/validation/br_dr_canonical_findings.md`` — summary now lives in the regression-test docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…valuated grid
R1 caught a semantic bug in the round-1 canonical-validation wording
fix. ``breakdown_M`` is the smallest M at which the robust CI
includes zero — an interpolated threshold between grid points — not
a claim about any specific grid point. Keying the "smallest grid
point fails" wording off ``breakdown_M <= 0.05`` was wrong: on a
grid starting at M=0 where the smallest evaluated point is still
robust (CI excludes zero), a small ``breakdown_M=0.03`` means
fragility emerges BETWEEN grid points, not at M=0.
Fix (both BR and DR):
- Added a ``_smallest_failing_grid_m`` helper (paired helpers in
``business_report.py`` and ``diagnostic_report.py``, intentionally
duplicated with cross-reference comments per the parity rule from
``feedback_cross_surface_parity_audit.md``).
- Helper returns the smallest evaluated M on the grid if that point
has ``robust_to_zero == False``, else ``None``.
- Fragile-sensitivity wording now fires "smallest M evaluated on the
sensitivity grid (M = X)" ONLY when the helper returns a value;
otherwise falls through to the numeric multiplier ``{bkd:.2g}x``.
- Castle Doctrine (Cheng-Hoekstra 2013) CS fit: grid starts at
M=0.5, every point non-robust — new wording quotes "(M = 0.5)"
instead of "0x the pre-period variation".
- Reviewer's counterexample (grid ``[0, 0.25, ...]`` with bkd=0.03,
smallest point robust): wording falls through to "0.03x the
pre-period variation", not "smallest grid point".
Tests:
- Rewrote ``TestCanonicalValidationSurfaceFixes`` on the BR side to
build sensitivity schemas with explicit grids. Added paired cases:
(a) smallest M fails, assert "smallest M evaluated"; (b) smallest
M robust, breakdown 0.03, assert multiplier wording is used.
- Added ``TestDRFragilePhrasingIsGridAware`` on the DR side mirroring
the same paired cases against ``_render_overall_interpretation``.
- Updated the Castle Doctrine canonical-dataset regression test to
assert ``"M = 0.5"`` appears (actual smallest evaluated grid point).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
MethodologyNo findings. The only affected method is HonestDiD reporting-layer narration. The new grid-aware branch in diff_diff/business_report.py:L2142-L2166 and diff_diff/diagnostic_report.py:L3147-L3173 is consistent with the HonestDiD Code QualityNo findings. PerformanceNo findings. The production change is constant-time/reporting-only; the only meaningful runtime cost is the added test coverage. MaintainabilityNo findings. The BR/DR fixes are mirrored, and the paired regression tests materially reduce the risk of the two surfaces drifting on this behavior. Tech DebtNo findings. The prior correctness issue was fixed directly rather than deferred, and I did not identify any new TODO-eligible debt in the changed code. SecurityNo findings. Documentation/TestsNo findings from static review. The new canonical-dataset suite plus the targeted BR/DR phrasing regressions cover the re-review issue directly in tests/test_br_dr_canonical_datasets.py:L145-L361, tests/test_business_report.py:L4117-L4193, and tests/test_diagnostic_report.py:L1894-L2006. I could not run them in this environment because |
Summary
tests/test_br_dr_canonical_datasets.py— pytest regression guards that runBusinessReporton Card-Krueger (1994), mpdta (Callaway-Sant'Anna 2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under both CS and SA). Assertions are property-level (direction of ATT, pre-trends verdict bin, HonestDiD breakdown-M tier, cross-estimator consistency), not exact-match, so they survive small data-aggregation drift between the bundled dataset and the published author sample.str.capitalize()lowercased every character after the first, flattening embedded abbreviations ("the NJ minimum-wage increase"->"The nj minimum-wage increase") and proper-noun phrases ("Castle Doctrine law adoption"->"Castle doctrine law adoption"). Replaced with a_sentence_first_upperhelper that preserves user-supplied casing.breakdown_M == 0as"violations reach 0x the pre-period variation"— a degenerate sentence (zero-times-anything). Atbreakdown_M <= 0.05both BR's summary and DR's overall-interpretation now say"includes zero even at the smallest parallel-trends violations on the sensitivity grid". Cross-surface parity: both BR and DR render helpers fixed.Uses the
_construct_*fallback data fromdiff_diff.datasetsso the regression tests have no network dependency (same patterntests/test_datasets.pyuses).Methodology references (required if estimator / math changes)
breakdown_M > 1.0for robust,< 0.5for fragile) match the existing BR/DR conventions documented indocs/methodology/REPORTING.md.Validation
tests/test_br_dr_canonical_datasets.py(new, 7 tests);tests/test_business_report.py(+5 tests inTestCanonicalValidationSurfaceFixespinning the wording fixes).Security / privacy
Generated with Claude Code