Extend PR #312 Y-normalization contract into SDID diagnostic methods#328
Extend PR #312 Y-normalization contract into SDID diagnostic methods#328
Conversation
PR #312 centered and scaled Y once in SyntheticDiD.fit() to avoid catastrophic cancellation in the SDID double-difference at extreme Y. The fix did not reach the two public diagnostic methods on SyntheticDiDResults: in_time_placebo() and sensitivity_to_zeta_omega() both re-run Frank-Wolfe on the original-scale fit snapshot with original-scale zetas, so at Y ~ 1e9 they reproduce the same silent failure the main path was fixed for. Capture Y_shift and Y_scale on the fit snapshot, apply the same (Y - shift) / scale normalization inside the two diagnostic methods, pass zeta / Y_scale and min_decrease derived from the normalized noise level to the FW weight solvers, and rescale att / pre_fit_rmse back to original-Y units before reporting. Unit-weight diagnostics (max_unit_weight, effective_n) are scale-invariant and reported without rescaling. Regression coverage: - TestDiagnosticScaleParity: att / pre_fit_rmse must scale by |a| across (Y -> a*Y + b) for both diagnostic methods; a no-effect DGP at Y ~ 1e9 must produce finite placebo atts landing within 5 * noise_level. - TestHeterogeneousAndRampingScale: cross-unit heterogeneous scale (units spanning 1e6 to 1e9) and cross-period ramping (trend growing 4 orders of magnitude across periods) must leave the fit and both diagnostic surfaces finite. These are the heterogeneity pathways the original TestScaleEquivariance (affine-only) suite did not cover. Covered by audit axis A (numerical precision / scale fragility). Findings D-4 and D-4b from docs/audits/silent-failures-findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix _SyntheticDiDFitSnapshot objects don't carry Y_shift / Y_scale fields. After this PR those fields default to (0.0, 1.0), which makes the new normalization path a pure no-op on the legacy snapshot. Add a regression test that overwrites results_._fit_snapshot with a manually-constructed snapshot using only the pre-PR fields, then confirms both in_time_placebo() and sensitivity_to_zeta_omega() preserve schema and row count. Locks in the backward-compatibility contract so future refactors don't accidentally tighten the normalization path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
The PR changes library diagnostics rather than the paper-defined ATT or SE estimators. Within that scope, the implementation is consistent with the documented SDID contract: fit already normalizes Code Quality
The snapshot plumbing is narrow and backward compatible: Performance
The extra subtraction/division is confined to post-fit diagnostics and does not touch the main fit or variance hot paths. Maintainability
The implementation, registry note, and new tests now describe the same diagnostic normalization contract, which reduces future drift risk. Tech Debt
I did not see a new silent-correctness issue that would need TODO tracking, and there is no conflicting tracked limitation in Security
The retained fit snapshot is still excluded from pickling, so the new fields do not widen the serialization/privacy surface. See diff_diff/results.py:L808-L823. Documentation/Tests
The registry update accurately describes the new behavior, and the added tests cover the right regression surfaces for this PR. See docs/methodology/REGISTRY.md:L1522-L1534 and tests/test_methodology_sdid.py:L2301-L2563. |
CI was hitting false-positive failures on tests that assert wall-clock elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3% overshoot that reflects runner noise (BLAS path variation, neighbor VM contention, cold caches), not a real regression. The existing "20x margin" comment acknowledged the problem but no fixed threshold can absorb CI wall-clock variance. Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (existing addopts already uses `-m 'not slow'`). Tests remain runnable on-demand via `pytest -m slow` for local benchmarking — same pattern the TROP suite uses per CLAUDE.md. Tests marked: - test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — rest of the SE-correctness class still runs) - test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI together) Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check. Added a TODO.md entry to replace it with a mock/spy in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI was hitting false-positive failures on tests that assert wall-clock elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3% overshoot that reflects runner noise (BLAS path variation, neighbor VM contention, cold caches), not a real regression. The existing "20x margin" comment acknowledged the problem but no fixed threshold can absorb CI wall-clock variance. Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (existing addopts already uses `-m 'not slow'`). Tests remain runnable on-demand via `pytest -m slow` for local benchmarking — same pattern the TROP suite uses per CLAUDE.md. Tests marked: - test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — rest of the SE-correctness class still runs) - test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI together) Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check. Added a TODO.md entry to replace it with a mock/spy in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Y_shiftandY_scaleon_SyntheticDiDFitSnapshotso thatSyntheticDiDResults.in_time_placebo()andSyntheticDiDResults.sensitivity_to_zeta_omega()can reproduce the exact normalization the main fit path applies.zeta / Y_scaleand amin_decreasederived from the normalized noise level, then rescaleatt/pre_fit_rmseback to original-Y units.max_unit_weightandeffective_nare scale-invariant and reported unchanged.SyntheticDiDregistry note to describe the new diagnostic contract (replaces the prior "diagnostics run on original-scale snapshot" caveat).Methodology references (required if estimator / math changes)
SyntheticDiDResults.in_time_placebo(),SyntheticDiDResults.sensitivity_to_zeta_omega().synth-inference/synthdid#71(R-package version of the extreme-Y cancellation issue).(Y, ζ) → (Y/s, ζ/s)); the change only prevents catastrophic cancellation at extreme Y that PR Fix SyntheticDiD catastrophic cancellation at extreme Y scale #312 already fixed on the mainfit()path.Validation
tests/test_methodology_sdid.py:TestDiagnosticScaleParity— assertsatt / pre_fit_rmsescale by|a|across(Y -> a*Y + b)for both diagnostic methods; checks finite placebo output at Y ~ 1e9 on a no-effect DGP.TestHeterogeneousAndRampingScale— cross-unit heterogeneous scale (units spanning 1e6-1e9) and cross-period ramping trend must leave fit + both diagnostics finite (D-4b regression coverage).TestDiagnosticSnapshotBackwardCompat— asserts that a_SyntheticDiDFitSnapshotconstructed withoutY_shift/Y_scale(using the new defaults of0.0and1.0) still drives both diagnostics with the same schema and row count — locks in the legacy-snapshot contract.tests/test_methodology_sdid.pysuite green (135 tests).Security / privacy
Audit context: resolves findings D-4 and D-4b from the axis-A (numerical precision / scale fragility) slice of the in-flight silent-failures audit. Follows the SDID coordination protocol; no in-flight SDID work to coordinate with (main-path fix #312 already landed).