Extend PR #312 Y-normalization contract into SDID diagnostic methods by igerber · Pull Request #328 · igerber/diff-diff

igerber · 2026-04-19T10:30:21Z

Summary

Capture Y_shift and Y_scale on _SyntheticDiDFitSnapshot so that SyntheticDiDResults.in_time_placebo() and SyntheticDiDResults.sensitivity_to_zeta_omega() can reproduce the exact normalization the main fit path applies.
Normalize the relevant Y arrays before running Frank-Wolfe inside each diagnostic, pass zeta / Y_scale and a min_decrease derived from the normalized noise level, then rescale att / pre_fit_rmse back to original-Y units. max_unit_weight and effective_n are scale-invariant and reported unchanged.
Update the SyntheticDiD registry note to describe the new diagnostic contract (replaces the prior "diagnostics run on original-scale snapshot" caveat).

Methodology references (required if estimator / math changes)

Method name(s): SyntheticDiDResults.in_time_placebo(), SyntheticDiDResults.sensitivity_to_zeta_omega().
Paper / source link(s): Arkhangelsky et al. (2021); synth-inference/synthdid#71 (R-package version of the extreme-Y cancellation issue).
Any intentional deviations from the source (and why): None. The diagnostic contract is a no-op on well-scaled Y (τ is location-invariant and scale-equivariant, FW is invariant under (Y, ζ) → (Y/s, ζ/s)); the change only prevents catastrophic cancellation at extreme Y that PR Fix SyntheticDiD catastrophic cancellation at extreme Y scale #312 already fixed on the main fit() path.

Validation

Tests added/updated in tests/test_methodology_sdid.py:
- TestDiagnosticScaleParity — asserts att / pre_fit_rmse scale by |a| across (Y -> a*Y + b) for both diagnostic methods; checks finite placebo output at Y ~ 1e9 on a no-effect DGP.
- TestHeterogeneousAndRampingScale — cross-unit heterogeneous scale (units spanning 1e6-1e9) and cross-period ramping trend must leave fit + both diagnostics finite (D-4b regression coverage).
- TestDiagnosticSnapshotBackwardCompat — asserts that a _SyntheticDiDFitSnapshot constructed without Y_shift / Y_scale (using the new defaults of 0.0 and 1.0) still drives both diagnostics with the same schema and row count — locks in the legacy-snapshot contract.
Full tests/test_methodology_sdid.py suite green (135 tests).
Backtest / simulation / notebook evidence: N/A — scale-parity regression coverage above is the substantive evidence.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Audit context: resolves findings D-4 and D-4b from the axis-A (numerical precision / scale fragility) slice of the in-flight silent-failures audit. Follows the SDID coordination protocol; no in-flight SDID work to coordinate with (main-path fix #312 already landed).

PR #312 centered and scaled Y once in SyntheticDiD.fit() to avoid catastrophic cancellation in the SDID double-difference at extreme Y. The fix did not reach the two public diagnostic methods on SyntheticDiDResults: in_time_placebo() and sensitivity_to_zeta_omega() both re-run Frank-Wolfe on the original-scale fit snapshot with original-scale zetas, so at Y ~ 1e9 they reproduce the same silent failure the main path was fixed for. Capture Y_shift and Y_scale on the fit snapshot, apply the same (Y - shift) / scale normalization inside the two diagnostic methods, pass zeta / Y_scale and min_decrease derived from the normalized noise level to the FW weight solvers, and rescale att / pre_fit_rmse back to original-Y units before reporting. Unit-weight diagnostics (max_unit_weight, effective_n) are scale-invariant and reported without rescaling. Regression coverage: - TestDiagnosticScaleParity: att / pre_fit_rmse must scale by |a| across (Y -> a*Y + b) for both diagnostic methods; a no-effect DGP at Y ~ 1e9 must produce finite placebo atts landing within 5 * noise_level. - TestHeterogeneousAndRampingScale: cross-unit heterogeneous scale (units spanning 1e6 to 1e9) and cross-period ramping (trend growing 4 orders of magnitude across periods) must leave the fit and both diagnostic surfaces finite. These are the heterogeneity pathways the original TestScaleEquivariance (affine-only) suite did not cover. Covered by audit axis A (numerical precision / scale fragility). Findings D-4 and D-4b from docs/audits/silent-failures-findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-fix _SyntheticDiDFitSnapshot objects don't carry Y_shift / Y_scale fields. After this PR those fields default to (0.0, 1.0), which makes the new normalization path a pure no-op on the legacy snapshot. Add a regression test that overwrites results_._fit_snapshot with a manually-constructed snapshot using only the pre-PR fields, then confirms both in_time_placebo() and sensitivity_to_zeta_omega() preserve schema and row count. Locks in the backward-compatibility contract so future refactors don't accidentally tighten the normalization path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T10:36:05Z

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 findings in the changed SDID diagnostic paths.
The affected methods are SyntheticDiDResults.in_time_placebo(), SyntheticDiDResults.sensitivity_to_zeta_omega(), fit-time snapshot capture in diff_diff/synthetic_did.py:L659-L671, and the internal snapshot schema in diff_diff/results.py:L653-L677.
Methodology review looks clean: the PR extends the same fit-time normalization contract into the two diagnostic refits by normalizing snapshot arrays, passing zeta / Y_scale into Frank-Wolfe, and rescaling att / pre_fit_rmse back to original-Y units, matching diff_diff/synthetic_did.py:L419-L506, diff_diff/results.py:L1153-L1264, diff_diff/results.py:L1340-L1420, and docs/methodology/REGISTRY.md:L1522-L1534.
I did not find a control-group, weighting, variance/SE, or assumption-check regression in the modified code. The shared SDID primitives and estimator contract remain unchanged in diff_diff/utils.py:L1500-L1713.
Added coverage is aligned with the change: scale-parity regressions, extreme-scale finite diagnostics, and legacy-snapshot construction coverage were added in tests/test_methodology_sdid.py:L2301-L2563.
Reviewer note: I could not execute the test suite in this environment because pytest and numpy are unavailable, so this is a source review plus test inspection.

Methodology

No findings.

The PR changes library diagnostics rather than the paper-defined ATT or SE estimators. Within that scope, the implementation is consistent with the documented SDID contract: fit already normalizes Y, rescales user-visible outputs, and stores original-scale zeta_*/noise_level; the diagnostics now replay the same optimization problem on the normalized scale and report back on the original scale. That is a documented numerical-stability extension, not an undocumented methodology deviation. See diff_diff/synthetic_did.py:L419-L506, diff_diff/results.py:L1153-L1264, diff_diff/results.py:L1340-L1420, and docs/methodology/REGISTRY.md:L1522-L1534.

Code Quality

No findings.

The snapshot plumbing is narrow and backward compatible: Y_shift and Y_scale were added as defaulted trailing fields, and the only production constructor updated is the fit path. See diff_diff/results.py:L653-L677 and diff_diff/synthetic_did.py:L659-L671.

Performance

No findings.

The extra subtraction/division is confined to post-fit diagnostics and does not touch the main fit or variance hot paths.

Maintainability

No findings.

The implementation, registry note, and new tests now describe the same diagnostic normalization contract, which reduces future drift risk.

Tech Debt

No findings.

I did not see a new silent-correctness issue that would need TODO tracking, and there is no conflicting tracked limitation in TODO.md that changes the assessment.

Security

No findings.

The retained fit snapshot is still excluded from pickling, so the new fields do not widen the serialization/privacy surface. See diff_diff/results.py:L808-L823.

Documentation/Tests

No findings.

The registry update accurately describes the new behavior, and the added tests cover the right regression surfaces for this PR. See docs/methodology/REGISTRY.md:L1522-L1534 and tests/test_methodology_sdid.py:L2301-L2563.

CI was hitting false-positive failures on tests that assert wall-clock elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3% overshoot that reflects runner noise (BLAS path variation, neighbor VM contention, cold caches), not a real regression. The existing "20x margin" comment acknowledged the problem but no fixed threshold can absorb CI wall-clock variance. Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (existing addopts already uses `-m 'not slow'`). Tests remain runnable on-demand via `pytest -m slow` for local benchmarking — same pattern the TROP suite uses per CLAUDE.md. Tests marked: - test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — rest of the SE-correctness class still runs) - test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI together) Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check. Added a TODO.md entry to replace it with a mock/spy in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 19, 2026 06:27

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber mentioned this pull request Apr 19, 2026

Exclude flaky wall-clock timing tests from default CI #330

Merged

igerber merged commit cd67d60 into main Apr 19, 2026
23 of 25 checks passed

igerber deleted the fix/sdid-diagnostics-scale-parity branch April 19, 2026 11:56

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend PR #312 Y-normalization contract into SDID diagnostic methods#328

Extend PR #312 Y-normalization contract into SDID diagnostic methods#328
igerber merged 2 commits intomainfrom
fix/sdid-diagnostics-scale-parity

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant