Skip to content

Extend PR #312 Y-normalization contract into SDID diagnostic methods#328

Merged
igerber merged 2 commits intomainfrom
fix/sdid-diagnostics-scale-parity
Apr 19, 2026
Merged

Extend PR #312 Y-normalization contract into SDID diagnostic methods#328
igerber merged 2 commits intomainfrom
fix/sdid-diagnostics-scale-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

  • Capture Y_shift and Y_scale on _SyntheticDiDFitSnapshot so that SyntheticDiDResults.in_time_placebo() and SyntheticDiDResults.sensitivity_to_zeta_omega() can reproduce the exact normalization the main fit path applies.
  • Normalize the relevant Y arrays before running Frank-Wolfe inside each diagnostic, pass zeta / Y_scale and a min_decrease derived from the normalized noise level, then rescale att / pre_fit_rmse back to original-Y units. max_unit_weight and effective_n are scale-invariant and reported unchanged.
  • Update the SyntheticDiD registry note to describe the new diagnostic contract (replaces the prior "diagnostics run on original-scale snapshot" caveat).

Methodology references (required if estimator / math changes)

  • Method name(s): SyntheticDiDResults.in_time_placebo(), SyntheticDiDResults.sensitivity_to_zeta_omega().
  • Paper / source link(s): Arkhangelsky et al. (2021); synth-inference/synthdid#71 (R-package version of the extreme-Y cancellation issue).
  • Any intentional deviations from the source (and why): None. The diagnostic contract is a no-op on well-scaled Y (τ is location-invariant and scale-equivariant, FW is invariant under (Y, ζ) → (Y/s, ζ/s)); the change only prevents catastrophic cancellation at extreme Y that PR Fix SyntheticDiD catastrophic cancellation at extreme Y scale #312 already fixed on the main fit() path.

Validation

  • Tests added/updated in tests/test_methodology_sdid.py:
    • TestDiagnosticScaleParity — asserts att / pre_fit_rmse scale by |a| across (Y -> a*Y + b) for both diagnostic methods; checks finite placebo output at Y ~ 1e9 on a no-effect DGP.
    • TestHeterogeneousAndRampingScale — cross-unit heterogeneous scale (units spanning 1e6-1e9) and cross-period ramping trend must leave fit + both diagnostics finite (D-4b regression coverage).
    • TestDiagnosticSnapshotBackwardCompat — asserts that a _SyntheticDiDFitSnapshot constructed without Y_shift / Y_scale (using the new defaults of 0.0 and 1.0) still drives both diagnostics with the same schema and row count — locks in the legacy-snapshot contract.
  • Full tests/test_methodology_sdid.py suite green (135 tests).
  • Backtest / simulation / notebook evidence: N/A — scale-parity regression coverage above is the substantive evidence.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Audit context: resolves findings D-4 and D-4b from the axis-A (numerical precision / scale fragility) slice of the in-flight silent-failures audit. Follows the SDID coordination protocol; no in-flight SDID work to coordinate with (main-path fix #312 already landed).

igerber and others added 2 commits April 19, 2026 06:27
PR #312 centered and scaled Y once in SyntheticDiD.fit() to avoid
catastrophic cancellation in the SDID double-difference at extreme Y.
The fix did not reach the two public diagnostic methods on
SyntheticDiDResults: in_time_placebo() and sensitivity_to_zeta_omega()
both re-run Frank-Wolfe on the original-scale fit snapshot with
original-scale zetas, so at Y ~ 1e9 they reproduce the same silent
failure the main path was fixed for.

Capture Y_shift and Y_scale on the fit snapshot, apply the same
(Y - shift) / scale normalization inside the two diagnostic methods,
pass zeta / Y_scale and min_decrease derived from the normalized
noise level to the FW weight solvers, and rescale att / pre_fit_rmse
back to original-Y units before reporting. Unit-weight diagnostics
(max_unit_weight, effective_n) are scale-invariant and reported
without rescaling.

Regression coverage:
- TestDiagnosticScaleParity: att / pre_fit_rmse must scale by |a|
  across (Y -> a*Y + b) for both diagnostic methods; a no-effect
  DGP at Y ~ 1e9 must produce finite placebo atts landing within
  5 * noise_level.
- TestHeterogeneousAndRampingScale: cross-unit heterogeneous scale
  (units spanning 1e6 to 1e9) and cross-period ramping (trend
  growing 4 orders of magnitude across periods) must leave the fit
  and both diagnostic surfaces finite. These are the heterogeneity
  pathways the original TestScaleEquivariance (affine-only) suite
  did not cover.

Covered by audit axis A (numerical precision / scale fragility).
Findings D-4 and D-4b from docs/audits/silent-failures-findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix _SyntheticDiDFitSnapshot objects don't carry Y_shift / Y_scale
fields. After this PR those fields default to (0.0, 1.0), which makes
the new normalization path a pure no-op on the legacy snapshot.

Add a regression test that overwrites results_._fit_snapshot with a
manually-constructed snapshot using only the pre-PR fields, then
confirms both in_time_placebo() and sensitivity_to_zeta_omega()
preserve schema and row count. Locks in the backward-compatibility
contract so future refactors don't accidentally tighten the
normalization path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

Methodology

  • No findings.

The PR changes library diagnostics rather than the paper-defined ATT or SE estimators. Within that scope, the implementation is consistent with the documented SDID contract: fit already normalizes Y, rescales user-visible outputs, and stores original-scale zeta_*/noise_level; the diagnostics now replay the same optimization problem on the normalized scale and report back on the original scale. That is a documented numerical-stability extension, not an undocumented methodology deviation. See diff_diff/synthetic_did.py:L419-L506, diff_diff/results.py:L1153-L1264, diff_diff/results.py:L1340-L1420, and docs/methodology/REGISTRY.md:L1522-L1534.

Code Quality

  • No findings.

The snapshot plumbing is narrow and backward compatible: Y_shift and Y_scale were added as defaulted trailing fields, and the only production constructor updated is the fit path. See diff_diff/results.py:L653-L677 and diff_diff/synthetic_did.py:L659-L671.

Performance

  • No findings.

The extra subtraction/division is confined to post-fit diagnostics and does not touch the main fit or variance hot paths.

Maintainability

  • No findings.

The implementation, registry note, and new tests now describe the same diagnostic normalization contract, which reduces future drift risk.

Tech Debt

  • No findings.

I did not see a new silent-correctness issue that would need TODO tracking, and there is no conflicting tracked limitation in TODO.md that changes the assessment.

Security

  • No findings.

The retained fit snapshot is still excluded from pickling, so the new fields do not widen the serialization/privacy surface. See diff_diff/results.py:L808-L823.

Documentation/Tests

  • No findings.

The registry update accurately describes the new behavior, and the added tests cover the right regression surfaces for this PR. See docs/methodology/REGISTRY.md:L1522-L1534 and tests/test_methodology_sdid.py:L2301-L2563.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
igerber added a commit that referenced this pull request Apr 19, 2026
CI was hitting false-positive failures on tests that assert wall-clock
elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance
failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3%
overshoot that reflects runner noise (BLAS path variation, neighbor VM
contention, cold caches), not a real regression. The existing "20x margin"
comment acknowledged the problem but no fixed threshold can absorb CI
wall-clock variance.

Mark the two CS timing tests @pytest.mark.slow so they're excluded from
default CI (existing addopts already uses `-m 'not slow'`). Tests remain
runnable on-demand via `pytest -m slow` for local benchmarking — same
pattern the TROP suite uses per CLAUDE.md.

Tests marked:
- test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance
  (method-level marker — rest of the SE-correctness class still runs)
- test_se_accuracy.py::TestPerformanceRegression (class-level marker — all
  three parametrized timing cases move out of default CI together)

Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses
wall-clock as a proxy for "short-circuit path taken" — that's a correctness
signal, not a performance signal. Marking it slow would remove the check.
Added a TODO.md entry to replace it with a mock/spy in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit cd67d60 into main Apr 19, 2026
23 of 25 checks passed
@igerber igerber deleted the fix/sdid-diagnostics-scale-parity branch April 19, 2026 11:56
igerber added a commit that referenced this pull request Apr 19, 2026
CI was hitting false-positive failures on tests that assert wall-clock
elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance
failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3%
overshoot that reflects runner noise (BLAS path variation, neighbor VM
contention, cold caches), not a real regression. The existing "20x margin"
comment acknowledged the problem but no fixed threshold can absorb CI
wall-clock variance.

Mark the two CS timing tests @pytest.mark.slow so they're excluded from
default CI (existing addopts already uses `-m 'not slow'`). Tests remain
runnable on-demand via `pytest -m slow` for local benchmarking — same
pattern the TROP suite uses per CLAUDE.md.

Tests marked:
- test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance
  (method-level marker — rest of the SE-correctness class still runs)
- test_se_accuracy.py::TestPerformanceRegression (class-level marker — all
  three parametrized timing cases move out of default CI together)

Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses
wall-clock as a proxy for "short-circuit path taken" — that's a correctness
signal, not a performance signal. Marking it slow would remove the check.
Added a TODO.md entry to replace it with a mock/spy in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant