Skip to content

Close axis-C/J silent-failures audit: B-spline derivative + PA survey cache#339

Merged
igerber merged 4 commits intomainfrom
fix/axis-cj-closeouts
Apr 19, 2026
Merged

Close axis-C/J silent-failures audit: B-spline derivative + PA survey cache#339
igerber merged 4 commits intomainfrom
fix/axis-cj-closeouts

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

Bundles the two remaining S-complexity findings from the Phase 2 audit — closes Phase 3 execution.

  • Address code review feedback for rank_control_units #12 ContinuousDiD B-spline degenerate knot (axis C, Minor, continuous_did_bspline.py:153): the per-basis derivative loop swallowed `ValueError` from `scipy.interpolate.BSpline`, leaving affected columns of the derivative design matrix as zero with no signal. ContinuousDiD analytical inference then fed a biased `dPsi` into SE computation. Fix aggregates failed-basis indices and emits one `UserWarning` naming them. The all-identical-knot degenerate case (single dose value) remains silently handled — zero derivatives are mathematically correct there.
  • Prepare 0.6.0 release #28 PowerAnalysis survey-design cache staleness (axis J, Major, `power.py:171-180`): `_build_survey_design()` cached its return value on first call and never invalidated. Mutating `config.survey_design` after `init` silently returned the stale design. Cache dropped entirely — default construction is microseconds and user-provided designs are reference copies, so the cache never earned its complexity.

Audit state post-merge: all 28 actionable Phase-2 findings resolved. Three P1 follow-ups remain logged in `TODO.md` from PR #337 (Rust-backend algorithmic divergences: FW/PGD mismatch in `compute_synthetic_weights`, TROP grid-search on rank-deficient Y, TROP bootstrap RNG unification) — those are post-audit cleanup work.

Methodology references

  • Method name(s): N/A — no methodology changes (silent-failure surface fixes only).
  • Paper / source link(s): N/A.
  • Any intentional deviations from the source (and why): None.

Validation

  • Tests added/updated:
    • `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` — 3 new tests: single-dose silent contract, forced-`ValueError` aggregate warning, clean-knot no-warning regression.
    • `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` — 3 new tests: mutation picks up new design, clear-to-None falls back to default, repeat-calls-equivalent regression.
  • `pytest tests/test_continuous_did.py tests/test_power.py -m 'not slow'` → 259 passed, 20 deselected locally.
  • REGISTRY: new Notes under §ContinuousDiD (B-spline derivative behavior) and §PowerAnalysis (SurveyPowerConfig cache semantics).

Security / privacy

Confirm no secrets/PII in this PR: Yes.


Generated with Claude Code

… cache

Bundles the two remaining S-complexity findings from the Phase 2 audit,
closing Phase 3 execution.

Finding #12 — ContinuousDiD B-spline degenerate knot (axis C, Minor,
`continuous_did_bspline.py:153`): `bspline_derivative_design_matrix`
silently swallowed `ValueError` from `scipy.interpolate.BSpline` in the
per-basis derivative loop, leaving affected columns of the derivative
design matrix as zero with no user-visible signal. Downstream
ContinuousDiD analytical inference then fed a biased `dPsi` into SE
computation. Fix aggregates failed-basis indices and emits ONE
`UserWarning` naming them. The all-identical-knot degenerate case
(single dose value, `knots[0] == knots[-1]`) remains silently handled —
derivatives there are mathematically zero, well-defined, and always
have been.

Finding #28 — PowerAnalysis survey-design cache staleness (axis J,
Major, `power.py:171-180`): `_build_survey_design()` populated
`self._cached_survey_design` on first call and never invalidated.
Mutating `config.survey_design` after `__init__` silently returned the
stale cached design. Default construction is microseconds and
user-provided designs are reference copies, so the cache never earned
its cost. Fix drops the cache entirely; method now reflects live
`self.survey_design` every call.

Six new tests:
- `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3):
  single-dose silent contract, `ValueError`-forced aggregate warning,
  happy-path no-warning regression.
- `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` (3):
  mutate-survey_design-picks-up-new, clearing-falls-back-to-default,
  repeat-calls-equivalent regression.

REGISTRY notes added under §ContinuousDiD (edge cases) and §PowerAnalysis
(`survey_config` section).

Audit state post-PR: all 28 actionable Phase-2 findings resolved (26 in
prior PRs; #12 + #28 here). Three P1 follow-ups remain logged in
`TODO.md` from PR #337's discovered divergences (FW/PGD algorithmic
mismatch in `compute_synthetic_weights`, TROP grid-search on rank-
deficient Y, TROP bootstrap RNG unification). Those are post-audit
cleanup work, not Phase-3 scope.

No behavioral changes on clean inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • Affected methodology surface: ContinuousDiD’s B-spline derivative/ACRT path and the survey-power SurveyPowerConfig helper.
  • The diff does not change estimator formulas, weighting rules, variance formulas, or defaults; it converts two previously silent behaviors into surfaced, documented behavior.
  • Both touched behaviors are documented in the Methodology Registry at docs/methodology/REGISTRY.md:L726 and docs/methodology/REGISTRY.md:L2586-L2587.
  • No unmitigated P0/P1 findings in the diff.
  • One P3 informational note: the new ContinuousDiD warning/note understates that the zeroed derivative matrix affects ACRT point estimates, not just inference.
  • Test execution was not possible in this environment because pytest is not installed; the test review below is source-based.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The unrelated Rust/TROP follow-ups referenced in the PR body are already tracked in TODO.md:L86-L88.

Security

  • No findings.

Documentation/Tests

CI reviewer correctly noted that dPsi feeds both ACRT point estimates
(continuous_did.py:1026-1046, bootstrap path at 1524-1561) and the
analytical/bootstrap inference, not just inference alone. Updated the
warning text and the REGISTRY note to say both outputs are biased when
derivative construction fails.

No behavior change; text-only accuracy fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2250c67933c1e2316c22cab366d6353ccd50afe4


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The new note at docs/methodology/REGISTRY.md#L2586 says mutating config.survey_design “or any other config field” used to return a stale cached design. _build_survey_design() only depends on self.survey_design, so fields such as n_strata, icc, or weight_variation never changed the returned SurveyDesign. The note overstates the bug surface and could mislead future triage. Concrete fix: Narrow the wording to reassignment/clearing of config.survey_design, or toggling between None and a user-supplied SurveyDesign.
  • No test-design findings. The new regressions at tests/test_continuous_did.py#L116 and tests/test_power.py#L2842 cover the intended silent-failure paths, but I could not execute them in this environment because pytest is unavailable.

Reviewer noted the previous registry note said "any other config field"
could trigger the stale cache, but _build_survey_design() only depends
on self.survey_design. Narrowed the note to call out survey_design
reassignment specifically (replace user-supplied design or toggle
between None and a user-supplied design). Also explicitly states the
other config fields never influenced the returned design, to prevent
misleading future triage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1e3d001fb71e38cc8b072fc32d2b2a2a89e06e5b


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings. Removing the tiny cache in diff_diff/power.py:L169-L187 adds only cheap SurveyDesign(...) construction on the default path and does not look material relative to per-simulation DGP generation and estimator fitting.

Maintainability

Tech Debt

  • No findings. The follow-up Rust/TROP items referenced in the PR body are already tracked in TODO.md:L86-L88 and do not block this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: tests/test_continuous_did.py:L138-L189 checks that the aggregate warning fires and that failed basis columns stay zero, but it never checks that the non-failed derivative columns still match the normal computation. A future regression that zeroes or corrupts the entire derivative matrix after any ValueError could still pass this test. Concrete fix: compute a baseline derivative matrix for the same inputs without the patching hook and assert equality on the non-failed columns, or at minimum assert that at least one unaffected non-intercept column remains non-zero.
  • No other findings. The added regressions in tests/test_continuous_did.py:L116-L205 and tests/test_power.py:L2842-L2895 are otherwise well targeted to the silent-failure surfaces this PR closes.

Reviewer noted the partial-failure test only asserted failed columns
are zero and the warning fires, but didn't verify non-failed columns
remain correctly populated. A regression zeroing the entire derivative
matrix on any ValueError would have passed. Added a baseline comparison
(un-patched bspline_derivative_design_matrix) and asserted equality on
unaffected columns, plus a sanity check that at least one unaffected
non-intercept column is non-zero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0cf60dddc8cbcaff0b98790c046f1397eec9d5e8


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed surface. On inspection, the prior P3 test-gap from the last AI review appears addressed.

Executive Summary

Methodology

No findings. Cross-checking the ContinuousDiD methodology reference at docs/methodology/continuous-did.md:1, the new registry note at docs/methodology/REGISTRY.md:726, the helper change at diff_diff/continuous_did_bspline.py:143, and the ACRT/bootstrapping call sites at diff_diff/continuous_did.py:1026 and diff_diff/continuous_did.py:1524 shows no change to estimands, identification assumptions, weighting, or variance/SE formulas. The power-side change in diff_diff/power.py:169 is cache semantics only, not a methodological change.

Code Quality

No findings.

Performance

No findings. Dropping _cached_survey_design in diff_diff/power.py:169 only adds trivial default SurveyDesign construction on the survey_design is None path, which is not material relative to simulation and estimator fit costs.

Maintainability

No findings. The implementation, tests, and registry notes are aligned across diff_diff/continuous_did_bspline.py:168, diff_diff/power.py:169, docs/methodology/REGISTRY.md:726, and docs/methodology/REGISTRY.md:2586.

Tech Debt

No findings. The Rust/TROP follow-ups referenced in the PR body are already tracked in TODO.md:86, TODO.md:87, and TODO.md:88, so they are non-blocking under the stated rubric.

Security

No findings.

Documentation/Tests

No findings. The previous re-review’s P3 test concern is addressed by the new unaffected-column baseline assertions in tests/test_continuous_did.py:191 and tests/test_continuous_did.py:206, and the survey-design staleness coverage now exercises both replacement and reset-to-default cases in tests/test_power.py:2843 and tests/test_power.py:2866. Residual limitation: I could not execute the tests in this sandbox because pytest is unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit cbb8814 into main Apr 19, 2026
23 of 24 checks passed
@igerber igerber deleted the fix/axis-cj-closeouts branch April 19, 2026 23:37
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant