Close axis-C/J silent-failures audit: B-spline derivative + PA survey cache by igerber · Pull Request #339 · igerber/diff-diff

igerber · 2026-04-19T21:40:05Z

Summary

Bundles the two remaining S-complexity findings from the Phase 2 audit — closes Phase 3 execution.

Address code review feedback for rank_control_units #12 ContinuousDiD B-spline degenerate knot (axis C, Minor, continuous_did_bspline.py:153): the per-basis derivative loop swallowed `ValueError` from `scipy.interpolate.BSpline`, leaving affected columns of the derivative design matrix as zero with no signal. ContinuousDiD analytical inference then fed a biased `dPsi` into SE computation. Fix aggregates failed-basis indices and emits one `UserWarning` naming them. The all-identical-knot degenerate case (single dose value) remains silently handled — zero derivatives are mathematically correct there.
Prepare 0.6.0 release #28 PowerAnalysis survey-design cache staleness (axis J, Major, `power.py:171-180`): `_build_survey_design()` cached its return value on first call and never invalidated. Mutating `config.survey_design` after `init` silently returned the stale design. Cache dropped entirely — default construction is microseconds and user-provided designs are reference copies, so the cache never earned its complexity.

Audit state post-merge: all 28 actionable Phase-2 findings resolved. Three P1 follow-ups remain logged in `TODO.md` from PR #337 (Rust-backend algorithmic divergences: FW/PGD mismatch in `compute_synthetic_weights`, TROP grid-search on rank-deficient Y, TROP bootstrap RNG unification) — those are post-audit cleanup work.

Methodology references

Method name(s): N/A — no methodology changes (silent-failure surface fixes only).
Paper / source link(s): N/A.
Any intentional deviations from the source (and why): None.

Validation

Tests added/updated:
- `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` — 3 new tests: single-dose silent contract, forced-`ValueError` aggregate warning, clean-knot no-warning regression.
- `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` — 3 new tests: mutation picks up new design, clear-to-None falls back to default, repeat-calls-equivalent regression.
`pytest tests/test_continuous_did.py tests/test_power.py -m 'not slow'` → 259 passed, 20 deselected locally.
REGISTRY: new Notes under §ContinuousDiD (B-spline derivative behavior) and §PowerAnalysis (SurveyPowerConfig cache semantics).

Security / privacy

Confirm no secrets/PII in this PR: Yes.

Generated with Claude Code

… cache Bundles the two remaining S-complexity findings from the Phase 2 audit, closing Phase 3 execution. Finding #12 — ContinuousDiD B-spline degenerate knot (axis C, Minor, `continuous_did_bspline.py:153`): `bspline_derivative_design_matrix` silently swallowed `ValueError` from `scipy.interpolate.BSpline` in the per-basis derivative loop, leaving affected columns of the derivative design matrix as zero with no user-visible signal. Downstream ContinuousDiD analytical inference then fed a biased `dPsi` into SE computation. Fix aggregates failed-basis indices and emits ONE `UserWarning` naming them. The all-identical-knot degenerate case (single dose value, `knots[0] == knots[-1]`) remains silently handled — derivatives there are mathematically zero, well-defined, and always have been. Finding #28 — PowerAnalysis survey-design cache staleness (axis J, Major, `power.py:171-180`): `_build_survey_design()` populated `self._cached_survey_design` on first call and never invalidated. Mutating `config.survey_design` after `__init__` silently returned the stale cached design. Default construction is microseconds and user-provided designs are reference copies, so the cache never earned its cost. Fix drops the cache entirely; method now reflects live `self.survey_design` every call. Six new tests: - `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3): single-dose silent contract, `ValueError`-forced aggregate warning, happy-path no-warning regression. - `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` (3): mutate-survey_design-picks-up-new, clearing-falls-back-to-default, repeat-calls-equivalent regression. REGISTRY notes added under §ContinuousDiD (edge cases) and §PowerAnalysis (`survey_config` section). Audit state post-PR: all 28 actionable Phase-2 findings resolved (26 in prior PRs; #12 + #28 here). Three P1 follow-ups remain logged in `TODO.md` from PR #337's discovered divergences (FW/PGD algorithmic mismatch in `compute_synthetic_weights`, TROP grid-search on rank- deficient Y, TROP bootstrap RNG unification). Those are post-audit cleanup work, not Phase-3 scope. No behavioral changes on clean inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T21:46:34Z

Overall Assessment

✅ Looks good

Executive Summary

Affected methodology surface: ContinuousDiD’s B-spline derivative/ACRT path and the survey-power SurveyPowerConfig helper.
The diff does not change estimator formulas, weighting rules, variance formulas, or defaults; it converts two previously silent behaviors into surfaced, documented behavior.
Both touched behaviors are documented in the Methodology Registry at docs/methodology/REGISTRY.md:L726 and docs/methodology/REGISTRY.md:L2586-L2587.
No unmitigated P0/P1 findings in the diff.
One P3 informational note: the new ContinuousDiD warning/note understates that the zeroed derivative matrix affects ACRT point estimates, not just inference.
Test execution was not possible in this environment because pytest is not installed; the test review below is source-based.

Methodology

Severity: P3
Impact: The new warning in diff_diff/continuous_did_bspline.py:L168-L180 and the matching registry note in docs/methodology/REGISTRY.md:L726 say ContinuousDiD inference may be biased when derivative construction fails. But the same dPsi matrix is also used for the ACRT point estimates and bootstrap ACRT surfaces at diff_diff/continuous_did.py:L1026-L1046 and diff_diff/continuous_did.py:L1524-L1561. The behavior is documented, so this is informational rather than blocking, but the wording understates the scope of affected outputs.
Concrete fix: Broaden the warning/registry wording to say that derivative-based ACRT estimates and inference may be biased, or explicitly set derivative-based outputs to NaN when derivative construction fails.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The unrelated Rust/TROP follow-ups referenced in the PR body are already tracked in TODO.md:L86-L88.

Security

No findings.

Documentation/Tests

No findings.
The new documentation coverage is present at docs/methodology/REGISTRY.md:L726 and docs/methodology/REGISTRY.md:L2586-L2587.
The added tests are sensible helper-level regressions at tests/test_continuous_did.py:L116-L205 and tests/test_power.py:L2842-L2895, but I could not execute them here because python -m pytest fails with No module named pytest.

CI reviewer correctly noted that dPsi feeds both ACRT point estimates (continuous_did.py:1026-1046, bootstrap path at 1524-1561) and the analytical/bootstrap inference, not just inference alone. Updated the warning text and the REGISTRY note to say both outputs are biased when derivative construction fails. No behavior change; text-only accuracy fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T21:47:56Z

/ai-review

github-actions · 2026-04-19T21:56:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2250c67933c1e2316c22cab366d6353ccd50afe4

Overall Assessment

✅ Looks good

Executive Summary

Re-review result: no unmitigated P0/P1 findings in the changed surface.
The earlier informational concern is resolved: the ContinuousDiD warning and registry note now explicitly state that derivative-construction failures can bias both ACRT point estimates and analytical/bootstrap inference, matching downstream dPsi use in diff_diff/continuous_did.py#L1028, diff_diff/continuous_did.py#L1225, diff_diff/continuous_did.py#L1525, and docs/methodology/REGISTRY.md#L726.
The ContinuousDiD change is methodology-adjacent but not a formula, weighting, variance, identification, or default-behavior change; it converts a silent helper failure into a surfaced, documented warning.
The survey-power change is a narrow cache removal in diff_diff/power.py#L169; all visible survey-fit builders still obtain survey_design through that helper at diff_diff/power.py#L407, diff_diff/power.py#L424, diff_diff/power.py#L443, and diff_diff/power.py#L460.
One P3 documentation nit remains: the new power registry note slightly overstates the stale-cache surface at docs/methodology/REGISTRY.md#L2586.
Source-based review only for execution: python -m pytest tests/test_continuous_did.py tests/test_power.py -m 'not slow' could not run here because pytest is not installed.

Methodology

No findings. Cross-checking docs/methodology/continuous-did.md#L206, diff_diff/continuous_did_bspline.py#L168, diff_diff/continuous_did.py#L1028, diff_diff/continuous_did.py#L1225, and diff_diff/continuous_did.py#L1525 shows the PR does not change ATT/ACRT estimands, identification assumptions, weighting, or SE formulas; it only surfaces a previously silent derivative-construction failure. That is consistent with the source material’s derivative-based ACRT setup. (nber.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The new note at docs/methodology/REGISTRY.md#L2586 says mutating config.survey_design “or any other config field” used to return a stale cached design. _build_survey_design() only depends on self.survey_design, so fields such as n_strata, icc, or weight_variation never changed the returned SurveyDesign. The note overstates the bug surface and could mislead future triage. Concrete fix: Narrow the wording to reassignment/clearing of config.survey_design, or toggling between None and a user-supplied SurveyDesign.
No test-design findings. The new regressions at tests/test_continuous_did.py#L116 and tests/test_power.py#L2842 cover the intended silent-failure paths, but I could not execute them in this environment because pytest is unavailable.

Reviewer noted the previous registry note said "any other config field" could trigger the stale cache, but _build_survey_design() only depends on self.survey_design. Narrowed the note to call out survey_design reassignment specifically (replace user-supplied design or toggle between None and a user-supplied design). Also explicitly states the other config fields never influenced the returned design, to prevent misleading future triage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T22:01:30Z

/ai-review

github-actions · 2026-04-19T22:10:23Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1e3d001fb71e38cc8b072fc32d2b2a2a89e06e5b

Overall Assessment

✅ Looks good

Executive Summary

Re-review result: the prior documentation nit is resolved, and I found no unmitigated P0/P1 issues in the changed surface.
Methodology-adjacent impact is confined to ContinuousDiD’s continuous-treatment B-spline derivative path. Cross-checking docs/methodology/continuous-did.md:L206-L223, docs/methodology/REGISTRY.md:L718-L726, diff_diff/continuous_did_bspline.py:L143-L181, diff_diff/continuous_did.py:L1026-L1046, and diff_diff/continuous_did.py:L1524-L1563 shows no change to estimands, identification, weighting, or SE formulas.
The documented all-same-dose contract is preserved: the helper still handles the degenerate single-dose case silently, consistent with docs/methodology/REGISTRY.md:L718-L726 and the higher-level warning path in diff_diff/continuous_did.py:L974-L980.
The survey-power fix in diff_diff/power.py:L169-L187 removes stale SurveyDesign caching without changing power methodology, and the visible survey fit builders still use the live helper at diff_diff/power.py:L403-L460.
One new P3 test gap remains: the partial-failure derivative regression in tests/test_continuous_did.py:L138-L189 does not verify that unaffected derivative columns remain correctly populated.
I could not execute the target tests here because this sandbox is missing importable numpy / pytest.

Methodology

No findings. Affected method: ContinuousDiD continuous-treatment B-spline sieve ACRT derivative evaluation. The PR converts a silent helper failure into a documented warning while leaving the source-aligned formulas unchanged in docs/methodology/continuous-did.md:L206-L223, diff_diff/continuous_did.py:L1026-L1046, and diff_diff/continuous_did.py:L1524-L1563. The survey-power cache removal in diff_diff/power.py:L169-L187 is not a methodology change.

Code Quality

No findings.

Performance

No findings. Removing the tiny cache in diff_diff/power.py:L169-L187 adds only cheap SurveyDesign(...) construction on the default path and does not look material relative to per-simulation DGP generation and estimator fitting.

Maintainability

No findings. The code comments, docstrings, and registry notes are now aligned across diff_diff/continuous_did_bspline.py:L143-L181, diff_diff/power.py:L169-L187, docs/methodology/REGISTRY.md:L726, and docs/methodology/REGISTRY.md:L2586.

Tech Debt

No findings. The follow-up Rust/TROP items referenced in the PR body are already tracked in TODO.md:L86-L88 and do not block this PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: tests/test_continuous_did.py:L138-L189 checks that the aggregate warning fires and that failed basis columns stay zero, but it never checks that the non-failed derivative columns still match the normal computation. A future regression that zeroes or corrupts the entire derivative matrix after any ValueError could still pass this test. Concrete fix: compute a baseline derivative matrix for the same inputs without the patching hook and assert equality on the non-failed columns, or at minimum assert that at least one unaffected non-intercept column remains non-zero.
No other findings. The added regressions in tests/test_continuous_did.py:L116-L205 and tests/test_power.py:L2842-L2895 are otherwise well targeted to the silent-failure surfaces this PR closes.

Reviewer noted the partial-failure test only asserted failed columns are zero and the warning fires, but didn't verify non-failed columns remain correctly populated. A regression zeroing the entire derivative matrix on any ValueError would have passed. Added a baseline comparison (un-patched bspline_derivative_design_matrix) and asserted equality on unaffected columns, plus a sanity check that at least one unaffected non-intercept column is non-zero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T22:11:53Z

/ai-review

github-actions · 2026-04-19T22:20:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0cf60dddc8cbcaff0b98790c046f1397eec9d5e8

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed surface. On inspection, the prior P3 test-gap from the last AI review appears addressed.

Executive Summary

The prior re-review concern on ContinuousDiD test coverage appears fixed: the new regression now checks that unaffected derivative columns match the unpatched baseline in tests/test_continuous_did.py:191 and tests/test_continuous_did.py:206.
Affected methods are ContinuousDiD’s continuous-treatment B-spline ACRT derivative path and the survey-aware SurveyPowerConfig plumbing used by the simulation power helpers.
Methodology cross-check: the ContinuousDiD formulas and use of dPsi in both point estimates and bootstrap are unchanged in diff_diff/continuous_did.py:1026 and diff_diff/continuous_did.py:1524; the new warning behavior is explicitly documented in docs/methodology/REGISTRY.md:726, so this is not an undocumented methodology defect.
The survey-power fix removes stale SurveyDesign caching in diff_diff/power.py:169 and matches the new registry note at docs/methodology/REGISTRY.md:2586 without changing power formulas or identification assumptions.
I could not run the targeted pytest subset here because pytest is not installed in this environment.

Methodology

No findings. Cross-checking the ContinuousDiD methodology reference at docs/methodology/continuous-did.md:1, the new registry note at docs/methodology/REGISTRY.md:726, the helper change at diff_diff/continuous_did_bspline.py:143, and the ACRT/bootstrapping call sites at diff_diff/continuous_did.py:1026 and diff_diff/continuous_did.py:1524 shows no change to estimands, identification assumptions, weighting, or variance/SE formulas. The power-side change in diff_diff/power.py:169 is cache semantics only, not a methodological change.

Code Quality

No findings.

Performance

No findings. Dropping _cached_survey_design in diff_diff/power.py:169 only adds trivial default SurveyDesign construction on the survey_design is None path, which is not material relative to simulation and estimator fit costs.

Maintainability

No findings. The implementation, tests, and registry notes are aligned across diff_diff/continuous_did_bspline.py:168, diff_diff/power.py:169, docs/methodology/REGISTRY.md:726, and docs/methodology/REGISTRY.md:2586.

Tech Debt

No findings. The Rust/TROP follow-ups referenced in the PR body are already tracked in TODO.md:86, TODO.md:87, and TODO.md:88, so they are non-blocking under the stated rubric.

Security

No findings.

Documentation/Tests

No findings. The previous re-review’s P3 test concern is addressed by the new unaffected-column baseline assertions in tests/test_continuous_did.py:191 and tests/test_continuous_did.py:206, and the survey-design staleness coverage now exercises both replacement and reset-to-default cases in tests/test_power.py:2843 and tests/test_power.py:2866. Residual limitation: I could not execute the tests in this sandbox because pytest is unavailable.

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit cbb8814 into main Apr 19, 2026
23 of 24 checks passed

igerber deleted the fix/axis-cj-closeouts branch April 19, 2026 23:37

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close axis-C/J silent-failures audit: B-spline derivative + PA survey cache#339

Close axis-C/J silent-failures audit: B-spline derivative + PA survey cache#339
igerber merged 4 commits intomainfrom
fix/axis-cj-closeouts

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant