Surface row-count for axis-E silent coercion / drop sites#331
Conversation
|
Overall Assessment The estimator math / weighting / SE behavior touched here looks fine: the Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Three findings from the CI AI review on PR #331: P1 (newly identified): `ContinuousDiD.fit()` still silently rewrote `dose` to 0.0 for rows with `first_treat==0` but nonzero dose, in the same preprocessing block the PR had otherwise cleaned up. Counting those rows and emitting a `UserWarning` with the row count before the zeroing closes the axis-E slice for this estimator. REGISTRY updated with a matching **Note:** and two regression tests added (silent on clean data, warns with correct row count on polluted data). P2 (warning wording): The axis-E excess-drop warning in `_compute_outcome_changes` hardcoded "check_parallel_trends" but the function is called from `check_parallel_trends_robust` and `equivalence_test_trends`, not from the simple `check_parallel_trends`. Add a `caller_label` kwarg and thread the correct name from each public entry point; regression tests cover both callers surfacing the warning with their own name. P2 (inf asymmetry): The inf-first-treat counter used `np.isinf`, which counts both +inf and -inf, but the actual recode on the next line (`.replace([np.inf, float("inf")], 0)`) only touches +inf. A -inf input would have triggered a warning about "treating as never-treated" even though no such recode happened. Switch to `np.isposinf` so the warning count matches the recode set; add a test confirming -inf does not fire the warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior re-review findings on Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. I did not identify a new deferrable item in the changed paths that needs Security No findings. Documentation/Tests
Static-review note: test execution was not possible here because |
Three long-standing axis-E (silent data sanitization) sites now emit a UserWarning with the affected row count before the coercion, so users can no longer be silently shifted between the treated and control sides of an estimator or have rows vanish from a diagnostic pass without a signal. - WooldridgeDiD: NaN cohort values were filled to 0 (never-treated) both in `_filter_sample` and in `fit()`. Both now warn with the NaN row count before the fillna (finding #24). - ContinuousDiD: `first_treat=inf` was replaced with 0 silently. `fit()` now counts inf rows and warns before the replace, before any downstream drop-zero-dose / negative-dose validation (finding #26). - `_compute_outcome_changes` (the `check_parallel_trends` diff helper) dropped NaN first-differences without reporting the count. It now distinguishes the expected first-period-per-unit drops from excess drops caused by gaps / NaN outcomes and warns with the breakdown when excess drops are detected (finding #27). Finding #25 (TROP D-matrix coercion) was verified during scoping to already be resolved in `trop_local.py:60-66` via `n_missing_structural` + returned `missing_mask`; no code change required. REGISTRY updated under both WooldridgeDiD and ContinuousDiD to document the new warning contract. Covered by audit axis E (data sanitization). Findings #24, #26, #27 from docs/audits/silent-failures-findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P2 (wooldridge): Extract shared `_warn_and_fill_nan_cohort(df, cohort, stacklevel)` helper used by both `_filter_sample` and `fit()`. Removes the copy-paste warning block that was flagged as a future drift risk. - P2 (tests): Add `test_inf_first_treat_warning_counts_rows_not_units` on a 4-unit x 3-period panel. 2 units carry inf across all 3 periods (6 inf rows, 2 inf units) — the warning must report 6, not 2, because `.replace(inf, 0)` is row-level. - P3 (utils wording): The `_compute_outcome_changes` excess-drop warning said "gaps or NaN outcomes" but the code actually counts all NaN first-differences. Rephrased to "additional NaN first-differences (e.g. NaN outcomes or unit-period gaps upstream)" so the message doesn't over-claim what the helper can detect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from the CI AI review on PR #331: P1 (newly identified): `ContinuousDiD.fit()` still silently rewrote `dose` to 0.0 for rows with `first_treat==0` but nonzero dose, in the same preprocessing block the PR had otherwise cleaned up. Counting those rows and emitting a `UserWarning` with the row count before the zeroing closes the axis-E slice for this estimator. REGISTRY updated with a matching **Note:** and two regression tests added (silent on clean data, warns with correct row count on polluted data). P2 (warning wording): The axis-E excess-drop warning in `_compute_outcome_changes` hardcoded "check_parallel_trends" but the function is called from `check_parallel_trends_robust` and `equivalence_test_trends`, not from the simple `check_parallel_trends`. Add a `caller_label` kwarg and thread the correct name from each public entry point; regression tests cover both callers surfacing the warning with their own name. P2 (inf asymmetry): The inf-first-treat counter used `np.isinf`, which counts both +inf and -inf, but the actual recode on the next line (`.replace([np.inf, float("inf")], 0)`) only touches +inf. A -inf input would have triggered a warning about "treating as never-treated" even though no such recode happened. Switch to `np.isposinf` so the warning count matches the recode set; add a test confirming -inf does not fire the warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI AI re-review flagged (P1) that the previous commit claimed "-inf will be rejected by downstream validators" in both the code comment and REGISTRY.md, but no such validator existed. After the `+inf -> 0` normalization, `first_treat < 0` units fell out of both the treated (g > 0) and never-treated (g == 0) masks, so the affected units were silently excluded from the estimator — exactly the axis-E silent failure the PR was closing. - ContinuousDiD.fit() now validates `first_treat < 0` explicitly post-normalization and raises ValueError with the row count. -inf, -2, and any other negative value are all rejected. - REGISTRY.md note rewritten to match the implemented behavior. - Existing -inf test replaced with one that asserts `pytest.raises(ValueError)` matching the row-count message, plus a positive regression test confirming +inf warning stays silent on panels with only valid 0/positive `first_treat` values. - tests/test_utils.py::test_silent_on_balanced_panel tightened: the balanced-panel silence assertion now filters on any warning containing "dropped", so a regression that changed the warning label would no longer hide a genuine drop signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erage
Two new P1 findings from CI AI re-review:
1. ContinuousDiD.fit() still accepted NaN `first_treat` values. NaN
survives preprocessing but satisfies neither the treated (g > 0)
nor never-treated (g == 0) mask, so affected units were silently
excluded from both pools — same silent-failure shape as the
already-rejected `first_treat < 0`. Reject NaN explicitly with
ValueError + row count.
2. `StaggeredTripleDifference.fit()` silently recoded
`first_treat=np.inf -> 0` via `.replace([np.inf, float("inf")], 0)`.
Its sibling `staggered.py:1508-1519` already surfaces this with a
UserWarning; this PR mirrors that contract so the estimator no
longer shifts units between treated and never-treated pools
without signaling. REGISTRY gets a matching **Note:** under the
StaggeredTripleDifference section.
Regression tests:
- test_nan_first_treat_raises_with_row_count (ContinuousDiD) on a
4-unit x 3-period panel with 3 NaN rows.
- test_inf_first_treat_works (StaggeredTripleDifference) upgraded
from silent-success to `pytest.warns(UserWarning, match=row-count)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d103eec to
65d1bd8
Compare
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
wooldridge.py):NaNcohort values were silently filled to0(never-treated) in both_filter_sampleandfit(). Extract a shared_warn_and_fill_nan_cohort(df, cohort, stacklevel)helper so both entry paths emit oneUserWarningreporting the affected row count before the.fillna(0).continuous_did.py):first_treat=infwas silently coerced to0via.replace([inf, -inf], 0).fit()now counts inf rows and emits aUserWarningwith the row count before the coercion._compute_outcome_changes(utils.py):.dropna(subset=["_outcome_change"])silently removed the first period per unit. The helper now distinguishes the expected first-period-per-unit drops from excess drops and emits aUserWarningwith the breakdown when excess NaN first-differences are present.Finding #25 (TROP D-matrix coercion) was verified during scoping to already be resolved at
trop_local.py:60-66— it warns withn_missing_structuraland returnsmissing_mask. No code change required.Methodology references (required if estimator / math changes)
check_parallel_trendsdiagnostic helper.NaN cohort -> 0,inf first_treat -> 0, first-perioddropna) are unchanged.docs/methodology/REGISTRY.md.Validation
tests/test_wooldridge.py::TestCohortNaNWarning—fit()+_filter_sampleboth warn with the NaN-cohort row count; clean-cohort run is silent.tests/test_continuous_did.py::TestEdgeCases::test_inf_first_treat_normalization— extended to assert the warning with the expected row count.tests/test_continuous_did.py::TestEdgeCases::test_no_inf_first_treat_no_warning— silent path coverage.tests/test_continuous_did.py::TestEdgeCases::test_inf_first_treat_warning_counts_rows_not_units— on a 4-unit x 3-period panel with 2 inf units (6 inf rows), the warning reports 6, locking in the row-level contract.tests/test_utils.py::TestComputeOutcomeChanges::test_silent_on_balanced_panel— clean balanced panel must not warn.tests/test_utils.py::TestComputeOutcomeChanges::test_warns_on_nan_outcomes_with_excess_drop_count— excess NaN-outcome drops surface via warning.tests/test_wooldridge.py+tests/test_continuous_did.py+tests/test_utils.py= 258 tests green.Security / privacy
Audit context: resolves findings #24, #26, #27 from the axis-E (silent data sanitization) slice of the in-flight silent-failures audit. Finding #25 (TROP D-matrix) verified as already-resolved during scoping. Closes axis E.