Fix HonestDiD identified set and inference (methodology review)#248
Fix HonestDiD identified set and inference (methodology review)#248
Conversation
Paper review of Rambachan & Roth (2023) revealed 6 structural issues in the HonestDiD implementation. This commit fixes the core identified set computation and adds paper-aligned inference procedures: - F1: DeltaRM now constrains first differences (not levels), using union-of-polyhedra decomposition per Lemma 2.2 - F2: LP now pins delta_pre = beta_pre via equality constraints, matching the paper's Equations 5-6 - F3: DeltaSD constraint matrix accounts for delta_0 = 0 at the pre-post boundary (T+Tbar-1 rows, not T+Tbar-2) - F4: Optimal FLCI for DeltaSD (Section 4.1) replaces naive bound extension; ARP hybrid framework added for DeltaRM (Section 3.2) - F5-F6: REGISTRY equation corrections documented in paper review New functions: _cv_alpha, _compute_worst_case_bias, _compute_optimal_flci, _compute_pre_first_differences, _construct_constraints_rm_component, _solve_rm_bounds_union, _setup_moment_inequalities, _enumerate_vertices, _compute_arp_test, _arp_confidence_set Paper review: docs/methodology/papers/rambachan-roth-2023-review.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hort-circuit Performance improvements to optimal FLCI computation without changing the underlying Nelder-Mead algorithm: - cv_alpha: Newton's method (5 iterations) replaces bisection (100 iterations) - Centrosymmetric bias: 1 LP solve instead of 2 for Delta^SD - M=0 short-circuit: skip optimization when identified set is a point - Looser tolerances: fatol=1e-6, xatol=1e-5 (sufficient precision) - warm-start support: v_pre_init parameter for sensitivity grids Benchmark: 9-value sensitivity grid now runs in 0.1s (was ~9 minutes). Unit test suite: 2 minutes (was 8m49s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nstraints - DeltaSD: second differences (was first differences), all periods (was pre-only) - DeltaRM: first differences (was absolute levels) - Identified set: document delta_pre = beta_pre pinning constraint - Inference: document optimal FLCI for SD, ARP hybrid for RM - Update requirements checklist to reflect implemented features Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ARP hybrid confidence sets for Delta^RM attempted a 200-point grid search × 5000 simulations per fit() call, causing 37-minute test runtime. The moment inequality transformation needs further calibration before producing valid CIs consistently. Disable ARP for now, use conservative naive FLCI for RM CIs. ARP infrastructure retained for future enablement. Full test suite: 63/63 pass in 0.28s (was 37 minutes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify corrected implementation against Rambachan & Roth (2023): - TestDeltaSDConstraintMatrix: delta_0=0 boundary, bridge constraint, row count (T+Tbar-1), hand-computed 2+2 case - TestIdentifiedSetLP: delta_pre=beta_pre pinning, M=0 linear extrapolation, three-period analytical case from Section 2.3 - TestDeltaRMFirstDifferences: first-difference constraints (not levels), boundary term, Mbar=0 point identification, monotonicity - TestOptimalFLCI: cv_alpha folded normal, optimal narrower than naive, M=0 short-circuit instant - TestBreakdownValueMethodology: significance monotonicity in M Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document all 6 corrections (DeltaRM first-diffs, LP equality constraints, DeltaSD boundary, optimal FLCI, REGISTRY equations, performance). Note outstanding ARP calibration work and R benchmark comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Overall Assessment ⛔ Blocker Static review note: I could not execute the HonestDiD tests in this environment because Executive Summary
Methodology
Code Quality No separate code-quality findings beyond the methodology defects above. Performance No material performance regressions identified from the diff itself. Maintainability No separate maintainability findings beyond the RM/ARP documentation/method-label mismatch already called out. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…f, labels P0: Fix _compute_worst_case_bias to use correct bias direction (v, not v-l) over centered Delta^SD. The bias is v'delta with delta_pre=0 centering, making bias nonzero when l has post-period components. Previously bias was identically zero, making FLCI a pure variance CI. P1: Thread df through _compute_optimal_flci for survey inference. M=0 path honors df via _get_critical_value; df<=0 returns NaN. P1: Distinguish LP infeasibility (status=2 -> NaN bounds) from unboundedness (status=3 -> inf bounds) in _solve_bounds_lp. P1: Fix RM ci_method from "C-LF" to "FLCI" (code always uses naive FLCI). Update _compute_rm_bounds docstring and REGISTRY.md deviation note to accurately describe unconditional naive FLCI fallback. P2: Fix breakdown assertion from (lb<=0 or ub>=0) to (lb<=0<=ub). P3: Update DeltaRM/DeltaSDRM docstrings to describe first-difference constraints (was absolute levels). P3: Add tech debt entry to TODO.md for RM ARP calibration. Regression tests: bias nonzero for M>0, CI width increases with M, infeasible LP returns NaN, breakdown with weak effect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review note: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…ation P1: Gate df<=0 -> NaN at top of _compute_optimal_flci for all M values, honoring the project's inference contract for undefined survey df. P1: M=0 SE now includes pre-period variance contribution via the extrapolation weight vector, not just l'Sigma_post l. P1: _compute_smoothness_bounds propagates NaN from infeasible LP bounds to CI, preventing finite CIs for refuted restrictions. P3: Updated HonestDiD class docstring to match corrected Delta^RM first-difference definition. P3: METHODOLOGY_REVIEW.md survey variance checklist now distinguishes RM/M=0 (verified) from M>0 smoothness (asymptotic normal only). P2: Added fit-level tests for infeasible smoothness CI and df_survey=0. Updated width monotonicity test for M>0 only (M=0 uses different SE). 85/85 tests pass in 0.75s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review based on the diff plus local methodology docs/source inspection; I did not rerun the full HonestDiD suite here. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
The previous FLCI bias computation was wrong: it either pinned
delta_pre=0 (making bias zero) or left levels unconstrained (making
bias infinite). The paper (Section 4.1.1) parameterizes the estimator
in terms of weights on pre-treatment first differences (slopes):
theta_hat = l'beta_post - sum_s w_s (beta_s - beta_{s-1})
with constraint sum(w)=1 (linear trend invariance). The bias LP
operates in first-difference space where Delta^SD is a bounded
polyhedron |fd_{i+1} - fd_i| <= M, making the LP well-posed.
New helper functions:
- _build_fd_transform: maps first-differences to levels (C matrix)
- _build_fd_smoothness_constraints: Delta^SD in fd-space
- _w_to_v: maps slope weights w to estimator direction v
_cv_alpha now supports optional df parameter using scipy.stats.nct
(folded non-central t) for survey inference on the smoothness path.
M=0 naturally falls out: bias=0 when sum(w)=1, optimizer just
minimizes variance, giving exact linear-extrapolation weights
including pre-period contribution to SE.
87/87 tests pass. New regressions: survey df=2 widens smoothness CI,
M=0 SE includes pre-period variance, bias nonzero for M>0.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update two remaining instances of "max pre-period violation" to "max pre-period first difference" in result summary and class docstring. Downgrade METHODOLOGY_REVIEW.md status to "Complete (pending R comparison)" to accurately reflect that R benchmark validation is still outstanding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
The outstanding concerns and deviations now accurately reflect that: - DeltaSD uses the paper's optimal FLCI (Section 4.1) — not a deviation - DeltaRM uses naive FLCI — documented deviation from R's ARP hybrid - Note format matches project's reviewer-recognized labels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only; I did not rerun the HonestDiD suite here because the sandbox Python is missing repo dependencies such as Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P0: The linear-trend-neutrality condition sum(w)=1 is only correct for
l=e_1 (theta=tau_1). For general l, Eq. 17 requires sum(w) = sum_j j*l_j.
Also, T slopes are needed (including boundary slope s=0), not T-1.
With T=1: the boundary slope gives v=[target_sum, l], so the FLCI
center for the three-period model is correctly beta_1 + beta_{-1}.
With Tbar>1 and default l: target_sum = sum_j j*(1/Tbar), ensuring
the fd-space bias LP is bounded and multi-post CIs are finite.
P2: Added regressions for T=1/Tbar=1/M=0 center, multi-post M=0 and
M>0 with default l_vec. All produce correct finite CIs.
P3: Replaced local absolute path in paper review with DOI reference.
90/90 tests pass in 1.20s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the HonestDiD suite here because this sandbox does not have Executive Summary
Methodology
No additional P0/P1 issues stood out in the changed Code Quality No material findings beyond the methodology/documentation drift noted elsewhere. Performance
Maintainability No material findings beyond the stale methodology notes called out below. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…ze RM LP P1: Add REGISTRY.md deviation note and runtime UserWarning for method="combined" (Delta^SDRM): paper proves FLCI is NOT consistent for this restriction class (Proposition 4.2). P3: Update stale sum(w)=1 comments to sum(w)=sum_j j*l_j in _compute_worst_case_bias docstring, _compute_optimal_flci docstring, and METHODOLOGY_REVIEW.md deviation notes. P3: Simplify _solve_rm_bounds_union to single LP using max(pre_diffs) instead of looping over all pre-period first differences (components with smaller bounds are nested inside the max-component bounds). 90/90 tests pass in 1.31s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. This sandbox does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
NaN comparisons (ci_lb <= 0 <= ci_ub) return False, causing undefined CIs from infeasible LPs or df=0 to be silently classified as "significant" in is_significant, significance_stars, summary(), repr(), _find_breakdown(), and breakdown_value(). Added _ci_is_finite() guard on HonestDiDResults and finite-check helpers in _find_breakdown/breakdown_value. NaN CIs now: - is_significant = False (not True) - significance_stars = "" (not "*") - repr shows "(undefined)" instead of "*" - breakdown search treats as "not significant" Added regression: infeasible fit checked for is_significant=False, no stars, and "undefined" in repr. 90/90 tests pass in 1.15s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1: Updated inference description to accurately reflect that Delta^SD smoothness uses folded non-central t (scipy.stats.nct) when df_survey is present, while Delta^RM and naive FLCI paths use standard t via _get_critical_value. Previously described as uniform _get_critical_value. P3: Changed stale "M=0 corresponds to exact parallel trends" to "M=0 enforces linear trend extrapolation" matching the paper and the later edge-case note in the same section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only. I could not run the HonestDiD tests in this sandbox because the available Python runtime does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The M=0 short-circuit test asserted < 0.1s, but CI runners (GitHub Actions ubuntu-latest py3.13) can take 0.11s due to import overhead. Relaxed to 0.5s which still validates the short-circuit (non-optimized path takes several seconds). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Methodology references (required if estimator / math changes)
Validation
tests/test_methodology_honest_did.py(17 new tests): constraint matrices, LP bounds, first-difference constraints, optimal FLCI, breakdown monotonicitytests/test_honest_did.py(63 tests updated): imports fixed, constraint expectations updateddocs/methodology/papers/rambachan-roth-2023-review.mdSecurity / privacy
Generated with Claude Code