From fe3ab98594f309ed4d16331edcc9f669c0f5c4bc Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 18:42:00 -0400 Subject: [PATCH 1/2] TODO.md cleanup: remove SHIPPED rows, refresh stale refs, add tiered backlog MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wave 1 of multi-wave tech-debt paydown. Docs-only — no estimator code touched. Changes: - Delete 3 SHIPPED rows marked `Done`: HAD trends_lin (PR #389), HAD yatchew_hr_test(null="mean_independence") (post-PR #392), HAD Phase 5 follow-up tutorial T22 (PR #440). - Fix stray blank line inside Methodology/Correctness table (was rendering as two fragments). - Correct dCDH survey cell-period allocator row's PR-column "PR 2" → "#408" (actual dcdh-by-path-survey-design merge). - Refresh Known Limitations line refs to current state of estimators.py (`:778-784` → `:1647`; `:567-588` → `:890-911`). - Regenerate Large Module Files table from `wc -l diff_diff/*.py` ≥1000 lines. Add 11 newly-eligible modules (chaisemartin_dhaultfoeuille.py 8636, had_pretests.py 4951, had.py 4593, etc.). Update header sentence to make the 3000/2000/1000 thresholds explicit instead of silently relaxing the documented "< 1000 lines" target. - Rewrite Standard Error Consistency section — `vcov_type` has subsumed the proposed `se_type` knob; point readers at the existing methodology row for the remaining 8-standalone-estimator threading work. - Modernize Test Coverage section to reference `pytest.importorskip` rather than a stale "21 visualization tests" count. - Add new `### Prioritized Tech-Debt Backlog` subsection: Tier A/B/C/D ordering by effort × risk, anchoring back to existing source-of-truth table rows so the tables remain canonical. Seven Tier A quick wins, twelve Tier B mid-size methodology items, nine Tier C derivation items, eight Tier D deferred/research items. The new tier structure exists so subsequent waves can be picked off without re-litigating priority each turn. Wave 2 candidate is the clubsandwich_cr2_golden.json regeneration from R (row 87 / Tier A). Co-Authored-By: Claude Opus 4.7 (1M context) --- TODO.md | 126 ++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 90 insertions(+), 36 deletions(-) diff --git a/TODO.md b/TODO.md index 7572e088..e061befa 100644 --- a/TODO.md +++ b/TODO.md @@ -12,8 +12,8 @@ Current limitations that may affect users: | Issue | Location | Priority | Notes | |-------|----------|----------|-------| -| MultiPeriodDiD wild bootstrap not supported | `estimators.py:778-784` | Low | Edge case | -| `predict()` raises NotImplementedError | `estimators.py:567-588` | Low | Rarely needed | +| MultiPeriodDiD wild bootstrap not supported (falls back to analytical) | `estimators.py:1647` | Low | Edge case | +| `predict()` raises NotImplementedError | `estimators.py:890-911` | Low | Rarely needed | For survey-specific limitations (NotImplementedError paths), see the [Current Limitations](docs/survey-roadmap.md#current-limitations) section @@ -23,28 +23,38 @@ of survey-roadmap.md. ### Large Module Files -Target: < 1000 lines per module for maintainability. Updated 2026-03-29. +Target: ideally < 1000 lines per module; modules ≥3000 lines are candidates for splitting, 2000-3000 are monitored, 1000-2000 are accepted as a cohesion / scope trade-off. Updated 2026-05-15. | File | Lines | Action | |------|-------|--------| -| `power.py` | 2588 | Consider splitting (power analysis + MDE + sample size) | -| `linalg.py` | 2289 | Monitor — unified backend, splitting would hurt cohesion | -| `staggered.py` | 2275 | Monitor — grew with survey support | -| `imputation.py` | 2009 | Monitor | -| `triple_diff.py` | 1921 | Monitor | -| `utils.py` | 1902 | Monitor | -| `two_stage.py` | 1708 | Monitor | -| `survey.py` | 1646 | Monitor — grew with Phase 6 features | -| `continuous_did.py` | 1626 | Monitor | -| `honest_did.py` | 1511 | Acceptable | -| `sun_abraham.py` | 1540 | Acceptable | -| `estimators.py` | 1357 | Acceptable | -| `trop_local.py` | 1261 | Acceptable | -| `trop_global.py` | 1251 | Acceptable | -| `prep.py` | 1225 | Acceptable | -| `pretrends.py` | 1105 | Acceptable | -| `trop.py` | 981 | Split done — trop_global.py + trop_local.py | -| `visualization/` | 4172 | Subpackage (split across 7 files) — OK | +| `chaisemartin_dhaultfoeuille.py` | 8636 | Consider splitting (per-path / placebos / survey IF / aggregation) | +| `had_pretests.py` | 4951 | Consider splitting (Stute / Yatchew / QUG / joint pretests) | +| `had.py` | 4593 | Consider splitting (continuous / mass-point / event-study / survey paths) | +| `staggered.py` | 3963 | Consider splitting — grew through survey + aggregation features | +| `linalg.py` | 3601 | Consider splitting (vcov surfaces) — unified backend, splitting would hurt cohesion | +| `diagnostic_report.py` | 3380 | Consider splitting (per-method renderers + provenance) | +| `power.py` | 3196 | Consider splitting (power analysis + MDE + sample size) | +| `synthetic_did.py` | 2819 | Monitor — variance methods + survey paths | +| `honest_did.py` | 2785 | Monitor | +| `business_report.py` | 2653 | Monitor — per-method narrative renderers | +| `imputation.py` | 2475 | Monitor | +| `survey.py` | 2466 | Monitor — grew with Phase 6 features | +| `utils.py` | 2396 | Monitor | +| `prep_dgp.py` | 2057 | Monitor | +| `triple_diff.py` | 2053 | Monitor | +| `estimators.py` | 1991 | Acceptable | +| `two_stage.py` | 1985 | Acceptable | +| `chaisemartin_dhaultfoeuille_results.py` | 1981 | Acceptable | +| `prep.py` | 1876 | Acceptable | +| `efficient_did.py` | 1793 | Acceptable | +| `sun_abraham.py` | 1713 | Acceptable | +| `continuous_did.py` | 1682 | Acceptable | +| `results.py` | 1676 | Acceptable | +| `staggered_triple_diff.py` | 1619 | Acceptable | +| `bacon.py` | 1144 | Acceptable | +| `pretrends.py` | 1133 | Acceptable | +| `conley.py` | 1006 | Acceptable | +| `visualization/` | 4316 | Subpackage (split across 7 files) — OK | --- @@ -57,7 +67,7 @@ Deferred items from PR reviews that were not addressed before merge. | Issue | Location | PR | Priority | |-------|----------|----|----------| | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low | -| dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | PR 2 | Medium | +| dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium | | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low | | dCDH by_path: negative-baseline path regression (e.g. `(-1, 0, 0, 0)`) is not yet exercised. The existing negative-D test (`test_negative_integer_D_supported`) only covers paths with negative values in non-baseline positions like `(0, -1, -1, -1)`, which does not trigger the R `substr(path, 1, 1)` bug regime (the bug needs a multi-character baseline). Add a switcher fixture with `D_{g,1} = -1` and assert the resulting path tuple key. | `tests/test_chaisemartin_dhaultfoeuille.py` | #419 | Low | | dCDH by_path: per-path placebo heterogeneity (`predict_het` rows for negative horizons) is currently NaN-filled in `to_dataframe(level="by_path")` `het_*` columns and unpopulated in `path_heterogeneity_effects`. R `did_multiplegt_dyn(..., by_path, predict_het)` forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R likely emits placebo het rows we do not yet mirror. Validate R's actual placebo predict_het output, then either implement parity or document the deviation explicitly. | `diff_diff/chaisemartin_dhaultfoeuille.py`, `diff_diff/chaisemartin_dhaultfoeuille_results.py`, `tests/test_chaisemartin_dhaultfoeuille_parity.py` | #422 | Medium | @@ -66,7 +76,6 @@ Deferred items from PR reviews that were not addressed before merge. | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) | | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium | | EfficientDiD `control_group="last_cohort"` trims at `last_g - anticipation` but REGISTRY says `t >= last_g`. With `anticipation=0` (default) these are identical. With `anticipation>0`, code is arguably more conservative (excludes anticipation-contaminated periods). Either align REGISTRY with code or change code to `t < last_g` — needs design decision. | `efficient_did.py` | #230 | Low | - | TripleDifference power: `generate_ddd_data` is a fixed 2×2×2 cross-sectional DGP — no multi-period or unbalanced-group support. Add a `generate_ddd_panel_data` for panel DDD power analysis. | `prep_dgp.py`, `power.py` | #208 | Low | | Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low | @@ -104,15 +113,12 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` mass-point: `vcov_type in {"hc2", "hc2_bm"}` raises `NotImplementedError` pending a 2SLS-specific leverage derivation. The OLS leverage `x_i' (X'X)^{-1} x_i` is wrong for 2SLS; the correct finite-sample correction uses `x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i`. Needs derivation plus an R / Stata (`ivreg2 small robust`) parity anchor. | `diff_diff/had.py::_fit_mass_point_2sls` | Phase 2a | Medium | | `HeterogeneousAdoptionDiD` survey-design API consolidation, **next minor bump**: drop the deprecated `survey=` and `weights=` kwargs on all 8 HAD surfaces (`HeterogeneousAdoptionDiD.fit`, `did_had_pretest_workflow`, `qug_test`, `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`); only `survey_design=` remains. Also fold the legacy back-end `weights=` paths (e.g. `_aggregate_unit_weights` ad-hoc routing) into the unified `_resolve_survey_for_fit`-driven path. The `_make_trivial_resolved` underscore alias on `survey.py` stays (one-line, harmless). DeprecationWarning ships in this PR; the removal PR is ~50 LoC of cleanup. | `diff_diff/had.py`, `diff_diff/had_pretests.py` | next minor bump | Medium | | `HeterogeneousAdoptionDiD` continuous paths: thread `cluster=` through `bias_corrected_local_linear` (Phase 1c's wrapper already supports cluster; Phase 2a ignores it with a `UserWarning` on the continuous path to keep scope tight). | `diff_diff/had.py`, `diff_diff/local_linear.py` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Eq 17 / Eq 18 linear-trend detrending: SHIPPED in PR #389 (Phase 4 R-parity, 2026-04). Exposed as `trends_lin: bool = False` keyword-only kwarg on `HeterogeneousAdoptionDiD.fit(aggregate="event_study")`, `joint_pretrends_test`, `joint_homogeneity_test`. Mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)`. Pierce-Schott published-number parity (paper p=0.51 / p=0.40) deferred indefinitely (LBD-restricted analysis panel); replaced by end-to-end R-package parity at `tests/test_did_had_parity.py`. | `diff_diff/had_pretests.py::joint_pretrends_test`, `diff_diff/had.py` | Phase 4 (shipped) | Done | | `HeterogeneousAdoptionDiD` `trends_lin × survey_design` follow-up: per-group linear-trend slope under survey weighting (weighted slope estimator? per-PSU slope?) is not derived from the paper. PR #389 raises `NotImplementedError` on the combination across all 3 trends_lin surfaces. If user demand emerges, derive the weighted variant and lift the gate. | `diff_diff/had.py::HeterogeneousAdoptionDiD.fit`, `diff_diff/had_pretests.py::joint_pretrends_test`, `diff_diff/had_pretests.py::joint_homogeneity_test` | follow-up | Low | -| `HeterogeneousAdoptionDiD` `yatchew_hr_test(null="mean_independence")` mode: SHIPPED post-PR #392 (2026-04). Adds `null: Literal["linearity", "mean_independence"]` keyword-only kwarg mirroring R `YatchewTest::yatchew_test(order=0)`. Default `"linearity"` is bit-exact backcompat. `tests/test_did_had_parity.py::TestYatchewParity` now routes placebo rows through `null="mean_independence"` (R `order=0`) and effect rows through `null="linearity"` (R `order=1`); parity holds at `atol=1e-10` after the documented `× G/(G-1)` finite-sample convention shift. | `diff_diff/had_pretests.py::yatchew_hr_test` | follow-up (shipped) | Done | | `HeterogeneousAdoptionDiD` Stute family Stata-bridge parity: PR #389 R-parity covers the full HAD fit + Yatchew surfaces but skips Stute family (`stute_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`) because no R `Stutetest` package exists publicly (chaisemartinPackages publishes only the Stata `stute_test` module; the paper cites a 2024c R Stutetest module that is not on GitHub or CRAN). Stata-bridge parity would add `benchmarks/stata/generate_stute_golden.do` + a Stata installation requirement. Low priority unless user demand emerges. | `benchmarks/stata/`, `tests/test_stute_test_parity.py` | follow-up | Low | | `HeterogeneousAdoptionDiD` Phase 3 Stute performance: Appendix D vectorized matrix form replaces the per-iteration OLS refit with a single precomputed `M = I - X(X'X)^{-1}X'` applied to `eps * eta`. Functionally identical, ~2x faster. Shipped literal-refit form in Phase 3 to match paper text and keep reviewer surface small. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial — SHIPPED. T22 (`docs/tutorials/22_had_survey_design.ipynb` + `tests/test_t22_had_survey_design_drift.py`) landed as the follow-up to PR #432; demonstrates the now-supported `SurveyDesign(strata=...)` path through HAD + `did_had_pretest_workflow` end-to-end on a BRFSS-shape household-panel design. T20 HAD brand-campaign (PR #394), T21 HAD pretest workflow (PR #409), and `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row (Phase 5 wave 1, PR #402) landed earlier. | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a (shipped) | Done | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | @@ -146,18 +152,66 @@ Deferred items from PR reviews that were not addressed before merge. --- -### Standard Error Consistency +### Prioritized Tech-Debt Backlog + +Ordered paydown view across the tables above. Tier A → D is by effort × risk, not severity — every item here already carries its own `Low / Medium` priority in the source-of-truth tables. The intent is to give a flat ordering to draw from wave-by-wave without re-litigating priority each time. Anchors point to the location reference of the originating row. + +#### Tier A — Quick wins (≤1 day, ≤3 CI rounds expected) + +- Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R via `benchmarks/R/generate_clubsandwich_golden.R` (currently `source: python_self_reference`) +- HonestDiD `test_m0_short_circuit`: replace wall-clock `elapsed < 0.5s` proxy with a state flag (`tests/test_methodology_honest_did.py:246`) +- EfficientDiD `control_group="last_cohort"` REGISTRY-vs-code alignment with `anticipation>0` (`efficient_did.py`, one design decision) +- TripleDifference: add `generate_ddd_panel_data` for panel DDD power analysis (`prep_dgp.py`, `power.py`) +- TROP: extract shared data-setup helper between `fit()` and `_fit_global()` (~150 LoC dedup; `trop.py`, `trop_global.py`, `trop_local.py`) +- WooldridgeDiD: emit warning when canonical-link is violated (`wooldridge.py`; W2023 Prop 3.1) +- HonestDiD vertex-rejection diagnostic counter on ARP enumeration exhaust (`honest_did.py:1907`) + +(SyntheticDiD `placebo_effects` → `variance_effects` rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.) + +#### Tier B — Mid-size methodology (5-10 CI rounds expected, per memory cascade priors) + +- Thread `vcov_type` through 8 standalone estimators: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD` (none currently expose `self.vcov_type`) +- SyntheticDiD: rename internal `placebo_effects` → `variance_effects` AND public `placebo_effects` field with deprecation alias retained for one release (`synthetic_did.py`, `results.py`) +- StaggeredTripleDifference R parity: commit CSV fixtures + add covariate-adjusted scenarios + aggregation-SE assertions (`tests/test_methodology_staggered_triple_diff.py`, `benchmarks/R/benchmark_staggered_triplediff.R`) +- StaggeredTripleDifference: per-cohort group-effect SE WIF override for exact R `triplediff` match (`staggered_triple_diff.py`) +- WooldridgeDiD: QMLE Stata-parity `qmle` weight type + Stata golden values (`wooldridge.py`, `linalg.py`, `tests/test_wooldridge.py`) +- WooldridgeDiD: optional `weights="cohort_share"` on `aggregate()` (`wooldridge_results.py`) +- HAD survey-design API consolidation: drop deprecated `survey=`/`weights=` kwargs (`had.py`, `had_pretests.py`; gated on next minor bump) +- Survey-design resolution / collapse helper extraction across `continuous_did.py`, `efficient_did.py`, `stacked_did.py` +- dCDH heterogeneity df threading: t-distribution at heterogeneity surface (or formalize the tolerance constant) (`chaisemartin_dhaultfoeuille.py`) +- dCDH by_path placebo `predict_het` parity vs R `did_multiplegt_dyn(..., by_path, predict_het)` (`chaisemartin_dhaultfoeuille.py`, `chaisemartin_dhaultfoeuille_results.py`) +- Rust local-method solver path unification to `solve_wls_svd` + bootstrap-weight RNG parity audit (`rust/src/trop.rs`, `rust/src/bootstrap.rs`) +- AI review CI workflow-contract pin test expansion (`tests/test_openai_review.py`) +- In-site Sphinx render of `REPORTING.md` and `REGISTRY.md` (`docs/conf.py` + `:doc:` link migration) + +#### Tier C — Heavy / derivation required + +- HonestDiD Δ^RM ARP conditional/hybrid confidence sets (`honest_did.py`) +- Weighted one-way Bell-McCaffrey + weighted CR2 Bell-McCaffrey + HC2/CR2 on absorbed-FE (linalg derivations + R parity harness) (`linalg.py`, `estimators.py::DifferenceInDifferences.fit`, `estimators.py::MultiPeriodDiD.fit`, `twfe.py::fit`) +- Multi-absorb weighted demeaning: alternating-projection iteration for N>1 absorb + weights (`estimators.py`) +- ImputationDiD dense `(A0'A0).toarray()` OOM: alternative dense fallback or richer sparse strategy (`imputation.py:1531`) +- HAD mass-point `vcov_type ∈ {hc2, hc2_bm}`: 2SLS-specific leverage derivation (`had.py::_fit_mass_point_2sls`) +- HAD repeated-cross-section identification path (`had.py::_validate_had_panel`) +- HAD time-varying-dose event study estimator (`had.py::_validate_had_panel_event_study`) +- Conley + `survey_design` (`linalg.py::_validate_vcov_args`, `conley.py`) +- SyntheticDiD `vcov_type="conley"` (`synthetic_did.py::SyntheticDiD` — new analytical sandwich path OR spatial-block bootstrap) + +#### Tier D — Deferred / research (no active action planned) + +- HAD survey-aware support-endpoint test (`had_pretests.py::qug_test`; waits on literature — endpoint EVT × survey-aware functional CLT) +- HAD joint cross-horizon analytical covariance / unweighted event-study sup-t band (low user demand) +- HAD Phase 4.5 replicate-weight pretests (BRR/Fay/JK1/JKn/SDR composition derivation) +- HAD Stute family Stata-bridge parity (no R `Stutetest` package exists publicly) +- HAD `trends_lin × survey_design` weighted-slope derivation +- Phase 1c lprobust follow-ups (`vce` modes, weights, multi-eval grid, clustered-DGP auto-bandwidth) — deferred to Phase 2+ of `bias_corrected_local_linear` +- TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14) residual bypass paths — diminishing return given documented threat model +- All remaining `Low`-priority Performance and Testing/Docs rows (R-script-per-test, CS R covariate-adjusted IRLS benchmark, doc-deps integrity CI, Rust faer SVD overhead, etc.) -Different estimators compute SEs differently. Consider unified interface. +--- -| Estimator | Default SE Type | -|-----------|-----------------| -| DifferenceInDifferences | HC1 or cluster-robust | -| TwoWayFixedEffects | Always cluster-robust (unit level) | -| CallawaySantAnna | Simple difference-in-means SE | -| SyntheticDiD | Bootstrap or placebo-based | +### Standard Error Consistency -**Action**: Consider adding `se_type` parameter for consistency across estimators. +`vcov_type` has subsumed the previously-proposed `se_type` knob — `DifferenceInDifferences` and `TwoWayFixedEffects` expose the full surface (`hc1`, `hc2`, `hc2_bm`, `cr1`, `cr2`, `conley`, `bootstrap`). Threading `vcov_type` through the 8 standalone estimators (`CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`) remains open and is tracked as a single methodology row in the table above (Phase 1a row). ### Type Annotations @@ -176,7 +230,7 @@ Deprecated parameters still present for backward compatibility: ## Test Coverage -**Note**: 21 visualization tests are skipped when matplotlib unavailable—this is expected. +Visualization tests skip when matplotlib / plotly are not installed (see `pytest.importorskip` markers in `tests/test_visualization*.py`). --- From 4e829a28db1049678abc7c8c5517e4b8cef04925 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 15 May 2026 18:49:06 -0400 Subject: [PATCH 2/2] Address PR #447 R1 review (2 P2, 1 self-audit) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P2 — SE-consistency rewrite misstated the public API The rewrite listed cr1, cr2, bootstrap as vcov_type values. The actual validated set in linalg.py::_VALID_VCOV_TYPES is {"classical","hc1","hc2","hc2_bm","conley"}; cluster-robust variance is obtained via cluster= alongside the heteroscedasticity kind (hc1+cluster ⇒ CR1, hc2_bm+cluster ⇒ CR2 Bell-McCaffrey), and wild cluster bootstrap is a separate inference="wild_bootstrap" path on the same estimator. Rewrote the SE Consistency paragraph to match the actual API. P2 — Large Module Files table omitted 8 modules already ≥1000 lines The refreshed inventory in section 24-56 missed: _nprobust_port.py (1412), practitioner.py (1402), trop_global.py (1350), trop_local.py (1339), local_linear.py (1332), wooldridge.py (1305), chaisemartin_dhaultfoeuille_bootstrap.py (1175), stacked_did.py (1050). Mechanically regenerated from wc -l diff_diff/*.py >= 1000; all 35 current ≥1000-line modules now listed (verified via comm). Self-audit fix linalg.py Action cell read "Consider splitting (vcov surfaces) — unified backend, splitting would hurt cohesion" — self-contradictory. Reworded to "Consider splitting only if cohesion can be preserved" so the threshold rule and the cohesion constraint can both be honored. Co-Authored-By: Claude Opus 4.7 (1M context) --- TODO.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/TODO.md b/TODO.md index e061befa..07b5110b 100644 --- a/TODO.md +++ b/TODO.md @@ -31,7 +31,7 @@ Target: ideally < 1000 lines per module; modules ≥3000 lines are candidates fo | `had_pretests.py` | 4951 | Consider splitting (Stute / Yatchew / QUG / joint pretests) | | `had.py` | 4593 | Consider splitting (continuous / mass-point / event-study / survey paths) | | `staggered.py` | 3963 | Consider splitting — grew through survey + aggregation features | -| `linalg.py` | 3601 | Consider splitting (vcov surfaces) — unified backend, splitting would hurt cohesion | +| `linalg.py` | 3601 | Consider splitting (vcov surfaces) only if cohesion can be preserved — unified backend; vcov / solver paths are tightly coupled | | `diagnostic_report.py` | 3380 | Consider splitting (per-method renderers + provenance) | | `power.py` | 3196 | Consider splitting (power analysis + MDE + sample size) | | `synthetic_did.py` | 2819 | Monitor — variance methods + survey paths | @@ -51,8 +51,16 @@ Target: ideally < 1000 lines per module; modules ≥3000 lines are candidates fo | `continuous_did.py` | 1682 | Acceptable | | `results.py` | 1676 | Acceptable | | `staggered_triple_diff.py` | 1619 | Acceptable | +| `_nprobust_port.py` | 1412 | Acceptable | +| `practitioner.py` | 1402 | Acceptable | +| `trop_global.py` | 1350 | Acceptable | +| `trop_local.py` | 1339 | Acceptable | +| `local_linear.py` | 1332 | Acceptable | +| `wooldridge.py` | 1305 | Acceptable | +| `chaisemartin_dhaultfoeuille_bootstrap.py` | 1175 | Acceptable | | `bacon.py` | 1144 | Acceptable | | `pretrends.py` | 1133 | Acceptable | +| `stacked_did.py` | 1050 | Acceptable | | `conley.py` | 1006 | Acceptable | | `visualization/` | 4316 | Subpackage (split across 7 files) — OK | @@ -211,7 +219,7 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk, ### Standard Error Consistency -`vcov_type` has subsumed the previously-proposed `se_type` knob — `DifferenceInDifferences` and `TwoWayFixedEffects` expose the full surface (`hc1`, `hc2`, `hc2_bm`, `cr1`, `cr2`, `conley`, `bootstrap`). Threading `vcov_type` through the 8 standalone estimators (`CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`) remains open and is tracked as a single methodology row in the table above (Phase 1a row). +`vcov_type` has subsumed the previously-proposed `se_type` knob. `DifferenceInDifferences` and `TwoWayFixedEffects` accept `vcov_type ∈ {"classical", "hc1", "hc2", "hc2_bm", "conley"}` (the validated set in `linalg.py::_VALID_VCOV_TYPES`); cluster-robust variance is obtained by passing `cluster=` alongside the heteroscedasticity kind (`hc1 + cluster` ⇒ CR1 Liang-Zeger; `hc2_bm + cluster` ⇒ CR2 Bell-McCaffrey, gated by the open weighted-CR2 / absorbed-FE rows in the table above); wild cluster bootstrap is a separate `inference="wild_bootstrap"` path on the same estimator. Threading `vcov_type` through the 8 standalone estimators (`CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`) remains open and is tracked as a single methodology row in the table above (Phase 1a row). ### Type Annotations