Skip to content

BaconDecomposition methodology audit (Goodman-Bacon 2021)#454

Merged
igerber merged 5 commits into
mainfrom
feature/bacon-methodology-audit
May 16, 2026
Merged

BaconDecomposition methodology audit (Goodman-Bacon 2021)#454
igerber merged 5 commits into
mainfrom
feature/bacon-methodology-audit

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 16, 2026

Summary

  • Promotes BaconDecomposition from In ProgressComplete (R parity goldens pending) in METHODOLOGY_REVIEW.md. Operationalizes the paper review landed in PR Add Goodman-Bacon (2021) paper review #451 against diff_diff/bacon.py.
  • Theorem 1 exact-weights rewrite (bacon.py:_recompute_exact_weights): the prior "exact" path was missing the (1 - n_kU) factor in subsample variance, did not square the sample share, and added an extraneous unit_share factor. The post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE. Rewrite implements Eqs. 7-9 + 10e-g exactly; TWFE-vs-weighted-sum identity now holds at atol=1e-10.
  • Default weights flipped from "approximate" to "exact" at three entry points (BaconDecomposition.__init__, bacon_decompose, TwoWayFixedEffects.decompose). Approximate remains opt-in. Survey-design behavior change: exact mode rejects time-varying within-unit survey columns; users with such inputs migrate to weights="approximate". Documented in CHANGELOG + the bacon_decompose() docstring.
  • Always-treated warn+remap per paper footnote 11 with ordered-time logic (first_treat <= min(time), excluding sentinels {0, np.inf}): works correctly on event-time panels with negative or zero-crossing labels. Remapping uses an internal column; user's first_treat column is preserved unchanged. Count surfaced as new BaconDecompositionResults.n_always_treated_remapped field. Loop gate for treated_vs_never comparisons uses POST-remap U count so panels whose U is entirely remapped always-treated still emit β̂_{kU} terms.
  • REGISTRY.md ## BaconDecomposition block replaced with paper-review-sourced content + four sub-notes (weight modes, always-treated remap, R parity status, unbalanced-panel deviation). Explicit removal: the prior block's "Weights may be negative for later-vs-earlier comparisons" claim was incorrect — Theorem 1 weights are strictly positive and sum to 1; negative weights are an estimand-level phenomenon at the ATT level, not estimator-level.
  • New tests/test_methodology_bacon.py (~28 tests across 6 classes): hand-checks of Eqs. 7-9 + 10b-d at atol=1e-10 on a hand-derived balanced panel, R parity (skip on missing goldens), always-treated remap mechanics, edge cases (negative time, unbalanced panel, no-untreated, single-cohort), weight modes, survey-design narrowing.
  • New R generator benchmarks/R/generate_bacon_golden.R covering three DGP fixtures. JSON goldens deferred until bacondecomp R package is installed (tracked in TODO.md).

Methodology references (required if estimator / math changes)

  • Method name(s): BaconDecomposition (bacon.py), TwoWayFixedEffects.decompose (twfe.py)
  • Paper / source link(s): Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics 225(2), 254-277. DOI: 10.1016/j.jeconom.2021.03.014. Paper review on file: docs/methodology/papers/goodman-bacon-2021-review.md (PR Add Goodman-Bacon (2021) paper review #451).
  • Any intentional deviations from the source (and why): (a) Unbalanced panels accepted with a UserWarning (paper Appendix A's proof assumes balanced; library extension, Theorem 1 identity holds only approximately); (b) weights="approximate" opt-in fast path with simplified variance not present in R bacondecomp (numerical output may differ from R); (c) weights="exact" requires within-unit-constant survey columns (paper notation does not specify a survey design; library convention from the exact-path per-unit aggregation). All three documented in REGISTRY.md ## BaconDecomposition Notes / Deviation block.

Validation

  • Tests added/updated: tests/test_methodology_bacon.py (new, ~28 tests, 6 classes); tests/test_bacon.py updated (test_weighted_sum_equals_twfe tolerance tightened to < 1e-10; TestWeightsParameter renamed test_approximate_weights_defaulttest_exact_weights_default + test_approximate_weights_opt_in). Hand-calculable Theorem 1 DGP derived in-test with weights {0.3, 0.4, 0.1, 0.2} precomputed from Eqs. 10e-g.
  • Backtest / simulation / notebook evidence (if applicable): 60 tests pass in the methodology-bacon + test-bacon files (3 skipped for R parity). 96 tests pass across all bacon/decompose callers in the broader regression sweep (pytest -k 'bacon or decompose').

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Promotes BaconDecomposition from In Progress → Complete (R parity pending)
in METHODOLOGY_REVIEW.md. Operationalizes the paper review landed in
PR #451 against diff_diff/bacon.py.

Audit findings and corrections:

1. Theorem 1 exact-weights rewrite (bacon.py:_recompute_exact_weights).
   The prior "exact" mode did not actually compute Eqs. 7-9 / 10e-g —
   it was missing the (1 - n_kU) factor in the within-subsample treatment
   variance, did not square the sample share, and added an extraneous
   unit_share factor not present in the paper. The post-hoc sum-to-1
   normalization masked the relative-weight error but produced ~0.3%
   decomposition error vs TWFE (0.007 absolute on a 3-cohort + never-
   treated DGP). Rewrote the function to compute exact numerators of
   Eqs. 10e/f/g and let post-hoc normalization handle the V_hat^D
   denominator (Theorem 1 guarantees V_hat^D = sum(numerators)). Now
   matches TWFE at atol=1e-10 on noisy and hand-calculable DGPs.

2. Default `weights` flipped from "approximate" to "exact" at 3 entry
   points: BaconDecomposition.__init__() (bacon.py:397), bacon_decompose()
   (bacon.py:1064), TwoWayFixedEffects.decompose() (twfe.py:684). The
   approximate path remains opt-in for speed-sensitive diagnostic loops.
   diff_diff/diagnostic_report.py:1740 updated to pass explicit
   weights="exact". The existing test_weighted_sum_equals_twfe tolerance
   was tightened from < 0.1 to < 1e-10 to lock the Theorem 1 algebraic
   identity contract.

   **Survey-design behavior change**: weights="exact" routes through
   _validate_unit_constant_survey, which rejects survey designs whose
   weights / strata / PSU / FPC columns vary within a unit across
   periods. The previous approximate default tolerated time-varying
   within-unit survey weights via observation-level weighted means.
   Migration: pass weights="approximate" explicitly to retain the
   legacy path. Documented in CHANGELOG Changed entry and the new
   bacon_decompose() docstring survey_design parameter block.

3. Always-treated warn+remap per paper footnote 11 (bacon.py:fit()).
   Units with first_treat <= min(time) (excluding never-treated
   sentinels 0 and np.inf) are auto-remapped to U via an internal
   column (__bacon_first_treat_internal__), preserving the user's
   first_treat column unchanged. The count is exposed on the result
   via a new BaconDecompositionResults.n_always_treated_remapped field
   and rendered in summary() when nonzero.

   **n_never_treated reports TRUE never-treated only**, computed from
   the original user column before remap — remapped always-treated
   units appear separately as n_always_treated_remapped, no double-
   counting.

   **Loop gate uses POST-remap U count**: the treated_vs_never
   comparison loop gates on n_units_in_U_bucket (post-remap) so panels
   whose U is composed entirely of remapped always-treated units still
   emit beta_kU^{2x2} terms. Without this distinction the loop would
   silently drop those terms and break the Theorem 1 identity.

   **Ordered-time logic**: detection uses first_treat <= min(time)
   (not positive-sign restriction), so event-time panels with
   negative or zero-crossing period labels (e.g. time ∈ [-2,..,3])
   work correctly. A cohort at first_treat=-1 on such a panel is a
   valid timing group; a cohort at first_treat=-3 is remapped to U.
   Both timing_groups filters updated to exclude only the U
   sentinels, not positive values.

REGISTRY.md replacement:

- Replaced ## BaconDecomposition block with paper-review-sourced content
  plus four sub-notes (weight modes, always-treated remap with ordered-
  time logic, R parity status, unbalanced-panel deviation).
- Explicitly removed the prior block's "Weights may be negative for
  later-vs-earlier comparisons" claim. Theorem 1 weights are strictly
  positive and sum to 1 (positivity is the headline of the theorem);
  negative weights are an estimand-level phenomenon (Borusyak-Jaravel
  2017, de Chaisemartin-D'Haultfoeuille 2020) at the ATT level, not
  estimator-level.
- Narrowed the machine-precision claim to balanced panels only; the
  unbalanced-panel library extension is documented as an explicit
  Deviation block (Goodman-Bacon Appendix A's proof assumes balanced
  panels; under unbalance, the Theorem 1 identity holds only
  approximately, though outputs remain finite).

New artifacts:

- tests/test_methodology_bacon.py (~1050 lines, ~28 tests across 6
  classes):
    - TestBaconHandCalculation: hand-checks Eqs. 7-9 + 10b-d at
      atol=1e-10 on a minimal hand-derived balanced panel (weights
      {0.3, 0.4, 0.1, 0.2} hand-computed from sample shares and
      treatment-share variances).
    - TestBaconParityR: skips on missing R goldens.
    - TestBaconAlwaysTreatedRemap: regression-tests warn+remap
      mechanics including user-data-preservation, U-bucket-only-from-
      remap (the bug from the loop-gate fix), negative first_treat
      as a valid cohort (event-time encoding), and remap of negative
      first_treat below min(time).
    - TestBaconEdgeCases: no-untreated, single-cohort, unbalanced
      panel (finite but NOT machine precision), constant-ATT recovery.
    - TestBaconWeightModes: locks exact-is-default contract.
    - TestBaconSurveyDesignNarrowing: survey_design composes with
      exact mode + warn+remap; defaulted BaconDecomposition(),
      bacon_decompose(), and the survey-time-varying-weights
      rejection contract are pinned.

- benchmarks/R/generate_bacon_golden.R (234 lines): R generator script
  for bacondecomp::bacon() parity goldens across 3 DGP fixtures.
  JSON goldens deferred until bacondecomp R package is installed in
  the local R library (TODO.md follow-up row).

CHANGELOG entries: ### Changed (default flip + survey behavior change),
TODO.md: prior BaconDecomposition row replaced by narrower R-parity
goldens deferral row.

Test results: 96 passed, 3 skipped (R parity) across all bacon/decompose
callers (test_bacon.py, test_methodology_bacon.py, test_business_report,
test_methodology_twfe, test_practitioner, test_target_parameter,
test_survey_phase3, test_survey_phase6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the feature/bacon-methodology-audit branch from 6b20a4d to 61b30bb Compare May 16, 2026 14:05
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 61b30bb200739930d13aecfdb14a53dd0098ffc6


Overall Assessment
Needs changes

Executive Summary

  • I did not find an unmitigated P0/P1 methodology defect in the exact-weight rewrite itself. The Eqs. 7-9 / 10e-g implementation, always-treated remap, and documented unbalanced-panel extension are internally consistent with the updated registry and paper-review notes.
  • Severity P1: the reporting wrapper now hardcodes weights="exact" but does not expose the documented weights="approximate" fallback, so survey-backed DiagnosticReport/BusinessReport Bacon replays can now degrade to an error on panels with within-unit-varying survey columns.
  • Severity P3: the new zero-crossing support language is broader than the actual API contract, because first_treat == 0 is still reserved as the never-treated sentinel.
  • Severity P3: user-facing parity wording now overstates the validation state; R parity goldens are still tracked as pending in the methodology docs.

Methodology

  • Severity: P3. Impact: the PR now says negative or zero-crossing labels “work correctly,” but 0 is still hardcoded as a never-treated sentinel, so a real cohort with first_treat == 0 would be folded into U rather than treated as a timing group. That is contract/documentation drift rather than a proven flaw in the new Theorem 1 math, but it is still worth fixing because the PR explicitly broadens the supported labeling story. Concrete fix: either narrow the wording to exclude treatment cohorts at 0, or add explicit validation / configurable sentinel support before advertising zero-crossing support. diff_diff/bacon.py:L469-L481, diff_diff/bacon.py:L552-L559, diff_diff/bacon.py:L592-L600, docs/methodology/REGISTRY.md:L2674-L2675

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P1. Impact: DiagnosticReport has no public way to select Bacon’s weight mode, yet _check_bacon() now hardcodes weights="exact". That breaks the PR’s own documented migration path for survey-backed panels whose weights / PSU / strata / FPC vary within unit: core Bacon still supports weights="approximate" for that case, but the report wrapper can only produce an "error" block instead of a usable replay. This is the “incomplete parameter propagation” anti-pattern on a changed public behavior. Concrete fix: add a bacon_weights (or similarly named) kwarg to DiagnosticReport, forward it through BusinessReport, and thread it into bacon_decompose(); alternatively, catch the exact-mode unit-constancy failure and downgrade to an explicit skip/fallback path. Add a regression for survey-backed staggered reports with within-unit-varying survey columns. diff_diff/diagnostic_report.py:L316-L330, diff_diff/diagnostic_report.py:L1740-L1747, diff_diff/bacon.py:L517-L524, diff_diff/bacon.py:L1256-L1265, docs/methodology/REPORTING.md:L223-L233

Tech Debt

  • No findings. The deferred R-golden work is already tracked in TODO.md, so I am not counting that as a blocker.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new example text says exact mode is “matching R bacondecomp::bacon() at atol=1e-6,” but the repo’s own methodology tracking still says parity goldens are pending and the parity tests skip when the JSON is absent. That overstates the level of shipped validation for a statistical contract. Concrete fix: soften the wording to “intended to match” until the goldens land, or commit benchmarks/data/r_bacondecomp_golden.json in the same PR. diff_diff/bacon.py:L1281-L1283, docs/methodology/REGISTRY.md:L2661-L2671, METHODOLOGY_REVIEW.md:L935-L936

Path to Approval

  1. Expose Bacon’s weight mode on DiagnosticReport and its BusinessReport passthrough, or implement an explicit skip/fallback path instead of hard-failing the Bacon block on exact-mode survey validation.
  2. Add DR/BR regression coverage for a survey-backed staggered panel with within-unit-varying survey columns, asserting the exact-mode failure surface and the approximate-mode success path.

Verification note: I could not execute tests/test_bacon.py or tests/test_methodology_bacon.py in this environment because pytest and the project’s Python deps (for example pandas) are not installed here.

P1 (Maintainability): DiagnosticReport._check_bacon now emits a
structured `status="skipped"` block when bacon_decompose raises the
within-unit-varying-survey ValueError from `_validate_unit_constant_survey`,
instead of falling through to the opaque `status="error"` block. The skip
reason names the precomputed={'bacon': ...} + explicit weights="approximate"
escape hatch so survey-backed reports on panels with time-varying within-
unit weights / strata / PSU / FPC retain a documented migration path.
Other ValueErrors (and other exception types) still surface via the
existing `error` block. New regression at
TestBaconSurveyDesignNarrowing::test_diagnostic_report_skips_with_structured_reason_on_time_varying_survey.

P3 (Methodology, zero-crossing wording): clarify that "negative or
zero-crossing labels work correctly" applies to the **time axis** only
(`time` column with event-time labels like [-2,..,3]), not to
`first_treat`. `first_treat ∈ {0, np.inf}` remains reserved as the
never-treated sentinel and is not configurable today; a real treatment
cohort with `first_treat == 0` would be folded into U and should be
re-labeled. Updated docstring, code comment, REGISTRY note, and
CHANGELOG entry to surface this restriction explicitly.

P3 (Documentation/Tests, R parity claim): soften the bacon_decompose
docstring example from "matching R bacondecomp::bacon() at atol=1e-6"
to "intended to match R bacondecomp::bacon() at atol=1e-6 (R parity
goldens pending  see TODO.md)" since the JSON goldens are still
unshipped. REGISTRY weight-modes note refined to call out hand-calc /
TWFE-identity validation at atol=1e-10 vs. the pending direct R parity.

Tests: 61 pass in test_methodology_bacon.py + test_bacon.py (one new
regression), 3 skipped (R parity); 97 pass across the broader
bacon/decompose surface.
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c88667e0b54b72c3da258829385dfe9950c01fbe


Overall Assessment
✅ Looks good

Executive Summary

  • No unmitigated P0/P1 findings in the changed Bacon estimator paths. The exact-weight rewrite in diff_diff/bacon.py:L812-L970 matches the Theorem 1 / Eqs. 7-9, 10e-g contract documented in docs/methodology/REGISTRY.md:L2613-L2648 and docs/methodology/papers/goodman-bacon-2021-review.md:L50-L79.
  • The always-treated handling now matches Goodman-Bacon footnote 11: remap happens before decomposition, the treated_vs_never loop now gates on the post-remap U bucket, and regression tests cover the previously missing U-only-remapped and negative-time cases. diff_diff/bacon.py:L545-L670, tests/test_methodology_bacon.py:L539-L707
  • The previous P1 on reporting wrappers is addressed. DiagnosticReport._check_bacon() now returns a structured skip with an explicit weights="approximate" / precomputed={'bacon': ...} escape hatch, and BusinessReport already forwards the required passthrough args. diff_diff/diagnostic_report.py:L1737-L1772, diff_diff/business_report.py:L404-L420
  • Remaining issues are P3-only documentation drift: some high-level prose still uses the old positive-only always-treated condition, and one CHANGELOG sentence still overstates direct R parity.
  • I could not run pytest here because pytest is not installed in this environment.

Methodology

  • Severity: P3. Impact: high-level methodology prose still says the always-treated remap applies to 0 < first_treat <= min(time), but the implementation and the detailed registry note now use ordered-time first_treat <= min(time) excluding sentinels. That can mislead users with negative-coded / event-time panels about what gets remapped to U. Concrete fix: align the summary wording in docs/methodology/REGISTRY.md:L2607-L2611, METHODOLOGY_REVIEW.md:L923-L941, and tests/test_methodology_bacon.py:L19-L20 / L491-L492 to the detailed wording already used in docs/methodology/REGISTRY.md:L2673-L2676 and diff_diff/bacon.py:L467-L487.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings. The prior re-review blocker on the report wrapper is resolved by the new structured-skip path in diff_diff/diagnostic_report.py:L1749-L1772.

Tech Debt

  • Severity: P3. Impact: direct R parity goldens are still outstanding, but this is explicitly tracked in TODO.md:L77-L77, and the PR ships both the generator and skip-aware parity tests, so it is mitigated and non-blocking under the stated rubric. Concrete fix: generate and commit benchmarks/data/r_bacondecomp_golden.json once bacondecomp is available.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: CHANGELOG.md:L16-L16 still says the new default exact path is “matching R bacondecomp::bacon(),” while the methodology review, registry, TODO, and the new parity tests all state that direct R parity is still pending until the golden JSON is generated. That overstates the shipped validation state. Concrete fix: soften that sentence to “intended to match” or “R parity pending” until benchmarks/data/r_bacondecomp_golden.json lands. METHODOLOGY_REVIEW.md:L929-L936, TODO.md:L77-L77, tests/test_methodology_bacon.py:L300-L309

R2 verdict was ✅ Looks good with three P3 informational items. Closes
all three:

1. Prose alignment: high-level summary text in `REGISTRY.md`,
   `METHODOLOGY_REVIEW.md`, and the test module docstring/class
   docstring used the old positive-only `0 < first_treat <= min(time)`
   condition for the always-treated remap, while the implementation and
   the detailed registry note already use ordered-time logic excluding
   the `{0, np.inf}` sentinels. Aligned the summary wording across all
   four surfaces (REGISTRY.md:2609, METHODOLOGY_REVIEW.md:923 +
   Corrections Made entry, test_methodology_bacon.py header + class
   docstring) to match the detailed-note phrasing.

2. CHANGELOG R-parity claim: the Changed entry still said the new
   default is "matching R `bacondecomp::bacon()`" while the methodology
   review, registry, TODO, and parity tests all show direct R parity is
   pending until the goldens land. Softened to "intended to match"
   with the validation state spelled out (hand-calc + TWFE-vs-weighted-
   sum identity at atol=1e-10; direct R bit parity pending).

3. R parity goldens TODO row: non-blocking, already tracked in TODO.md;
   no action required this PR.

Tests: 61 pass in test_methodology_bacon.py + test_bacon.py (same as
post-R1), 3 skipped (R parity).
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 785c65f92f615c5654b625171b2e3e973dfea341


Overall Assessment
✅ Looks good

Executive Summary

Methodology

  • No findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: direct bacondecomp parity is still pending until the golden JSON is generated, but this is explicitly tracked and the new parity tests skip cleanly, so it is mitigated and non-blocking under the stated rubric. Concrete fix: run benchmarks/R/generate_bacon_golden.R, commit benchmarks/data/r_bacondecomp_golden.json, and let tests/test_methodology_bacon.py execute in CI. TODO.md

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new always-treated warning tells users to “pass first_treat=0 or first_treat=np.inf,” but first_treat is the column-name argument, not a value slot, so the remediation text is misleading when the warning fires. Concrete fix: rewrite the warning in diff_diff/bacon.py (lines 573-580) to tell users to recode the affected rows’ first_treat values to 0 or np.inf before fitting.
  • Severity: P3. Impact: the test prose still mixes the paper-review shorthand t_i < 1 with the merged implementation/registry contract first_treat <= min(time), which keeps the equality-at-first-observed-period case slightly ambiguous for future reviewers. Concrete fix: align the class docstring in tests/test_methodology_bacon.py with the merged registry wording in docs/methodology/REGISTRY.md, or explicitly label the == min(time) case as a library/R-alignment convention.

R3 verdict was ✅ Looks good with two P3 informational items:

1. Warning string accuracy: the always-treated remap warning told users
   to "pass first_treat=0 or first_treat=np.inf explicitly to silence
   this warning," but `first_treat` is the column-name argument to
   `fit()`, not a value slot. Rewrote to "recode the affected rows'
   first_treat values to 0 or np.inf in your input data before fitting"
   to point users at the actual remediation surface.

2. Test class docstring: clarified that the implementation contract
   `first_treat <= min(time)` includes the `== min(time)` boundary case
   that the paper's strict `t_i < 1` shorthand excludes, and explained
   the pragmatic rationale (units treated at the first observable
   period have no untreated cell and cannot contribute to any 2x2 DD
   as a treated cohort). Aligns with merged registry note wording and
   removes the shorthand-vs-implementation ambiguity that the R3
   reviewer flagged.

Tests: 61 pass in test_methodology_bacon.py + test_bacon.py, 3 skipped
(R parity).
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d35fe915678bed2377a901d544fc516c91cbc251


Overall Assessment

Looks good. I found no unmitigated P0/P1 issues in the changed Bacon methodology, weighting, remap, or default-propagation paths.

Executive Summary

  • The exact-weight rewrite in diff_diff/bacon.py aligns with the Goodman-Bacon Theorem 1 formulas captured in the paper review and registry, and I did not find an undocumented methodology deviation in the changed estimator code.
  • The prior re-review P3 about the misleading always-treated warning is resolved: the warning now correctly tells users to recode affected first_treat values, not the first_treat argument name, in diff_diff/bacon.py.
  • Default propagation is consistent across BaconDecomposition, bacon_decompose, TwoWayFixedEffects.decompose, and the diagnostic-report replay path.
  • Remaining items are P3 only: one partial wrapper fix in DiagnosticReport._check_bacon(), and one public-docstring drift where wrapper docs still omit the new remap/sentinel contract.
  • R parity goldens are still pending, but that follow-up is explicitly tracked in TODO.md and the new parity tests skip cleanly until the JSON fixture exists.
  • Static review only: I could not run local pandas/R validation in this environment.

Methodology

Code Quality

  • Severity: P3. Impact: DiagnosticReport._check_bacon() now downgrades the within-unit-varying-survey failure to status="skipped", but replicate-weight survey designs still fall through the generic exception path and surface as status="error", even though Bacon explicitly declares that design unsupported. This is not a correctness bug, but it leaves the new skip-vs-error fix partial in the same wrapper path. Concrete fix: add an except NotImplementedError as exc branch before the generic handler in diff_diff/diagnostic_report.py:L1749-L1780, and return a structured skip keyed to the rejection raised in diff_diff/bacon.py:L515-L520.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: direct R parity against bacondecomp::bacon() is still deferred, so the new exact path is substantiated by hand-calculation and TWFE-identity tests rather than committed cross-language goldens. This is already tracked and therefore mitigated. Concrete fix: generate and commit benchmarks/data/r_bacondecomp_golden.json from benchmarks/R/generate_bacon_golden.R, then let tests/test_methodology_bacon.py::TestBaconParityR run in CI. See TODO.md:L77.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the public wrapper docstrings in bacon_decompose() and TwoWayFixedEffects.decompose() still describe only 0/np.inf as the never-treated contract, while the changed implementation now also remaps first_treat <= min(time) into U and reserves first_treat == 0 as a sentinel. Users reading wrapper help can miss the behavior change even though BaconDecomposition.fit() and the registry document it. Concrete fix: mirror the expanded first_treat contract from diff_diff/bacon.py:L467-L487 into diff_diff/bacon.py:L1252-L1277 and diff_diff/twfe.py:L703-L714, or explicitly point those docstrings at docs/methodology/REGISTRY.md:L2674-L2676.

R4 verdict was Looks good with two P3 polish items (plus the standing
R-parity-goldens follow-up, which is tracked and non-blocking). Both
addressed:

1. Replicate-weight skip path (P3 Code Quality): the R1 skip-vs-error
   fix on `DiagnosticReport._check_bacon` only handled the within-unit-
   varying-survey ValueError case. Replicate-weight survey designs
   raise NotImplementedError from `BaconDecomposition.fit()` and were
   still falling through to the generic exception handler as
   `status="error"`. Added an `except NotImplementedError` branch that
   returns a structured skip with a reason naming the supported
   alternative (TSL-based design via SurveyDesign + precomputed=
   escape hatch). New regression at
   `TestBaconSurveyDesignNarrowing::test_diagnostic_report_skips_with_structured_reason_on_replicate_weights`.

2. Wrapper docstring mirror (P3 Documentation/Tests): the public
   `bacon_decompose()` and `TwoWayFixedEffects.decompose()` wrappers
   had short `first_treat` parameter descriptions that mentioned only
   the 0/np.inf never-treated convention, omitting the new
   always-treated remap + sentinel reservation contract that lives in
   `BaconDecomposition.fit()`. Mirrored the expanded contract into
   both wrappers (with a pointer to the full docstring on the class)
   so users reading wrapper help see the behavior change.

Tests: 62 pass in test_methodology_bacon.py + test_bacon.py (one new
regression), 3 skipped (R parity); 98 pass across the broader
bacon/decompose surface.
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5a508343da40a18d3c3f3867571d3f87997e894c


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The exact-weight rewrite in diff_diff/bacon.py:L813-L971 matches the Goodman-Bacon Theorem 1 formulas documented in docs/methodology/papers/goodman-bacon-2021-review.md:L50-L79 and docs/methodology/REGISTRY.md:L2613-L2648; I did not find an undocumented estimator, weighting, or variance mismatch in the changed path.
  • Default propagation is consistent across BaconDecomposition, bacon_decompose, TwoWayFixedEffects.decompose, and diagnostic-report replay: diff_diff/bacon.py:L426-L438, diff_diff/bacon.py:L1226-L1343, diff_diff/twfe.py:L677-L770, diff_diff/diagnostic_report.py:L1737-L1807.
  • The prior re-review P3s are resolved: wrapper docstrings now carry the remap/sentinel/default contract, and DiagnosticReport._check_bacon() now converts replicate-weight unsupported designs into a structured skip instead of a generic error: diff_diff/bacon.py:L1252-L1288, diff_diff/twfe.py:L703-L726, diff_diff/diagnostic_report.py:L1777-L1801.
  • The always-treated remap, post-remap U-loop gating, and negative-time handling are covered by targeted regressions in tests/test_methodology_bacon.py:L489-L722.
  • Remaining follow-up is P3 only: R parity goldens are still pending, but that is explicitly tracked in TODO.md:L77-L77 and the parity tests skip cleanly until the fixture exists.
  • Static review only: I could not execute the test suite in this environment because pytest is not installed.

Methodology

  • No findings. The changed estimator math and default behavior are consistent with the current Methodology Registry contract and the Goodman-Bacon paper review.

Code Quality

  • No findings. The prior partial wrapper-handling issue is resolved in diff_diff/diagnostic_report.py:L1749-L1801.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 (tracked in TODO.md). Impact: direct bacondecomp::bacon() parity is still deferred because benchmarks/data/r_bacondecomp_golden.json has not been committed yet, so CI currently validates the new path via hand-calculation and TWFE-identity tests rather than cross-language goldens. Concrete fix: install bacondecomp and jsonlite, run benchmarks/R/generate_bacon_golden.R, commit benchmarks/data/r_bacondecomp_golden.json, and let tests/test_methodology_bacon.py::TestBaconParityR run normally. Refs: TODO.md:L77-L77, benchmarks/R/generate_bacon_golden.R:L1-L234, tests/test_methodology_bacon.py:L301-L445.

Security

  • No findings.

Documentation/Tests

  • No findings. The previous docstring drift is resolved, and the new regression coverage is appropriately targeted at the methodology-risky paths, including remap behavior and diagnostic-report survey skips: diff_diff/bacon.py:L1252-L1288, diff_diff/twfe.py:L703-L726, tests/test_methodology_bacon.py:L1020-L1141.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026
@igerber igerber merged commit 294b57c into main May 16, 2026
33 of 34 checks passed
@igerber igerber deleted the feature/bacon-methodology-audit branch May 16, 2026 17:01
igerber added a commit that referenced this pull request May 16, 2026
Closes the PR #454 deferred R parity follow-up (TODO.md row removed).

Generated `benchmarks/data/r_bacondecomp_golden.json` from the committed
`benchmarks/R/generate_bacon_golden.R` script against `bacondecomp 0.1.1`
on R 4.5.2. Three DGP fixtures: `uniform_3groups_with_never_treated`,
`two_groups_no_never_treated`, `always_treated_remapped`.

Parity results at atol=1e-6 via `tests/test_methodology_bacon.py::TestBaconParityR`:
- TWFE coefficient: ✅ matches across all 3 fixtures
- Weights-sum:      ✅ matches across all 3 fixtures
- Per-component:    ✅ on the 2 non-remap fixtures; **structural convention
  divergence** on `always_treated_remapped` (skipped per-component, kept
  aggregate). R keeps `first_treat=1` as a distinct timing cohort and
  emits `Later vs Always Treated` comparisons; Python's paper-footnote-11
  convention remaps those units to `U` and folds them into a single
  `treated_vs_never` cell per treated cohort. The aggregate is invariant
  per Theorem 1 — the U bucket's weight is re-allocated across nested
  2x2 cells but the total weight on {cohort_k vs U} is identical. Only
  the per-component breakdown differs structurally between conventions.

Tracker promotions:
- METHODOLOGY_REVIEW.md: BaconDecomposition status row → **Complete**
  (was `**Complete** (R parity pending)`); removed from In Progress
  prose mention; removed from Priority Order substantive-review list;
  Test Coverage count refreshed (24 → 33); R Comparison Results block
  rewritten as **Validated**.
- docs/methodology/REGISTRY.md: Reference Implementations bullet + Verified
  Components checklist + Note (weight modes) updated; new Note (R parity
  convention divergence on always-treated) documents the convention.
- TODO.md: BaconDecomposition R parity goldens row removed.
- CHANGELOG.md: new `[Unreleased]` Added bullet for the close-out;
  PR-B Changed entry tightened ("intended to match" → "matching ... at
  atol=1e-6").
- diff_diff/bacon.py: `bacon_decompose` docstring example wording
  tightened from "intended to match" to "matches" with TestBaconParityR
  pointer.

Tests: 33/33 pass in test_methodology_bacon.py (no skips; was 30+3
skipped); 32 pass in test_bacon.py; 101 pass across the broader
bacon/decompose surface (was 98+3 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 16, 2026
…e-out

R6 verdict was Looks good with 1 P3 informational item: the older PR-B
audit bullet at CHANGELOG.md:13 (added in PR #454) still described
the pre-goldens deferral state ("JSON goldens deferred", "TestBaconParityR
skips with a pointer when goldens missing", "status flipped to
**Complete (R parity goldens pending)**"). That contradicts the new
PR-457 bullet at CHANGELOG.md:11 (committed goldens + 4 active parity
tests) within the same [Unreleased] section, so the release notes
read as internally inconsistent.

Updated 3 strings in the PR-B bullet to reflect the within-release
close-out:
- Status flip wording: now says the (R parity pending) caveat was
  closed by the parity-goldens bullet above in this same release.
- TestBaconParityR description: 4 tests, all active post-release;
  skips only in partial-checkout scenarios.
- (4) outcome: parity goldens deferral was closed within this release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant