igerber · igerber · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
diff --git a/TODO.md b/TODO.md
@@ -74,7 +74,6 @@ Deferred items from PR reviews that were not addressed before merge.
 
 | Issue | Location | PR | Priority |
 |-------|----------|----|----------|
-| BaconDecomposition R parity goldens: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time (2026-05-16). R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`, writing `benchmarks/data/r_bacondecomp_golden.json`. `tests/test_methodology_bacon.py::TestBaconParityR` (3 tests) skips with a pointer until the JSON lands. The PR-B audit substantiates Theorem 1 (Eqs. 7-9 + 10e-g) via hand-calculable + machine-precision identity tests; R parity is desirable as a cross-language anchor but not the only substantiation. Mirrors StaggeredTripleDifference precedent (PR #245). | `benchmarks/R/generate_bacon_golden.R`, `benchmarks/data/r_bacondecomp_golden.json` (TBD), `tests/test_methodology_bacon.py::TestBaconParityR` | follow-up | Medium |
 | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low |
 | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium |
 | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low |

diff --git a/benchmarks/R/generate_bacon_golden.R b/benchmarks/R/generate_bacon_golden.R
@@ -7,9 +7,21 @@
 #
 # The diff-diff BaconDecomposition implementation (`diff_diff/bacon.py`) with
 # the default ``weights="exact"`` is expected to match the values in this JSON
-# to atol=1e-6 on the per-component (treated, control, type) tuples, and to
-# match the TWFE coefficient to the same tolerance. The ``weights="approximate"``
-# path is a library-only optimization and is NOT covered by this parity harness.
+# at atol=1e-6 along a three-tier contract:
+#   (1) aggregate TWFE coefficient + weights-sum on all 3 fixtures;
+#   (2) direct per-component (treated, control, type) parity on the 2
+#       non-remap fixtures AND on the 6 timing-vs-timing rows of
+#       `always_treated_remapped`;
+#   (3) cohort-level fold-back parity for the U bucket on
+#       `always_treated_remapped` — Python's paper-footnote-11 remap folds
+#       R's separate `Later vs Always Treated` + `Treated vs Untreated`
+#       rows into a single `treated_vs_never` cell per cohort, so the
+#       aggregate is invariant per Theorem 1 but the per-component
+#       breakdown differs by convention. See REGISTRY notes:
+#       `**Note (R parity convention divergence on always-treated)**` and
+#       `**Deviation (first-period boundary extension on always-treated remap)**`.
+# The ``weights="approximate"`` path is a library-only optimization and is
+# NOT covered by this parity harness.
 #
 # Three fixtures:
 #   1. uniform_3groups_with_never_treated — 3 timing groups + never-treated U;
@@ -18,8 +30,8 @@
 #   2. two_groups_no_never_treated — 2 timing groups only; tests the
 #      timing-only decomposition where the s_{kU} terms drop.
 #   3. always_treated_remapped — 3 timing groups + 1 always-treated cohort
-#      (first_treat = 1). Validates that Python's warn+remap of t_i < 1 into
-#      U matches R bacondecomp's native behavior.
+#      (first_treat = 1). Validates the convention-divergent U-bucket
+#      fold-back on Python's warn+remap of always-treated units into U.
 #
 # Run:
 #   cd benchmarks/R && Rscript generate_bacon_golden.R
@@ -193,11 +205,21 @@ df2 <- build_panel(
 fixture_2 <- extract_bacon(df2, "two_groups_no_never_treated")
 
 cat("Building fixture 3: always_treated_remapped...\n")
-# 3 timing-cohorts + 5 always-treated units (first_treat = 1, i.e., treated
-# in every observable period) + 30 never-treated. R's bacondecomp natively
-# groups the first_treat=1 cohort with U (since they are treated throughout
-# every observable period and never serve as a within-window control), which
-# matches what diff-diff's warn+remap does in Python.
+# 3 timing-cohorts (3, 4, 5) + 5 always-treated units (first_treat = 1, i.e.,
+# treated in every observable period) + 25 never-treated. R's bacondecomp
+# keeps the first_treat=1 cohort as a *separate* timing cohort (not in U) and
+# emits a `Later vs Always Treated` comparison row for each later cohort
+# alongside the standard `Treated vs Untreated` row. Python's paper-footnote-11
+# convention remaps these units into the U bucket and folds R's two columns
+# of components into a single `treated_vs_never` cell per treated cohort.
+# The aggregate (TWFE coefficient + weights-sum) is invariant per Theorem 1,
+# but the per-component breakdown differs by convention — see REGISTRY
+# `**Note (R parity convention divergence on always-treated)**` and
+# `**Deviation (first-period boundary extension on always-treated remap)**`.
+# `tests/test_methodology_bacon.py::TestBaconParityR` carves out the U-bucket
+# rows for direct per-component parity (keeping the 6 timing-vs-timing rows
+# under direct parity) and asserts the U-bucket fold-back separately via
+# `test_always_treated_remapped_fold_back_matches_r` at atol=1e-6.
 df3 <- build_panel(
   n_units_per_cohort   = 25L,
   n_periods            = 6L,
@@ -220,8 +242,18 @@ out <- list(
     r_version            = R.version.string,
     description          = paste(
       "Goodman-Bacon (2021) decomposition parity goldens for diff-diff",
-      "BaconDecomposition. Parity target: atol=1e-6 on per-component",
-      "(treated, control, type) tuples plus the TWFE coefficient."
+      "BaconDecomposition. Parity target at atol=1e-6:",
+      "(1) aggregate TWFE coefficient + weights-sum across all 3 fixtures;",
+      "(2) direct per-component (treated, control, type) parity on the 2",
+      "non-remap fixtures AND on the 6 timing-vs-timing rows of",
+      "always_treated_remapped;",
+      "(3) cohort-level fold-back parity for the U bucket on",
+      "always_treated_remapped (Python's paper-footnote-11 remap folds",
+      "R's separate Later-vs-Always-Treated + Treated-vs-Untreated rows",
+      "into a single treated_vs_never cell per cohort; aggregate is",
+      "invariant per Theorem 1, breakdown differs by convention).",
+      "See REGISTRY Note (R parity convention divergence on always-treated)",
+      "+ Deviation (first-period boundary extension)."
     )
   ),
   uniform_3groups_with_never_treated = fixture_1,

diff --git a/benchmarks/data/r_bacondecomp_golden.json b/benchmarks/data/r_bacondecomp_golden.json
diff --git a/diff_diff/bacon.py b/diff_diff/bacon.py
@@ -475,7 +475,15 @@ def fit(
             excluding the never-treated sentinels ``0`` and ``np.inf``)
             are automatically remapped to the ``U`` (untreated) bucket
             per Goodman-Bacon (2021) footnote 11, with a
-            ``UserWarning``. Detection uses ordered-time logic on the
+            ``UserWarning``. **Library boundary extension:** the paper
+            uses the strict inequality ``t_i < 1`` (units treated
+            *before* the first observable period); the library uses the
+            **inclusive** ``first_treat <= min(time)`` rule, additionally
+            folding units treated *at* the first observable period
+            (``first_treat == min(time)``) into ``U`` because such units
+            have no untreated cell in-panel. See REGISTRY's
+            ``**Deviation (first-period boundary extension on
+            always-treated remap)**`` block for the full contract. Detection uses ordered-time logic on the
             **time axis** so panels whose ``time`` column contains
             negative or zero-crossing labels (e.g. event-time
             ``time ∈ [-2,..,3]``) are handled correctly; the ``0``
@@ -1302,9 +1310,16 @@ def bacon_decompose(
     >>> from diff_diff import bacon_decompose
     >>>
     >>> # Default: paper-faithful Goodman-Bacon (2021) Theorem 1 weights
-    >>> # (weights="exact"); intended to match R bacondecomp::bacon() at
-    >>> # atol=1e-6 (R parity goldens pending — see TODO.md "R parity
-    >>> # goldens generation" for the deferred validation step).
+    >>> # (weights="exact"); matches R bacondecomp::bacon() at atol=1e-6 on
+    >>> # the aggregate (TWFE coefficient + weights-sum) across all panels,
+    >>> # and on the per-component breakdown when there are no
+    >>> # always-treated / first-period-treated cohorts (i.e. all
+    >>> # non-sentinel first_treat values are strictly greater than
+    >>> # min(time)). For panels with always-treated units, the
+    >>> # per-component breakdown diverges by convention (Python remaps
+    >>> # to U per paper footnote 11; R emits `Later vs Always Treated`);
+    >>> # see REGISTRY note on R parity convention divergence. Validated
+    >>> # via tests/test_methodology_bacon.py::TestBaconParityR.
     >>> results = bacon_decompose(
     ...     data=panel_df,
     ...     outcome='earnings',

diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2616,7 +2616,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 
 *Assumption checks / warnings:*
 - Requires variation in treatment timing (staggered adoption)
-- Always-treated units (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`; paper footnote 11) are automatically remapped to the `U` (untreated) bucket with a `UserWarning`; see the `**Note (always-treated remap)**` below for the full ordered-time / sentinel contract
+- Always-treated units (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`; per paper footnote 11 with a library-convention extension on the first-period boundary case, see `**Deviation (first-period boundary extension)**` below) are automatically remapped to the `U` (untreated) bucket with a `UserWarning`; see the `**Note (always-treated remap)**` below for the full ordered-time / sentinel contract
 - Unbalanced panels are accepted with a `UserWarning`; the paper's Appendix A proof assumes balanced panels
 - Falls back to timing-only comparisons when no never-treated units are present (no untreated group → `s_{kU}` terms drop, weights rescale to sum to 1; **VWCT and ΔATT can still bias the result** — see paper Eqs. 14-15)
 
@@ -2668,7 +2668,7 @@ Where `n_k` is the sample share of timing group `k`, `n_{kℓ} = n_k / (n_k + n_
 - Always-treated units: see `**Note (always-treated remap)**` below
 
 **Reference implementation(s):**
-- R: `bacondecomp::bacon()` (CRAN). Parity script at `benchmarks/R/generate_bacon_golden.R`; goldens pending follow-up R install (see TODO.md).
+- R: `bacondecomp::bacon()` (CRAN). Parity script at `benchmarks/R/generate_bacon_golden.R`; goldens committed at `benchmarks/data/r_bacondecomp_golden.json` (generated against `bacondecomp` 0.1.1 + R 4.5.2). Parity validated at `atol=1e-6` via `tests/test_methodology_bacon.py::TestBaconParityR` (4 tests: TWFE coefficient + weights-sum match across 3 fixtures; per-component estimate + weight parity locked on the 2 non-remap fixtures and on the 6 timing-vs-timing rows of `always_treated_remapped`; the U-bucket convention divergence on `always_treated_remapped` is pinned by a dedicated fold-back test).
 - Stata: `bacondecomp` (SSC). Authors: Goodman-Bacon, Goldring, Nichols (2019).
 
 **Requirements checklist:**
@@ -2678,11 +2678,13 @@ Where `n_k` is the sample share of timing group `k`, `n_{kℓ} = n_k / (n_k + n_
 - [x] Visualization shows weight vs. estimate by comparison type
 - [x] Always-treated remap to U per Goodman-Bacon (2021) footnote 11 (PR-B audit)
 - [x] Hand-calculable Theorem 1 verification: `tests/test_methodology_bacon.py::TestBaconHandCalculation` (7 tests, atol=1e-10)
-- [ ] R `bacondecomp::bacon()` parity at atol=1e-6 (R generator script committed; JSON goldens pending follow-up R install — `tests/test_methodology_bacon.py::TestBaconParityR` skips when missing)
+- [x] R `bacondecomp::bacon()` parity at atol=1e-6 (3 fixtures; TWFE coefficient + weights-sum match across all 3; per-component parity locked on the 2 non-remap fixtures and on the 6 timing-vs-timing rows of `always_treated_remapped`; the U-bucket fold-back is asserted by a dedicated `test_always_treated_remapped_fold_back_matches_r` — see `**Note (R parity convention divergence)**` below)
 - [x] Survey design support (Phase 3): weighted cell means, weighted within-transform, weighted group shares
-- **Note (weight modes):** `weights="exact"` (default, paper-faithful Eqs. 7-9 + 10e-g) vs `weights="approximate"` (simplified variance, opt-in for speed-sensitive diagnostic loops). The PR-A paper review (#451) and PR-B audit established `"exact"` as the default with the **intent** to match R `bacondecomp::bacon()` and the paper's Theorem 1 contract; R parity is validated by hand-calculation (atol=1e-10) and TWFE-vs-weighted-sum identity (atol=1e-10) but the direct R bit-by-bit parity at atol=1e-6 is still pending the R `bacondecomp` install — see Test Coverage checklist above. The approximate path is retained for backward compatibility; numerical output may differ from R.
-- **Note (always-treated remap):** Units whose `first_treat` is at or before the first observable period (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`) are automatically remapped to the `U` bucket via an internal column (`__bacon_first_treat_internal__`) with a `UserWarning` — per paper footnote 11. Detection uses ordered-time logic on the **time axis**, so panels whose `time` column has negative or zero-crossing labels (e.g. event-time `time ∈ [-2,..,3]`) are handled correctly: a cohort at `first_treat=-1` on such a panel is a valid timing group; a cohort at `first_treat=-3` is remapped to U. The user's original `first_treat` column on the input `data` frame is preserved unchanged. The count of remapped units is surfaced via `BaconDecompositionResults.n_always_treated_remapped`. **Sentinel restriction:** `first_treat ∈ {0, np.inf}` is reserved as the never-treated marker and is not configurable today; a real treatment cohort with `first_treat == 0` would be folded into `U` and should be re-labeled to a non-sentinel value before fitting. The `0` reservation applies to `first_treat` only, not to `time`.
+- **Note (weight modes):** `weights="exact"` (default, paper-faithful Eqs. 7-9 + 10e-g) vs `weights="approximate"` (simplified variance, opt-in for speed-sensitive diagnostic loops). The PR-A paper review (#451) and PR-B audit established `"exact"` as the default to match R `bacondecomp::bacon()` and the paper's Theorem 1 contract; R parity is validated at `atol=1e-6` (see `**Note (R parity convention divergence)**` below for the one structural convention difference). Hand-calculation + TWFE-vs-weighted-sum identity hold at `atol=1e-10`. The approximate path is retained for backward compatibility; numerical output may differ from R.
+- **Note (always-treated remap):** Units whose `first_treat` is at or before the first observable period (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`) are automatically remapped to the `U` bucket via an internal column (`__bacon_first_treat_internal__`) with a `UserWarning` — per paper footnote 11 (with a library boundary extension on `first_treat == min(time)`; see `**Deviation (first-period boundary extension)**` below). Detection uses ordered-time logic on the **time axis**, so panels whose `time` column has negative or zero-crossing labels (e.g. event-time `time ∈ [-2,..,3]`) are handled correctly: a cohort at `first_treat=-1` on such a panel is a valid timing group; a cohort at `first_treat=-3` is remapped to U. The user's original `first_treat` column on the input `data` frame is preserved unchanged. The count of remapped units is surfaced via `BaconDecompositionResults.n_always_treated_remapped`. **Sentinel restriction:** `first_treat ∈ {0, np.inf}` is reserved as the never-treated marker and is not configurable today; a real treatment cohort with `first_treat == 0` would be folded into `U` and should be re-labeled to a non-sentinel value before fitting. The `0` reservation applies to `first_treat` only, not to `time`.
 - **Note (Bacon survey diagnostic):** Bacon decomposition with survey weights is diagnostic; exact-sum guarantee holds at machine precision under `weights="exact"` **on balanced panels**. `weights="exact"` requires within-unit-constant survey columns (approximate path accepts time-varying weights).
+- **Note (R parity convention divergence on always-treated):** R `bacondecomp::bacon()` keeps `first_treat=1` (the always-treated cohort) as a separate timing cohort and emits an additional comparison type `Later vs Always Treated` (cohort k vs the always-treated cell) alongside the standard `Treated vs Untreated` row. Python's footnote-11 convention remaps these units to the `U` bucket and folds those R-side rows into a single `treated_vs_never` cell per treated cohort. The aggregate (TWFE coefficient + sum of weights) is invariant to this re-bucketing — Theorem 1's identity holds identically because the U bucket's total weight gets re-allocated across nested 2x2 cells but the total weight on `{cohort_k vs U}` is the same. The per-component breakdown, however, differs structurally between the two conventions. The R parity test (`tests/test_methodology_bacon.py::TestBaconParityR::test_component_estimates_match_r`) asserts per-component parity at `atol=1e-6` on the 2 fixtures without always-treated (`uniform_3groups_with_never_treated`, `two_groups_no_never_treated`) AND on the 6 timing-vs-timing rows of `always_treated_remapped` — the carve-out is narrowed to U-bucket rows only (R's `Later vs Always Treated` rows canonicalize to `treated_vs_never` and are dropped alongside the matching Python rows). The R→Python U-bucket fold-back is pinned separately by `test_always_treated_remapped_fold_back_matches_r`, which aggregates R's split `Later vs Always Treated` + `Treated vs Untreated` rows per treated cohort and asserts the combined weight + weight-averaged estimate match Python's single `treated_vs_never` cell at `atol=1e-6`. Aggregate parity (`test_twfe_coef_matches_r`, `test_weights_sum_matches_r`) is locked across all 3 fixtures.
+- **Deviation (first-period boundary extension on always-treated remap):** Paper footnote 11 (Goodman-Bacon 2021) uses the strict inequality `t_i < 1` (units treated *before* the first observable period) for the always-treated bucket. The library applies the **inclusive** `first_treat <= min(time)` rule, which additionally folds units treated *at* the first observable period (`first_treat == min(time)`) into `U`. This is a library boundary convention, not a paper-faithful rule: such units have no untreated cell in the observed panel and so cannot contribute to any 2x2 DD as a treated cohort, so folding them into the U bucket mirrors the always-treated handling rather than dropping them silently. R `bacondecomp::bacon()` does not apply this boundary fold-back — it keeps `first_treat == min(time)` cohorts in their own bucket and emits `Later vs Always Treated` comparisons (see the **Note (R parity convention divergence on always-treated)** above for how the parity tests handle the resulting structural breakdown difference; aggregate Theorem 1 identity remains invariant). When `min(time)` is strictly greater than 1 (no first-period-treated cohorts), the library rule reduces to the paper's strict rule and the two conventions coincide.
 - **Deviation (unbalanced-panel library extension):** Unbalanced panels are accepted with a `UserWarning` ("Unbalanced panel detected. Bacon decomposition assumes balanced panels. Results may be inaccurate."). Goodman-Bacon (2021) Appendix A's proof assumes a balanced panel; under unbalance, the Theorem 1 identity holds only approximately. The decomposition still returns finite, well-defined outputs but `weights="exact"` does NOT achieve the machine-precision algebraic identity that the balanced-panel claims above describe.
 
 ---