Port R `did` package tests, fix not_yet_treated bugs by igerber · Pull Request #207 · igerber/diff-diff

igerber · 2026-03-18T11:29:43Z

Summary

Port ~21 tests from R's did package (bcallaway11/did) to cross-validate CallawaySantAnna estimator
Tests organized as Tier 1 (Python DGP, loose tolerance) and Tier 2 (R golden values, strict tolerance)
Fix: not_yet_treated control group now works without never-treated units (requires ≥2 cohorts)
Fix: not_yet_treated control mask uses max(t, base_period) to prevent contamination when base_period="universal" exceeds evaluation period
Update REGISTRY.md to document corrected control mask formula and relaxed never-treated requirement
Add feature gaps section to TODO.md (repeated cross-sections, sampling weights, calendar time aggregation)

Methodology references (required if estimator / math changes)

Method name(s): Callaway & Sant'Anna (2021) — not-yet-treated control group construction
Paper / source link(s): Callaway & Sant'Anna (2021, JoE); R reference implementation did::att_gt()
Any intentional deviations from the source: None — fixes align diff-diff with R's did package behavior

Validation

Tests added/updated:
- tests/test_csdid_ported.py — 30 tests (20 Tier 1 + 10 Tier 2), all passing
- tests/helpers/csdid_dgp.py — R DGP translation for Tier 1 tests
- benchmarks/R/generate_csdid_test_values.R — generates golden values for Tier 2
- benchmarks/data/csdid_golden_values.json — committed golden values (786KB)
Existing tests unaffected: 184/184 pass in test_staggered.py + test_methodology_callaway.py
Pure Python mode verified: DIFF_DIFF_BACKEND=python pytest tests/test_csdid_ported.py — all pass

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

Port ~21 tests from R's `did` package (bcallaway11/did) to validate CallawaySantAnna estimator. Tests organized as Tier 1 (Python DGP, loose tolerance, always-run) and Tier 2 (R golden values, strict tolerance). Found and fixed two bugs via the ported tests: - not_yet_treated control group now works without never-treated units (requires ≥2 treatment cohorts) - not_yet_treated control mask now uses max(t, base_period) to prevent contamination when base_period="universal" exceeds evaluation period Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-18T11:40:25Z

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the estimator change. The methodology-sensitive not_yet_treated fix appears aligned with the project’s registry and the in-code Callaway & Sant’Anna reference. This review is static only; I could not run the new tests in this environment because the review image is missing test/runtime dependencies.

Executive Summary

The affected method is Callaway & Sant’Anna (2021) ATT(g,t) construction for control_group="not_yet_treated", especially the base_period/anticipation interaction. I did not find an undocumented methodology deviation in the estimator paths.
The updated control-mask formula is applied consistently in all ATT(g,t) implementations, and the relaxed “no never-treated units” rule for not_yet_treated is documented in the Methodology Registry.
Newly supported all-treated not_yet_treated runs can now return results while still reporting n_control_units as the count of never-treated units only, so summaries may say Control units: 0 even when later-treated cohorts are serving as controls.
Two new exact-match R goldens are generated with panel=FALSE but compared against the Python panel estimator anyway, which overstates current source-material validation for those scenarios.
The new TODO.md entries properly track deferred R-compatibility gaps and are non-blocking by project policy.

Methodology

Severity: P3 informational. Impact: The PR changes Callaway & Sant’Anna’s not_yet_treated control construction and the all-treated eligibility rule. The implementation now uses the same G > max(t, base_period) + anticipation logic in _compute_att_gt_fast, the no-covariate vectorized path, and the covariate-regression path, and the registry documents that same rule plus the relaxed never-treated requirement. I do not see an undocumented methodology mismatch here. Concrete fix: None. References: diff_diff/staggered.py#L449, diff_diff/staggered.py#L635, diff_diff/staggered.py#L793, diff_diff/staggered.py#L856, diff_diff/staggered.py#L1141, docs/methodology/REGISTRY.md#L351, docs/methodology/REGISTRY.md#L409

Code Quality

Severity: P2. Impact: fit() still computes n_control_units from the never-treated count only, then stores that in a results object whose summary labels it as Control units. With this PR’s new support for all-treated not_yet_treated samples, successful runs can now report Control units: 0 even though later-treated cohorts are actively used as controls. That is misleading user-facing metadata in a newly supported configuration. Concrete fix: Make the summary label conditional for not_yet_treated or split the metadata into n_never_treated_units plus a documented note that the actual control pool is dynamic by (g,t). References: diff_diff/staggered.py#L1134, diff_diff/staggered.py#L1141, diff_diff/staggered.py#L1377, diff_diff/staggered_results.py#L90, diff_diff/staggered_results.py#L155

Performance

No findings.

Maintainability

No findings beyond the metadata issue above.

Tech Debt

Severity: P3 informational. Impact: The added TODO entries correctly track deferred gaps relative to R did for repeated cross-sections, weights, and calendar-time aggregation, so those items are non-blocking under the project’s review policy. Concrete fix: None. Reference: TODO.md#L157

Security

No findings.

Documentation/Tests

Severity: P2. Impact: Two new “golden value” validations are not apples-to-apples with the supported estimator. The R fixture generator explicitly uses panel = FALSE for dynamic_effects and non_consecutive_periods, but the Python tests compare those fixtures against CallawaySantAnna(...) with no repeated-cross-section mode. Since TODO.md simultaneously tracks repeated cross-sections as a blocked feature, these tests overstate current source-material validation rather than actually proving parity. Concrete fix: Regenerate those fixtures with panel=TRUE if the goal is to validate the supported panel estimator, or skip/xfail these two goldens until repeated cross-sections are implemented. At minimum, have the Python harness read scenario["params"] and reject unsupported exact comparisons such as panel=False. References: benchmarks/R/generate_csdid_test_values.R#L170, benchmarks/R/generate_csdid_test_values.R#L199, tests/test_csdid_ported.py#L994, tests/test_csdid_ported.py#L1022, TODO.md#L157

…d panel=FALSE golden values - Rename summary label "Control units:" → "Never-treated units:" in CallawaySantAnnaResults to avoid confusion when not_yet_treated control group has zero never-treated units - Clarify n_control_units docstring to note it excludes not-yet-treated dynamic controls - Remove panel=FALSE from scenarios 4 and 5 in R golden value generator so golden values match the Python panel estimator (previously overstated validation) - Regenerate csdid_golden_values.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T11:51:37Z

/ai-review

github-actions · 2026-03-18T12:00:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c8cd414a64bffcb15b610d10edf702a16b1f9ced

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The methodology-sensitive estimator change is aligned with the local Methodology Registry, the prior re-review issues appear addressed, and the remaining concern is a minor test-coverage gap rather than a correctness defect.

Executive Summary

The affected method is Callaway & Sant'Anna (2021) ATT(g,t) construction for control_group="not_yet_treated" under base_period and anticipation. The updated control rule first_treat > max(t, base_period) + anticipation, with cohort g excluded from its own control pool, is implemented consistently across the general, vectorized, and covariate-regression paths: diff_diff/staggered.py:L449-L459, diff_diff/staggered.py:L631-L638, diff_diff/staggered.py:L789-L797, diff_diff/staggered.py:L856-L862.
The relaxed all-treated eligibility rule is documented and enforced correctly: not_yet_treated no longer requires never-treated units, but it does require at least two treatment cohorts when none exist. I did not find an undocumented methodology deviation or a missing assumption check in that new path: diff_diff/staggered.py:L1136-L1148, docs/methodology/REGISTRY.md:L409-L419, docs/methodology/REGISTRY.md:L424-L425.
The prior metadata finding is resolved. Results now document n_control_units as never-treated units only, and the printed summary label matches that meaning: diff_diff/staggered_results.py:L90-L92, diff_diff/staggered_results.py:L153-L159.
The prior panel-mismatch concern also appears resolved. The current R golden generator does not create exact-match panel=FALSE scenarios; the only explicit panel override is panel = TRUE in the two-group case: benchmarks/R/generate_csdid_test_values.R:L170-L202, benchmarks/R/generate_csdid_test_values.R:L312-L316.
Minor remaining gap: the new R golden fixtures export SEs, but the Python exact-match tests only assert ATT parity, so inference parity is not yet pinned down by the added fixtures: benchmarks/R/generate_csdid_test_values.R:L10-L13, benchmarks/R/generate_csdid_test_values.R:L22-L39, tests/test_csdid_ported.py:L946-L954, tests/test_csdid_ported.py:L1011-L1020, tests/test_csdid_ported.py:L1099-L1107, tests/test_csdid_ported.py:L1196-L1203.

Methodology

Severity: P3 informational. Impact: The PR changes Callaway & Sant'Anna ATT(g,t) control construction for not_yet_treated, especially when base_period="universal" makes the base period later than the evaluation period. The implementation now uses max(t, base_period) and excludes cohort g in all estimator paths, and the registry documents that same rule plus the all-treated >= 2-cohort guard. I did not find an undocumented methodology mismatch, incorrect control-group composition, incorrect variance/SE logic, or a missing assumption check in the changed code. Concrete fix: None. References: diff_diff/staggered.py:L449-L459, diff_diff/staggered.py:L631-L638, diff_diff/staggered.py:L789-L797, diff_diff/staggered.py:L856-L862, diff_diff/staggered.py:L1136-L1148, docs/methodology/REGISTRY.md:L354-L357, docs/methodology/REGISTRY.md:L409-L419.

Code Quality

No findings. The previously misleading Control units summary is now corrected to Never-treated units, which matches the stored field semantics: diff_diff/staggered_results.py:L90-L92, diff_diff/staggered_results.py:L153-L159.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3 informational. Impact: The new TODO entries cleanly track remaining R did feature gaps for repeated cross-sections, sampling weights, and calendar-time aggregation, so those deferred items are non-blocking under the stated review policy. Concrete fix: None. References: TODO.md:L157-L165.

Security

No findings.

Documentation/Tests

Severity: P3 informational. Impact: The new golden generator exports both ATT and SE outputs for group-time and aggregation results, but the Python exact-match tests currently compare ATT values only. That means the new fixtures validate point-estimate parity, but not analytical SE/inference parity, even on the methodology-sensitive control-group change in this PR. Concrete fix: Add exact-match assertions for exported se and overall_se fields in representative group-time and aggregation scenarios, or document explicitly that the new goldens are ATT-only. References: benchmarks/R/generate_csdid_test_values.R:L10-L13, benchmarks/R/generate_csdid_test_values.R:L22-L39, tests/test_csdid_ported.py:L946-L954, tests/test_csdid_ported.py:L1011-L1020, tests/test_csdid_ported.py:L1099-L1107, tests/test_csdid_ported.py:L1196-L1203.
Validation note: pytest is not installed in this environment, so I could not execute the new suite here.

igerber merged commit 2acf245 into main Mar 18, 2026
10 checks passed

igerber deleted the cs-edid-tests branch March 18, 2026 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port R `did` package tests, fix not_yet_treated bugs#207

Port R `did` package tests, fix not_yet_treated bugs#207
igerber merged 2 commits intomainfrom
cs-edid-tests

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Mar 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant