Skip to content

Port R did package tests, fix not_yet_treated bugs#207

Merged
igerber merged 2 commits intomainfrom
cs-edid-tests
Mar 18, 2026
Merged

Port R did package tests, fix not_yet_treated bugs#207
igerber merged 2 commits intomainfrom
cs-edid-tests

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 18, 2026

Summary

  • Port ~21 tests from R's did package (bcallaway11/did) to cross-validate CallawaySantAnna estimator
  • Tests organized as Tier 1 (Python DGP, loose tolerance) and Tier 2 (R golden values, strict tolerance)
  • Fix: not_yet_treated control group now works without never-treated units (requires ≥2 cohorts)
  • Fix: not_yet_treated control mask uses max(t, base_period) to prevent contamination when base_period="universal" exceeds evaluation period
  • Update REGISTRY.md to document corrected control mask formula and relaxed never-treated requirement
  • Add feature gaps section to TODO.md (repeated cross-sections, sampling weights, calendar time aggregation)

Methodology references (required if estimator / math changes)

  • Method name(s): Callaway & Sant'Anna (2021) — not-yet-treated control group construction
  • Paper / source link(s): Callaway & Sant'Anna (2021, JoE); R reference implementation did::att_gt()
  • Any intentional deviations from the source: None — fixes align diff-diff with R's did package behavior

Validation

  • Tests added/updated:
    • tests/test_csdid_ported.py — 30 tests (20 Tier 1 + 10 Tier 2), all passing
    • tests/helpers/csdid_dgp.py — R DGP translation for Tier 1 tests
    • benchmarks/R/generate_csdid_test_values.R — generates golden values for Tier 2
    • benchmarks/data/csdid_golden_values.json — committed golden values (786KB)
  • Existing tests unaffected: 184/184 pass in test_staggered.py + test_methodology_callaway.py
  • Pure Python mode verified: DIFF_DIFF_BACKEND=python pytest tests/test_csdid_ported.py — all pass

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

Port ~21 tests from R's `did` package (bcallaway11/did) to validate
CallawaySantAnna estimator. Tests organized as Tier 1 (Python DGP, loose
tolerance, always-run) and Tier 2 (R golden values, strict tolerance).

Found and fixed two bugs via the ported tests:
- not_yet_treated control group now works without never-treated units
  (requires ≥2 treatment cohorts)
- not_yet_treated control mask now uses max(t, base_period) to prevent
  contamination when base_period="universal" exceeds evaluation period

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the estimator change. The methodology-sensitive not_yet_treated fix appears aligned with the project’s registry and the in-code Callaway & Sant’Anna reference. This review is static only; I could not run the new tests in this environment because the review image is missing test/runtime dependencies.

Executive Summary

  • The affected method is Callaway & Sant’Anna (2021) ATT(g,t) construction for control_group="not_yet_treated", especially the base_period/anticipation interaction. I did not find an undocumented methodology deviation in the estimator paths.
  • The updated control-mask formula is applied consistently in all ATT(g,t) implementations, and the relaxed “no never-treated units” rule for not_yet_treated is documented in the Methodology Registry.
  • Newly supported all-treated not_yet_treated runs can now return results while still reporting n_control_units as the count of never-treated units only, so summaries may say Control units: 0 even when later-treated cohorts are serving as controls.
  • Two new exact-match R goldens are generated with panel=FALSE but compared against the Python panel estimator anyway, which overstates current source-material validation for those scenarios.
  • The new TODO.md entries properly track deferred R-compatibility gaps and are non-blocking by project policy.

Methodology

Code Quality

  • Severity: P2. Impact: fit() still computes n_control_units from the never-treated count only, then stores that in a results object whose summary labels it as Control units. With this PR’s new support for all-treated not_yet_treated samples, successful runs can now report Control units: 0 even though later-treated cohorts are actively used as controls. That is misleading user-facing metadata in a newly supported configuration. Concrete fix: Make the summary label conditional for not_yet_treated or split the metadata into n_never_treated_units plus a documented note that the actual control pool is dynamic by (g,t). References: diff_diff/staggered.py#L1134, diff_diff/staggered.py#L1141, diff_diff/staggered.py#L1377, diff_diff/staggered_results.py#L90, diff_diff/staggered_results.py#L155

Performance

No findings.

Maintainability

No findings beyond the metadata issue above.

Tech Debt

  • Severity: P3 informational. Impact: The added TODO entries correctly track deferred gaps relative to R did for repeated cross-sections, weights, and calendar-time aggregation, so those items are non-blocking under the project’s review policy. Concrete fix: None. Reference: TODO.md#L157

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: Two new “golden value” validations are not apples-to-apples with the supported estimator. The R fixture generator explicitly uses panel = FALSE for dynamic_effects and non_consecutive_periods, but the Python tests compare those fixtures against CallawaySantAnna(...) with no repeated-cross-section mode. Since TODO.md simultaneously tracks repeated cross-sections as a blocked feature, these tests overstate current source-material validation rather than actually proving parity. Concrete fix: Regenerate those fixtures with panel=TRUE if the goal is to validate the supported panel estimator, or skip/xfail these two goldens until repeated cross-sections are implemented. At minimum, have the Python harness read scenario["params"] and reject unsupported exact comparisons such as panel=False. References: benchmarks/R/generate_csdid_test_values.R#L170, benchmarks/R/generate_csdid_test_values.R#L199, tests/test_csdid_ported.py#L994, tests/test_csdid_ported.py#L1022, TODO.md#L157

…d panel=FALSE golden values

- Rename summary label "Control units:" → "Never-treated units:" in CallawaySantAnnaResults
  to avoid confusion when not_yet_treated control group has zero never-treated units
- Clarify n_control_units docstring to note it excludes not-yet-treated dynamic controls
- Remove panel=FALSE from scenarios 4 and 5 in R golden value generator so golden
  values match the Python panel estimator (previously overstated validation)
- Regenerate csdid_golden_values.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: c8cd414a64bffcb15b610d10edf702a16b1f9ced


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The methodology-sensitive estimator change is aligned with the local Methodology Registry, the prior re-review issues appear addressed, and the remaining concern is a minor test-coverage gap rather than a correctness defect.

Executive Summary

  • The affected method is Callaway & Sant'Anna (2021) ATT(g,t) construction for control_group="not_yet_treated" under base_period and anticipation. The updated control rule first_treat > max(t, base_period) + anticipation, with cohort g excluded from its own control pool, is implemented consistently across the general, vectorized, and covariate-regression paths: diff_diff/staggered.py:L449-L459, diff_diff/staggered.py:L631-L638, diff_diff/staggered.py:L789-L797, diff_diff/staggered.py:L856-L862.
  • The relaxed all-treated eligibility rule is documented and enforced correctly: not_yet_treated no longer requires never-treated units, but it does require at least two treatment cohorts when none exist. I did not find an undocumented methodology deviation or a missing assumption check in that new path: diff_diff/staggered.py:L1136-L1148, docs/methodology/REGISTRY.md:L409-L419, docs/methodology/REGISTRY.md:L424-L425.
  • The prior metadata finding is resolved. Results now document n_control_units as never-treated units only, and the printed summary label matches that meaning: diff_diff/staggered_results.py:L90-L92, diff_diff/staggered_results.py:L153-L159.
  • The prior panel-mismatch concern also appears resolved. The current R golden generator does not create exact-match panel=FALSE scenarios; the only explicit panel override is panel = TRUE in the two-group case: benchmarks/R/generate_csdid_test_values.R:L170-L202, benchmarks/R/generate_csdid_test_values.R:L312-L316.
  • Minor remaining gap: the new R golden fixtures export SEs, but the Python exact-match tests only assert ATT parity, so inference parity is not yet pinned down by the added fixtures: benchmarks/R/generate_csdid_test_values.R:L10-L13, benchmarks/R/generate_csdid_test_values.R:L22-L39, tests/test_csdid_ported.py:L946-L954, tests/test_csdid_ported.py:L1011-L1020, tests/test_csdid_ported.py:L1099-L1107, tests/test_csdid_ported.py:L1196-L1203.

Methodology

  • Severity: P3 informational. Impact: The PR changes Callaway & Sant'Anna ATT(g,t) control construction for not_yet_treated, especially when base_period="universal" makes the base period later than the evaluation period. The implementation now uses max(t, base_period) and excludes cohort g in all estimator paths, and the registry documents that same rule plus the all-treated >= 2-cohort guard. I did not find an undocumented methodology mismatch, incorrect control-group composition, incorrect variance/SE logic, or a missing assumption check in the changed code. Concrete fix: None. References: diff_diff/staggered.py:L449-L459, diff_diff/staggered.py:L631-L638, diff_diff/staggered.py:L789-L797, diff_diff/staggered.py:L856-L862, diff_diff/staggered.py:L1136-L1148, docs/methodology/REGISTRY.md:L354-L357, docs/methodology/REGISTRY.md:L409-L419.

Code Quality

  • No findings. The previously misleading Control units summary is now corrected to Never-treated units, which matches the stored field semantics: diff_diff/staggered_results.py:L90-L92, diff_diff/staggered_results.py:L153-L159.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 informational. Impact: The new TODO entries cleanly track remaining R did feature gaps for repeated cross-sections, sampling weights, and calendar-time aggregation, so those deferred items are non-blocking under the stated review policy. Concrete fix: None. References: TODO.md:L157-L165.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 informational. Impact: The new golden generator exports both ATT and SE outputs for group-time and aggregation results, but the Python exact-match tests currently compare ATT values only. That means the new fixtures validate point-estimate parity, but not analytical SE/inference parity, even on the methodology-sensitive control-group change in this PR. Concrete fix: Add exact-match assertions for exported se and overall_se fields in representative group-time and aggregation scenarios, or document explicitly that the new goldens are ATT-only. References: benchmarks/R/generate_csdid_test_values.R:L10-L13, benchmarks/R/generate_csdid_test_values.R:L22-L39, tests/test_csdid_ported.py:L946-L954, tests/test_csdid_ported.py:L1011-L1020, tests/test_csdid_ported.py:L1099-L1107, tests/test_csdid_ported.py:L1196-L1203.
  • Validation note: pytest is not installed in this environment, so I could not execute the new suite here.

@igerber igerber merged commit 2acf245 into main Mar 18, 2026
10 checks passed
@igerber igerber deleted the cs-edid-tests branch March 18, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant