Add EDID validation tests against paper results by igerber · Pull Request #221 · igerber/diff-diff

igerber · 2026-03-21T14:10:39Z

Summary

Add HRS empirical replication tests validating EDiD ATT(g,t), ES(e), and ES_avg against Table 6 of Chen, Sant'Anna & Xie (2025)
Add Compustat MC simulation tests validating unbiasedness, RMSE dominance, efficiency gains, coverage, and SE calibration against Tables 4/5 patterns
Commit processed HRS fixture (656-individual subset of Dobkin et al. 2018 replication kit) as tests/data/hrs_edid_validation.csv
Add replication_data/ to .gitignore

HRS Replication Results (vs Table 6)

Target	Paper	Ours	Diff	Diff/SE
ATT(8,8)	3072	3076.9	+4.9	0.006
ATT(8,9)	1112	1098.1	-13.9	-0.022
ATT(8,10)	1038	1047.2	+9.2	0.011
ATT(9,9)	3063	3045.3	-17.7	-0.026
ATT(9,10)	90	96.0	+6.0	0.009
ATT(10,10)	2908	2925.7	+17.7	0.020
ES(0)	3024	3025.8	+1.8	0.004
ES(1)	692	686.0	-6.0	-0.013
ES(2)	1038	1047.2	+9.2	0.011
ES_avg	1585	1586.3	+1.3	0.003

All differences are < 0.03 standard errors. The CS cross-validation matches to < $1 on all ATT(g,t).

Methodology references

N/A — no methodology changes, validation only

Validation

Tests added: tests/test_efficient_did_validation.py (7 HRS replication + 5 MC simulation tests)
Data fixture: tests/data/hrs_edid_validation.csv (656 individuals × 4 waves from Dobkin et al. 2018)
HRS tests run in all CI builds (fast, not slow-marked)
MC tests run in Rust CI builds only (slow-marked, included via -m '')

Security / privacy

Confirm no secrets/PII in this PR: Yes
HRS fixture contains anonymized survey data from a publicly available replication kit

Generated with Claude Code

🤖 Generated with Claude Code

Validate the EfficientDiD estimator against published results from the paper's Table 6 (HRS empirical application) and Tables 4/5 patterns (Compustat MC simulations). HRS replication: point estimates match within 0.1-1.3% for all targets except ATT(9,10) which is a near-zero noisy estimate (90 ± 641). The CS cross-validation matches to <$1 on all ATT(g,t), confirming data loading correctness. Fixture is a 656-individual subset of the Dobkin et al. (2018) publicly available replication kit. MC simulations validate unbiasedness, RMSE dominance over CS, efficiency gains increasing with serial correlation, ~95% coverage, and SE calibration across rho values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-21T14:20:54Z

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The PR changes only test assets: .gitignore, tests/data/hrs_edid_validation.csv, and tests/test_efficient_did_validation.py. No estimator, weighting, variance, or default-behavior code paths are modified.
Methodology alignment is good: the new tests use EfficientDiD(pt_assumption="all"), compute paper ES_avg from event-study output rather than library overall_att, and treat Table 6 SEs as diagnostic-only, which matches the EfficientDiD registry notes.
No P0/P1 issues were introduced in identification, control-group logic, variance/SE handling, or NaN inference patterns.
P2: the sample-selection regression check is too loose for a committed deterministic fixture, so moderate fixture drift could still pass.
P3: the Compustat DGP/truth logic is duplicated and partly hard-coded, creating a second source of truth.
I could not execute the new tests in this sandbox because the available interpreter lacks numpy, pandas, and pytest.

Methodology

P3 informational Impact: none. The new file correctly distinguishes the paper’s ES_avg from the library’s overall_att convention, and it does not assert exact Table 6 SE equality. That matches the documented EfficientDiD notes in docs/methodology/REGISTRY.md and docs/methodology/REGISTRY.md, and is reflected in tests/test_efficient_did_validation.py and tests/test_efficient_did_validation.py. Concrete fix: none required.

Code Quality

No findings.

Performance

No findings.

Maintainability

P3 Impact: tests/test_efficient_did_validation.py duplicates the Compustat DGP already defined in tests/test_efficient_did.py, and _TRUE_ES_AVG_COMPUSTAT is a hard-coded derived constant at tests/test_efficient_did_validation.py. If the DGP parameters change in one place, this validation suite can silently validate against stale “truth.” Concrete fix: extract the shared DGP and truth calculations into a common test helper and derive ES_avg programmatically from the same parameter definitions.

Tech Debt

No findings.

Security

No findings. I do not see obvious secrets or free-text PII in tests/data/hrs_edid_validation.csv, and the .gitignore change is benign.

Documentation/Tests

P2 Impact: the sample-selection guard in tests/test_efficient_did_validation.py uses broad approximate tolerances even though the committed fixture in tests/data/hrs_edid_validation.csv is deterministic. The checked-in file currently has exact cohort counts of 656 units total, with 252/176/163/65 in cohorts 8/9/10/inf. As written, a nontrivial fixture drift could still pass the “expected counts” gate. Concrete fix: assert the exact checked-in counts and exact period support, or verify against a checked-in metadata manifest.
P3 Impact: the committed HRS fixture has no in-repo provenance or regeneration note, while .gitignore suggests raw replication inputs live outside version control. That makes future audit/rebuild of the paper-validation subset harder than necessary. Concrete fix: add a short README or regeneration script next to the fixture documenting the upstream source, selection steps, and expected checksum/counts.

P2: Assert exact cohort counts (656/252/176/163/65) and wave support since the CSV fixture is deterministic — approximate tolerances could mask fixture drift. P3: Derive _TRUE_ES_AVG_COMPUSTAT programmatically from DGP parameters instead of hard-coding, so changes to the DGP definition propagate automatically. P3: Add tests/data/README.md documenting the HRS fixture source, sample selection steps, and expected counts for future audit/rebuild. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T14:26:56Z

/ai-review

github-actions · 2026-03-21T14:34:47Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a4b388178e6f6d863213a6391dc4353da1e57ca6

Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

The diff is validation-only: it adds a new EfficientDiD validation suite, a committed HRS fixture, fixture documentation, and a .gitignore entry; no estimator implementation files are modified.
Methodology alignment is good: the new tests compute paper ES_avg from event-study output instead of the library’s overall_att, and they treat Table 6 SEs as diagnostic-only, which matches the EfficientDiD registry notes. tests/test_efficient_did_validation.py:L133-L143 tests/test_efficient_did_validation.py:L288-L301 docs/methodology/REGISTRY.md:L576-L600 docs/methodology/REGISTRY.md:L627-L627
The prior re-review items were meaningfully addressed: the HRS gate now asserts exact cohort counts and wave support, and the new fixture README documents source and selection steps. tests/test_efficient_did_validation.py:L246-L266 tests/data/README.md:L5-L34
P2: the main HRS paper-match tolerances are still looser than a committed deterministic fixture warrants, so moderate numeric drift could pass unnoticed. tests/test_efficient_did_validation.py:L26-L39 tests/test_efficient_did_validation.py:L69-L76 tests/test_efficient_did_validation.py:L268-L286
P3: the Compustat DGP is still duplicated instead of shared, leaving two sources of truth for the same simulation design. tests/test_efficient_did_validation.py:L84-L130 tests/test_efficient_did.py:L115-L164
I could not run pytest here because this sandbox lacks pytest, numpy, and pandas; I did verify via stdlib parsing that the committed CSV has 2,624 rows, 656 units, exact 252/176/163/65 cohort counts, no duplicate (unit,time) keys, and balanced [7,8,9,10] support.

Methodology

P3 informational Impact: none. The new suite correctly distinguishes the paper’s ES_avg from the library’s documented overall_att convention, and it does not incorrectly require equality between paper cluster-robust SEs and the library’s default analytical SEs. That matches the registry’s documented notes, so this is not a defect. Concrete fix: none. tests/test_efficient_did_validation.py:L133-L143 tests/test_efficient_did_validation.py:L284-L301 docs/methodology/REGISTRY.md:L593-L600 docs/methodology/REGISTRY.md:L627-L627 docs/methodology/REGISTRY.md:L651-L654

Code Quality

No findings.

Performance

No findings.

Maintainability

P3 Impact: _make_compustat_dgp() is still duplicated in the new validation file instead of shared with the existing EfficientDiD test suite. The hard-coded derived constant was improved, but the generator itself remains a second source of truth, so future DGP edits can silently desynchronize the suites. Concrete fix: move the Compustat DGP and related truth helpers into a shared module under tests/helpers/ and import them in both suites. tests/test_efficient_did_validation.py:L84-L130 tests/test_efficient_did.py:L115-L164

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P2 Impact: the primary HRS replication assertions use _assert_close() with max(10% of expected, 200) as the acceptance band. On this fixed committed fixture, that is broad enough to let moderate estimator drift pass while the test still reports a successful paper replication; e.g. the ATT(9,10)=90 target would accept anything from -110 to 290. Concrete fix: tighten the Table 6 / ES / ES_avg checks to a materially smaller threshold, ideally a small fraction of the published SEs or a checked-in expected-results manifest with narrow absolute tolerances; keep broader tolerances only for smoke-test cross-checks if needed. tests/test_efficient_did_validation.py:L26-L39 tests/test_efficient_did_validation.py:L69-L76 tests/test_efficient_did_validation.py:L268-L286

P2: Replace loose max(10%*estimate, 200) tolerance with 0.1*SE (10% of one published standard error). Our actual diffs are all < 0.03 SE, so this catches real drift while absorbing minor sample differences. ATT(9,10)=90 now accepts [26, 154] instead of [-110, 290]. P3: Extract Compustat DGP into tests/edid_dgp.py as the single source of truth. Both test_efficient_did.py and test_efficient_did_validation.py import from it. Truth values (ES_avg, overall_att) are derived programmatically from the shared DGP parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T14:42:51Z

/ai-review

github-actions · 2026-03-21T14:57:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: eec1fe838e7f156fadbec29b724d8464179f5d8b

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

This re-review addresses the prior review’s substantive items: the paper-match tolerances are now SE-based, and the Compustat DGP is shared instead of duplicated. I only have one remaining P3 maintainability note.

Executive Summary

The diff is validation-only at the estimator layer: it adds an EfficientDiD paper-validation suite, a committed HRS fixture, a shared Compustat DGP helper, and a pytest import-path tweak. tests/test_efficient_did_validation.py:L1 tests/edid_dgp.py:L1 pyproject.toml:L90
Methodology alignment is good: the new suite computes paper ES_avg from event-study output rather than conflating it with the library’s overall_att, matching the registry’s documented convention. tests/test_efficient_did_validation.py:L88 docs/methodology/REGISTRY.md:L627
The previous tolerance concern is addressed: _assert_close() now uses published-SE-based tolerances instead of the earlier broad absolute band. tests/test_efficient_did_validation.py:L70
The previous DGP duplication concern is addressed: tests/test_efficient_did.py now reuses the shared helper in tests/edid_dgp.py. tests/test_efficient_did.py:L115 tests/edid_dgp.py:L16
P3: the new helper import is enabled by globally adding tests/ to pytest’s import path, which is broader than necessary for the rest of the suite. pyproject.toml:L92 tests/test_efficient_did_validation.py:L18 tests/test_efficient_did.py:L115
I could not run pytest here because this sandbox lacks pytest, numpy, and pandas; I did verify via stdlib CSV parsing that the committed fixture is balanced and matches the documented 656 / 252 / 176 / 163 / 65 unit counts.

Methodology

Severity P3 informational Impact: none. The HRS and MC tests correctly separate the paper’s ES_avg target from the library’s overall_att convention, and they keep analytical-vs-cluster SE comparison diagnostic-only rather than requiring equality. That matches the Methodology Registry’s documented notes, so this is not a defect. Concrete fix: none. tests/test_efficient_did_validation.py:L88 tests/test_efficient_did_validation.py:L208 docs/methodology/REGISTRY.md:L593 docs/methodology/REGISTRY.md:L627

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3 Impact: adding pythonpath = ["tests"] makes every module under tests/ importable as a top-level module for the entire suite, which is broader than needed and increases future name-collision risk. Concrete fix: switch the new imports to from tests.edid_dgp import ... (or move the helper under tests/helpers/) and remove the global pytest pythonpath entry. pyproject.toml:L92 tests/test_efficient_did_validation.py:L18 tests/test_efficient_did.py:L115

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings. The fixture provenance and expected cohort counts are now documented, and the validation suite asserts those counts directly. tests/data/README.md:L5 tests/test_efficient_did_validation.py:L167

P3: Move edid_dgp.py into tests/helpers/ and add the helpers directory to sys.path via conftest.py instead of the global pythonpath = ["tests"] setting. This avoids making every module under tests/ importable as a top-level module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T15:05:40Z

/ai-review

github-actions · 2026-03-21T15:39:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 23402929c1504cddcc7ffc93d2d5b8a2b94671b6

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Re-review status: the previous broad import-path concern is resolved. The PR now exposes only tests/helpers to pytest imports instead of all of tests. tests/conftest.py:13
Methodology alignment is good for the affected validations (EfficientDiD ATT(g,t), ES(e), ES_avg, plus the Section 5.2 Compustat DGP): the new suite computes paper ES_avg from event-study output rather than using library overall_att, which matches the Methodology Registry’s documented convention. tests/test_efficient_did_validation.py:88 tests/test_efficient_did_validation.py:204 docs/methodology/REGISTRY.md:576 docs/methodology/REGISTRY.md:627
The shared Compustat DGP extraction is a straight deduplication of the existing helper into tests/helpers/edid_dgp.py; I did not find a behavioral change in the existing EfficientDiD tests that consume it. tests/test_efficient_did.py:115 tests/helpers/edid_dgp.py:16
Slow MC validations are excluded by default in local pytest config, but Rust CI explicitly runs the full suite with -m '', so the new slow tests are still exercised in CI. pyproject.toml:93 .github/workflows/rust-test.yml:148
P3: the new fixture README says regeneration is documented in “the plan file,” but no such file exists in-tree, so fixture regeneration is not fully reproducible from repo docs alone. tests/data/README.md:34
I could not run pytest here because pytest, numpy, and pandas are unavailable in this environment; I did verify via stdlib CSV parsing that the fixture has 2,624 rows, 656 balanced units, and the documented cohort counts.

Methodology

Severity P3 informational Impact: none. The validation is aligned with the source/registry: ES_avg is computed as the uniform mean of post-treatment ES(e) values rather than reusing library overall_att, and the paper’s cluster-robust SEs are used only as diagnostic/tolerance anchors while the estimator continues to expose analytical SEs by default. Concrete fix: none. tests/test_efficient_did_validation.py:88 tests/test_efficient_did_validation.py:204 tests/test_efficient_did_validation.py:208 docs/methodology/REGISTRY.md:593 docs/methodology/REGISTRY.md:627

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The prior import-path issue has been addressed by narrowing the pytest-time path mutation to tests/helpers. tests/conftest.py:13

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3 Impact: the fixture provenance README does not fully support regeneration from the repository, because it references “the plan file” without identifying any committed file or script. Concrete fix: either commit the referenced plan/script or replace that sentence with the exact in-repo path or explicit command sequence used to build hrs_edid_validation.csv. tests/data/README.md:33 tests/data/README.md:34

P3: Replace reference to non-committed plan file with the exact Python command sequence used to build hrs_edid_validation.csv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-21T15:48:06Z

/ai-review

github-actions · 2026-03-21T16:27:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c688557e81e70f30c4f19c7eebf4fe5e800467c

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Re-review status: the prior P3 on fixture reproducibility is resolved. The new fixture README now documents source, sample construction, expected counts, and an explicit regeneration script in tests/data/README.md:5, tests/data/README.md:33, tests/data/README.md:35.
Methodology alignment is good for the affected method (EfficientDiD): the new validation computes paper ES_avg from event-study output rather than using library overall_att, which matches the registry’s documented estimand distinction in tests/test_efficient_did_validation.py:88, tests/test_efficient_did_validation.py:204, docs/methodology/REGISTRY.md:576, docs/methodology/REGISTRY.md:627.
The Table 6 SE comparisons are used only as tolerance and diagnostic anchors; they do not assert that the library’s analytical SEs equal the paper’s cluster-robust SEs, which is consistent with the registry’s SE notes in tests/test_efficient_did_validation.py:26, tests/test_efficient_did_validation.py:208, docs/methodology/REGISTRY.md:593, docs/methodology/REGISTRY.md:598.
The shared Compustat DGP extraction is straight deduplication of existing EfficientDiD test logic and reduces duplicate maintenance burden in tests/helpers/edid_dgp.py:16, tests/test_efficient_did.py:115.
The prior import-path concern remains addressed: pytest now exposes only tests/helpers rather than all of tests in tests/conftest.py:13.
Runtime verification here was limited: numpy, pandas, and pytest are unavailable in this environment, so this review relied on static inspection plus CSV structure/count checks.

Methodology

Severity P3 informational Impact: none. The affected method is EfficientDiD, and the new HRS/MC validation correctly treats paper ES_avg as the uniform average of post-treatment ES(e) values via _compute_es_avg() rather than as library overall_att; the registry explicitly documents that this is a named estimand difference, not a defect. Concrete fix: none. tests/test_efficient_did_validation.py:88, tests/test_efficient_did_validation.py:119, docs/methodology/REGISTRY.md:568, docs/methodology/REGISTRY.md:627
Severity P3 informational Impact: none. The new Table 6 checks use published cluster-robust SEs only to set closeness tolerances and to sanity-check magnitude, while the estimator continues to expose analytical SEs by default; that split matches the registry’s documented EfficientDiD variance conventions. Concrete fix: none. tests/test_efficient_did_validation.py:70, tests/test_efficient_did_validation.py:208, docs/methodology/REGISTRY.md:593, docs/methodology/REGISTRY.md:602

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The helper extraction is clean, and the pytest import-path change stays narrowly scoped to tests/helpers, avoiding the earlier broader sys.path mutation concern in tests/conftest.py:13, tests/helpers/edid_dgp.py:1.

Tech Debt

No findings. This diff does not introduce any new deferred-correctness item that should be tracked in TODO.md, and the prior README reproducibility gap has been addressed directly in-tree in tests/data/README.md:33.

Security

No findings.

Documentation/Tests

No findings. The previous documentation gap is resolved: the fixture README now includes provenance, sample-selection rules, expected cohort counts, and a concrete regeneration recipe in tests/data/README.md:5, tests/data/README.md:19, tests/data/README.md:35.

Add inline regeneration script to HRS fixture README

8c68855

P3: Replace reference to non-committed plan file with the exact Python command sequence used to build hrs_edid_validation.csv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber merged commit 8a72a07 into main Mar 21, 2026
13 checks passed

igerber deleted the feature/edid-paper-validation branch March 21, 2026 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EDID validation tests against paper results#221

Add EDID validation tests against paper results#221
igerber merged 5 commits intomainfrom
feature/edid-paper-validation

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Mar 21, 2026

Summary

HRS Replication Results (vs Table 6)

Methodology references

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

igerber commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant