Skip to content

Add EDID validation tests against paper results#221

Merged
igerber merged 5 commits intomainfrom
feature/edid-paper-validation
Mar 21, 2026
Merged

Add EDID validation tests against paper results#221
igerber merged 5 commits intomainfrom
feature/edid-paper-validation

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 21, 2026

Summary

  • Add HRS empirical replication tests validating EDiD ATT(g,t), ES(e), and ES_avg against Table 6 of Chen, Sant'Anna & Xie (2025)
  • Add Compustat MC simulation tests validating unbiasedness, RMSE dominance, efficiency gains, coverage, and SE calibration against Tables 4/5 patterns
  • Commit processed HRS fixture (656-individual subset of Dobkin et al. 2018 replication kit) as tests/data/hrs_edid_validation.csv
  • Add replication_data/ to .gitignore

HRS Replication Results (vs Table 6)

Target Paper Ours Diff Diff/SE
ATT(8,8) 3072 3076.9 +4.9 0.006
ATT(8,9) 1112 1098.1 -13.9 -0.022
ATT(8,10) 1038 1047.2 +9.2 0.011
ATT(9,9) 3063 3045.3 -17.7 -0.026
ATT(9,10) 90 96.0 +6.0 0.009
ATT(10,10) 2908 2925.7 +17.7 0.020
ES(0) 3024 3025.8 +1.8 0.004
ES(1) 692 686.0 -6.0 -0.013
ES(2) 1038 1047.2 +9.2 0.011
ES_avg 1585 1586.3 +1.3 0.003

All differences are < 0.03 standard errors. The CS cross-validation matches to < $1 on all ATT(g,t).

Methodology references

  • N/A — no methodology changes, validation only

Validation

  • Tests added: tests/test_efficient_did_validation.py (7 HRS replication + 5 MC simulation tests)
  • Data fixture: tests/data/hrs_edid_validation.csv (656 individuals × 4 waves from Dobkin et al. 2018)
  • HRS tests run in all CI builds (fast, not slow-marked)
  • MC tests run in Rust CI builds only (slow-marked, included via -m '')

Security / privacy

  • Confirm no secrets/PII in this PR: Yes
  • HRS fixture contains anonymized survey data from a publicly available replication kit

Generated with Claude Code

🤖 Generated with Claude Code

Validate the EfficientDiD estimator against published results from the
paper's Table 6 (HRS empirical application) and Tables 4/5 patterns
(Compustat MC simulations).

HRS replication: point estimates match within 0.1-1.3% for all targets
except ATT(9,10) which is a near-zero noisy estimate (90 ± 641). The
CS cross-validation matches to <$1 on all ATT(g,t), confirming data
loading correctness. Fixture is a 656-individual subset of the Dobkin
et al. (2018) publicly available replication kit.

MC simulations validate unbiasedness, RMSE dominance over CS, efficiency
gains increasing with serial correlation, ~95% coverage, and SE
calibration across rho values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The PR changes only test assets: .gitignore, tests/data/hrs_edid_validation.csv, and tests/test_efficient_did_validation.py. No estimator, weighting, variance, or default-behavior code paths are modified.
  • Methodology alignment is good: the new tests use EfficientDiD(pt_assumption="all"), compute paper ES_avg from event-study output rather than library overall_att, and treat Table 6 SEs as diagnostic-only, which matches the EfficientDiD registry notes.
  • No P0/P1 issues were introduced in identification, control-group logic, variance/SE handling, or NaN inference patterns.
  • P2: the sample-selection regression check is too loose for a committed deterministic fixture, so moderate fixture drift could still pass.
  • P3: the Compustat DGP/truth logic is duplicated and partly hard-coded, creating a second source of truth.
  • I could not execute the new tests in this sandbox because the available interpreter lacks numpy, pandas, and pytest.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • No findings.

Security

Documentation/Tests

  • P2 Impact: the sample-selection guard in tests/test_efficient_did_validation.py uses broad approximate tolerances even though the committed fixture in tests/data/hrs_edid_validation.csv is deterministic. The checked-in file currently has exact cohort counts of 656 units total, with 252/176/163/65 in cohorts 8/9/10/inf. As written, a nontrivial fixture drift could still pass the “expected counts” gate. Concrete fix: assert the exact checked-in counts and exact period support, or verify against a checked-in metadata manifest.
  • P3 Impact: the committed HRS fixture has no in-repo provenance or regeneration note, while .gitignore suggests raw replication inputs live outside version control. That makes future audit/rebuild of the paper-validation subset harder than necessary. Concrete fix: add a short README or regeneration script next to the fixture documenting the upstream source, selection steps, and expected checksum/counts.

P2: Assert exact cohort counts (656/252/176/163/65) and wave support
since the CSV fixture is deterministic — approximate tolerances could
mask fixture drift.

P3: Derive _TRUE_ES_AVG_COMPUSTAT programmatically from DGP parameters
instead of hard-coding, so changes to the DGP definition propagate
automatically.

P3: Add tests/data/README.md documenting the HRS fixture source,
sample selection steps, and expected counts for future audit/rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: a4b388178e6f6d863213a6391dc4353da1e57ca6


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • P3 Impact: _make_compustat_dgp() is still duplicated in the new validation file instead of shared with the existing EfficientDiD test suite. The hard-coded derived constant was improved, but the generator itself remains a second source of truth, so future DGP edits can silently desynchronize the suites. Concrete fix: move the Compustat DGP and related truth helpers into a shared module under tests/helpers/ and import them in both suites. tests/test_efficient_did_validation.py:L84-L130 tests/test_efficient_did.py:L115-L164

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P2 Impact: the primary HRS replication assertions use _assert_close() with max(10% of expected, 200) as the acceptance band. On this fixed committed fixture, that is broad enough to let moderate estimator drift pass while the test still reports a successful paper replication; e.g. the ATT(9,10)=90 target would accept anything from -110 to 290. Concrete fix: tighten the Table 6 / ES / ES_avg checks to a materially smaller threshold, ideally a small fraction of the published SEs or a checked-in expected-results manifest with narrow absolute tolerances; keep broader tolerances only for smoke-test cross-checks if needed. tests/test_efficient_did_validation.py:L26-L39 tests/test_efficient_did_validation.py:L69-L76 tests/test_efficient_did_validation.py:L268-L286

P2: Replace loose max(10%*estimate, 200) tolerance with 0.1*SE
(10% of one published standard error). Our actual diffs are all
< 0.03 SE, so this catches real drift while absorbing minor sample
differences. ATT(9,10)=90 now accepts [26, 154] instead of [-110, 290].

P3: Extract Compustat DGP into tests/edid_dgp.py as the single source
of truth. Both test_efficient_did.py and test_efficient_did_validation.py
import from it. Truth values (ES_avg, overall_att) are derived
programmatically from the shared DGP parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: eec1fe838e7f156fadbec29b724d8464179f5d8b


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

This re-review addresses the prior review’s substantive items: the paper-match tolerances are now SE-based, and the Compustat DGP is shared instead of duplicated. I only have one remaining P3 maintainability note.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3 Impact: adding pythonpath = ["tests"] makes every module under tests/ importable as a top-level module for the entire suite, which is broader than needed and increases future name-collision risk. Concrete fix: switch the new imports to from tests.edid_dgp import ... (or move the helper under tests/helpers/) and remove the global pytest pythonpath entry. pyproject.toml:L92 tests/test_efficient_did_validation.py:L18 tests/test_efficient_did.py:L115

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

P3: Move edid_dgp.py into tests/helpers/ and add the helpers
directory to sys.path via conftest.py instead of the global
pythonpath = ["tests"] setting. This avoids making every module
under tests/ importable as a top-level module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 23402929c1504cddcc7ffc93d2d5b8a2b94671b6


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Re-review status: the previous broad import-path concern is resolved. The PR now exposes only tests/helpers to pytest imports instead of all of tests. tests/conftest.py:13
  • Methodology alignment is good for the affected validations (EfficientDiD ATT(g,t), ES(e), ES_avg, plus the Section 5.2 Compustat DGP): the new suite computes paper ES_avg from event-study output rather than using library overall_att, which matches the Methodology Registry’s documented convention. tests/test_efficient_did_validation.py:88 tests/test_efficient_did_validation.py:204 docs/methodology/REGISTRY.md:576 docs/methodology/REGISTRY.md:627
  • The shared Compustat DGP extraction is a straight deduplication of the existing helper into tests/helpers/edid_dgp.py; I did not find a behavioral change in the existing EfficientDiD tests that consume it. tests/test_efficient_did.py:115 tests/helpers/edid_dgp.py:16
  • Slow MC validations are excluded by default in local pytest config, but Rust CI explicitly runs the full suite with -m '', so the new slow tests are still exercised in CI. pyproject.toml:93 .github/workflows/rust-test.yml:148
  • P3: the new fixture README says regeneration is documented in “the plan file,” but no such file exists in-tree, so fixture regeneration is not fully reproducible from repo docs alone. tests/data/README.md:34
  • I could not run pytest here because pytest, numpy, and pandas are unavailable in this environment; I did verify via stdlib CSV parsing that the fixture has 2,624 rows, 656 balanced units, and the documented cohort counts.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The prior import-path issue has been addressed by narrowing the pytest-time path mutation to tests/helpers. tests/conftest.py:13

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity P3 Impact: the fixture provenance README does not fully support regeneration from the repository, because it references “the plan file” without identifying any committed file or script. Concrete fix: either commit the referenced plan/script or replace that sentence with the exact in-repo path or explicit command sequence used to build hrs_edid_validation.csv. tests/data/README.md:33 tests/data/README.md:34

P3: Replace reference to non-committed plan file with the exact
Python command sequence used to build hrs_edid_validation.csv.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c688557e81e70f30c4f19c7eebf4fe5e800467c


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The helper extraction is clean, and the pytest import-path change stays narrowly scoped to tests/helpers, avoiding the earlier broader sys.path mutation concern in tests/conftest.py:13, tests/helpers/edid_dgp.py:1.

Tech Debt

No findings. This diff does not introduce any new deferred-correctness item that should be tracked in TODO.md, and the prior README reproducibility gap has been addressed directly in-tree in tests/data/README.md:33.

Security

No findings.

Documentation/Tests

No findings. The previous documentation gap is resolved: the fixture README now includes provenance, sample-selection rules, expected cohort counts, and a concrete regeneration recipe in tests/data/README.md:5, tests/data/README.md:19, tests/data/README.md:35.

@igerber igerber merged commit 8a72a07 into main Mar 21, 2026
13 checks passed
@igerber igerber deleted the feature/edid-paper-validation branch March 21, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant