HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt) by igerber · Pull Request #402 · igerber/diff-diff

igerber · 2026-05-09T14:36:16Z

Summary

Adds _handle_had and _handle_had_event_study to practitioner.py::_HANDLERS, routing both HAD result classes through HAD-specific Baker et al. (2025) step guidance (5 steps: pretest workflow, sibling-estimator routing, bandwidth/sup-t bands, event-study disaggregation, design-auto-detection + last-cohort-only-WAS framing).
Symmetric pair: _handle_continuous gains a Step-4 nudge to HeterogeneousAdoptionDiD for ContinuousDiD users on no-untreated panels — bidirectional routing closure.
Extends _check_nan_att with an ndarray branch (lazy numpy import, np.all(np.isnan(arr)) semantics so partial-NaN arrays don't over-fire). Scalar path bit-exact preserved across all 12 untouched handlers.
Adds full HAD section + result-class blocks + ## HAD Pretests index (7 pretest entry points) + Choosing-an-Estimator row to diff_diff/guides/llms-full.txt. Tightens the existing Continuous treatment intensity Choosing row with (some units untreated) for explicit contrast with the new HAD row.
21 new tests (14 in TestHADDispatch + 6 in TestLLMsFullHADCoverage + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant).

Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs.

Test plan

pytest tests/test_practitioner.py -v (47 passed; 16 new HAD tests)
pytest tests/test_guides.py -v (31 passed; 6 new HAD coverage tests)
black diff_diff/practitioner.py tests/test_practitioner.py tests/test_guides.py
ruff check diff_diff/practitioner.py tests/test_practitioner.py tests/test_guides.py
mypy diff_diff/practitioner.py (clean; pre-existing prep_dgp.py errors unrelated)
CI green after ready-for-ci label

🤖 Generated with Claude Code

Add _handle_had + _handle_had_event_study to practitioner.py, routing both HeterogeneousAdoptionDiD result classes through HAD-specific Baker et al. (2025) step guidance: did_had_pretest_workflow (step 3), ContinuousDiD/CallawaySantAnna routing nudge (step 4), bandwidth_diagnostics + simultaneous bands (step 6), per-horizon WAS event-study disaggregation (step 7), design-auto-detection + last-cohort-only-WAS framing (step 8). Symmetric pair: _handle_continuous gains Step-4 routing to HAD on no-untreated panels - the HAD <-> ContinuousDiD routing loop is now bidirectional. Extend _check_nan_att with ndarray branch (lazy numpy import + np.all(np.isnan(arr)) semantics so partial-NaN arrays don't over-fire the warning). Scalar path bit-exact preserved across all 12 untouched handlers. Add full HAD section + result-class blocks + ## HAD Pretests index covering all 7 pretest entry points + Choosing-an-Estimator row to diff_diff/guides/llms-full.txt (the bundled-in-wheel agent reference). Tighten the existing Continuous treatment intensity Choosing row with "(some units untreated)" so the HAD vs ContinuousDiD contrast is explicit. Framing: "no untreated unit" / dose variation, never "no comparison group" - locked by negative-assertion tests on both handler text and llms-full.txt section. docs/doc-deps.yaml: remove the llms-full.txt deferral note on had.py and add llms-full.txt entries to had.py, had_pretests.py, and practitioner.py blocks. 21 new tests (14 in tests/test_practitioner.py::TestHADDispatch + 6 in tests/test_guides.py::TestLLMsFullHADCoverage + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-09T14:42:01Z

Overall Assessment

⚠️ Needs changes

Executive Summary

P1 The new HAD estimator-selection guidance now says “if untreated units exist, switch away from HAD,” but the registry and had.py explicitly allow a small untreated share and retain never-treated units on the staggered event-study path.
P2 llms-full.txt documents a HAD constructor/fit signature that does not exist (h, b, rcond; aggregate=None) instead of the real public API.
P2 Two new practitioner snippets use undefined names (result.bandwidth_diagnostics, survey_design=design), and the new tests only check keyword presence, not whether the advertised snippets/signatures are valid.
_check_nan_att’s ndarray handling itself looks correct and consistent with the HAD event-study NaN contract; I did not find an inference/SE regression in the live code path.

Methodology

P1 Affected methods: HeterogeneousAdoptionDiD, ContinuousDiD, CallawaySantAnna. The new runtime guidance and chooser text hard-code “untreated units => not HAD” (diff_diff/practitioner.py:L847-L903, diff_diff/practitioner.py:L977-L1026, diff_diff/guides/llms-full.txt:L595-L653, diff_diff/guides/llms-full.txt:L1890-L1891). That conflicts with the Methodology Registry, which explicitly says untreated units need not be dropped and that staggered HAD event studies retain never-treated units (docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2525-L2535), and with the validator/implementation itself (diff_diff/had.py:L907-L1088, diff_diff/had.py:L1100-L1147, diff_diff/had.py:L1314-L1355). Impact: bundled agent surfaces can recommend the wrong estimator on valid HAD designs. Concrete fix: remove the blanket first_treat == 0 => switch away from HAD rule and rewrite the HAD intro/chooser text around the actual WAS/dose-variation contract, explicitly noting that a small untreated mass or last-cohort-plus-never-treated filtering is still compatible with HAD.

Code Quality

P2 Two new practitioner snippets are internally inconsistent: result.bandwidth_diagnostics uses an undefined singular name, and the sup-t example passes survey_design=design with no design object in scope (diff_diff/practitioner.py:L905-L920, diff_diff/practitioner.py:L1029-L1048). Impact: copy/paste of the advertised guidance fails immediately. Concrete fix: use declared placeholders (results.bandwidth_diagnostics, survey_design=sd or similar) and add a compile/smoke test for emitted code blocks.

Performance

No findings.

Maintainability

No additional findings beyond the doc/test issues below.

Tech Debt

No findings. The TODO.md change properly tracks the remaining tutorial follow-up rather than leaving it implicit.

Security

No findings.

Documentation/Tests

P2 The new llms-full.txt HAD API block does not match the actual public API. The guide invents constructor args h, b, and rcond, omits real args d_lower, kernel, vcov_type, robust, and cluster, and documents aggregate: str | None = None instead of the real aggregate="overall" default (diff_diff/guides/llms-full.txt:L597-L625, diff_diff/had.py:L2641-L2651, diff_diff/had.py:L2786-L2806, diff_diff/had.py:L2845-L2906). Impact: agents using the bundled guide will generate failing HAD calls. Concrete fix: sync the guide to the real signatures/defaults and add a regression test that compares the documented parameter list to inspect.signature(HeterogeneousAdoptionDiD.__init__) and .fit().
P2 The new tests lock in the bad framing and only assert substring presence, not semantic correctness or snippet validity (tests/test_guides.py:L280-L344, tests/test_practitioner.py:L499-L610). Impact: CI currently blesses the methodology drift above and would not catch the signature/snippet errors. Concrete fix: replace the “must say no untreated/universal rollout” assertion with checks aligned to the registry contract, and add signature/snippet smoke tests.

Path to Approval

Rewrite HAD selection guidance in practitioner.py and llms-full.txt so it matches the registry: HAD can coexist with a small untreated share and, on staggered event-study panels, with last-cohort-plus-never-treated filtering.
Correct the llms-full.txt HAD constructor/fit blocks to the actual had.py signatures and defaults.
Fix the broken practitioner snippets and add tests that validate semantics/signatures/snippet executability, not just keyword presence.

P1 methodology fix: Step-4 routing in _handle_had + _handle_had_event_study no longer says "switch away from HAD if untreated units exist" - that contradicts REGISTRY § HeterogeneousAdoptionDiD edge cases (line 2403: "Authors do NOT require untreated units to be dropped"; line 2408 + had.py:1325: never-treated units RETAINED on staggered event-study). Reframed as the actual estimand differentiator: HAD targets WAS at the dose support boundary; ContinuousDiD targets per-dose ATT(d) / ACRT(d) and requires never-treated controls. Routing fires only when the user wants the ATT(d) estimand AND has never-treated controls, not on "untreated units exist". Tightens the corresponding Choosing-an-Estimator table row to surface WAS vs ATT(d) as the differentiator. P2 (a) signatures: llms-full.txt HAD constructor + fit() blocks now match the actual HeterogeneousAdoptionDiD.__init__ / .fit signatures exactly. Drops invented kwargs (h, b, rcond) and adds the real ones (d_lower, kernel, vcov_type, robust, cluster). aggregate default corrected from None to "overall". fit() now lists survey, weights, cband (positional-or-keyword) and survey_design + trends_lin (keyword-only). P2 (b) snippet bugs: result.bandwidth_diagnostics -> results.bandwidth_diagnostics (matching the plural convention of other handlers); sup-t snippet now imports SurveyDesign and constructs sd before passing survey_design=sd (was survey_design=design with no design defined). P2 (c) tests: New TestLLMsFullHADCoverage tests use inspect.signature(HeterogeneousAdoptionDiD.__init__) and .fit() to regress the documented signatures against the real API. New test_llms_full_had_section_methodology_compatible_with_untreated locks the negative assertion that the section does NOT carry framing contradicting the registry. Practitioner tests gain test_had_step_4_does_not_misframe_untreated_unit_routing + test_had_handler_snippets_are_valid_python_syntax (catches snippet syntax errors via ast.parse) + test_handle_continuous_step_4_snippet_is_valid_python. 83 tests pass (47 in test_practitioner including 5 new + 36 in test_guides including 9 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T14:57:13Z

/ai-review

github-actions · 2026-05-09T15:02:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6d6e950e322443f19131aacb0e855f37e367172a

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review blockers look resolved: the untreated-unit/HAD routing now matches the registry, the HAD ctor/fit signatures in llms-full.txt match had.py, and the broken practitioner snippets were fixed.
P1 — The new HAD pretest guidance misstates what the shipped tests actually test: QUG is documented as “Assumption 5,” stute_test() as “Assumption 7,” and _handle_had() implies the default two-period workflow covers Assumption 7 even though the actual workflow explicitly does not.
P2 — The new HAD result-class blocks in llms-full.txt are not API-complete, and the new tests do not protect that surface from drift.
The remaining tutorial follow-up is properly tracked in TODO.md, so that part is non-blocking.

Methodology

P1 — diff_diff/practitioner.py:L857-L865 and diff_diff/guides/llms-full.txt:L1417-L1419 misstate the paper/registry contract. qug_test() is the support-infimum/QUG test (H0: d_lower = 0), not “Assumption 5”; stute_test() is the Assumption 8 linearity test, not an Assumption 7 mean-independence test; and did_had_pretest_workflow(..., aggregate="overall") runs steps 1 and 3 only, with Assumption 7/step 2 explicitly absent on that path. The shipped source and registry say this directly at diff_diff/had_pretests.py:L1291-L1373, diff_diff/had_pretests.py:L1565-L1668, diff_diff/had_pretests.py:L4434-L4449, docs/methodology/REGISTRY.md:L2265-L2271, and docs/methodology/REGISTRY.md:L2480-L2489. Impact: the new agent-facing surfaces can tell users that the default two-period workflow validated Assumption 7 or Design 1 Assumption 5 when it did not, which is a methodology misrepresentation on the identification-check surface. Concrete fix: rewrite the overall HAD pretest descriptions so they say “QUG/support-infimum test + Assumption 8 linearity tests; Assumption 7 pre-trends is only covered on aggregate="event_study",” and remove the Assumption 5 mapping from QUG.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 — TODO.md:L112 properly tracks the remaining Phase 5 tutorial follow-up (T21/T22). Impact: none; this is acceptable deferred work under the project rules. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P2 — The new llms-full.txt result-class tables are incomplete relative to the actual HAD result dataclasses. The single-period block omits real fields such as n_mass_point, n_above_d_lower, cluster_name, bias_corrected_fit, variance_formula, and effective_dose_mean; the event-study block omits vcov_type, cluster_name, bandwidth_diagnostics, bias_corrected_fit, and filter_info. Compare diff_diff/guides/llms-full.txt:L1227-L1284 with diff_diff/had.py:L330-L355 and diff_diff/had.py:L684-L733. The new tests only check section presence, untreated-unit framing, and signature params, not result-field completeness or assumption labeling (tests/test_guides.py:L285-L358, tests/test_practitioner.py:L510-L688). Impact: agents treating llms-full.txt as the authoritative surface will miss available diagnostics/metadata, and CI will not catch further drift on these blocks. Concrete fix: either make the tables exhaustive or label them as partial summaries, and add a regression that compares the documented field list against the actual dataclass fields or to_dict() keys.
Could not execute pytest in this sandbox: pytest is not installed here, and importing diff_diff also fails because numpy is unavailable. This review is therefore static.

Path to Approval

Correct the HAD pretest wording in diff_diff/practitioner.py and diff_diff/guides/llms-full.txt so the test/assumption mapping matches had_pretests.py and REGISTRY.md: QUG = support-infimum test, stute_test()/yatchew_hr_test() = Assumption 8 linearity, and Assumption 7 is only covered on the event-study path.
Add regression tests that fail if the docs or practitioner text again claim that qug_test() is Assumption 5, stute_test() is Assumption 7, or did_had_pretest_workflow(..., aggregate="overall") covers step 2/Assumption 7.

P1 pretest assumption labels: _handle_had step-3 + llms-full.txt HAD Pretests section + qug_test/stute_test bullets misstated which paper assumption each shipped test actually tests: - qug_test was labeled "Assumption 5 support condition", but QUG tests H_0: d_lower = 0 (paper Theorem 4 / step 1 of the workflow). Assumption 5 is the Design 1 sign-identification condition and is NOT testable via pre-trends per REGISTRY.md:2270. - stute_test was labeled "Assumption 7 mean-independence", but stute_test is the Assumption 8 linearity test (paper Section 4.2 step 3 / Appendix D). Assumption 7 is pre-trends (step 2). - did_had_pretest_workflow(aggregate="overall") was implied to cover step 2, but the workflow runs steps 1 + 3 only - step 2 is explicitly not covered on the overall path (had_pretests.py:4434-4441 + the workflow's verdict flags the gap). Rewrote both surfaces to match the actual contracts: QUG = paper Theorem 4 support-infimum test (step 1, decides Design 1' vs Design 1); Stute / Yatchew-HR = Assumption 8 linearity tests (step 3); Assumption 7 step 2 closure requires aggregate="event_study" (joint Stute pre-trends). Assumption 7 / step 2 gap is explicitly flagged on the overall path so agents do not assume coverage where there is none. P2 result-class field tables incomplete: HeterogeneousAdoptionDiDResults table was missing n_mass_point, n_above_d_lower, cluster_name, bias_corrected_fit, variance_formula, effective_dose_mean. HeterogeneousAdoptionDiDEventStudyResults table was missing vcov_type, cluster_name, bandwidth_diagnostics, bias_corrected_fit, filter_info. Added all missing fields with correct types and descriptions. Tests added (3 new, 86 total): - test_llms_full_had_results_class_field_lists_match_real_dataclass: uses dataclasses.fields() to enumerate every public field on both result classes and assert each appears in the documented table. Catches future drift where new fields land but the guide is not updated. - test_llms_full_had_pretests_assumption_labels_correct: scans the qug_test and stute_test bullets in the HAD Pretests section and enforces positive labels (support-infimum / Theorem 4 / linearity) + forbids positive Assumption-5 / Assumption-7 misclaims (negative disclaimers like "QUG does NOT test Assumption 5" remain allowed). - test_had_step_3_pretest_assumption_labels_correct: same checks on the practitioner.py _handle_had step-3 why-text; also requires positive acknowledgment of the Assumption 7 / step 2 gap on the overall workflow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T15:09:51Z

/ai-review

github-actions · 2026-05-09T15:16:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ed2842a83881f1ffdc050862159357fdd47578f8

Overall Assessment

✅ Looks good

Executive Summary

The prior re-review methodology blocker is resolved: the new HAD practitioner text and llms-full pretest section now correctly map QUG to the support-infimum test, Stute/Yatchew to Assumption 8, and the aggregate="overall" workflow to paper steps 1 + 3 only.
The new HAD handler dispatch, ndarray all-NaN warning path, and llms-full dataclass-field coverage are consistent with had.py and had_pretests.py.
P2 — diff_diff/guides/llms-full.txt still misdocuments the single-period HAD variance_formula and effective_dose_mean fields for weighted mass_point fits; the code populates 2SLS-specific values there.
P2 [Newly identified] — the official practitioner guide returned by get_llm_guide("practitioner") still routes all continuous-intensity designs to ContinuousDiD and omits HAD, so the new runtime estimator-selection guidance is not propagated across all agent-facing workflow docs.
The Phase 5 tutorial follow-up is properly tracked in TODO.md, so that part is non-blocking.

Methodology

No findings. The previous assumption-label mismatch is fixed in diff_diff/practitioner.py:L857-L871 and diff_diff/guides/llms-full.txt:L1407-L1438, which now match diff_diff/had_pretests.py:L4435-L4449 and docs/methodology/REGISTRY.md:L2480-L2501.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 — The remaining HAD tutorial work is appropriately tracked in TODO.md:L112-L112. Impact: none. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P2 — The new single-period HAD results table still understates weighted mass_point metadata. diff_diff/guides/llms-full.txt:L1249-L1255 says variance_formula and effective_dose_mean are continuous-path-only / None on mass-point fits, but the actual static fit populates pweight_2sls / survey_binder_tsl_2sls and a weighted Wald-IV dose-gap on weighted mass-point paths in diff_diff/had.py:L3585-L3628 and diff_diff/had.py:L3642-L3659. Impact: agents using llms-full.txt as the authoritative result-surface guide will misread available inference metadata on weighted mass-point fits. Concrete fix: update those two field descriptions to include the 2SLS labels and weighted mass-point denominator semantics, and extend the guide regression to check these semantics instead of only field-name presence.
P2 [Newly identified] — The practitioner workflow guide is still stale for continuous-treatment estimator selection. diff_diff/_guides_api.py:L7-L10 exposes llms-practitioner.txt as the official "practitioner" guide, and diff_diff/guides/llms-full.txt:L34-L37 tells users to consult it, but diff_diff/guides/llms-practitioner.txt:L151-L162 still routes every continuous-intensity design to ContinuousDiD. That now conflicts with the new runtime reroute in diff_diff/practitioner.py:L729-L749 and with ContinuousDiD's never-treated requirement in diff_diff/continuous_did.py:L348-L360. Impact: agent consumers following the official workflow guide can still be steered to the wrong estimator on universal-rollout / no-untreated panels. Concrete fix: update the Step 4 decision tree and surrounding workflow prose to distinguish ContinuousDiD from HeterogeneousAdoptionDiD, and add a regression on get_llm_guide("practitioner") coverage.

Execution note: this was a static review only. I could not run pytest in this sandbox because pytest is not installed, and importing diff_diff fails here because numpy is unavailable.

P2 (a) variance_formula / effective_dose_mean field descriptions in the single-period results table understated weighted mass-point fits. The single-period table previously said both fields were continuous-path-only / None on mass-point fits, but per had.py:3585-3629 weighted mass-point fits populate variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and effective_dose_mean as the weighted Wald-IV dose gap. Both rows now enumerate all four variance_formula labels and describe the mass-point Wald-IV dose-gap denominator explicitly. P2 (b) llms-practitioner.txt Step 4 decision tree was stale: it routed ALL continuous-intensity designs to ContinuousDiD, omitting HAD. That contradicted the new runtime estimator-selection guidance and would steer agents to the wrong estimator on universal-rollout panels. Updated the continuous branch to distinguish ContinuousDiD (per-dose ATT(d) / ACRT(d), requires never-treated controls) from HeterogeneousAdoptionDiD (WAS at dose support boundary, compatible with universal rollout or small never-treated share). Tests added (2 new, 88 total): - test_llms_full_had_variance_formula_describes_all_designs: scans the variance_formula and effective_dose_mean rows in the single-period HAD results table; asserts all four variance_formula labels (pweight, survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and the mass-point Wald-IV dose-gap semantics are present. - test_llms_practitioner_step_4_distinguishes_had_from_continuous: parses the Step 4 decision tree from get_llm_guide("practitioner"); asserts both ContinuousDiD and HeterogeneousAdoptionDiD appear in the continuous branch and that never-treated / universal-rollout framing is the routing distinguisher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T15:30:04Z

/ai-review

github-actions · 2026-05-09T15:38:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 00ed3ab37edaf51ce02aad6ae618564d5eb1fbed

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review doc issues are mostly resolved: llms-full.txt now documents weighted mass-point HAD metadata in diff_diff/guides/llms-full.txt:L1254-L1255, and the official practitioner guide now distinguishes ContinuousDiD from HeterogeneousAdoptionDiD in diff_diff/guides/llms-practitioner.txt:L160-L168.
P1 — the new HAD Step-3 runtime guidance overstates pretest coverage on weighted/survey fits. practitioner_next_steps() says did_had_pretest_workflow runs QUG on the overall and event-study paths, but the actual workflow skips QUG whenever survey_design= / survey= / weights= is used.
The ndarray branch in _check_nan_att() looks correct: it preserves the old scalar path and warns only on all-NaN arrays, avoiding false positives on partial-NaN HAD event-study output in diff_diff/practitioner.py:L1176-L1218.
P3 — docs/methodology/REGISTRY.md still marks the HAD practitioner integration / llms-full section as pending, so the authoritative registry is stale relative to this PR and TODO.md:L112-L112.

Methodology

Severity: P1. diff_diff/practitioner.py:L857-L882 and diff_diff/practitioner.py:L997-L1014 state that did_had_pretest_workflow runs QUG on the overall and event-study paths. But the actual workflow explicitly skips QUG under survey_design= / survey= / weights= and returns only a linearity-conditional verdict per diff_diff/had_pretests.py:L4488-L4495 and docs/methodology/REGISTRY.md:L2445-L2450. Impact: agent consumers of the new runtime HAD guidance can be told that the support-infimum / Design-1-vs-Design-1' step was checked on weighted designs when the library deliberately cannot check it there. Concrete fix: update both HAD Step-3 messages and snippets to say that QUG coverage applies only on unweighted paths, and explicitly note that weighted paths skip QUG and require a linearity-conditional interpretation; add a regression covering this wording.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining tutorial follow-up is appropriately tracked in TODO.md:L112-L112.

Security

No findings.

Documentation/Tests

Severity: P3. docs/methodology/REGISTRY.md:L2507-L2555 still says the HAD practitioner_next_steps() integration and llms-full.txt HAD section are pending, even though this PR lands them and TODO.md:L112-L112 reflects that. Impact: the Methodology Registry is now out of sync with the shipped agent-facing surface, which weakens it as the source of truth for future reviews. Concrete fix: update the Phase 5 notes/checklist in REGISTRY.md to reflect the landed wave-1 surfaces and leave only the tutorial follow-ups pending.
Static review only: I could not run pytest here because pytest is not installed in this sandbox.

Path to Approval

Amend the HAD Step-3 guidance in diff_diff/practitioner.py:L857-L882 and diff_diff/practitioner.py:L997-L1014 so it explicitly states that survey_design= / survey= / weights= skip QUG and yield only a linearity-conditional verdict.
Add targeted regressions in tests/test_practitioner.py asserting that both HAD handler variants mention the QUG-under-survey deferral and do not imply full step-1 coverage on weighted paths.

P1 HAD Step-3 overstated pretest coverage on weighted/survey fits: practitioner_next_steps() said did_had_pretest_workflow runs QUG on both the overall and event-study paths without noting that the workflow explicitly skips QUG whenever survey_design= / survey= / weights= is supplied (Phase 4.5 C0 deferral, had_pretests.py:4488-4495 + REGISTRY § "QUG Null Test" Note (Phase 4.5 C0)). On weighted fits the workflow emits a UserWarning and returns a linearity-conditional verdict only. Both _handle_had and _handle_had_event_study Step-3 why-text + code snippets now explicitly state that survey-weighted fits skip QUG and yield a linearity-conditional verdict (the weighted verdict is conditional on QUG holding by assumption). The event-study text also notes that joint Stute pre-trends and joint homogeneity-linearity themselves remain available under survey weighting via the PSU-level Mammen multiplier bootstrap. P3 REGISTRY § HeterogeneousAdoptionDiD requirements checklist was stale: marked "Phase 5: practitioner_next_steps() integration" and "Phase 5 (remaining): llms-full.txt section" as pending. Updated to reflect this PR landing wave 1 of Phase 5; only T21 (HAD pretest workflow tutorial) and T22 (weighted/survey HAD tutorial) remain queued, both tracked in TODO.md. Tests added (1 new, 89 total): - test_had_step_3_flags_qug_under_survey_deferral: asserts both HAD handler variants surface the QUG-under-survey skip and the linearity-conditional-verdict caveat. Without this caveat agents may assume step 1 / Design 1' vs Design 1 was checked on weighted fits when the library deliberately does not check it there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T15:47:12Z

/ai-review

github-actions · 2026-05-09T15:51:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d4b909157a6a38dfcf7d41355985efcfd1406154

Overall Assessment

✅ Looks good

Executive Summary

Previous P1 resolved: both new HAD Step-3 handler variants now explicitly state that survey/weighted workflows skip QUG and return only a linearity-conditional verdict, matching the shipped workflow and registry. diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, diff_diff/had_pretests.py:L4488-L4565
The new HAD/ContinuousDiD routing is methodologically consistent with the current implementation: ContinuousDiD still requires a D=0 comparison group in this library, while HAD remains valid with a small never-treated share and retains never-treated units on the staggered event-study path. diff_diff/continuous_did.py:L348-L360, docs/methodology/REGISTRY.md:L721-L723, docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2535-L2536
_check_nan_att()’s ndarray branch is conservative and correctly scoped: it warns only on all-NaN arrays and avoids false positives on legitimate partial-NaN HAD event-study output. diff_diff/practitioner.py:L1196-L1238
Severity: P3. llms-full.txt now correctly documents weighted mass-point variance_formula / effective_dose_mean, but the unchanged single-period result-field docstrings in diff_diff/had.py still describe those fields as continuous-only / None on mass-point. diff_diff/guides/llms-full.txt:L1254-L1255, diff_diff/had.py:L347-L366, implementation diff_diff/had.py:L3565-L3660
Static review only: this sandbox lacks numpy and pytest, so I could not run the targeted tests or import the package.

Methodology

No findings. The prior QUG-under-survey wording defect is fixed, and the new practitioner/guide text now matches the workflow and registry on step coverage, untreated-unit compatibility, and last-cohort-only event-study framing. diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, diff_diff/guides/llms-full.txt:L1405-L1440, docs/methodology/REGISTRY.md:L2431-L2450, docs/methodology/REGISTRY.md:L2551-L2554

Code Quality

No findings. _check_nan_att() preserves the old scalar path and adds a narrow ndarray fallback without introducing inline inference logic or partial NaN guards. diff_diff/practitioner.py:L1196-L1238

Performance

No findings. The new runtime cost on the common path is minimal, and the numpy import is lazy and only taken for array-valued att. diff_diff/practitioner.py:L1217-L1225

Maintainability

No findings. Phase-5 bookkeeping is now internally aligned across registry, TODO, and doc dependency mapping. docs/doc-deps.yaml:L383-L405, docs/methodology/REGISTRY.md:L2551-L2554, TODO.md:L112

Tech Debt

No findings. The remaining HAD follow-up work is properly tracked as tutorial debt rather than left as an implicit gap. TODO.md:L112

Security

No findings.

Documentation/Tests

Severity: P3. Impact: llms-full.txt and the actual implementation agree that single-period weighted mass-point HAD fits populate variance_formula and effective_dose_mean, but the unchanged HeterogeneousAdoptionDiDResults field docstrings still say those fields are continuous-only / None on mass-point. That leaves two internal source-of-truth surfaces disagreeing about the same public result object. diff_diff/guides/llms-full.txt:L1254-L1255, diff_diff/had.py:L347-L366, diff_diff/had.py:L3565-L3660
Concrete fix: update the variance_formula and effective_dose_mean field docstrings in diff_diff/had.py to include the mass-point labels and weighted Wald-IV dose-gap semantics already implemented; optionally add a doc-drift regression for those descriptions alongside the existing guide tests.

P3 dataclass-docstring drift: PR #402 R3 fixed the llms-full.txt field descriptions to acknowledge that weighted mass-point HAD fits populate variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and effective_dose_mean as the weighted Wald-IV dose gap (per had.py:3585-3660), but the HeterogeneousAdoptionDiDResults dataclass field docstrings in had.py:347-366 still said those fields were continuous-only / None on mass-point - leaving two source-of-truth surfaces disagreeing about the same public result object. Updated both field docstrings to enumerate all four variance_formula labels (continuous + mass-point variants under both `weights=` shortcut and `survey_design=` paths) and to describe the mass-point weighted Wald-IV dose-gap denominator semantics (`mean(D | Z=1, w) - mean(D | Z=0, w)` where Z = 1{D > d_lower}). Tests added (1 new, 90 total): - test_had_results_dataclass_docstrings_match_weighted_mass_point_contract: uses inspect.getsource(HeterogeneousAdoptionDiDResults) to scan the class source and assert the variance_formula docstring mentions both pweight_2sls and survey_binder_tsl_2sls labels, and the effective_dose_mean docstring mentions mass-point Wald-IV semantics. Locks both field docstrings against drift back to the continuous-only framing now that the llms-full.txt guide and the actual fit() code populate these on mass-point fits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T15:54:41Z

/ai-review

github-actions · 2026-05-09T15:59:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0ecb635284ed477709c7aa6592da3f435785665d

Overall Assessment

✅ Looks good

Executive Summary

The prior P3 from the last review is resolved: the HAD result-class docstrings in diff_diff/had.py:L347-L371 now match the weighted mass-point behavior already documented in diff_diff/guides/llms-full.txt:L1254-L1255.
The new HAD practitioner routing is methodologically aligned with the registry: ContinuousDiD still requires never-treated / D=0 support, while HAD remains compatible with a small untreated share and retains never-treated units on the staggered event-study path (diff_diff/practitioner.py:L729-L748, diff_diff/practitioner.py:L897-L923, diff_diff/practitioner.py:L1038-L1063, diff_diff/guides/llms-practitioner.txt:L160-L168, diff_diff/continuous_did.py:L348-L360, docs/methodology/REGISTRY.md:L2400-L2408).
The new HAD pretest guidance now uses the right assumption labels and correctly surfaces the survey-weighted QUG deferral (diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, diff_diff/guides/llms-full.txt:L1405-L1438, docs/methodology/REGISTRY.md:L2548-L2552).
_check_nan_att() is a narrow, conservative change: it only warns on all-NaN ndarray att and leaves partial-NaN event-study arrays alone, which matches the documented HAD event-study contract (diff_diff/practitioner.py:L1196-L1235).
P3 only: the new weighted HAD event-study sup-t guidance omits the documented mass-point vcov_type='hc1' requirement, so the default survey example can still raise NotImplementedError on mass-point weighted fits.
Static review only: I could not run pytest or import the package in this sandbox because pytest and numpy are unavailable.

Methodology

No findings. The changed HAD guidance matches the Methodology Registry on pretest scope, untreated-unit compatibility, and last-cohort-only event-study framing (diff_diff/practitioner.py:L857-L1122, diff_diff/guides/llms-full.txt:L1405-L1438, docs/methodology/REGISTRY.md:L2374-L2382, docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2551-L2552).

Code Quality

No findings. _check_nan_att() remains a presentation-layer warning helper and does not introduce any inline inference anti-pattern; estimator inference still routes through safe_inference() (diff_diff/practitioner.py:L1196-L1235, diff_diff/had.py:L3553-L3562, diff_diff/had.py:L4447-L4455).

Performance

No findings. The added ndarray handling is lazy and only imports numpy on the rare array-valued att path (diff_diff/practitioner.py:L1217-L1228).

Maintainability

No findings. The Phase 5 bookkeeping is internally aligned across registry, TODO, and doc dependency mapping (docs/methodology/REGISTRY.md:L2551-L2554, TODO.md:L112-L112, docs/doc-deps.yaml:L388-L405).

Tech Debt

No findings. Remaining HAD tutorial work is explicitly tracked in TODO.md:L112-L112, so it is non-blocking under the stated review policy.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: diff_diff/practitioner.py:L1067-L1088 and the new HAD fit docs in diff_diff/guides/llms-full.txt:L616-L630 present weighted event-study sup-t usage as a generic survey_design= / weights= path, but mass-point HAD rejects the default classical variance family whenever the IF matrix is consumed downstream (diff_diff/had.py:L3495-L3507, diff_diff/had.py:L4277-L4291). That restriction is already documented in the registry (docs/methodology/REGISTRY.md:L2380-L2382), so this is informational rather than blocking, but the example as written can still fail on a supported design family.
Concrete fix: either make the example explicitly continuous-design-only, or add vcov_type='hc1' / robust=True to the weighted mass-point example and note the caveat inline. Add a regression test for that runtime contract; current tests only check that the guide mentions sup-t and that snippets parse as Python, not that the weighted mass-point example reflects the supported vcov combinations (tests/test_practitioner.py:L528-L535, tests/test_practitioner.py:L656-L680, tests/test_guides.py:L383-L405).

P3 mass-point + survey vcov requirement: per had.py:3495-3507 the mass-point design rejects the default classical vcov family on the survey_design= path with NotImplementedError (the survey path composes Binder-TSL on the HC1-scale influence function, which targets V_HC1 rather than the classical sandwich). The Step-6 sup-t / cband snippet in _handle_had_event_study and the HAD section in llms-full.txt presented weighted event-study fits as a generic survey_design= path without surfacing this constraint, so the example as written would fail at fit time on a mass-point panel. Both surfaces now make the requirement explicit: - The Step-6 snippet uses HeterogeneousAdoptionDiD(vcov_type='hc1', ...) with an inline comment explaining that hc1 is required on mass-point + survey and is a no-op on the continuous designs (which use the CCT-2014 robust SE regardless), making it a safe default for the survey-aware example. - A new "Mass-point + survey constraint" paragraph in the HAD section of llms-full.txt documents the same requirement and routing. Tests added (2 new, 92 total): - test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibility: asserts the sup-t / cband snippet either uses vcov_type='hc1' / robust=True or surfaces the mass-point + survey vcov requirement inline so agents adapting the snippet on a mass-point panel know to add it. - test_llms_full_had_section_documents_mass_point_survey_vcov_requirement: asserts the HAD section documents the mass-point + survey vcov requirement (vcov_type mention paired with mass-point context). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T16:13:58Z

/ai-review

github-actions · 2026-05-09T16:20:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: da24a59596c0f8435aa72d0347fc3ffca21022e5

Overall Assessment

⚠️ Needs changes

The prior P3 about weighted HAD event-study guidance is resolved: the new guide and practitioner Step 6 now surface the mass-point vcov_type='hc1' requirement (diff_diff/guides/llms-full.txt:L658-L660, diff_diff/practitioner.py:L1066-L1097). But I found one new P1 on the changed HAD pretest guidance: the new agent-facing text overstates when Assumption 7 is actually checked on the event-study path.

Executive Summary

Previous review item resolved: the mass-point + weighted/survey vcov_type='hc1' caveat is now documented on the new HAD surfaces (diff_diff/guides/llms-full.txt:L658-L660, diff_diff/practitioner.py:L1066-L1097).
P1 [Newly identified]: the new HAD pretest guidance says aggregate='event_study' on a multi-period panel closes the step-2 / Assumption 7 gap, but the registry and runtime contract only allow that when there is at least one earlier pre-period beyond F-1 (diff_diff/guides/llms-full.txt:L1407-L1423, diff_diff/practitioner.py:L857-L893, diff_diff/practitioner.py:L1007-L1034, docs/methodology/REGISTRY.md:L2487-L2489).
The underlying workflow already handles the edge case correctly: when no earlier placebo pre-period exists, it sets pretrends_joint=None, keeps all_pass=False, and flags joint pre-trends skipped (diff_diff/had_pretests.py:L4738-L4756, diff_diff/had_pretests.py:L4822-L4831, tests/test_had_pretests.py:L2443-L2449, tests/test_had_pretests.py:L4998-L5049).
_check_nan_att() remains a conservative presentation-layer helper; I did not find a new inference anti-pattern or NaN-propagation bug in the changed estimator code (diff_diff/practitioner.py:L1205-L1249).
I could not run pytest here because the sandbox does not have the pytest module installed.

Methodology

Severity: P1 [Newly identified]
Impact: The new HAD pretest copy in diff_diff/guides/llms-full.txt:L1407-L1423 and both new practitioner handlers in diff_diff/practitioner.py:L857-L893 and diff_diff/practitioner.py:L1007-L1034 tells users that switching to aggregate='event_study' on a multi-period panel closes the Assumption 7 / paper-step-2 gap. That is broader than the documented contract. The Methodology Registry says step 2 closes only when there are two pre-periods (the base F-1 plus an earlier placebo); otherwise the workflow must leave pretrends_joint=None and report joint pre-trends skipped (no earlier pre-period) (docs/methodology/REGISTRY.md:L2487-L2489). The implementation and tests follow that narrower rule (diff_diff/had_pretests.py:L4738-L4756, diff_diff/had_pretests.py:L4822-L4831, tests/test_had_pretests.py:L2443-L2449, tests/test_had_pretests.py:L4998-L5049). On minimal valid event-study panels, the new agent-facing guidance can therefore tell users/LLMs that step 2 was covered when it was not.
Concrete fix: Update every changed Step-3 / HAD-pretests surface to say that aggregate='event_study' closes step 2 only if an earlier placebo pre-period beyond F-1 exists (and survives any trends_lin consumption / filtering). Otherwise say explicitly that the workflow returns pretrends_joint=None, all_pass=False, and a joint pre-trends skipped (no earlier pre-period) verdict.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. Remaining HAD tutorial follow-ups are properly tracked in TODO.md:L107-L113, so they are non-blocking under the review policy.

Security

No findings.

Documentation/Tests

Severity: P2
Impact: The new guide/practitioner regression tests lock assumption labels and the QUG-under-survey caveat, but they do not lock the “event-study only closes step 2 when an earlier placebo pre-period exists” precondition. That is why the overstatement above can land in the new agent-facing surfaces without tripping the added tests (tests/test_guides.py:L550-L617, tests/test_practitioner.py:L782-L835). The underlying workflow contract is already regression-tested elsewhere (tests/test_had_pretests.py:L2443-L2449, tests/test_had_pretests.py:L4998-L5049).
Concrete fix: Add explicit tests/test_guides.py and tests/test_practitioner.py assertions that the HAD pretest text mentions the earlier-pre-period requirement and the pretrends_joint=None / joint pre-trends skipped fallback.

Path to Approval

Correct the new HAD pretest guidance in diff_diff/practitioner.py and diff_diff/guides/llms-full.txt so it does not claim that any multi-period aggregate='event_study' fit closes Assumption 7; require an earlier placebo pre-period beyond F-1, and document the joint pre-trends skipped fallback otherwise.
Add guide/practitioner regression tests for that caveat, mirroring the existing workflow-contract tests in tests/test_had_pretests.py:L2443-L2449 and tests/test_had_pretests.py:L4998-L5049.

P1 step-2 / Assumption 7 closure precondition: per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + 2769, aggregate="event_study" closes paper Section 4.2 step 2 ONLY IF the panel has at least one earlier placebo pre-period beyond the base F-1. With only the base F-1 pre-period available (minimal 3-period event-study, or 4-period under trends_lin=True where the consumed F-2 placebo is auto-dropped), the workflow sets pretrends_joint=None, all_pass=False, and appends 'joint pre-trends skipped (no earlier pre-period)' to the verdict - step 2 stays uncovered. The previous Step-3 wording in both _handle_had and _handle_had_event_study + the HAD Pretests intro in llms-full.txt said generically that aggregate="event_study" closes the step-2 gap, which is overbroad and could mislead agents on minimal valid event-study panels. All three surfaces now make the precondition explicit AND document the pretrends_joint=None / 'joint pre-trends skipped' fallback verdict so agents know what to expect when the precondition fails. P2 missing regression coverage: the prior tests locked assumption labels and the QUG-under-survey caveat but did not lock the earlier-pre-period precondition - that is why the overstatement landed in the new agent-facing surfaces without tripping the existing guide / practitioner tests. Tests added (2 new, 94 total): - test_had_step_3_documents_earlier_pre_period_precondition_for_step_2: asserts both HAD handler variants surface the 'earlier pre-period' / placebo precondition AND the pretrends_joint=None / 'joint pre-trends skipped' fallback. - test_llms_full_had_pretests_documents_earlier_pre_period_precondition: same lock on the HAD Pretests section in llms-full.txt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T16:30:34Z

/ai-review

github-actions · 2026-05-09T16:35:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4a247587df7099933e0624c2d34099f53fe36d65

Overall Assessment

✅ Looks good

The prior P1 from the re-review is resolved. I did not find any unmitigated P0/P1 issues in the changed HAD guidance, registry updates, or practitioner dispatch surfaces.

Executive Summary

The previous methodology mismatch is fixed: the new HAD guide and practitioner text now correctly state that event-study step-2 closure requires an earlier placebo pre-period beyond F-1, matching both the Methodology Registry and the runtime workflow fallback in did_had_pretest_workflow() (docs/methodology/REGISTRY.md:L2482-L2489, diff_diff/had_pretests.py:L4738-L4797, diff_diff/guides/llms-full.txt:L1409-L1442, diff_diff/practitioner.py:L857-L898, diff_diff/practitioner.py:L1014-L1047).
The weighted-path caveats are surfaced correctly on the new agent-facing HAD surfaces: QUG-under-survey remains explicitly deferred, and mass-point survey examples now call out the vcov_type="hc1" requirement (diff_diff/guides/llms-full.txt:L658-L660, diff_diff/guides/llms-full.txt:L1435-L1442, diff_diff/practitioner.py:L874-L898, diff_diff/practitioner.py:L1078-L1110).
The new _check_nan_att() ndarray branch is a presentation-layer helper only; it warns only on all-NaN per-horizon arrays and does not introduce a new inference anti-pattern (diff_diff/practitioner.py:L1217-L1250).
P3 [Newly identified]: HeterogeneousAdoptionDiDResults.to_dict() still documents weighted variance_formula as only "pweight" / "survey_binder_tsl", which now disagrees with the updated field docstrings and actual mass-point weighted outputs (diff_diff/had.py:L347-L369, diff_diff/had.py:L471-L488).
I could not run the targeted tests in this environment because pytest and numpy are not installed.

Methodology

No findings. The changed HAD-facing guidance now matches the documented source contract for Assumption 7 / step-2 closure and its skip fallback (docs/methodology/REGISTRY.md:L2482-L2489, diff_diff/had_pretests.py:L4738-L4797).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining HAD tutorial follow-ups are explicitly tracked in TODO.md:L107-L113, so they are non-blocking under the review policy.

Security

No findings.

Documentation/Tests

Severity: P3 [Newly identified]
Impact: HeterogeneousAdoptionDiDResults.to_dict() still describes weighted variance_formula as continuous-only and omits the mass-point Wald-IV denominator wording, so not all public documentation surfaces agree after this PR updated the field docstrings and llms-full (diff_diff/had.py:L347-L369, diff_diff/had.py:L471-L488). This is informational only; the runtime fields themselves appear correct.
Concrete fix: Update the to_dict() docstring in diff_diff/had.py:L471-L488 to mirror the full weighted contract already documented in diff_diff/had.py:L347-L369, including "pweight_2sls" / "survey_binder_tsl_2sls" and the mass-point effective_dose_mean semantics.
I could not execute pytest here because the sandbox is missing both pytest and numpy, so test validation is based on diff inspection rather than a local run.

P3 doc drift: PR #402 R3 fixed llms-full.txt, R5 fixed the dataclass field docstrings, but HeterogeneousAdoptionDiDResults.to_dict() still described variance_formula as continuous-only ("pweight" / "survey_binder_tsl") and omitted the mass-point Wald-IV effective_dose_mean semantics. Three internal source-of-truth surfaces were now disagreeing about the same public result object's to_dict() output shape. Updated to_dict() docstring to enumerate all four variance_formula labels (pweight, survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and to describe the per-design effective_dose_mean semantics (continuous mean of D / D - d_lower vs mass-point weighted Wald-IV dose gap mean(D | Z=1, w) - mean(D | Z=0, w)). Mirrors the field-docstring contract from R5. Tests added (1 new, 95 total): - test_had_results_to_dict_docstring_matches_weighted_mass_point_contract: reads HeterogeneousAdoptionDiDResults.to_dict.__doc__ and asserts it enumerates all four variance_formula labels and the mass-point Wald-IV effective_dose_mean semantics. Mirrors the existing dataclass-field-docstring lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-05-09T16:38:33Z

/ai-review

github-actions · 2026-05-09T16:42:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b2bfdd09d4f3fc7253a742f7b16aec8e6c040cae

Overall Assessment

✅ Looks good

Executive Summary

The prior re-review P3 is resolved: HeterogeneousAdoptionDiDResults.to_dict() and the weighted-field docstrings now agree with the actual weighted mass-point contract (diff_diff/had.py:L347-L372, diff_diff/had.py:L468-L499).
The new HAD practitioner guidance matches the Methodology Registry on step-2 closure requiring an earlier placebo pre-period, QUG deferral under survey weighting, untreated-unit compatibility, and last-cohort-only event-study framing (docs/methodology/REGISTRY.md:L2481-L2489, docs/methodology/REGISTRY.md:L2533-L2538, diff_diff/practitioner.py:L857-L899, diff_diff/practitioner.py:L1013-L1155).
The ContinuousDiD ↔ HAD routing is consistent with implementation: this codebase still rejects no-D=0 ContinuousDiD panels, while HAD remains valid with a small never-treated share when the estimand is WAS (diff_diff/continuous_did.py:L348-L360, diff_diff/practitioner.py:L729-L748, diff_diff/practitioner.py:L902-L929).
_check_nan_att()’s new ndarray branch is consistent with the HAD event-study NaN contract: it warns only on all-NaN horizon arrays and does not over-fire on partial-NaN outputs (diff_diff/practitioner.py:L1217-L1261, diff_diff/had.py:L540-L545).
The new guide/practitioner surfaces are well regression-locked for signature drift, methodology wording, weighted mass-point docs, and runtime handler behavior (tests/test_guides.py:L280-L640, tests/test_practitioner.py:L499-L935).
I could not run the targeted tests locally because this sandbox does not have pytest or numpy installed.

Methodology

No findings. The changed HAD-facing guidance is aligned with the registry and cited method notes, including the earlier-pre-period requirement for step-2 closure, the QUG-under-survey deferral, untreated-unit compatibility, and the Appendix B.2 last-cohort filter (docs/methodology/REGISTRY.md:L2400-L2408, docs/methodology/REGISTRY.md:L2481-L2489, docs/methodology/REGISTRY.md:L2533-L2538, diff_diff/guides/llms-full.txt:L593-L660, diff_diff/guides/llms-full.txt:L1407-L1442, diff_diff/guides/llms-practitioner.txt:L151-L168).

Code Quality

No findings. The new practitioner handlers stay string-only at runtime, avoid inline inference anti-patterns, and the ndarray NaN warning path is narrowly scoped (diff_diff/practitioner.py:L847-L1261, tests/test_practitioner.py:L642-L691).

Performance

No findings. The only runtime change is a lazy numpy import in the exceptional ndarray path of _check_nan_att(), so there is no meaningful hot-path regression from the diff inspection (diff_diff/practitioner.py:L1238-L1255).

Maintainability

No findings. The new HAD docs and practitioner surfaces are backed by focused regression tests for signatures, field tables, wording, and dispatch behavior, which should make future drift obvious (tests/test_guides.py:L353-L519, tests/test_practitioner.py:L733-L935).

Tech Debt

No findings. The remaining HAD tutorial work is explicitly tracked in TODO.md, so it is non-blocking under the review policy (TODO.md:L111-L114).

Security

No findings.

Documentation/Tests

No findings. The previous public-doc mismatch on weighted mass-point variance_formula / effective_dose_mean is fixed and now covered by tests (diff_diff/had.py:L347-L372, diff_diff/had.py:L468-L499, tests/test_practitioner.py:L733-L818, tests/test_guides.py:L479-L519).
Residual risk: I could not execute pytest here because the environment is missing both pytest and numpy, so test validation is based on diff inspection only.

igerber added the ready-for-ci Triggers CI test workflows label May 9, 2026

igerber merged commit df270ab into main May 9, 2026
28 of 29 checks passed

igerber deleted the had-phase-5-agent-surfaces branch May 9, 2026 18:07

Conversation

igerber commented May 9, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

igerber commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant