HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt)#402
HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt)#402
Conversation
Add _handle_had + _handle_had_event_study to practitioner.py, routing both HeterogeneousAdoptionDiD result classes through HAD-specific Baker et al. (2025) step guidance: did_had_pretest_workflow (step 3), ContinuousDiD/CallawaySantAnna routing nudge (step 4), bandwidth_diagnostics + simultaneous bands (step 6), per-horizon WAS event-study disaggregation (step 7), design-auto-detection + last-cohort-only-WAS framing (step 8). Symmetric pair: _handle_continuous gains Step-4 routing to HAD on no-untreated panels - the HAD <-> ContinuousDiD routing loop is now bidirectional. Extend _check_nan_att with ndarray branch (lazy numpy import + np.all(np.isnan(arr)) semantics so partial-NaN arrays don't over-fire the warning). Scalar path bit-exact preserved across all 12 untouched handlers. Add full HAD section + result-class blocks + ## HAD Pretests index covering all 7 pretest entry points + Choosing-an-Estimator row to diff_diff/guides/llms-full.txt (the bundled-in-wheel agent reference). Tighten the existing Continuous treatment intensity Choosing row with "(some units untreated)" so the HAD vs ContinuousDiD contrast is explicit. Framing: "no untreated unit" / dose variation, never "no comparison group" - locked by negative-assertion tests on both handler text and llms-full.txt section. docs/doc-deps.yaml: remove the llms-full.txt deferral note on had.py and add llms-full.txt entries to had.py, had_pretests.py, and practitioner.py blocks. 21 new tests (14 in tests/test_practitioner.py::TestHADDispatch + 6 in tests/test_guides.py::TestLLMsFullHADCoverage + 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant). Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No additional findings beyond the doc/test issues below. Tech Debt No findings. The Security No findings. Documentation/Tests
Path to Approval
|
P1 methodology fix: Step-4 routing in _handle_had + _handle_had_event_study no longer says "switch away from HAD if untreated units exist" - that contradicts REGISTRY § HeterogeneousAdoptionDiD edge cases (line 2403: "Authors do NOT require untreated units to be dropped"; line 2408 + had.py:1325: never-treated units RETAINED on staggered event-study). Reframed as the actual estimand differentiator: HAD targets WAS at the dose support boundary; ContinuousDiD targets per-dose ATT(d) / ACRT(d) and requires never-treated controls. Routing fires only when the user wants the ATT(d) estimand AND has never-treated controls, not on "untreated units exist". Tightens the corresponding Choosing-an-Estimator table row to surface WAS vs ATT(d) as the differentiator. P2 (a) signatures: llms-full.txt HAD constructor + fit() blocks now match the actual HeterogeneousAdoptionDiD.__init__ / .fit signatures exactly. Drops invented kwargs (h, b, rcond) and adds the real ones (d_lower, kernel, vcov_type, robust, cluster). aggregate default corrected from None to "overall". fit() now lists survey, weights, cband (positional-or-keyword) and survey_design + trends_lin (keyword-only). P2 (b) snippet bugs: result.bandwidth_diagnostics -> results.bandwidth_diagnostics (matching the plural convention of other handlers); sup-t snippet now imports SurveyDesign and constructs sd before passing survey_design=sd (was survey_design=design with no design defined). P2 (c) tests: New TestLLMsFullHADCoverage tests use inspect.signature(HeterogeneousAdoptionDiD.__init__) and .fit() to regress the documented signatures against the real API. New test_llms_full_had_section_methodology_compatible_with_untreated locks the negative assertion that the section does NOT carry framing contradicting the registry. Practitioner tests gain test_had_step_4_does_not_misframe_untreated_unit_routing + test_had_handler_snippets_are_valid_python_syntax (catches snippet syntax errors via ast.parse) + test_handle_continuous_step_4_snippet_is_valid_python. 83 tests pass (47 in test_practitioner including 5 new + 36 in test_guides including 9 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
P1 pretest assumption labels: _handle_had step-3 + llms-full.txt HAD Pretests section + qug_test/stute_test bullets misstated which paper assumption each shipped test actually tests: - qug_test was labeled "Assumption 5 support condition", but QUG tests H_0: d_lower = 0 (paper Theorem 4 / step 1 of the workflow). Assumption 5 is the Design 1 sign-identification condition and is NOT testable via pre-trends per REGISTRY.md:2270. - stute_test was labeled "Assumption 7 mean-independence", but stute_test is the Assumption 8 linearity test (paper Section 4.2 step 3 / Appendix D). Assumption 7 is pre-trends (step 2). - did_had_pretest_workflow(aggregate="overall") was implied to cover step 2, but the workflow runs steps 1 + 3 only - step 2 is explicitly not covered on the overall path (had_pretests.py:4434-4441 + the workflow's verdict flags the gap). Rewrote both surfaces to match the actual contracts: QUG = paper Theorem 4 support-infimum test (step 1, decides Design 1' vs Design 1); Stute / Yatchew-HR = Assumption 8 linearity tests (step 3); Assumption 7 step 2 closure requires aggregate="event_study" (joint Stute pre-trends). Assumption 7 / step 2 gap is explicitly flagged on the overall path so agents do not assume coverage where there is none. P2 result-class field tables incomplete: HeterogeneousAdoptionDiDResults table was missing n_mass_point, n_above_d_lower, cluster_name, bias_corrected_fit, variance_formula, effective_dose_mean. HeterogeneousAdoptionDiDEventStudyResults table was missing vcov_type, cluster_name, bandwidth_diagnostics, bias_corrected_fit, filter_info. Added all missing fields with correct types and descriptions. Tests added (3 new, 86 total): - test_llms_full_had_results_class_field_lists_match_real_dataclass: uses dataclasses.fields() to enumerate every public field on both result classes and assert each appears in the documented table. Catches future drift where new fields land but the guide is not updated. - test_llms_full_had_pretests_assumption_labels_correct: scans the qug_test and stute_test bullets in the HAD Pretests section and enforces positive labels (support-infimum / Theorem 4 / linearity) + forbids positive Assumption-5 / Assumption-7 misclaims (negative disclaimers like "QUG does NOT test Assumption 5" remain allowed). - test_had_step_3_pretest_assumption_labels_correct: same checks on the practitioner.py _handle_had step-3 why-text; also requires positive acknowledgment of the Assumption 7 / step 2 gap on the overall workflow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Execution note: this was a static review only. I could not run |
P2 (a) variance_formula / effective_dose_mean field descriptions in
the single-period results table understated weighted mass-point fits.
The single-period table previously said both fields were
continuous-path-only / None on mass-point fits, but per had.py:3585-3629
weighted mass-point fits populate variance_formula in {"pweight_2sls",
"survey_binder_tsl_2sls"} and effective_dose_mean as the weighted
Wald-IV dose gap. Both rows now enumerate all four variance_formula
labels and describe the mass-point Wald-IV dose-gap denominator
explicitly.
P2 (b) llms-practitioner.txt Step 4 decision tree was stale: it routed
ALL continuous-intensity designs to ContinuousDiD, omitting HAD. That
contradicted the new runtime estimator-selection guidance and would
steer agents to the wrong estimator on universal-rollout panels.
Updated the continuous branch to distinguish ContinuousDiD (per-dose
ATT(d) / ACRT(d), requires never-treated controls) from
HeterogeneousAdoptionDiD (WAS at dose support boundary, compatible with
universal rollout or small never-treated share).
Tests added (2 new, 88 total):
- test_llms_full_had_variance_formula_describes_all_designs: scans the
variance_formula and effective_dose_mean rows in the single-period
HAD results table; asserts all four variance_formula labels (pweight,
survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and the
mass-point Wald-IV dose-gap semantics are present.
- test_llms_practitioner_step_4_distinguishes_had_from_continuous:
parses the Step 4 decision tree from get_llm_guide("practitioner");
asserts both ContinuousDiD and HeterogeneousAdoptionDiD appear in the
continuous branch and that never-treated / universal-rollout framing
is the routing distinguisher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. The remaining tutorial follow-up is appropriately tracked in Security No findings. Documentation/Tests
Path to Approval
|
P1 HAD Step-3 overstated pretest coverage on weighted/survey fits: practitioner_next_steps() said did_had_pretest_workflow runs QUG on both the overall and event-study paths without noting that the workflow explicitly skips QUG whenever survey_design= / survey= / weights= is supplied (Phase 4.5 C0 deferral, had_pretests.py:4488-4495 + REGISTRY § "QUG Null Test" Note (Phase 4.5 C0)). On weighted fits the workflow emits a UserWarning and returns a linearity-conditional verdict only. Both _handle_had and _handle_had_event_study Step-3 why-text + code snippets now explicitly state that survey-weighted fits skip QUG and yield a linearity-conditional verdict (the weighted verdict is conditional on QUG holding by assumption). The event-study text also notes that joint Stute pre-trends and joint homogeneity-linearity themselves remain available under survey weighting via the PSU-level Mammen multiplier bootstrap. P3 REGISTRY § HeterogeneousAdoptionDiD requirements checklist was stale: marked "Phase 5: practitioner_next_steps() integration" and "Phase 5 (remaining): llms-full.txt section" as pending. Updated to reflect this PR landing wave 1 of Phase 5; only T21 (HAD pretest workflow tutorial) and T22 (weighted/survey HAD tutorial) remain queued, both tracked in TODO.md. Tests added (1 new, 89 total): - test_had_step_3_flags_qug_under_survey_deferral: asserts both HAD handler variants surface the QUG-under-survey skip and the linearity-conditional-verdict caveat. Without this caveat agents may assume step 1 / Design 1' vs Design 1 was checked on weighted fits when the library deliberately does not check it there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3 dataclass-docstring drift: PR #402 R3 fixed the llms-full.txt field descriptions to acknowledge that weighted mass-point HAD fits populate variance_formula in {"pweight_2sls", "survey_binder_tsl_2sls"} and effective_dose_mean as the weighted Wald-IV dose gap (per had.py:3585-3660), but the HeterogeneousAdoptionDiDResults dataclass field docstrings in had.py:347-366 still said those fields were continuous-only / None on mass-point - leaving two source-of-truth surfaces disagreeing about the same public result object. Updated both field docstrings to enumerate all four variance_formula labels (continuous + mass-point variants under both `weights=` shortcut and `survey_design=` paths) and to describe the mass-point weighted Wald-IV dose-gap denominator semantics (`mean(D | Z=1, w) - mean(D | Z=0, w)` where Z = 1{D > d_lower}). Tests added (1 new, 90 total): - test_had_results_dataclass_docstrings_match_weighted_mass_point_contract: uses inspect.getsource(HeterogeneousAdoptionDiDResults) to scan the class source and assert the variance_formula docstring mentions both pweight_2sls and survey_binder_tsl_2sls labels, and the effective_dose_mean docstring mentions mass-point Wald-IV semantics. Locks both field docstrings against drift back to the continuous-only framing now that the llms-full.txt guide and the actual fit() code populate these on mass-point fits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. The changed HAD guidance matches the Methodology Registry on pretest scope, untreated-unit compatibility, and last-cohort-only event-study framing ( Code Quality No findings. Performance No findings. The added ndarray handling is lazy and only imports Maintainability No findings. The Phase 5 bookkeeping is internally aligned across registry, TODO, and doc dependency mapping ( Tech Debt No findings. Remaining HAD tutorial work is explicitly tracked in Security No findings. Documentation/Tests
|
P3 mass-point + survey vcov requirement: per had.py:3495-3507 the mass-point design rejects the default classical vcov family on the survey_design= path with NotImplementedError (the survey path composes Binder-TSL on the HC1-scale influence function, which targets V_HC1 rather than the classical sandwich). The Step-6 sup-t / cband snippet in _handle_had_event_study and the HAD section in llms-full.txt presented weighted event-study fits as a generic survey_design= path without surfacing this constraint, so the example as written would fail at fit time on a mass-point panel. Both surfaces now make the requirement explicit: - The Step-6 snippet uses HeterogeneousAdoptionDiD(vcov_type='hc1', ...) with an inline comment explaining that hc1 is required on mass-point + survey and is a no-op on the continuous designs (which use the CCT-2014 robust SE regardless), making it a safe default for the survey-aware example. - A new "Mass-point + survey constraint" paragraph in the HAD section of llms-full.txt documents the same requirement and routing. Tests added (2 new, 92 total): - test_had_event_study_sup_t_snippet_uses_hc1_for_mass_point_survey_compatibility: asserts the sup-t / cband snippet either uses vcov_type='hc1' / robust=True or surfaces the mass-point + survey vcov requirement inline so agents adapting the snippet on a mass-point panel know to add it. - test_llms_full_had_section_documents_mass_point_survey_vcov_requirement: asserts the HAD section documents the mass-point + survey vcov requirement (vcov_type mention paired with mass-point context). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior P3 about weighted HAD event-study guidance is resolved: the new guide and practitioner Step 6 now surface the mass-point Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 step-2 / Assumption 7 closure precondition: per docs/methodology/REGISTRY.md HeterogeneousAdoptionDiD § "Assumption 7 / step 2 closure" + had_pretests.py:4738-4756 + 2769, aggregate="event_study" closes paper Section 4.2 step 2 ONLY IF the panel has at least one earlier placebo pre-period beyond the base F-1. With only the base F-1 pre-period available (minimal 3-period event-study, or 4-period under trends_lin=True where the consumed F-2 placebo is auto-dropped), the workflow sets pretrends_joint=None, all_pass=False, and appends 'joint pre-trends skipped (no earlier pre-period)' to the verdict - step 2 stays uncovered. The previous Step-3 wording in both _handle_had and _handle_had_event_study + the HAD Pretests intro in llms-full.txt said generically that aggregate="event_study" closes the step-2 gap, which is overbroad and could mislead agents on minimal valid event-study panels. All three surfaces now make the precondition explicit AND document the pretrends_joint=None / 'joint pre-trends skipped' fallback verdict so agents know what to expect when the precondition fails. P2 missing regression coverage: the prior tests locked assumption labels and the QUG-under-survey caveat but did not lock the earlier-pre-period precondition - that is why the overstatement landed in the new agent-facing surfaces without tripping the existing guide / practitioner tests. Tests added (2 new, 94 total): - test_had_step_3_documents_earlier_pre_period_precondition_for_step_2: asserts both HAD handler variants surface the 'earlier pre-period' / placebo precondition AND the pretrends_joint=None / 'joint pre-trends skipped' fallback. - test_llms_full_had_pretests_documents_earlier_pre_period_precondition: same lock on the HAD Pretests section in llms-full.txt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good The prior P1 from the re-review is resolved. I did not find any unmitigated P0/P1 issues in the changed HAD guidance, registry updates, or practitioner dispatch surfaces. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3 doc drift: PR #402 R3 fixed llms-full.txt, R5 fixed the dataclass field docstrings, but HeterogeneousAdoptionDiDResults.to_dict() still described variance_formula as continuous-only ("pweight" / "survey_binder_tsl") and omitted the mass-point Wald-IV effective_dose_mean semantics. Three internal source-of-truth surfaces were now disagreeing about the same public result object's to_dict() output shape. Updated to_dict() docstring to enumerate all four variance_formula labels (pweight, survey_binder_tsl, pweight_2sls, survey_binder_tsl_2sls) and to describe the per-design effective_dose_mean semantics (continuous mean of D / D - d_lower vs mass-point weighted Wald-IV dose gap mean(D | Z=1, w) - mean(D | Z=0, w)). Mirrors the field-docstring contract from R5. Tests added (1 new, 95 total): - test_had_results_to_dict_docstring_matches_weighted_mass_point_contract: reads HeterogeneousAdoptionDiDResults.to_dict.__doc__ and asserts it enumerates all four variance_formula labels and the mass-point Wald-IV effective_dose_mean semantics. Mirrors the existing dataclass-field-docstring lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
_handle_hadand_handle_had_event_studytopractitioner.py::_HANDLERS, routing both HAD result classes through HAD-specific Baker et al. (2025) step guidance (5 steps: pretest workflow, sibling-estimator routing, bandwidth/sup-t bands, event-study disaggregation, design-auto-detection + last-cohort-only-WAS framing)._handle_continuousgains a Step-4 nudge toHeterogeneousAdoptionDiDfor ContinuousDiD users on no-untreated panels — bidirectional routing closure._check_nan_attwith an ndarray branch (lazynumpyimport,np.all(np.isnan(arr))semantics so partial-NaN arrays don't over-fire). Scalar path bit-exact preserved across all 12 untouched handlers.## HAD Pretestsindex (7 pretest entry points) + Choosing-an-Estimator row todiff_diff/guides/llms-full.txt. Tightens the existingContinuous treatment intensityChoosing row with(some units untreated)for explicit contrast with the new HAD row.TestHADDispatch+ 6 inTestLLMsFullHADCoverage+ 1 fixture-minimality regression locking the "handlers are STRING-ONLY at runtime" stability invariant).Closes the Phase 5 "agent surfaces" gap; T21 pretest tutorial and T22 weighted/survey tutorial remain queued as separate notebook PRs.
Test plan
pytest tests/test_practitioner.py -v(47 passed; 16 new HAD tests)pytest tests/test_guides.py -v(31 passed; 6 new HAD coverage tests)black diff_diff/practitioner.py tests/test_practitioner.py tests/test_guides.pyruff check diff_diff/practitioner.py tests/test_practitioner.py tests/test_guides.pymypy diff_diff/practitioner.py(clean; pre-existingprep_dgp.pyerrors unrelated)ready-for-cilabel🤖 Generated with Claude Code