Tutorial 21: HAD pre-test workflow (composite QUG + Stute + Yatchew)#409
Tutorial 21: HAD pre-test workflow (composite QUG + Stute + Yatchew)#409
Conversation
PR Review ReportOverall AssessmentExecutive Summary
MethodologyFinding 1 — P1Severity: P1 Impact: The new docs/test prose says the QUG fail-to-reject outcome lets Concrete fix: Reword the changed prose to separate the two facts: QUG fails to reject Finding 2 — P3 InformationalSeverity: P3 Impact: The two-period Concrete fix: None required. Code QualityNo P0/P1/P2 findings. The new test file parses successfully. Pattern-wide grep for inline inference anti-patterns found no new changed-source occurrence; there are no modified estimator/inference paths. PerformanceNo findings. The new drift tests use bootstrap-heavy checks, but that is appropriate for tutorial drift coverage and not a runtime library path. MaintainabilityNo additional P0/P1/P2 findings beyond the registry/doc status issue listed under Documentation/Tests. Tech DebtNo blocking tech-debt issue. The TODO row now tracks T22 as remaining, which is consistent with the PR’s stated deferred work. SecurityNo findings. No secrets or security-sensitive code paths were introduced in the reviewed non-notebook diff. Documentation/TestsFinding 3 — P2Severity: P2 Impact: The methodology registry still says the T21 tutorial is queued/remaining, while this PR marks T21 as landed in TODO, README, changelog, and doc-deps. Since the registry is the project’s methodology source of truth for reviews, stale phase status creates avoidable confusion. Concrete fix: Update Audit Confirmation
Path to Approval
|
P1 — Decouple QUG from HAD's `design="auto"` selector across all surfaces. The two are independent rules: QUG is a statistical pre-test on `H0: d_lower = 0`; `_detect_design()` is a min/median heuristic (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`). On T21's panel both checks point to the same identification path but the mechanisms are independent — `_detect_design()` does not consume the QUG p-value. Reword tutorial Section 2 + Section 3, CHANGELOG entry, and drift-test docstrings to reflect this. Add `test_had_design_auto_lands_on_continuous_at_zero`: explicitly fits `HAD(design="auto")` on the two-period panel and asserts `design == "continuous_at_zero"` and `target_parameter == "WAS"`, locking the prose claim independently of the QUG-test pins. P2 — Update REGISTRY.md to mark T21 shipped (PR #409); leave T22 row queued. All 16 drift tests pass on both backends; notebook executes cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review ReportOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. One non-blocking P2 documentation cleanup remains. Executive Summary
MethodologyFinding M1 — P3 InformationalSeverity: P3 Impact: The previous P1 is addressed: the prose now separates QUG fail-to-reject from HAD’s independent Concrete fix: None required. Finding M2 — P3 InformationalSeverity: P3 Impact: The two-period Concrete fix: None required. Code QualityNo P0/P1/P2 findings. The new drift test is focused and parses successfully. PerformanceNo findings. The added bootstrap-heavy checks are test-only drift coverage, not runtime library code. MaintainabilityNo blocking findings. The remaining issue is documentation consistency, listed below. Tech DebtNo blocking tech-debt issue. T22 remains tracked in SecurityNo findings. Secret-pattern grep across the changed non-notebook files returned no hits. Documentation/TestsFinding D1 — P2Severity: P2 Impact: Documentation status is still partially stale: the registry’s older Phase 3 follow-up paragraph still says Concrete fix: Update Audit Confirmation
|
- REGISTRY.md L2509: practitioner_next_steps + T21 tutorial were marked "queued for Phase 5"; both now landed (PR #402 + PR #409). Update to reflect actual status; T22 remains queued. - CHANGELOG.md L11 (T21 entry): drift-test count was "15 tests"; now 16 (after the new test_had_design_auto_lands_on_continuous_at_zero added in R1). - CHANGELOG.md L15 (PR #402 entry, retroactive): said "T21 pretest tutorial and T22 weighted/survey tutorial remain queued"; T21 has since landed in PR #409. Update to reflect that. No methodology change; no test surface changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review ReportOverall Assessment✅ Looks good — no unmitigated P0/P1 findings. One P2 maintainability/doc-impact mapping issue remains, but it does not block approval. Executive Summary
MethodologyFinding M1 — P3 InformationalSeverity: P3 Impact: The prior methodology concern is addressed. The tutorial/test surfaces distinguish QUG from Concrete fix: None required. Code QualityNo findings. The new test file is focused and parses successfully. PerformanceNo findings. The added bootstrap-heavy checks are test-only drift coverage. MaintainabilityFinding MT1 — P2Severity: P2 Impact: T21 is mapped under Concrete fix: Also add Tech DebtNo blocking findings. T22 remains tracked in SecurityNo findings. Secret-pattern grep over the changed non-notebook files found no secrets. Documentation/TestsNo blocking findings. The prior documentation/count issue is resolved in the supplied diff. Verification was limited by missing local dependencies: Audit Confirmation
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review ReportOverall AssessmentExecutive Summary
MethodologyFinding M1 — P1 [Newly identified]Severity: P1 Impact: The tutorial reverses the HAD design labels: it calls Concrete fix: Regenerate the notebook and review extract so every occurrence says Documented deviations checked: Code QualityNo findings. The new test file is focused and AST-parses successfully. PerformanceNo findings. Added bootstrap-heavy checks are test-only drift coverage. MaintainabilityFinding MT1 — P2Severity: P2 Impact: Prior finding remains. T21 is mapped under Concrete fix: Also add Tech DebtNo blocking findings. T22 remains tracked in SecurityNo findings. Secret-pattern grep over changed non-notebook surfaces found no secrets. Documentation/TestsNo additional findings beyond M1 and MT1. I did not load Path to Approval
Audit Confirmation
|
P1 — HAD design label convention was reversed across T21. Per
REGISTRY:2267 + had.py:7-33, the convention is:
- Design 1' = continuous_at_zero (d_lower = 0, QUG case) — that's T21
- Design 1 = continuous_near_d_lower (d_lower > 0) — that's T20
T21 had Design 1 / Design 1' swapped throughout. Fixed in the build
script (Section 1 paper-step taxonomy, Section 2 panel framing,
Section 3 reading-the-verdict, Section 7 Extensions). Notebook
re-executed and review extract regenerated.
Two residual "QUG selects/picks the identification path" leakages from
the original prose also surfaced (Section 7 + Summary checklist). Both
contradicted the explicit QUG-vs-_detect_design separation locked by
test_had_design_auto_lands_on_continuous_at_zero. Reworded to keep the
two rules independent ("QUG fail-to-reject and `design="auto"`
heuristic both pointed independently"; "QUG is a statistical test on
H0; `design="auto"` calls _detect_design() which uses a min/median
heuristic — both pointed to continuous_at_zero on this panel").
P2 (MT1) — T21 was mapped under had_pretests.py in doc-deps.yaml but
the drift test now also locks HAD(design="auto") / _detect_design()
behavior from had.py via test_had_design_auto_lands_on_continuous_at_zero.
Add T21 entry to the had.py docs block with a note on the
_detect_design() drift coverage so a future had.py design-selection
change does not miss T21 in the manual docs-impact map.
All 16 drift tests still pass on Rust; nbmake clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
MethodologyFinding M1 — P1 [Newly identified]Severity: P1 Impact: The tutorial describes “Step 4” as “Boundary continuity (paper Assumptions 5, 6)” and repeatedly says Assumptions 5/6 remain the non-testable Step 4 caveat on this T21 Concrete fix: Reword the tutorial taxonomy so Section 4 workflow steps are: QUG, pre-trends, linearity/homogeneity, then the “use TWFE if none rejects” decision. Move the non-testable caveat into a separate paragraph and make it design-specific: for T21’s Finding M2 — P3 informationalSeverity: P3 Concrete fix: None required. Code QualityNo findings. The added test file is focused and AST-parses successfully. PerformanceNo findings. Bootstrap-heavy checks are test-only drift coverage. MaintainabilityNo blocking findings. The supplied diff fixes the previous docs-impact mirror by adding T21 under the Tech DebtNo blocking findings. T22 remains tracked in SecurityNo findings. Secret-pattern grep over changed non-notebook text/test surfaces found no new secret-like material. Documentation/TestsNo additional findings beyond M1. I did not load Path to Approval
Audit Confirmation
|
Two methodology framing errors conflated in the original tutorial: - "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)" in the workflow taxonomy. Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration), Step 4 is actually the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used." Boundary-continuity assumptions are a separate concern. - Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point) identification caveats — the library emits a UserWarning citing them on Design 1 fits and stays silent on Design 1' (continuous_at_zero) fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1' via QUG fail-to-reject + the _detect_design() heuristic, so the relevant non-testable caveat is **Assumption 3** (uniform continuity of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6. Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately. Reframed across 7 surfaces in the build script: - Section 1 four-step enumeration: Step 4 is now the decision rule - Section 1: added a separate paragraph for the non-testable identification caveat that's design-path-specific (Assumption 3 for Design 1', Assumptions 5/6 for Design 1) and explicitly notes the library's UserWarning behavior matches this split - Section 4 event-study verdict reading: separated Step 4 (decision rule) from the Design 1' caveat - Section 4 horizon-detail closing: same split - Section 6 leadership template: replaced "Step 4 / Assumptions 5/6" caveat with the correct Design 1' caveat (Assumption 3); explicit parenthetical noting T20's caveat was different because T20 was Design 1 - Section 6 bottom line: same split (decision rule vs caveat) - Section 8 summary checklist: replaced single Step-4-as-caveat bullet with a two-part bullet on the workflow vs caveat distinction Notebook re-executed, review extract regenerated. All 16 drift tests still pass; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
MethodologyM1 — P1 [Newly identified]Severity: P1 Impact: The public Concrete fix: Reword the function docstring to say the event-study path uses joint Stute for Step 3; users needing Yatchew robustness can call M2 — P3 informationalSeverity: P3 Concrete fix: None required. Code QualityNo findings. The new drift test is focused and AST-parses successfully. PerformanceNo findings. Bootstrap-heavy work is confined to drift tests/tutorial validation. MaintainabilityNo additional findings. The temporary Tech DebtNo blocking findings. T22 remains tracked in SecurityNo findings. Secret-pattern grep over changed text/test surfaces found no hits. Documentation/TestsNo additional findings beyond M1. The new drift file defines 16 tests and covers overall vs event-study paths, QUG/design-auto separation, joint horizon counts, and both Yatchew null modes. I could not execute tests because Path to Approval
Audit Confirmation
|
…ion rule Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration) and the same module's two correctly-framed docstrings (module-level at line 54, `_compose_verdict_event_study` at line 2735), paper Step 4 is the decision rule "use TWFE if none of the tests rejects" — NOT a test or assumption. The Yatchew-HR test is a Step 3 alternative (a single-horizon swap-in for Stute), and there is no joint Yatchew variant in the paper. The `did_had_pretest_workflow()` docstring at L4445-4446 was the only place in the file that mislabeled the Yatchew alternative as "Step 4". Reword to: - correctly tag Yatchew as a step-3 alternative - explicitly state Step 4 is the decision rule with no code path - cross-reference the two existing correctly-framed docstrings in the same file so future readers can confirm the convention is consistent Surfaced by PR #409 R6 review (T21 tutorial pre-test taxonomy work made the inconsistency visible). All 42 workflow-specific tests in `tests/test_had_pretests.py` still pass; T21 drift test (16 tests) still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings. Prior P1 methodology-doc mismatch is resolved in the supplied diff. Executive Summary
MethodologyM1 — P3 InformationalSeverity: P3 Impact: Prior P1 is addressed. The workflow docstring now describes event-study Step 3 as joint Stute, Yatchew-HR as a single-horizon alternative, and paper Step 4 as the decision rule rather than a separate Yatchew test path. Concrete fix: None required. M2 — P3 InformationalSeverity: P3 Impact: The tutorial uses Concrete fix: None required. Code QualityNo findings. PerformanceNo findings. The new bootstrap-heavy work is confined to tutorial drift validation. MaintainabilityNo findings. The supplied diff excludes the temporary Tech DebtNo blocking findings. T22 weighted/survey HAD tutorial remains tracked in SecurityNo findings. Secret-pattern grep over changed text/test surfaces found no actionable hits. Documentation/TestsD1 — P2Severity: P2 Impact: The drift file says it pins quoted bootstrap p-values with tolerance bands, but Concrete fix: Use bounded tolerance bands around the quoted values, preferably backend-aware if needed. For example, assert roughly Audit Confirmation
|
Two bootstrap p-value drift tests had lower-bound-only assertions: - `test_overall_stute_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.686 → would silently pass if p drifted to 0.99 - `test_event_study_homogeneity_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.763 → same silent-stale risk The third bootstrap test (`test_event_study_pretrends_fails_to_reject`) already used a bounded band `0.0 <= p <= 0.25`. Mirror that pattern on the other two with bounded bands per `feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15 width): - Stute: 0.53 <= p <= 0.84 (band ~0.31 around 0.686) - Homogeneity: 0.61 <= p <= 0.92 (band ~0.31 around 0.763) Both bands wide enough for Rust ↔ pure-Python RNG path differences; both narrow enough that drift in either direction (toward rejection or toward an even cleaner pass) flags the prose as stale. All 16 drift tests pass on both backends within the new bands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings in the supplied diff. Executive Summary
MethodologyM1 — P3 InformationalSeverity: P3 Impact: Prior methodology-doc mismatch is resolved. The docstring now frames event-study Step 3 as joint Stute, Yatchew-HR as a single-horizon alternative, and paper Step 4 as the decision rule rather than another diagnostic. Concrete fix: None required. M2 — P3 InformationalSeverity: P3 Impact: The tutorial’s Concrete fix: None required. Code QualityNo findings. The PerformanceNo findings. The bootstrap-heavy checks are confined to the tutorial drift test and use module-scoped fixtures. MaintainabilityNo findings. The temporary Tech DebtNo blocking findings. SecurityNo findings. Secret-pattern scan over changed text/test surfaces found no actionable hits. Documentation/TestsD1 — P3 InformationalSeverity: P3 Impact: Prior P2 is addressed in the supplied diff: the overall Stute and event-study homogeneity bootstrap p-values now use bounded tolerance bands around the tutorial’s quoted values rather than only lower-bounding Concrete fix: None required. Audit Confirmation
|
End-to-end practitioner walkthrough for `did_had_pretest_workflow` building on T20's brand-campaign framing. Uses a Design 1 (`continuous_at_zero`) panel variant (Uniform[$0.01K, $50K] vs T20's [$5K, $50K]) so the QUG step fails-to-reject and the verdict text fires the load-bearing "Assumption 7 deferred" pivot for the upgrade-arc narrative. Three sections: - Overall workflow on a two-period collapse: Step 1 + Step 3 only; verdict explicitly flags Step 2 as deferred (single pre-period). - Upgrade to event_study workflow: closes all three testable steps via QUG + joint pre-trends Stute (3 horizons) + joint homogeneity Stute (4 horizons); verdict reads "TWFE admissible under Section 4 assumptions". - Yatchew side panel comparing null="linearity" (default, paper Theorem 7) vs null="mean_independence" (Phase 4 R-parity with R YatchewTest::yatchew_test(order=0)) on the within-pre-period first-difference paired with post-period dose. Companion drift-test file with 15 tests pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic stats, and bootstrap p-value tolerance bands per backend. Updates T20 Section 6 Extensions with a forward-pointer to T21, `docs/tutorials/README.md` with a T21 entry, `docs/doc-deps.yaml` `had_pretests.py` block, CHANGELOG `[Unreleased]`, and the T21/T22 TODO row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ection vs proof Two methodology framing issues in T21: 1. The DGP `Uniform[$0.01K, $50K]` has support strictly above zero. The tutorial / README / CHANGELOG / drift-test docstrings called it a "true Design 1 (`continuous_at_zero`)" panel, conflating "QUG fails-to-reject d_lower=0 in this finite sample" with "the true DGP support is at zero". Reframe across all surfaces: the DGP has a strictly-positive but very near-zero lower bound chosen so QUG fails-to-reject; HAD's `design="auto"` then selects the `continuous_at_zero` identification path on that QUG outcome (a workflow decision following the test, not a property of the true DGP). 2. The notebook over-described fail-to-reject pre-tests as "formal validation", "conclusive", "closes assumptions", "TWFE admissible without methodological caveat". Soften to "diagnostics fail to reject", "supports but does not prove", "non-rejection evidence under finite-sample power and test specification". Pre-test tutorials should teach the limits of pre-tests, not paper over them. Also extracts a `yatchew_side_panel_inputs` fixture in the drift test to deduplicate post_dose / dy construction across the two side-panel tests. Numerical pins unchanged; all 15 drift tests still pass on both backends; notebook executes cleanly; T20 drift unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed tutorial Two stale shorthand phrasings inconsistent with the revised methodology framing: - Section 7 Extensions: "single Design 1 panel" → "single panel where QUG led the workflow to select the continuous_at_zero (Design 1) identification path" (matches the corrected Section 2 wording). - `test_event_study_pretrends_fails_to_reject` docstring quoted "close to alpha = 0.05 but conclusive"; the user-facing text now says "warrants scrutiny" - update internal docstring to match. No methodology change, no new pins; all 15 drift tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 — Decouple QUG from HAD's `design="auto"` selector across all surfaces. The two are independent rules: QUG is a statistical pre-test on `H0: d_lower = 0`; `_detect_design()` is a min/median heuristic (`continuous_at_zero` fires when `d.min() < 0.01 * median(|d|)`). On T21's panel both checks point to the same identification path but the mechanisms are independent — `_detect_design()` does not consume the QUG p-value. Reword tutorial Section 2 + Section 3, CHANGELOG entry, and drift-test docstrings to reflect this. Add `test_had_design_auto_lands_on_continuous_at_zero`: explicitly fits `HAD(design="auto")` on the two-period panel and asserts `design == "continuous_at_zero"` and `target_parameter == "WAS"`, locking the prose claim independently of the QUG-test pins. P2 — Update REGISTRY.md to mark T21 shipped (PR #409); leave T22 row queued. All 16 drift tests pass on both backends; notebook executes cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- REGISTRY.md L2509: practitioner_next_steps + T21 tutorial were marked "queued for Phase 5"; both now landed (PR #402 + PR #409). Update to reflect actual status; T22 remains queued. - CHANGELOG.md L11 (T21 entry): drift-test count was "15 tests"; now 16 (after the new test_had_design_auto_lands_on_continuous_at_zero added in R1). - CHANGELOG.md L15 (PR #402 entry, retroactive): said "T21 pretest tutorial and T22 weighted/survey tutorial remain queued"; T21 has since landed in PR #409. Update to reflect that. No methodology change; no test surface changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2 — CELL_07 first bullet had a conceptual error in describing the QUG mechanic: "D_(1) is small relative to the gap D_(2)-D_(1)" — actually D_(1) ≈ 0.181 and the gap ≈ 0.047, so D_(1) is 3.86x LARGER than the gap. The reason QUG fails-to-reject is that T = D_(1)/(D_(2)-D_(1)) = 3.86 lands below the critical value 19, NOT because of any "small relative to the gap" relationship. Rewrote to state the test statistic and critical value directly. P3 polish: - CELL_03: "approximately 0.007" → "below 0.01" (avoids numerical drift on a stat that scales with seed; the heuristic threshold itself is what matters). - CELL_07: added a one-line aside reconciling `all_pass=True` with Step 2 deferral on the overall path: `all_pass` aggregates only the steps that ran on each dispatch, so True here means "of the two steps run, neither rejected" — not that Assumption 7 has been cleared. - CELL_09: explained the very-large-negative `T_hr` ≈ -35,000 as a scale artifact (sigma2_diff scales with the squared dose-step gap; on Uniform[0.01, 50] doses with a true slope of 100, adjacent-by-dose units have dy gaps that swamp sigma2_lin). Adds explicit reference forward to the side panel where a different input gives T_hr ≈ 0 as a sanity check. - CELL_17: tightened mean_independence vs linearity framing to "linear fit absorbs any apparent slope (real or sample noise)" — the pre-period has no real signal so the original "absorbs the dose-response signal" wording was off-target on this panel. No methodology change; all 16 drift tests still pass; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI AI reviewer's diff-build excludes `docs/tutorials/*.ipynb` (`.github/workflows/ai_pr_review.yml:151-156` + reviewer prompt's DO-NOT list at `.github/codex/prompts/pr_review.md:87-91`), so the actual T21 notebook prose has not been visible to the CI reviewer through three review rounds. The notebook content was reviewed once via a standalone notebook-aware Agent (which caught a P2 conceptual error in CELL_07 + 4 P3 polish items, all addressed in `d9ea86a`), but the CI reviewer itself has only seen the adjacent surfaces (CHANGELOG, drift test, README, REGISTRY). This commit lands a one-shot markdown extract at `docs/_review/t21_notebook_extract.md` that mirrors the notebook's full narrative (markdown cells + code cells + executed outputs) so the CI reviewer can audit the prose directly on this PR. Regenerate via `python _scratch/t21_pretests/70_extract_for_review.py` from the notebook source-of-truth at `_scratch/t21_pretests/60_build_notebook.py`. Adds `_review` to Sphinx `exclude_patterns` in `docs/conf.py` so the docs build doesn't pick the file up. A follow-on PR will (a) remove this extract file + the Sphinx exclude_patterns entry and (b) replace the blanket `.ipynb` exclusion in the CI workflow with a markdown-only extraction (jq one-liner) wired into the diff-build itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 — HAD design label convention was reversed across T21. Per
REGISTRY:2267 + had.py:7-33, the convention is:
- Design 1' = continuous_at_zero (d_lower = 0, QUG case) — that's T21
- Design 1 = continuous_near_d_lower (d_lower > 0) — that's T20
T21 had Design 1 / Design 1' swapped throughout. Fixed in the build
script (Section 1 paper-step taxonomy, Section 2 panel framing,
Section 3 reading-the-verdict, Section 7 Extensions). Notebook
re-executed and review extract regenerated.
Two residual "QUG selects/picks the identification path" leakages from
the original prose also surfaced (Section 7 + Summary checklist). Both
contradicted the explicit QUG-vs-_detect_design separation locked by
test_had_design_auto_lands_on_continuous_at_zero. Reworded to keep the
two rules independent ("QUG fail-to-reject and `design="auto"`
heuristic both pointed independently"; "QUG is a statistical test on
H0; `design="auto"` calls _detect_design() which uses a min/median
heuristic — both pointed to continuous_at_zero on this panel").
P2 (MT1) — T21 was mapped under had_pretests.py in doc-deps.yaml but
the drift test now also locks HAD(design="auto") / _detect_design()
behavior from had.py via test_had_design_auto_lands_on_continuous_at_zero.
Add T21 entry to the had.py docs block with a note on the
_detect_design() drift coverage so a future had.py design-selection
change does not miss T21 in the manual docs-impact map.
All 16 drift tests still pass on Rust; nbmake clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two methodology framing errors conflated in the original tutorial: - "Paper Step 4" was described as "Boundary continuity (Assumptions 5/6)" in the workflow taxonomy. Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration), Step 4 is actually the DECISION RULE: "if Steps 1-3 don't reject, TWFE may be used." Boundary-continuity assumptions are a separate concern. - Assumptions 5/6 are Design 1 (continuous_near_d_lower / mass_point) identification caveats — the library emits a UserWarning citing them on Design 1 fits and stays silent on Design 1' (continuous_at_zero) fits per REGISTRY:2532 and had.py. T21's panel resolves to Design 1' via QUG fail-to-reject + the _detect_design() heuristic, so the relevant non-testable caveat is **Assumption 3** (uniform continuity of d -> Y_2(d) at zero, REGISTRY:2270), NOT Assumptions 5/6. Inherited the 5/6 framing from T20 (which IS Design 1) inappropriately. Reframed across 7 surfaces in the build script: - Section 1 four-step enumeration: Step 4 is now the decision rule - Section 1: added a separate paragraph for the non-testable identification caveat that's design-path-specific (Assumption 3 for Design 1', Assumptions 5/6 for Design 1) and explicitly notes the library's UserWarning behavior matches this split - Section 4 event-study verdict reading: separated Step 4 (decision rule) from the Design 1' caveat - Section 4 horizon-detail closing: same split - Section 6 leadership template: replaced "Step 4 / Assumptions 5/6" caveat with the correct Design 1' caveat (Assumption 3); explicit parenthetical noting T20's caveat was different because T20 was Design 1 - Section 6 bottom line: same split (decision rule vs caveat) - Section 8 summary checklist: replaced single Step-4-as-caveat bullet with a two-part bullet on the workflow vs caveat distinction Notebook re-executed, review extract regenerated. All 16 drift tests still pass; nbmake clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion rule Per REGISTRY's pretest workflow (lines 2482-2487 surrounding the four-step enumeration) and the same module's two correctly-framed docstrings (module-level at line 54, `_compose_verdict_event_study` at line 2735), paper Step 4 is the decision rule "use TWFE if none of the tests rejects" — NOT a test or assumption. The Yatchew-HR test is a Step 3 alternative (a single-horizon swap-in for Stute), and there is no joint Yatchew variant in the paper. The `did_had_pretest_workflow()` docstring at L4445-4446 was the only place in the file that mislabeled the Yatchew alternative as "Step 4". Reword to: - correctly tag Yatchew as a step-3 alternative - explicitly state Step 4 is the decision rule with no code path - cross-reference the two existing correctly-framed docstrings in the same file so future readers can confirm the convention is consistent Surfaced by PR #409 R6 review (T21 tutorial pre-test taxonomy work made the inconsistency visible). All 42 workflow-specific tests in `tests/test_had_pretests.py` still pass; T21 drift test (16 tests) still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bootstrap p-value drift tests had lower-bound-only assertions: - `test_overall_stute_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.686 → would silently pass if p drifted to 0.99 - `test_event_study_homogeneity_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.763 → same silent-stale risk The third bootstrap test (`test_event_study_pretrends_fails_to_reject`) already used a bounded band `0.0 <= p <= 0.25`. Mirror that pattern on the other two with bounded bands per `feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15 width): - Stute: 0.53 <= p <= 0.84 (band ~0.31 around 0.686) - Homogeneity: 0.61 <= p <= 0.92 (band ~0.31 around 0.763) Both bands wide enough for Rust ↔ pure-Python RNG path differences; both narrow enough that drift in either direction (toward rejection or toward an even cleaner pass) flags the prose as stale. All 16 drift tests pass on both backends within the new bands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2f2c7ca to
3ab7a86
Compare
Summary
did_had_pretest_workflowthat walks through the composite pre-test battery on a panel close in shape to T20's brand campaign, surfaces the structural gap on the two-period (aggregate="overall") path (no Step 2 / parallel pre-trends), and upgrades to the multi-period (aggregate="event_study") path that adds the joint pre-trends Stute and joint homogeneity Stute diagnostics.Uniform[$0.01K, $50K]for regional spend (vs T20's[$5K, $50K]) — true support strictly positive but very near zero, chosen so QUG fails-to-rejectH0: d_lower = 0in this finite sample. HAD'sdesign="auto"then selects thecontinuous_at_zeroidentification path on the QUG outcome (a workflow decision following the test, not a property of the true DGP support — explicitly distinguished in the tutorial prose).yatchew_hr_testnull modes side-by-side:null="linearity"(default, paper Theorem 7) vsnull="mean_independence"(PR Add yatchew_hr_test(null='mean_independence') mode #397, R-parity with RYatchewTest::yatchew_test(order=0)) on the within-pre-period first-difference paired with post-period dose. Illustrates the stricter null's larger residual variance (sigma2_lin7.01 vs 6.53) and smaller p-value (0.29 vs 0.49).tests/test_t21_had_pretest_workflow_drift.py, 15 tests) pinning panel composition, both verdict pivots, structural anchors on both paths, deterministic QUG / Yatchew statistics, and bootstrap p-value tolerance bands perfeedback_bootstrap_drift_tests_need_backend_tolerance.Surfaces touched
docs/tutorials/21_had_pretest_workflow.ipynb(new, 20 cells: 6 code + 14 markdown)tests/test_t21_had_pretest_workflow_drift.py(new, 15 tests)docs/tutorials/20_had_brand_campaign.ipynbSection 6 Extensions (forward-pointer to T21)docs/tutorials/README.md(T21 catalog entry)CHANGELOG.md[Unreleased]Added entryTODO.mdrow 112 (T21 marked done; T22 row remains queued)docs/doc-deps.yamlhad_pretests.pyblock (T21 tutorial entry)No source code changes in
diff_diff/. T22 weighted/survey HAD tutorial remains queued as a separate notebook PR perproject_had_followups.md.Test plan
pytest tests/test_t21_had_pretest_workflow_drift.py -v(Rust backend, 15/15 expected)DIFF_DIFF_BACKEND=python pytest tests/test_t21_had_pretest_workflow_drift.py -v(pure-Python backend, 15/15 expected)pytest --nbmake docs/tutorials/21_had_pretest_workflow.ipynb(notebook executes cleanly)pytest tests/test_t20_had_brand_campaign_drift.py -v(T20 drift unaffected by Section 6 forward-pointer edit, 13/13 expected)🤖 Generated with Claude Code