Add Tutorial 18: geo-experiment SyntheticDiD walkthrough (B2b)#289
Add Tutorial 18: geo-experiment SyntheticDiD walkthrough (B2b)#289
Conversation
Adds the practitioner-facing geo-experiment tutorial from ROADMAP B2b. Targeted at marketing analytics teams arriving from GeoLift or CausalImpact, the tutorial leads with SyntheticDiD on a synthetic 80-market panel and closes with a cross-validation against GeoLift's published Simulated Retail dataset. Notebook structure (40 cells, 8 sections): - Sections 1-2: Problem framing, synthetic DGP (80 markets, 18 treated, 12 weeks) - Sections 3-4: SDiD fit + diagnostics (unit/time weights, pre-fit RMSE, treated-vs-synthetic visualization) - Section 5: Placebo vs bootstrap SE comparison and practitioner_next_steps - Section 6: Cross-validation against GeoLift_Test.rda (chicago/portland), with honest framing of why diff-diff SDiD and GeoLift's ASCM disagree on a small treated group with loose pre-fit - Section 7: Stakeholder communication template (Tutorial 17 Section 9 pattern) - Section 8: Positioning - what diff-diff's SDiD adds for the GeoLift/ CausalImpact audience, scoped strictly to block-treatment SDiD All SDiD fits use seed=42 and n_bootstrap=100 for determinism. Notebook executes end-to-end via nbmake in 219s, well under the 600s CI budget. Cross-links wired: - docs/index.rst: tutorial added to "Business Applications" toctree - docs/practitioner_decision_tree.rst: .. tip:: in Branch 4 (Few Test Markets) - docs/practitioner_getting_started.rst: Next Steps entry - docs/tutorials/README.md: tutorial 18 entry (README otherwise stale, out of scope for this PR) - ROADMAP.md: B2b status -> In progress New data fixture: docs/tutorials/data/geolift_test.csv (128 KB), extracted from facebookincubator/GeoLift's GeoLift_Test.rda (MIT licensed) via the rdata Python package and augmented with day index and treated indicator. Sourced and verified against the GeoLift Walkthrough vignette. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address the four high-value P3 findings from local AI review: 1. Soften "canonical implementation" / "matches R" wording (Sections 1, 3 of the notebook). The exact equivalence to R synthdid lives in REGISTRY.md; the tutorial now points readers there instead of overstating equivalence. 2. Fix the "SDiD relaxes parallel-trends by construction" overclaim in Section 5. Identification still rests on assumptions (parallel trends in the *weighted* averages, no anticipation, design plausibility); what changes is the diagnostic, not the assumption. The new wording explains that the relevant test shifts from "are raw control units parallel?" to "is the synthetic match tight enough?" (which is the pre-fit RMSE we already inspected). 3. Rename Section 6 from "Cross-validation on GeoLift's Public Data" to "Cross-Method Comparison on GeoLift's Public Data". Cross-validation implies same method on different data; we are running different methods (SDiD vs ASCM) on the same data, which is a comparison, not a validation. Updated the matching wording in practitioner_decision_tree.rst, practitioner_getting_started.rst, and docs/tutorials/README.md. 4. Add docs/tutorials/data/README.md as a sidecar provenance/license file for the bundled GeoLift CSV. Documents source URL, MIT license, extraction date, extraction method (rdata Python package, no R dependency), schema, and a regeneration snippet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Address PR #289 second-round review feedback (b412101) plus a scope pivot. Pivot: let SDiD live on its own as the solution to the geo-experiment business problem, rather than defining the tutorial in opposition to GeoLift/CausalImpact. Removed Section 6 (Cross-Method Comparison) and Section 8 (positioning) entirely. Section 1 keeps a single neutral acknowledgment of the GeoLift/CausalImpact audience but no comparison section, no positioning section, no head-to-head numbers. Tutorial is now 33 cells across 6 sections (was 40 across 8), ending at the "Communicating Results to Leadership" stakeholder template. P1 fix - Bootstrap variance description (Section 5): The previous wording said bootstrap "re-estimates the unit weights" which contradicted both REGISTRY.md:1272-1278 and the _bootstrap_se docstring at synthetic_did.py:729-733. The actual implementation renormalizes the *original* unit weights for the resampled controls and keeps time weights *fixed*. Rewrote the Section 5 SE-methods cell to make this explicit and contrast with placebo (which DOES re-estimate both weight types per replication). P3 fix - hard-coded narrative numbers: Added a tolerance-based drift-guard cell at the end of Section 4 that asserts the ATT, CI bounds, pre-fit RMSE, and pre-period gap range are all within the tolerances quoted in the Section 3 narrative and the Section 6 stakeholder template. nbmake will fail if generate_did_data() or SyntheticDiD output drifts outside those ranges, forcing the markdown to be updated before the notebook can pass CI. P3 N/A - "Simulated Retail" vs "real-world" inconsistency: Disappeared with the Section 6 deletion above; no longer applies. Files removed: - docs/tutorials/data/geolift_test.csv (128 KB GeoLift fixture) - docs/tutorials/data/README.md (sidecar provenance file) - docs/tutorials/data/ directory (auto-removed when empty) Cross-link wording in decision_tree.rst, getting_started.rst, and docs/tutorials/README.md updated to drop the "cross-method comparison" phrasing - the tutorial no longer makes that comparison. Verified locally: nbmake runs in 22.58s (down from 219s without the slow GeoLift fit), all 5 drift guards pass on the actual numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests No additional findings. The new drift guards in docs/tutorials/18_geo_experiments.ipynb:L366-L376 are a good improvement. I could not rerun Path to Approval
|
Address PR #289 third-round review (96d2eea). The reviewer caught a methodology mismatch: my narrative claimed non-parallel pre-trends but generate_did_data is a parallel-trends DGP. Cross-checking against external practitioner sources (Cunningham's Mixtape, Abadie's framing, the Arkhangelsky et al. (2021) paper itself, and our own practitioner_decision_tree.rst Branch 4) showed the deeper issue: with 18 treated markets the example was outside SDiD's documented sweet spot in either direction. SDiD is for FEW treated markets in a LARGE donor pool, where basic DiD's averaging doesn't help and you need a counterfactual built specifically for those few treated units. The Arkhangelsky paper itself warns that SDiD's localization can WORSEN precision relative to basic DiD when there is little systematic heterogeneity to localize on. generate_did_data has no such heterogeneity (just unit FEs and a shared linear trend), so the seed-search to "fix" the bias was fighting an uphill battle - SDiD on that DGP is exactly the regime the authors caution against. Switched to generate_factor_data, which produces interactive fixed effects (different markets have different latent trajectories - what practitioners experience as markets responding to different macro forces: local economics, demographics, competitor activity). This is the regime SDiD is designed for, and in this regime SDiD's localization genuinely helps. Changes: - DGP: generate_did_data -> generate_factor_data, with params tuned via parameter search to give realistic conversion magnitudes (~600-2300), CI coverage of the true effect, and small bias (factor_strength=150, treated_loading_shift=0.5, unit_fe_sd=80, noise_sd=25, seed=42) - Treated count: 18 -> 5 (matches the library decision tree's "Few Test Markets" branch and the canonical synthetic control / SDiD use case) - Donor pool: 62 -> 75 (more donors for the SDiD weighting to draw on) - Use raw["treat"] (unit-level ever-treated) not raw["treated"] (time-varying), as required by SDiD's block-treatment API - Refreshed all narrative numbers in Section 3 (ATT 312, 95% CI (298, 326), 21% lift, recovers true effect 300 within 4%) - Refreshed Section 6 stakeholder template numbers and reframed for the 5-market pilot context - Refreshed drift guards: ATT in [305, 320], CI bounds, explicit "CI must cover 300" assertion, RMSE < 60. Removed the per-week pre_gap assertion (with factor data, individual week gaps after intercept correction can be larger even with a clean overall fit - the library's pre_treatment_fit RMSE is the right diagnostic). - Section 1: rewrote "Why this is hard" to lean on the few-treated rationale (small-cluster cluster-SE problem, market heterogeneity, parallel trends rarely holds) - Section 1: added pointer to practitioner_decision_tree.html#few-test-markets - Section 2: rewrote DGP description to explain factor structure in plain English ("markets respond to different macro forces"), no jargon about factor loadings - Section 5 parallel-trends callout: added the Arkhangelsky framing about SDiD "automating the kind of pre-period adjustment practitioners already do informally" Methodologically careful language locked in: - No claim that SDiD recovers an effect basic DiD can't (would require running both) - No claim that SDiD has tighter CIs than basic DiD (the win is honest inference, not narrower bounds) - No claim that SDiD is uniquely correct on parallel-trends data (per the Arkhangelsky caveat, it can be worse) - DOES claim that with this DGP and few treated, SDiD is in its documented sweet spot per Arkhangelsky et al. (2021), the library decision tree, and Cunningham's Mixtape Verified: nbmake passes in 53s (down from 219s on the original 18-treated notebook with the GeoLift section, and 22s on the previous 18-treated parallel-trends version), all 4 drift guards green, CI cleanly covers the true treatment effect of 300. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Address P3 from PR #289 latest review (f1456fa). The pre-trends plot legend was hardcoded as 'Treated markets (n=18)' / 'Control markets (n=62)' from the original 18-treated build, but the notebook now generates 5 treated and 75 control. The estimates were unaffected, but the core explanatory plot looked internally inconsistent. Fix derives n_tr and n_co from the data via len(set(...)) before constructing the legend labels with f-strings, so the labels can't go stale if the DGP is ever retuned again. Verified locally: nbmake passes in 55.85s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
PR #289 shipped Tutorial 18 (18_geo_experiments.ipynb) but ROADMAP.md still listed B2b as "In progress" with a description that mentioned the GeoLift/CausalImpact comparison. That comparison was explicitly dropped from the tutorial in commit 96d2eea and remains scoped under the separate B2c row. - Flip B2b status to "Done (Tutorial 18)" - Drop the "comparison with GeoLift/CausalImpact" mention (B2c still tracks that work independently) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
docs/tutorials/18_geo_experiments.ipynb- 40-cell practitioner walkthrough that leads withSyntheticDiDon a synthetic 80-market panel and closes with a cross-method comparison against GeoLift's published Simulated Retail dataset. Targets marketing analytics teams arriving from GeoLift or CausalImpact (ROADMAP B2b).docs/tutorials/data/geolift_test.csv(128 KB) extracted fromfacebookincubator/GeoLift'sGeoLift_Test.rda(MIT licensed), withdayandtreatedcolumns added. Provenance, license, schema, and a regeneration snippet documented indocs/tutorials/data/README.md.docs/index.rsttoctree,docs/practitioner_decision_tree.rstBranch 4.. tip::,docs/practitioner_getting_started.rstNext Steps,docs/tutorials/README.md, andROADMAP.md(B2b status -> In progress).Tutorial structure (8 sections, 40 cells)
practitioner_next_stepsAll SDiD fits use
seed=42andn_bootstrap=100for determinism.Methodology references (required if estimator / math changes)
docs/methodology/REGISTRY.mdfor implementation details.Validation
pytest tests/test_methodology_sdid.py -k "smoke or basic") still pass; the existing test suite is not affected.pytest --nbmake --nbmake-timeout=600 docs/tutorials/18_geo_experiments.ipynbin 219s, well under the 600s CI budget.seed=42, so re-runs produce identical numerical output.Security / privacy
Generated with Claude Code