Add Tutorial 18: geo-experiment SyntheticDiD walkthrough (B2b) by igerber · Pull Request #289 · igerber/diff-diff

igerber · 2026-04-11T18:45:22Z

Summary

New tutorial: docs/tutorials/18_geo_experiments.ipynb - 40-cell practitioner walkthrough that leads with SyntheticDiD on a synthetic 80-market panel and closes with a cross-method comparison against GeoLift's published Simulated Retail dataset. Targets marketing analytics teams arriving from GeoLift or CausalImpact (ROADMAP B2b).
New data fixture: docs/tutorials/data/geolift_test.csv (128 KB) extracted from facebookincubator/GeoLift's GeoLift_Test.rda (MIT licensed), with day and treated columns added. Provenance, license, schema, and a regeneration snippet documented in docs/tutorials/data/README.md.
Cross-links: docs/index.rst toctree, docs/practitioner_decision_tree.rst Branch 4 .. tip::, docs/practitioner_getting_started.rst Next Steps, docs/tutorials/README.md, and ROADMAP.md (B2b status -> In progress).

Tutorial structure (8 sections, 40 cells)

The geo-experiment problem (framing for the GeoLift/CausalImpact audience)
Synthetic DGP (80 markets, 18 treated, 12 weeks) with pre-trends visualization
SyntheticDiD fit (true effect 300, recovered 296.64, within 1%)
Diagnostics: unit weights, time weights, pre-fit RMSE, treated-vs-synthetic plot
Inference: placebo (default) vs bootstrap SE comparison + practitioner_next_steps
Cross-method comparison on the GeoLift_Test dataset (chicago/portland test markets, days 91-105 post). diff-diff SDiD: ATT 238 (95% CI -300, 777, wide because of loose pre-fit on a 2-treated/38-control panel). GeoLift's published ASCM: ATT 155.56 (5.4% lift). Same direction, different magnitudes - documented honestly as a feature of SDiD's uncertainty propagation, not a failure.
Communicating Results to Leadership (mirrors Tutorial 17 Section 9 pattern)
What diff-diff's SyntheticDiD adds for the GeoLift/CausalImpact audience (scoped strictly to block-treatment SDiD; no out-of-scope claims about staggered designs)

All SDiD fits use seed=42 and n_bootstrap=100 for determinism.

Methodology references (required if estimator / math changes)

N/A - no methodology changes. This PR is docs/tutorial-only and adds no new estimator code or REGISTRY.md entries.
Tutorial cites Arkhangelsky, Athey, Hirshberg, Imbens, & Wager (2021), AER 111(12), 4088-4118 (the SDiD paper) and points readers to docs/methodology/REGISTRY.md for implementation details.
GeoLift comparison numbers in Section 6 are sourced to the GeoLift Walkthrough vignette (Augmented Synthetic Control method, Ben-Michael, Feller & Rothstein 2021).

Validation

Tests added/updated: No test changes. Targeted SDiD smoke tests (pytest tests/test_methodology_sdid.py -k "smoke or basic") still pass; the existing test suite is not affected.
nbmake: Notebook executes end-to-end via pytest --nbmake --nbmake-timeout=600 docs/tutorials/18_geo_experiments.ipynb in 219s, well under the 600s CI budget.
Determinism: All SDiD fits pin seed=42, so re-runs produce identical numerical output.
Synthetic walkthrough verification: True treatment effect of 300 in the DGP; SDiD estimates 296.64 (95% CI 263-330, p < 0.01) - 1% error, true effect inside CI.
GeoLift cross-method comparison: Real-world dataset where SDiD honestly reports a wide CI (CI crosses zero) due to loose pre-fit (RMSE 753 > treated pre SD of 549) on a 2-treated/38-control panel. Used as a teaching moment for SDiD's uncertainty propagation rather than as a clean replication target.

Security / privacy

Confirm no secrets/PII in this PR: Yes
Bundled CSV is the public, MIT-licensed GeoLift Simulated Retail dataset (40 US cities, daily counts of an unspecified retail KPI). No personal data, no real customer data.

Generated with Claude Code

Adds the practitioner-facing geo-experiment tutorial from ROADMAP B2b. Targeted at marketing analytics teams arriving from GeoLift or CausalImpact, the tutorial leads with SyntheticDiD on a synthetic 80-market panel and closes with a cross-validation against GeoLift's published Simulated Retail dataset. Notebook structure (40 cells, 8 sections): - Sections 1-2: Problem framing, synthetic DGP (80 markets, 18 treated, 12 weeks) - Sections 3-4: SDiD fit + diagnostics (unit/time weights, pre-fit RMSE, treated-vs-synthetic visualization) - Section 5: Placebo vs bootstrap SE comparison and practitioner_next_steps - Section 6: Cross-validation against GeoLift_Test.rda (chicago/portland), with honest framing of why diff-diff SDiD and GeoLift's ASCM disagree on a small treated group with loose pre-fit - Section 7: Stakeholder communication template (Tutorial 17 Section 9 pattern) - Section 8: Positioning - what diff-diff's SDiD adds for the GeoLift/ CausalImpact audience, scoped strictly to block-treatment SDiD All SDiD fits use seed=42 and n_bootstrap=100 for determinism. Notebook executes end-to-end via nbmake in 219s, well under the 600s CI budget. Cross-links wired: - docs/index.rst: tutorial added to "Business Applications" toctree - docs/practitioner_decision_tree.rst: .. tip:: in Branch 4 (Few Test Markets) - docs/practitioner_getting_started.rst: Next Steps entry - docs/tutorials/README.md: tutorial 18 entry (README otherwise stale, out of scope for this PR) - ROADMAP.md: B2b status -> In progress New data fixture: docs/tutorials/data/geolift_test.csv (128 KB), extracted from facebookincubator/GeoLift's GeoLift_Test.rda (MIT licensed) via the rdata Python package and augmented with day index and treated indicator. Sourced and verified against the GeoLift Walkthrough vignette. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address the four high-value P3 findings from local AI review: 1. Soften "canonical implementation" / "matches R" wording (Sections 1, 3 of the notebook). The exact equivalence to R synthdid lives in REGISTRY.md; the tutorial now points readers there instead of overstating equivalence. 2. Fix the "SDiD relaxes parallel-trends by construction" overclaim in Section 5. Identification still rests on assumptions (parallel trends in the *weighted* averages, no anticipation, design plausibility); what changes is the diagnostic, not the assumption. The new wording explains that the relevant test shifts from "are raw control units parallel?" to "is the synthetic match tight enough?" (which is the pre-fit RMSE we already inspected). 3. Rename Section 6 from "Cross-validation on GeoLift's Public Data" to "Cross-Method Comparison on GeoLift's Public Data". Cross-validation implies same method on different data; we are running different methods (SDiD vs ASCM) on the same data, which is a comparison, not a validation. Updated the matching wording in practitioner_decision_tree.rst, practitioner_getting_started.rst, and docs/tutorials/README.md. 4. Add docs/tutorials/data/README.md as a sidecar provenance/license file for the bundled GeoLift CSV. Documents source URL, MIT license, extraction date, extraction method (rdata Python package, no R dependency), schema, and a regeneration snippet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T18:45:36Z

/ai-review

github-actions · 2026-04-11T18:52:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b412101cef3990cc01049093a99eaaf5ce28a632

Overall Assessment
⚠️ Needs changes — one unmitigated P1 methodology defect.

Executive Summary

No estimator code changed, and the tutorial’s default-placebo SyntheticDiD setup is otherwise consistent with the Methodology Registry.
The blocking issue is that the new tutorial misdescribes SyntheticDiD bootstrap variance: it says bootstrap re-estimates unit weights, but the documented and implemented method keeps weights fixed and renormalizes original control weights after resampling.
Section 6 is correctly framed as a cross-method comparison rather than a replication, but it inconsistently calls the GeoLift fixture both “Simulated Retail” and “real-world.”
Several result claims are hard-coded in markdown with no assertion cells, so notebook execution can pass while the prose silently drifts from actual output.

Methodology

Severity: P1. Impact: The affected method is SyntheticDiD bootstrap variance. The tutorial says bootstrap “re-estimates the unit weights” at docs/tutorials/18_geo_experiments.ipynb:375, but the registry documents fixed-weight bootstrap at docs/methodology/REGISTRY.md:1272, docs/methodology/REGISTRY.md:1275, and docs/methodology/REGISTRY.md:1276, and the implementation docstring repeats that at diff_diff/synthetic_did.py:729. This teaches the wrong SE algorithm and obscures the actual placebo-vs-bootstrap distinction. Concrete fix: Rewrite the bootstrap bullet to say that bootstrap resamples units, renormalizes the original control weights for the resampled controls, keeps time weights fixed, and recomputes the SDiD estimate with those fixed weights; if you want to mention re-estimation, reserve that for placebo.

Code Quality

No findings in the changed files beyond the methodology issue above.

Performance

No findings. This PR is docs/data only, and the added fixture is small.

Maintainability

No findings in the changed runtime code paths.

Tech Debt

No findings. TODO.md does not currently track the P1 methodology mismatch, so it is unmitigated.

Security

No findings. I did not see secrets or PII in the added CSV or docs changes.

Documentation/Tests

Severity: P3. Impact: The notebook calls the GeoLift benchmark a “Simulated Retail dataset” at docs/tutorials/18_geo_experiments.ipynb:442 but later calls it “a real-world dataset” at docs/tutorials/18_geo_experiments.ipynb:446 and docs/tutorials/18_geo_experiments.ipynb:517, while the fixture README describes it as the packaged GeoLift example dataset at docs/tutorials/data/README.md:8. This overstates the empirical provenance of the comparison. Concrete fix: Change “real-world dataset” to “published simulated dataset” or similar throughout Section 6.
Severity: P3. Impact: The tutorial hard-codes result claims in markdown, e.g. docs/tutorials/18_geo_experiments.ipynb:236 and docs/tutorials/18_geo_experiments.ipynb:541, with no assertion cells nearby. nbmake can catch execution failures, but not stale prose if generate_did_data() or SyntheticDiD output changes. Concrete fix: Add small tolerance-based assertion cells for the narrated ATT/CI/RMSE values, or generate the narrative from live results objects instead of hard-coding numbers.
I did not independently rerun nbmake here because the local review environment is missing numpy; the findings above are from static diff/registry/code inspection.

Path to Approval

Update the Section 5 bootstrap explanation in docs/tutorials/18_geo_experiments.ipynb:375 so it matches the registry and implementation. After that, the remaining items are P3 and compatible with ✅ Looks good.

Address PR #289 second-round review feedback (b412101) plus a scope pivot. Pivot: let SDiD live on its own as the solution to the geo-experiment business problem, rather than defining the tutorial in opposition to GeoLift/CausalImpact. Removed Section 6 (Cross-Method Comparison) and Section 8 (positioning) entirely. Section 1 keeps a single neutral acknowledgment of the GeoLift/CausalImpact audience but no comparison section, no positioning section, no head-to-head numbers. Tutorial is now 33 cells across 6 sections (was 40 across 8), ending at the "Communicating Results to Leadership" stakeholder template. P1 fix - Bootstrap variance description (Section 5): The previous wording said bootstrap "re-estimates the unit weights" which contradicted both REGISTRY.md:1272-1278 and the _bootstrap_se docstring at synthetic_did.py:729-733. The actual implementation renormalizes the *original* unit weights for the resampled controls and keeps time weights *fixed*. Rewrote the Section 5 SE-methods cell to make this explicit and contrast with placebo (which DOES re-estimate both weight types per replication). P3 fix - hard-coded narrative numbers: Added a tolerance-based drift-guard cell at the end of Section 4 that asserts the ATT, CI bounds, pre-fit RMSE, and pre-period gap range are all within the tolerances quoted in the Section 3 narrative and the Section 6 stakeholder template. nbmake will fail if generate_did_data() or SyntheticDiD output drifts outside those ranges, forcing the markdown to be updated before the notebook can pass CI. P3 N/A - "Simulated Retail" vs "real-world" inconsistency: Disappeared with the Section 6 deletion above; no longer applies. Files removed: - docs/tutorials/data/geolift_test.csv (128 KB GeoLift fixture) - docs/tutorials/data/README.md (sidecar provenance file) - docs/tutorials/data/ directory (auto-removed when empty) Cross-link wording in decision_tree.rst, getting_started.rst, and docs/tutorials/README.md updated to drop the "cross-method comparison" phrasing - the tutorial no longer makes that comparison. Verified locally: nbmake runs in 22.58s (down from 219s without the slow GeoLift fit), all 5 drift guards pass on the actual numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T19:13:54Z

/ai-review

github-actions · 2026-04-11T19:19:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 96d2eeae7d4ac00bae212b7e26ffd66cf33a36b3

Overall Assessment

⚠️ Needs changes — one unmitigated P1 methodology issue.

Executive Summary

The previous blocker is fixed: Section 5 now describes placebo vs. bootstrap SE the same way the registry and SyntheticDiD docstring do (docs/tutorials/18_geo_experiments.ipynb:L392-L399, docs/methodology/REGISTRY.md:L1261-L1278, diff_diff/synthetic_did.py:L729-L733).
[Newly identified] The notebook’s example data come from generate_did_data() (docs/tutorials/18_geo_experiments.ipynb:L91-L111), but the narrative teaches that example as a nonparallel-trends / canonical SDiD identification case (docs/tutorials/18_geo_experiments.ipynb:L36, docs/tutorials/18_geo_experiments.ipynb:L175).
generate_did_data() is a basic DiD DGP with unit fixed effects and a common time trend (diff_diff/prep_dgp.py:L27-L31, diff_diff/prep_dgp.py:L99-L102); the repo’s own power-analysis docs pair SyntheticDiD with factor-model data instead (docs/tutorials/06_power_analysis.ipynb:L465, diff_diff/prep_dgp.py:L341-L347).
The new drift-guard assertions are a good improvement and address the prior stale-prose concern (docs/tutorials/18_geo_experiments.ipynb:L366-L376).
I did not find other blocker-level issues in the changed docs/link updates.

Methodology

Severity: P1 [Newly identified]
Impact: Affected method: SyntheticDiD. The tutorial says the observed pre-period plot is “exactly the situation SDiD is designed for” because “we can’t pretend pre-trends are perfectly parallel” (docs/tutorials/18_geo_experiments.ipynb:L175) and frames the motivating problem as treated markets being on a different trajectory (docs/tutorials/18_geo_experiments.ipynb:L36). But the notebook explicitly builds the example with generate_did_data() (docs/tutorials/18_geo_experiments.ipynb:L91-L111), whose docstring and implementation define a basic DiD panel with unit fixed effects plus one shared linear time trend for all units (diff_diff/prep_dgp.py:L27-L31, diff_diff/prep_dgp.py:L99-L102). The project’s own docs separately identify factor-model data as the appropriate synthetic DGP family for SyntheticDiD (docs/tutorials/06_power_analysis.ipynb:L465, diff_diff/prep_dgp.py:L341-L347, diff_diff/prep_dgp.py:L403-L405). As written, the tutorial uses level/noise differences in a parallel-trends DGP to teach a nonparallel-trends identification story.
Concrete fix: Make the example and the narrative consistent. Either switch the notebook to a factor-model DGP (for example generate_factor_data(..., n_pre=6, n_post=6, n_treated=18) and pass treatment="treat" to SyntheticDiD.fit()), or keep generate_did_data() but rewrite the Section 1/2 markdown so it describes a simple block-treatment walkthrough with level differences / imperfect pre-fit rather than a failure of parallel trends or a case where SDiD is uniquely motivated.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. TODO.md does not track the P1 above, so it remains unmitigated.

Security

No findings.

Documentation/Tests

No additional findings. The new drift guards in docs/tutorials/18_geo_experiments.ipynb:L366-L376 are a good improvement. I could not rerun nbmake locally because this review environment is missing numpy (ModuleNotFoundError), so the review above is based on static diff / registry / source inspection.

Path to Approval

Resolve the tutorial/DGP mismatch by either changing the example to a factor-model DGP or rewriting the narrative at docs/tutorials/18_geo_experiments.ipynb:L36, docs/tutorials/18_geo_experiments.ipynb:L91-L111, and docs/tutorials/18_geo_experiments.ipynb:L175 so it no longer claims the current generate_did_data() example demonstrates nonparallel pre-trends.
If you change the DGP, refresh the quoted ATT/CI/RMSE/business-summary numbers and the drift guards at docs/tutorials/18_geo_experiments.ipynb:L366-L376 so the prose and checks stay synchronized.

Address PR #289 third-round review (96d2eea). The reviewer caught a methodology mismatch: my narrative claimed non-parallel pre-trends but generate_did_data is a parallel-trends DGP. Cross-checking against external practitioner sources (Cunningham's Mixtape, Abadie's framing, the Arkhangelsky et al. (2021) paper itself, and our own practitioner_decision_tree.rst Branch 4) showed the deeper issue: with 18 treated markets the example was outside SDiD's documented sweet spot in either direction. SDiD is for FEW treated markets in a LARGE donor pool, where basic DiD's averaging doesn't help and you need a counterfactual built specifically for those few treated units. The Arkhangelsky paper itself warns that SDiD's localization can WORSEN precision relative to basic DiD when there is little systematic heterogeneity to localize on. generate_did_data has no such heterogeneity (just unit FEs and a shared linear trend), so the seed-search to "fix" the bias was fighting an uphill battle - SDiD on that DGP is exactly the regime the authors caution against. Switched to generate_factor_data, which produces interactive fixed effects (different markets have different latent trajectories - what practitioners experience as markets responding to different macro forces: local economics, demographics, competitor activity). This is the regime SDiD is designed for, and in this regime SDiD's localization genuinely helps. Changes: - DGP: generate_did_data -> generate_factor_data, with params tuned via parameter search to give realistic conversion magnitudes (~600-2300), CI coverage of the true effect, and small bias (factor_strength=150, treated_loading_shift=0.5, unit_fe_sd=80, noise_sd=25, seed=42) - Treated count: 18 -> 5 (matches the library decision tree's "Few Test Markets" branch and the canonical synthetic control / SDiD use case) - Donor pool: 62 -> 75 (more donors for the SDiD weighting to draw on) - Use raw["treat"] (unit-level ever-treated) not raw["treated"] (time-varying), as required by SDiD's block-treatment API - Refreshed all narrative numbers in Section 3 (ATT 312, 95% CI (298, 326), 21% lift, recovers true effect 300 within 4%) - Refreshed Section 6 stakeholder template numbers and reframed for the 5-market pilot context - Refreshed drift guards: ATT in [305, 320], CI bounds, explicit "CI must cover 300" assertion, RMSE < 60. Removed the per-week pre_gap assertion (with factor data, individual week gaps after intercept correction can be larger even with a clean overall fit - the library's pre_treatment_fit RMSE is the right diagnostic). - Section 1: rewrote "Why this is hard" to lean on the few-treated rationale (small-cluster cluster-SE problem, market heterogeneity, parallel trends rarely holds) - Section 1: added pointer to practitioner_decision_tree.html#few-test-markets - Section 2: rewrote DGP description to explain factor structure in plain English ("markets respond to different macro forces"), no jargon about factor loadings - Section 5 parallel-trends callout: added the Arkhangelsky framing about SDiD "automating the kind of pre-period adjustment practitioners already do informally" Methodologically careful language locked in: - No claim that SDiD recovers an effect basic DiD can't (would require running both) - No claim that SDiD has tighter CIs than basic DiD (the win is honest inference, not narrower bounds) - No claim that SDiD is uniquely correct on parallel-trends data (per the Arkhangelsky caveat, it can be worse) - DOES claim that with this DGP and few treated, SDiD is in its documented sweet spot per Arkhangelsky et al. (2021), the library decision tree, and Cunningham's Mixtape Verified: nbmake passes in 53s (down from 219s on the original 18-treated notebook with the GeoLift section, and 22s on the previous 18-treated parallel-trends version), all 4 drift guards green, CI cleanly covers the true treatment effect of 300. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T20:44:19Z

/ai-review

github-actions · 2026-04-11T20:48:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f1456fa8849f6fbe2a6d8b11cfd54ad4fb9c042f

Overall Assessment

✅ Looks good — the prior re-review P1 methodology issue is resolved. I did not find any unmitigated P0/P1 findings in the changed files. One minor P3 docs issue remains.

Executive Summary

The previous blocker is fixed: the tutorial now uses generate_factor_data and explicitly frames the example as an interactive fixed-effects / factor-model SDiD use case, consistent with the repo’s SyntheticDiD methodology guidance. docs/tutorials/18_geo_experiments.ipynb:L64 docs/tutorials/18_geo_experiments.ipynb:L93 docs/tutorials/18_geo_experiments.ipynb:L103 diff_diff/prep_dgp.py:L327 docs/tutorials/06_power_analysis.ipynb:L465
The new Section 5 inference wording matches the Methodology Registry and SyntheticDiD implementation: placebo re-estimates both weight sets, while bootstrap keeps the original weights fixed apart from resampled-control renormalization. docs/tutorials/18_geo_experiments.ipynb:L405 docs/methodology/REGISTRY.md:L1261 docs/methodology/REGISTRY.md:L1272 diff_diff/synthetic_did.py:L729 diff_diff/synthetic_did.py:L938
The practitioner_next_steps explanation is accurate for SDiD: the library’s SDiD path emphasizes pre-fit quality and placebo/robustness checks rather than a standard raw parallel-trends test. docs/tutorials/18_geo_experiments.ipynb:L461 diff_diff/practitioner.py:L504
Minor issue: the pre-trends plot legend still says n=18 treated and n=62 control even though the notebook now generates 5 treated and 75 control markets. docs/tutorials/18_geo_experiments.ipynb:L103 docs/tutorials/18_geo_experiments.ipynb:L167 docs/tutorials/18_geo_experiments.ipynb:L168
I could not rerun the notebook locally because this review environment is missing pandas; the review is based on static inspection.

Methodology

No findings. The prior DGP/narrative mismatch is resolved: the notebook now teaches SDiD on factor-model data instead of the basic DiD generator. docs/tutorials/18_geo_experiments.ipynb:L93 docs/tutorials/18_geo_experiments.ipynb:L103 diff_diff/prep_dgp.py:L327 docs/tutorials/06_power_analysis.ipynb:L465
No findings. The placebo/bootstrap explanation is methodologically aligned with the registry and estimator docstring/code. docs/tutorials/18_geo_experiments.ipynb:L405 docs/methodology/REGISTRY.md:L1261 docs/methodology/REGISTRY.md:L1272 diff_diff/synthetic_did.py:L50 diff_diff/synthetic_did.py:L729

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. TODO.md does not track any issue that changes this PR’s assessment. TODO.md:L51

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The main pre-trends diagnostic figure has stale legend counts. The notebook generates 5 treated markets and 75 controls, but the legend still says Treated markets (n=18) and Control markets (n=62). That does not change any estimates, but it is misleading in the core explanatory plot and makes the tutorial look internally inconsistent. docs/tutorials/18_geo_experiments.ipynb:L103 docs/tutorials/18_geo_experiments.ipynb:L167 docs/tutorials/18_geo_experiments.ipynb:L168
Concrete fix: Replace the hard-coded legend strings with labels derived from the actual counts already computed in the notebook, or update them to n=5 and n=75.

Address P3 from PR #289 latest review (f1456fa). The pre-trends plot legend was hardcoded as 'Treated markets (n=18)' / 'Control markets (n=62)' from the original 18-treated build, but the notebook now generates 5 treated and 75 control. The estimates were unaffected, but the core explanatory plot looked internally inconsistent. Fix derives n_tr and n_co from the data via len(set(...)) before constructing the legend labels with f-strings, so the labels can't go stale if the DGP is ever retuned again. Verified locally: nbmake passes in 55.85s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T21:05:23Z

/ai-review

github-actions · 2026-04-11T21:11:00Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 897cab9d670bb2e1665340942a1655db1257f8f9

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings in the changed scope.

Executive Summary

Re-review focus item resolved: the pre-trends legend now derives treated/control counts from the dataframe, so the prior stale-label docs issue is fixed in docs/tutorials/18_geo_experiments.ipynb:L166-L172.
No methodology-affecting code changed. The tutorial’s SDiD setup uses the factor-model generator and the ever-treated indicator expected by SyntheticDiD, consistent with the estimator contract and registry requirements in docs/tutorials/18_geo_experiments.ipynb:L93-L125, diff_diff/prep_dgp.py:L341-L383, diff_diff/synthetic_did.py:L315-L355, and docs/methodology/REGISTRY.md:L1196-L1200.
The new inference explanation is aligned with the documented implementation: placebo re-estimates both weight sets, while bootstrap renormalizes original unit weights and keeps time weights fixed, as described in docs/tutorials/18_geo_experiments.ipynb:L409-L416, docs/methodology/REGISTRY.md:L1261-L1278, diff_diff/synthetic_did.py:L717-L741, and diff_diff/synthetic_did.py:L938-L1063.
The new tutorial is linked consistently from the main docs entry points in docs/index.rst:L79-L80, docs/practitioner_decision_tree.rst:L219-L223, and docs/practitioner_getting_started.rst:L340-L344.
Residual risk only: I could not rerun the notebook locally because this review environment is missing pandas, so numeric claims were checked by static inspection rather than execution.

Methodology

No findings. The tutorial’s use of generate_factor_data, raw["treat"], and its placebo/bootstrap descriptions are consistent with the registry and estimator implementation in docs/tutorials/18_geo_experiments.ipynb:L93-L125, docs/tutorials/18_geo_experiments.ipynb:L409-L416, diff_diff/prep_dgp.py:L341-L383, diff_diff/synthetic_did.py:L315-L355, and docs/methodology/REGISTRY.md:L1261-L1312.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see any new deferrable issue in the changed scope that needs tracking in TODO.md.

Security

No findings.

Documentation/Tests

No findings. The prior stale-legend issue is resolved in docs/tutorials/18_geo_experiments.ipynb:L166-L172.
Residual risk: I was not able to execute docs/tutorials/18_geo_experiments.ipynb in this environment because pandas is unavailable, so I did not independently reproduce the quoted ATT/CI values.

PR #289 shipped Tutorial 18 (18_geo_experiments.ipynb) but ROADMAP.md still listed B2b as "In progress" with a description that mentioned the GeoLift/CausalImpact comparison. That comparison was explicitly dropped from the tutorial in commit 96d2eea and remains scoped under the separate B2c row. - Flip B2b status to "Done (Tutorial 18)" - Drop the "comparison with GeoLift/CausalImpact" mention (B2c still tracks that work independently) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 11, 2026 14:33

igerber added the ready-for-ci Triggers CI test workflows label Apr 11, 2026

igerber merged commit 527a211 into main Apr 11, 2026
3 of 4 checks passed

igerber deleted the b2b-geo-experiment branch April 11, 2026 21:33

igerber mentioned this pull request Apr 11, 2026

docs: mark B2b Geo-Experiment tutorial as done in ROADMAP #291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tutorial 18: geo-experiment SyntheticDiD walkthrough (B2b)#289

Add Tutorial 18: geo-experiment SyntheticDiD walkthrough (B2b)#289
igerber merged 5 commits intomainfrom
b2b-geo-experiment

igerber commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 11, 2026

Summary

Tutorial structure (8 sections, 40 cells)

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant