igerber · igerber · Apr 19, 2026 · Apr 19, 2026 · Apr 19, 2026 · Apr 19, 2026
diff --git a/benchmarks/speed_review/README.md b/benchmarks/speed_review/README.md
@@ -0,0 +1,92 @@
+# Speed Review - Practitioner Workflow Benchmarks
+
+Scenario-driven performance measurement for end-to-end practitioner chains,
+as distinct from `benchmarks/run_benchmarks.py` which measures R-parity on
+isolated `fit()` calls.
+
+## Why these exist
+
+See [`docs/performance-scenarios.md`](../../docs/performance-scenarios.md) for
+the full methodology. Short version: the existing benchmarks measure
+`fit()` in isolation on 200 x 8 synthetic panels, which does not reflect what
+a practitioner running the 8-step Baker et al. (2025) workflow on a real
+BRFSS or geo-experiment panel actually sees. These scripts measure the full
+chain (Bacon -> fit -> HonestDiD -> cross-estimator robustness -> reporting)
+at data shapes anchored to applied-econ conventions.
+
+## Layout
+
+```
+benchmarks/speed_review/
+├── README.md                           # this file
+├── bench_shared.py                     # timing + pyinstrument + RSS harness
+├── run_all.py                          # orchestrator (both backends)
+├── bench_campaign_staggered.py         # Scenario 1: CS + 8-step chain
+├── bench_brand_awareness_survey.py     # Scenario 2: DiD + SurveyDesign
+├── bench_brfss_panel.py                # Scenario 3: aggregate_survey -> CS
+├── bench_geo_few_markets.py            # Scenario 4: SDiD + jackknife
+├── bench_reversible_dcdh.py            # Scenario 5: dCDH L_max + TSL
+├── bench_dose_response.py              # Scenario 6: ContinuousDiD splines
+├── mem_profile_brfss.py                # tracemalloc allocator attribution
+│                                       #   for BRFSS-1M (standalone)
+├── bench_callaway.py                   # pre-existing CS scaling sweep
+├── baseline_results.json               # pre-existing CS baseline
+└── baselines/                          # this effort's output
+    ├── <scenario>_<backend>.json       # phase-level wall-clock + peak RSS
+    ├── mem_profile_brfss_large_<backend>.txt   # tracemalloc top-N sites
+    └── profiles/                       # flame HTMLs (gitignored)
+        └── <scenario>_<backend>.html   # pyinstrument flame output
+```
+
+Each JSON baseline records both timing (per-phase wall-clock) and memory
+(start/peak/growth from a psutil background sampler at 10 ms). The
+`mem_profile_brfss.py` script does a separate tracemalloc pass on the
+BRFSS-1M scenario - this is kept out of the main timing harness because
+tracemalloc has 2-5x overhead and would contaminate wall-clock baselines.
+
+**Note on profile HTMLs.** pyinstrument flames are ~500KB-1.2MB each and are
+regenerated on every run; they live under `baselines/profiles/` which is
+gitignored. The key hotspots identified from them are already captured in
+the findings doc (top-5 hot phases per scenario); run a scenario locally
+to regenerate the full flame when needed.
+
+## Running
+
+```bash
+# One-time install
+pip install pyinstrument
+
+# All scenarios, both backends, all scales
+python benchmarks/speed_review/run_all.py
+
+# One scenario, one backend (the script runs its full scale sweep internally)
+DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py
+
+# Subset
+python benchmarks/speed_review/run_all.py --scenarios brfss_panel geo_few_markets
+```
+
+Multi-scale scenarios write per-scale outputs
+(e.g. `campaign_staggered_small_rust.json`, `..._medium_rust.json`,
+`..._large_rust.json`). Single-scale scenarios write the scale-free form
+(e.g. `dose_response_rust.json`). Full runtime for all scales × both
+backends is ~90 seconds on Apple Silicon M4.
+
+## Where to look for findings
+
+[`docs/performance-plan.md`](../../docs/performance-plan.md) - "Practitioner
+Workflow Baseline (v3.1.3)" section holds per-scenario hot-phase rankings
+and action recommendations. The scenarios here are the measurement surface;
+the findings doc is the decision output.
+
+## Adding a scenario
+
+1. Add the scenario definition to `docs/performance-scenarios.md`
+   (persona, data shape, operation chain, source anchor).
+2. Add `bench_<name>.py` following the existing scripts: build data, define
+   `phases` as a list of `(label, callable)` tuples, call `run_scenario`.
+3. Register it in `run_all.py`'s `SCRIPTS` dict.
+4. Run under both backends and commit the refreshed `baselines/*.json`.
+   The `baselines/profiles/*.html` flame HTMLs are gitignored and
+   regenerated per run - do not commit them.
+5. Add a per-scenario finding paragraph to `docs/performance-plan.md`.
diff --git a/benchmarks/speed_review/baselines/.gitignore b/benchmarks/speed_review/baselines/.gitignore
@@ -0,0 +1 @@
+profiles/
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -0,0 +1,66 @@
+{
+  "scenario": "brand_awareness_survey_large",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 1.0910496250000001,
+  "memory": {
+    "available": true,
+    "start_mb": 188.45,
+    "peak_mb": 327.44,
+    "growth_mb": 138.98,
+    "sampler_interval_s": 0.01
+  },
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.009826500000000182,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.030280333999999964,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_jk1": {
+      "seconds": 0.6243122919999999,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.24174716599999968,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.025623749999999834,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.01191299999999984,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.147335875,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 1000,
+    "n_periods": 12,
+    "n_obs": 12000,
+    "n_strata": 20,
+    "n_psu_per_stratum": 8,
+    "n_replicate_weights": 160,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -0,0 +1,66 @@
+{
+  "scenario": "brand_awareness_survey_large",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 1.0000031249999999,
+  "memory": {
+    "available": true,
+    "start_mb": 194.03,
+    "peak_mb": 336.08,
+    "growth_mb": 142.05,
+    "sampler_interval_s": 0.01
+  },
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.013511041000000112,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.03037650000000003,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_jk1": {
+      "seconds": 0.5431151669999998,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.21752962499999962,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.04399687500000038,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.016433082999999904,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.13501837500000002,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 1000,
+    "n_periods": 12,
+    "n_obs": 12000,
+    "n_strata": 20,
+    "n_psu_per_stratum": 8,
+    "n_replicate_weights": 160,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -0,0 +1,66 @@
+{
+  "scenario": "brand_awareness_survey_medium",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.563283334,
+  "memory": {
+    "available": true,
+    "start_mb": 133.69,
+    "peak_mb": 187.7,
+    "growth_mb": 54.02,
+    "sampler_interval_s": 0.01
+  },
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.010921792000000097,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.03732066599999995,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_jk1": {
+      "seconds": 0.20805304199999997,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.12622899999999992,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.01834783299999998,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.054030583000000076,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.10836029199999997,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 500,
+    "n_periods": 12,
+    "n_obs": 6000,
+    "n_strata": 15,
+    "n_psu_per_stratum": 6,
+    "n_replicate_weights": 90,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -0,0 +1,66 @@
+{
+  "scenario": "brand_awareness_survey_medium",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.5500554579999999,
+  "memory": {
+    "available": true,
+    "start_mb": 135.36,
+    "peak_mb": 184.86,
+    "growth_mb": 49.5,
+    "sampler_interval_s": 0.01
+  },
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.011186999999999947,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.03363270800000007,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_jk1": {
+      "seconds": 0.18678066699999996,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.16038787500000007,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.022171542000000155,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.0532650830000001,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.08262075000000002,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 500,
+    "n_periods": 12,
+    "n_obs": 6000,
+    "n_strata": 15,
+    "n_psu_per_stratum": 6,
+    "n_replicate_weights": 90,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}