Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions benchmarks/speed_review/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Speed Review - Practitioner Workflow Benchmarks

Scenario-driven performance measurement for end-to-end practitioner chains,
as distinct from `benchmarks/run_benchmarks.py` which measures R-parity on
isolated `fit()` calls.

## Why these exist

See [`docs/performance-scenarios.md`](../../docs/performance-scenarios.md) for
the full methodology. Short version: the existing benchmarks measure
`fit()` in isolation on 200 x 8 synthetic panels, which does not reflect what
a practitioner running the 8-step Baker et al. (2025) workflow on a real
BRFSS or geo-experiment panel actually sees. These scripts measure the full
chain (Bacon -> fit -> HonestDiD -> cross-estimator robustness -> reporting)
at data shapes anchored to applied-econ conventions.

## Layout

```
benchmarks/speed_review/
├── README.md # this file
├── bench_shared.py # timing + pyinstrument + RSS harness
├── run_all.py # orchestrator (both backends)
├── bench_campaign_staggered.py # Scenario 1: CS + 8-step chain
├── bench_brand_awareness_survey.py # Scenario 2: DiD + SurveyDesign
├── bench_brfss_panel.py # Scenario 3: aggregate_survey -> CS
├── bench_geo_few_markets.py # Scenario 4: SDiD + jackknife
├── bench_reversible_dcdh.py # Scenario 5: dCDH L_max + TSL
├── bench_dose_response.py # Scenario 6: ContinuousDiD splines
├── mem_profile_brfss.py # tracemalloc allocator attribution
│ # for BRFSS-1M (standalone)
├── bench_callaway.py # pre-existing CS scaling sweep
├── baseline_results.json # pre-existing CS baseline
└── baselines/ # this effort's output
├── <scenario>_<backend>.json # phase-level wall-clock + peak RSS
├── mem_profile_brfss_large_<backend>.txt # tracemalloc top-N sites
└── profiles/ # flame HTMLs (gitignored)
└── <scenario>_<backend>.html # pyinstrument flame output
```

Each JSON baseline records both timing (per-phase wall-clock) and memory
(start/peak/growth from a psutil background sampler at 10 ms). The
`mem_profile_brfss.py` script does a separate tracemalloc pass on the
BRFSS-1M scenario - this is kept out of the main timing harness because
tracemalloc has 2-5x overhead and would contaminate wall-clock baselines.

**Note on profile HTMLs.** pyinstrument flames are ~500KB-1.2MB each and are
regenerated on every run; they live under `baselines/profiles/` which is
gitignored. The key hotspots identified from them are already captured in
the findings doc (top-5 hot phases per scenario); run a scenario locally
to regenerate the full flame when needed.

## Running

```bash
# One-time install
pip install pyinstrument

# All scenarios, both backends, all scales
python benchmarks/speed_review/run_all.py

# One scenario, one backend (the script runs its full scale sweep internally)
DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py

# Subset
python benchmarks/speed_review/run_all.py --scenarios brfss_panel geo_few_markets
```

Multi-scale scenarios write per-scale outputs
(e.g. `campaign_staggered_small_rust.json`, `..._medium_rust.json`,
`..._large_rust.json`). Single-scale scenarios write the scale-free form
(e.g. `dose_response_rust.json`). Full runtime for all scales × both
backends is ~90 seconds on Apple Silicon M4.

## Where to look for findings

[`docs/performance-plan.md`](../../docs/performance-plan.md) - "Practitioner
Workflow Baseline (v3.1.3)" section holds per-scenario hot-phase rankings
and action recommendations. The scenarios here are the measurement surface;
the findings doc is the decision output.

## Adding a scenario

1. Add the scenario definition to `docs/performance-scenarios.md`
(persona, data shape, operation chain, source anchor).
2. Add `bench_<name>.py` following the existing scripts: build data, define
`phases` as a list of `(label, callable)` tuples, call `run_scenario`.
3. Register it in `run_all.py`'s `SCRIPTS` dict.
4. Run under both backends and commit the refreshed `baselines/*.json`.
The `baselines/profiles/*.html` flame HTMLs are gitignored and
regenerated per run - do not commit them.
5. Add a per-scenario finding paragraph to `docs/performance-plan.md`.
1 change: 1 addition & 0 deletions benchmarks/speed_review/baselines/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
profiles/
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"scenario": "brand_awareness_survey_large",
"backend": "python",
"has_rust_backend": false,
"total_seconds": 1.0910496250000001,
"memory": {
"available": true,
"start_mb": 188.45,
"peak_mb": 327.44,
"growth_mb": 138.98,
"sampler_interval_s": 0.01
},
"phases": {
"1_naive_fit_no_survey_design": {
"seconds": 0.009826500000000182,
"ok": true,
"error": null
},
"2_tsl_strata_psu_fpc": {
"seconds": 0.030280333999999964,
"ok": true,
"error": null
},
"3_replicate_weights_jk1": {
"seconds": 0.6243122919999999,
"ok": true,
"error": null
},
"4_multi_outcome_loop_3_metrics": {
"seconds": 0.24174716599999968,
"ok": true,
"error": null
},
"5_check_parallel_trends": {
"seconds": 0.025623749999999834,
"ok": true,
"error": null
},
"6_placebo_refit_pre_period": {
"seconds": 0.01191299999999984,
"ok": true,
"error": null
},
"7_event_study_plus_honest_did": {
"seconds": 0.147335875,
"ok": true,
"error": null
}
},
"metadata": {
"scale": "large",
"n_units": 1000,
"n_periods": 12,
"n_obs": 12000,
"n_strata": 20,
"n_psu_per_stratum": 8,
"n_replicate_weights": 160,
"outcomes": [
"outcome",
"consideration",
"purchase_intent"
]
},
"diff_diff_version": "3.1.3",
"numpy_version": "2.0.2"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"scenario": "brand_awareness_survey_large",
"backend": "rust",
"has_rust_backend": true,
"total_seconds": 1.0000031249999999,
"memory": {
"available": true,
"start_mb": 194.03,
"peak_mb": 336.08,
"growth_mb": 142.05,
"sampler_interval_s": 0.01
},
"phases": {
"1_naive_fit_no_survey_design": {
"seconds": 0.013511041000000112,
"ok": true,
"error": null
},
"2_tsl_strata_psu_fpc": {
"seconds": 0.03037650000000003,
"ok": true,
"error": null
},
"3_replicate_weights_jk1": {
"seconds": 0.5431151669999998,
"ok": true,
"error": null
},
"4_multi_outcome_loop_3_metrics": {
"seconds": 0.21752962499999962,
"ok": true,
"error": null
},
"5_check_parallel_trends": {
"seconds": 0.04399687500000038,
"ok": true,
"error": null
},
"6_placebo_refit_pre_period": {
"seconds": 0.016433082999999904,
"ok": true,
"error": null
},
"7_event_study_plus_honest_did": {
"seconds": 0.13501837500000002,
"ok": true,
"error": null
}
},
"metadata": {
"scale": "large",
"n_units": 1000,
"n_periods": 12,
"n_obs": 12000,
"n_strata": 20,
"n_psu_per_stratum": 8,
"n_replicate_weights": 160,
"outcomes": [
"outcome",
"consideration",
"purchase_intent"
]
},
"diff_diff_version": "3.1.3",
"numpy_version": "2.0.2"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"scenario": "brand_awareness_survey_medium",
"backend": "python",
"has_rust_backend": false,
"total_seconds": 0.563283334,
"memory": {
"available": true,
"start_mb": 133.69,
"peak_mb": 187.7,
"growth_mb": 54.02,
"sampler_interval_s": 0.01
},
"phases": {
"1_naive_fit_no_survey_design": {
"seconds": 0.010921792000000097,
"ok": true,
"error": null
},
"2_tsl_strata_psu_fpc": {
"seconds": 0.03732066599999995,
"ok": true,
"error": null
},
"3_replicate_weights_jk1": {
"seconds": 0.20805304199999997,
"ok": true,
"error": null
},
"4_multi_outcome_loop_3_metrics": {
"seconds": 0.12622899999999992,
"ok": true,
"error": null
},
"5_check_parallel_trends": {
"seconds": 0.01834783299999998,
"ok": true,
"error": null
},
"6_placebo_refit_pre_period": {
"seconds": 0.054030583000000076,
"ok": true,
"error": null
},
"7_event_study_plus_honest_did": {
"seconds": 0.10836029199999997,
"ok": true,
"error": null
}
},
"metadata": {
"scale": "medium",
"n_units": 500,
"n_periods": 12,
"n_obs": 6000,
"n_strata": 15,
"n_psu_per_stratum": 6,
"n_replicate_weights": 90,
"outcomes": [
"outcome",
"consideration",
"purchase_intent"
]
},
"diff_diff_version": "3.1.3",
"numpy_version": "2.0.2"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"scenario": "brand_awareness_survey_medium",
"backend": "rust",
"has_rust_backend": true,
"total_seconds": 0.5500554579999999,
"memory": {
"available": true,
"start_mb": 135.36,
"peak_mb": 184.86,
"growth_mb": 49.5,
"sampler_interval_s": 0.01
},
"phases": {
"1_naive_fit_no_survey_design": {
"seconds": 0.011186999999999947,
"ok": true,
"error": null
},
"2_tsl_strata_psu_fpc": {
"seconds": 0.03363270800000007,
"ok": true,
"error": null
},
"3_replicate_weights_jk1": {
"seconds": 0.18678066699999996,
"ok": true,
"error": null
},
"4_multi_outcome_loop_3_metrics": {
"seconds": 0.16038787500000007,
"ok": true,
"error": null
},
"5_check_parallel_trends": {
"seconds": 0.022171542000000155,
"ok": true,
"error": null
},
"6_placebo_refit_pre_period": {
"seconds": 0.0532650830000001,
"ok": true,
"error": null
},
"7_event_study_plus_honest_did": {
"seconds": 0.08262075000000002,
"ok": true,
"error": null
}
},
"metadata": {
"scale": "medium",
"n_units": 500,
"n_periods": 12,
"n_obs": 6000,
"n_strata": 15,
"n_psu_per_stratum": 6,
"n_replicate_weights": 90,
"outcomes": [
"outcome",
"consideration",
"purchase_intent"
]
},
"diff_diff_version": "3.1.3",
"numpy_version": "2.0.2"
}
Loading
Loading