Add practitioner-workflow performance baseline by igerber · Pull Request #333 · igerber/diff-diff

igerber · 2026-04-19T14:37:23Z

Summary

Defines 6 canonical practitioner scenarios (docs/performance-scenarios.md) anchored to applied-econ papers, industry writeups, and the bundled tutorials — covering CS + 8-step chain, survey 2x2, BRFSS microdata → CS panel, SDiD few-markets, reversible dCDH, and continuous dose-response.
Ports the existing speed_review/bench_callaway.py pattern into six scenario scripts that time each phase of the full chain (Bacon → fit → HonestDiD → robustness refits → reporting), profile it with pyinstrument, and write a JSON baseline + flame HTML per (scenario, backend).
Adds a "Practitioner Workflow Baseline (v3.1.3)" section to docs/performance-plan.md with per-scenario top-5 hot phases, Python-vs-Rust gap, and recommended action category (Rust / algorithmic / cache / leave alone).

Findings at a glance

Scenario	Python	Rust	Gap	Dominant phase
Staggered campaign (CS + chain)	0.48s	0.48s	1.0x	ImputationDiD robustness (49%)
Brand awareness survey	0.18s	0.20s	0.9x	multi-outcome + HonestDiD (~70%)
BRFSS microdata → CS panel	1.58s	1.58s	1.0x	`aggregate_survey` (93%)
Geo-experiment few markets (SDiD)	2.96s	0.04s	76x	Frank-Wolfe weight solver
Reversible treatment (dCDH L_max=3 + TSL)	0.49s	0.55s	0.9x	dCDH fit + heterogeneity refit
Pricing dose-response	0.57s	0.58s	1.0x	four spline variants, ~25% each

Three concrete follow-up candidates surfaced (see findings doc for details): _compute_stratified_psu_meat per-cell loop in aggregate_survey, ImputationDiD's unexpected 4x slowdown vs CS with n_bootstrap=999, and shared-precomputation opportunity between dCDH main fit and heterogeneity refit. No source changes are in this PR — optimizations become separate PRs citing specific findings.

Scope discipline

No changes under diff_diff/ or rust/.
Existing benchmarks/run_benchmarks.py (R-parity accuracy) is untouched; these scenarios complement it rather than replace it.
Flame HTMLs are regenerated locally and live under a gitignored baselines/profiles/; only the JSONs are committed.

Test plan

All 12 scenario × backend combinations run green via python benchmarks/speed_review/run_all.py.
Baseline JSONs written to benchmarks/speed_review/baselines/; HTML profiles regenerate on re-run.
No source code under diff_diff/ or rust/ modified — git diff --name-only confirms docs + benchmarks only.
benchmarks/speed_review/run_all.py --backend python and --backend rust both succeed as single-backend commands.

Six end-to-end scenarios covering CS + 8-step chain, survey DiD, BRFSS microdata -> CS panel, SDiD few-markets, reversible dCDH, and continuous dose-response -- anchored to applied-econ paper and industry conventions rather than the 200 x 8 cookie cutter. Each chain is timed per-phase and profiled with pyinstrument under both backends; findings and recommended actions are in docs/performance-plan.md. Measurement only -- no changes under diff_diff/ or rust/. The decision doc identifies aggregate_survey per-cell scaffolding, ImputationDiD fit loop, and dCDH heterogeneity refit as candidates for follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T14:45:07Z

Overall Assessment
Needs changes. Two unmitigated P1 issues block approval: Scenario 2 is not benchmarking the replicate-variance method or workload it claims to benchmark, and the new harness can report success even when benchmark phases fail.

Executive Summary

Scenario 2’s replicate-weight path is methodologically wrong as written: the generator produces JK1 delete-one-PSU replicate weights, but the benchmark registers them as BRR, which changes the variance formula.
The same Scenario 2 is documented as a ~32K-row, ~160-replicate practitioner survey, but the committed code and JSON are a 2,400-row, 40-replicate panel. The reported timings therefore do not measure the stated scenario.
run_scenario() records per-phase failures but never exits nonzero, and run_all.py checks only subprocess return codes. A run can therefore end with “All scenarios passed” after failed phases.
Output and reproduction docs are inconsistent: code writes under benchmarks/speed_review/baselines/, while several new docs still point to benchmarks/results/, and the README tells contributors to commit gitignored HTML files.

Methodology

Severity: P1 Method affected: survey replicate-weight variance in the brand-awareness DiD scenario. At benchmarks/speed_review/bench_brand_awareness_survey.py:75, the benchmark passes replicate_method="BRR", but diff_diff/prep_dgp.py:1653 shows that generate_survey_did_data(..., include_replicate_weights=True) emits JK1 delete-one-PSU weights, and diff_diff/survey.py:1510 plus docs/methodology/REGISTRY.md:2669 confirm that BRR and JK1 use different variance factors. Impact: the replicate-fit branch is not benchmarking the method described, and it would compute incorrect replicate-based SEs if anyone inspected the result object. Concrete fix: either switch the scenario to replicate_method="JK1" and rename the docs/baselines accordingly, or generate true BRR/Fay-style replicate weights before continuing to call the path BRR.
Severity: P3 benchmarks/speed_review/bench_campaign_staggered.py:61 and benchmarks/speed_review/bench_brfss_panel.py:85 time compute_honest_did() on bootstrap-fitted Callaway-Sant'Anna results with default base_period="varying". diff_diff/honest_did.py:675 and docs/methodology/REGISTRY.md:2231 document both as supported warning-path deviations, so this is informational, not a defect. Impact: those timings are for the implemented convenience path, not canonical HonestDiD methodology. Concrete fix: none required unless you want a separate canonical-method timing with base_period="universal" and n_bootstrap=0.

Code Quality

Severity: P1 At benchmarks/speed_review/bench_shared.py:88, per-phase exceptions are caught and only recorded as "ok": false; at benchmarks/speed_review/run_all.py:35, success is inferred solely from the subprocess exit code. Impact: run_all.py can print “All scenarios passed” even when one or more benchmark phases failed, so the PR’s test plan is not trustworthy. Concrete fix: make any failed phase force a non-zero exit, for example by tracking had_failure in run_scenario() and raising SystemExit(1) after writing artifacts.

Performance

Severity: P1 Scenario 2’s stated workload does not match the measured workload. The doc/spec side says ~32,000 respondent rows and ~160 replicate columns at docs/performance-scenarios.md:130, while the code uses n_units=200, n_periods=12 at benchmarks/speed_review/bench_brand_awareness_survey.py:30, and the committed baseline records n_obs=2400, n_replicate_weights=40 at benchmarks/speed_review/baselines/brand_awareness_survey_python.json:43. docs/performance-plan.md then treats that run as the practitioner baseline at docs/performance-plan.md:66. Impact: the “~200 ms end-to-end” conclusion is not about the scenario the docs claim to measure, so it is not a reliable input to optimization decisions. Concrete fix: either scale the benchmark up to the documented workload and recommit the JSON/plan text, or update the scenario docs, script docstring, and findings text to the smaller 2,400-row, 40-replicate panel that is actually being timed.

Maintainability

Severity: P2 Output locations and artifact instructions are inconsistent across the new files: benchmarks/speed_review/bench_shared.py:8, benchmarks/speed_review/run_all.py:5, and docs/performance-plan.md:16 still refer to benchmarks/results/, but the code writes to benchmarks/speed_review/baselines/; the README also says to commit profile HTMLs at benchmarks/speed_review/README.md:74 even though benchmarks/speed_review/baselines/.gitignore:1 ignores profiles/. Impact: reproduction instructions are internally inconsistent. Concrete fix: normalize every path reference to benchmarks/speed_review/baselines/ and remove the “commit HTML” instruction unless those files are meant to stop being gitignored.

Tech Debt
No findings. None of the P1 issues above are currently mitigated by TODO.md.

Security
No findings.

Documentation/Tests

Severity: P2 The new scenario doc includes an unrunnable code snippet at docs/performance-scenarios.md:143: it uses replicate_method="brr", but diff_diff/survey.py:86 accepts only uppercase BRR, Fay, JK1, JKn, or SDR. Impact: the documentation is not copy-paste correct and repeats the same replicate-method confusion as Scenario 2. Concrete fix: update the snippet to the actual supported method and casing once the Scenario 2 replicate design is corrected.

Path to Approval

Fix Scenario 2’s replicate design so method, weights, docs, and baselines all agree. Either use JK1 everywhere, or generate true BRR weights and keep the BRR labeling.
Make Scenario 2’s workload consistent across bench_brand_awareness_survey.py, docs/performance-scenarios.md, and the committed baseline JSON, then regenerate the Scenario 2 numbers in docs/performance-plan.md.
Propagate per-phase failures to process exit status so python benchmarks/speed_review/run_all.py cannot false-pass.
Sync the output-path and artifact docs to the actual benchmarks/speed_review/baselines/ layout, and fix the invalid replicate_method="brr" snippet.

Four scenarios (campaign_staggered, brand_awareness_survey, brfss_panel, geo_few_markets) now run at small/medium/large data scales rather than a single tutorial-scale point. The large scales reflect practitioner realism: 1M-row BRFSS pooled panels, 1,500-unit county-level staggered studies, 1,000-unit multi-region brand surveys, 500-unit zip-level geo-experiments. Key finding from the sweep: aggregate_survey at 1M microdata rows takes ~24 seconds (100% of BRFSS chain runtime), with 97% of that in _compute_stratified_psu_meat self-time. The tutorial-scale pass had flagged this as a 1.5s finding; at practitioner scale it is 15-20x larger and becomes the single highest-value optimization target identified. The other four findings hold across scales: CS chain scales well to 1,500 units, brand-survey chain scales sub-linearly, SDiD Rust gap is stable, ImputationDiD remains the top phase of the staggered chain at all scales. Measurement only. docs/performance-plan.md and docs/performance-scenarios.md updated with scale-sweep tables and scaling-finding narrative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T14:55:05Z

/ai-review

github-actions · 2026-04-19T15:10:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 33d55fbfa721e5b7d21bd85495aad421cf0cce9f

Overall Assessment

⚠️ Needs changes — two unmitigated P1 findings remain. The prior Scenario 2 workload-size mismatch appears resolved, but the replicate-method mismatch and benchmark false-pass behavior are still open.

Executive Summary

Scenario 2 still labels and times JK1 delete-one-PSU replicate weights as BRR, so the replicate-fit phase is methodologically misidentified and would use the wrong variance factor if its SEs were inspected.
The benchmark harness still false-passes failures: run_scenario() records phase errors without exiting nonzero, run_all.py only checks subprocess return codes, and some scenario scripts swallow substep exceptions before the harness can see them.
The earlier Scenario 2 size/workload mismatch is largely fixed: the PR now uses a consistent small/medium/large sweep across script, docs, and committed JSONs.
Documentation consistency improved but is still incomplete: some files still point to benchmarks/results/, README still tells contributors to commit gitignored profile HTMLs, and the Scenario 2 snippet still uses invalid replicate_method="brr".

Methodology

Severity: P1 Method affected: survey replicate-weight variance in the Scenario 2 DifferenceInDifferences + SurveyDesign workflow. generate_survey_did_data(..., include_replicate_weights=True) is documented and implemented as JK1 delete-one-PSU replicate weights in diff_diff/prep_dgp.py:L1247-L1249 and diff_diff/prep_dgp.py:L1653-L1669, while the benchmark constructs SurveyDesign(..., replicate_method="BRR") in benchmarks/speed_review/bench_brand_awareness_survey.py:L73-L84 and labels the phase/docs/baselines as BRR in benchmarks/speed_review/bench_brand_awareness_survey.py:L139-L146, docs/performance-scenarios.md:L132-L150, docs/performance-plan.md:L100-L106, and benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:L17-L18. The Methodology Registry and variance helper make clear that BRR and JK1 use different factors (1/R vs (R-1)/R) in docs/methodology/REGISTRY.md:L2876-L2889 and diff_diff/survey.py:L1510-L1524. Impact: the benchmark is not measuring the method it claims to measure, and the replicate-fit result object would carry the wrong SE formula if examined. Concrete fix: either switch Scenario 2 end-to-end to JK1 (constructor, phase names, docs, plan text, committed JSON labels) or replace the generator output with true BRR/Fay-style replicate weights before continuing to call the path BRR.
Severity: P3 Method affected: HonestDiD on Callaway-Sant’Anna event-study results in the campaign and BRFSS scenarios. Both scenarios benchmark compute_honest_did() on CS results built with the default base_period="varying" in benchmarks/speed_review/bench_campaign_staggered.py:L62-L83 and benchmarks/speed_review/bench_brfss_panel.py:L88-L114. compute_honest_did() explicitly warns that methodologically valid R-style interpretation requires base_period="universal" in diff_diff/honest_did.py:L675-L686, and the Registry documents this difference in docs/methodology/REGISTRY.md:L393-L412. Impact: these timings are for the library’s documented convenience path, not the canonical HonestDiD setup. Concrete fix: none required unless you want a separate canonical benchmark variant; this deviation is documented.

Code Quality

Severity: P1 run_scenario() still cannot make the overall run fail when benchmark phases fail. It catches phase exceptions and only records "ok": false in benchmarks/speed_review/bench_shared.py:L88-L105, while run_all.py treats success purely as subprocess exit code in benchmarks/speed_review/run_all.py:L35-L75. Several scenario functions also swallow substep exceptions and therefore never even reach the harness failure path, e.g. benchmarks/speed_review/bench_brand_awareness_survey.py:L129-L137, benchmarks/speed_review/bench_brfss_panel.py:L105-L114, and benchmarks/speed_review/bench_reversible_dcdh.py:L77-L95. Impact: python benchmarks/speed_review/run_all.py can still print “All scenarios passed” after incomplete practitioner chains, so the committed baselines and the PR’s test plan are not trustworthy. Concrete fix: track any failed phase/substep in run_scenario(), write artifacts, then exit nonzero; also stop converting failed substeps into successful phase timings unless the JSON explicitly records them as failures.

Performance

No findings. The prior Scenario 2 workload-size mismatch appears resolved: the script, scenario doc, and committed JSONs now align on a three-scale 200/500/1000-unit sweep with 40/90/160 replicate columns in benchmarks/speed_review/bench_brand_awareness_survey.py:L31-L35, docs/performance-scenarios.md:L132-L139, benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json:L43-L50, benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json:L43-L50, and benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:L43-L50.

Maintainability

Severity: P2 Output locations and artifact instructions remain internally inconsistent. The harness docstring still says outputs go to benchmarks/results/ in benchmarks/speed_review/bench_shared.py:L8-L9, run_all.py repeats the same path in benchmarks/speed_review/run_all.py:L5-L6, and the plan doc still says results are written under benchmarks/results/ in docs/performance-plan.md:L175-L176, even though the code writes to benchmarks/speed_review/baselines/ in benchmarks/speed_review/bench_shared.py:L41-L42. README also says flame HTMLs are gitignored in benchmarks/speed_review/README.md:L38-L42, but then tells contributors to commit them in benchmarks/speed_review/README.md:L73-L81, despite benchmarks/speed_review/baselines/.gitignore:L1. Impact: reproduction and contribution instructions are still not self-consistent. Concrete fix: normalize every reference to benchmarks/speed_review/baselines/ and either remove the “commit HTML” instruction or stop ignoring baselines/profiles/.

Tech Debt

No findings. TODO.md does not currently track the open P1 issues, so they are not mitigated.

Security

No findings.

Documentation/Tests

Severity: P2 Scenario 2 documentation still contains an unrunnable example. docs/performance-scenarios.md:L148-L150 uses replicate_method="brr", but SurveyDesign only accepts uppercase methods in diff_diff/survey.py:L86-L91; the same PR’s plan text already notes that "brr" fails while "BRR" works in docs/performance-plan.md:L143-L146. Impact: the benchmark docs are not copy-paste correct and still echo the BRR/JK1 confusion. Concrete fix: update the snippet to the actually supported method and casing once the Scenario 2 replicate design is corrected.

Path to Approval

Align Scenario 2’s replicate benchmark end-to-end: either rename/reconfigure it as JK1 everywhere, or generate true BRR/Fay replicate weights and keep the BRR labeling.
Make benchmark failures fatal: propagate any failed phase or substep to a non-zero exit from benchmarks/speed_review/bench_shared.py / benchmarks/speed_review/run_all.py, and stop swallowing failed substeps as successful timings.

Addresses the four CI review findings: - BRR -> JK1 rename. generate_survey_did_data(include_replicate_weights= True) emits JK1 delete-one-PSU weights per prep.py:1248; Scenario 2 was labeling them as BRR, which uses a different variance formula. Fixed script, phase label, scenario doc data-shape text, and example code snippet. - Exit-code propagation. run_scenario now records a module-level failure flag; an atexit handler os._exit(1)s if any phase recorded ok=False. run_all.py's subprocess return-code check now reliably surfaces phase failures. Verified with a forced-failure harness test. - Path references. bench_shared.py and run_all.py docstrings plus performance-plan.md prose normalized to benchmarks/speed_review/baselines/. - Contributor README. "Commit HTMLs" instruction removed; flame HTMLs are gitignored and regenerated per run. Adds memory measurement: - psutil background RSS sampler (10ms) in run_scenario writes a memory field to every scenario JSON: start, peak, growth-during-run. Zero timing impact (background thread, single-syscall samples). - mem_profile_brfss.py - standalone tracemalloc allocator attribution for the BRFSS-1M scenario. Separate from the timing harness so its 2-5x overhead does not contaminate wall-clock baselines. Memory findings extend the optimization priority list without changing the #1 recommendation. Headline insight: BRFSS aggregate_survey at 1M rows grows only 23 MB of working memory (vs 46 MB input), and tracemalloc's net-retained allocation is 0.6 MB. The 24-second cost is pure CPU - confirms the precompute-scaffolding fix is low-risk and fits in any deployment target including 512 MB Lambda. Secondary finding: staggered CS chain allocates 252-322 MB at 1,500 units (peak RSS 486-589 MB). Fine for workstations, tight for Lambda- tier deployments. Flagged as a lower-priority follow-up. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T16:11:32Z

/ai-review

github-actions · 2026-04-19T16:29:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8d5eae8fec6be57f98983576bb33de6e378b2135

Overall Assessment

⚠️ Needs changes — one prior P1 remains unmitigated. The earlier Scenario 2 JK1/BRR mismatch looks fixed, but benchmark failure propagation is still incomplete because several required substeps can fail inside a “successful” phase.

Executive Summary

Scenario 2’s earlier replicate-method mismatch appears resolved end-to-end: the benchmark, docs, and committed baselines now consistently describe JK1.
The new harness fix is only partial: uncaught phase failures now exit nonzero, but several scenario phases still swallow substep exceptions and return ok=true, so run_all.py can still report success for incomplete practitioner chains.
The HonestDiD usage on Callaway-Sant’Anna results remains a documented REGISTRY deviation, so it is P3 informational rather than a blocker.
The benchmark docs are not synchronized to the committed scripts/artifacts: several headline timing numbers and some scenario descriptions no longer match the checked-in JSON baselines or scripts.
mem_profile_brfss_large_rust.txt commits workstation-specific absolute paths, which is a minor security/privacy leak.

Methodology

The previous Scenario 2 BRR/JK1 finding appears resolved.

Severity: P3 Method affected: HonestDiD on Callaway-Sant’Anna event-study benchmarks. Impact: the campaign and BRFSS scenarios are timing the library’s documented convenience path rather than canonical R-style HonestDiD, because CallawaySantAnna defaults to base_period="varying" in diff_diff/staggered.py:L295, compute_honest_did() warns on non-universal base periods in diff_diff/honest_did.py:L675, and the REGISTRY explicitly documents both the varying-base warning and the bootstrap-diagonal fallback in docs/methodology/REGISTRY.md:L393 and docs/methodology/REGISTRY.md:L2448. The scripts do not override those defaults in benchmarks/speed_review/bench_campaign_staggered.py:L62 and benchmarks/speed_review/bench_brfss_panel.py:L88. Concrete fix: none required; if you want canonical HonestDiD timings, add a separate analytical + base_period="universal" benchmark variant.

Code Quality

Severity: P1 Impact: benchmark false-pass behavior is still possible. run_scenario() only marks a phase failed when an exception escapes the phase callable in benchmarks/speed_review/bench_shared.py:L176 and flips the process nonzero afterward in benchmarks/speed_review/bench_shared.py:L246; run_all.py only inspects subprocess exit codes in benchmarks/speed_review/run_all.py:L39 and benchmarks/speed_review/run_all.py:L66. But several phases still swallow required substep failures and convert them to strings: HonestDiD grids in benchmarks/speed_review/bench_brand_awareness_survey.py:L131, benchmarks/speed_review/bench_brfss_panel.py:L105, and benchmarks/speed_review/bench_reversible_dcdh.py:L77; dCDH heterogeneity in benchmarks/speed_review/bench_reversible_dcdh.py:L88; and dataframe extraction in benchmarks/speed_review/bench_dose_response.py:L48. run_all.py can therefore still print “All scenarios passed” after incomplete chains. Concrete fix: let these exceptions escape, or return structured substep statuses and have run_scenario() mark the enclosing phase ok=false and exit nonzero when any required substep fails.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find matching TODO.md entries that would mitigate the P1/P2 items above.

Security

Severity: P3 Impact: the committed mem-profile artifact leaks workstation-specific absolute paths, including a user home directory, in benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L15. This is minor, but unnecessary repository leakage. Concrete fix: scrub or relativize traced file paths before writing committed artifacts, or keep these profiles uncommitted like the HTML flames.

Documentation/Tests

Severity: P2 Impact: the docs are not a reliable description of what was actually benchmarked. docs/performance-plan.md reports campaign medium as 0.72s Python / 0.87s Rust in docs/performance-plan.md:L37, but the committed baselines are 0.749s / 0.715s in benchmarks/speed_review/baselines/campaign_staggered_medium_python.json:L5 and benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json:L5. It reports brand large as 0.83s / 0.92s total and a JK1 phase of 0.37s / 0.46s in docs/performance-plan.md:L41 and docs/performance-plan.md:L104, but the JSONs show 0.906s / 0.879s total and 0.446s / 0.430s for the JK1 phase in benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:L5, benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json:L5, benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:L24, and benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json:L24. The scenario spec also drifts from the scripts: BRFSS claims a “Full replicate-weight path” in docs/performance-scenarios.md:L200, but the script uses the aggregate_survey() second-stage pweight design returned in diff_diff/prep.py:L1498 and constructed in diff_diff/prep.py:L1777, passed through at benchmarks/speed_review/bench_brfss_panel.py:L116; the dCDH scenario claims heterogeneity="cohort" in docs/performance-scenarios.md:L267, but the script uses heterogeneity="group" in benchmarks/speed_review/bench_reversible_dcdh.py:L88. Concrete fix: regenerate the plan tables from baselines/*.json and treat docs/performance-scenarios.md as a strict spec that must match the scripts exactly, ideally via a validation script in CI.

Path to Approval

Remove the phase-local exception swallowing, or teach run_scenario() to record substep failures and exit nonzero when any required substep inside a phase fails. Once incomplete chains can no longer report success, the remaining items are P2/P3 and this review becomes ✅.

Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between  /  markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T17:15:39Z

/ai-review

github-actions · 2026-04-19T17:26:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 98a1f3a925944eade94483c4c516f7ea1e747248

Overall Assessment
✅ Looks good — this re-review resolves the prior P1. I did not find any unmitigated P0/P1 issues in the changed code; the remaining items are P2/P3 documentation/spec drift only.

Executive Summary

The prior false-pass harness issue is fixed: failed phases now record ok=false and force a nonzero process exit, and run_all.py keys success off subprocess return codes. benchmarks/speed_review/bench_shared.py:118 benchmarks/speed_review/bench_shared.py:176 benchmarks/speed_review/run_all.py:39
The prior workstation-path leak is fixed: the mem-profile writer now scrubs repo/home/site-package roots before writing the committed artifact. benchmarks/speed_review/mem_profile_brfss.py:107 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:14
No new estimator-methodology defects were introduced. The HonestDiD benchmark variants that differ from canonical R are explicitly documented in the Methodology Registry, so they are P3 informational rather than blockers. docs/methodology/REGISTRY.md:2448 docs/methodology/REGISTRY.md:631
One residual P2 remains: docs/performance-scenarios.md is still not a strict description of what several scenario scripts actually benchmark. docs/performance-scenarios.md:176 docs/performance-scenarios.md:231 docs/performance-scenarios.md:292

Methodology

Severity: P3. Impact: The campaign and BRFSS scenarios benchmark compute_honest_did() on bootstrap-fitted CallawaySantAnna results with default varying base periods. That is not canonical HonestDiD usage, but both deviations are explicitly documented in the Registry, so this is informational only. Concrete fix: none required; add a separate n_bootstrap=0, base_period="universal" variant only if you want canonical R-style HonestDiD timings. benchmarks/speed_review/bench_campaign_staggered.py:62 benchmarks/speed_review/bench_brfss_panel.py:88 docs/methodology/REGISTRY.md:2449
Severity: P3. Impact: The reversible dCDH scenario uses the library’s documented HonestDiD extension that maps DID^{pl}_l placebos onto the pre-period surface; interpretation differs from canonical event-study HonestDiD, but the Registry calls that out explicitly. Concrete fix: none required; optional narrative note in performance-plan.md if you want that distinction surfaced in the findings prose. benchmarks/speed_review/bench_reversible_dcdh.py:77 docs/methodology/REGISTRY.md:631

Code Quality

No findings. The prior rer-review P1 is mitigated by the new phase-failure recording and process-exit plumbing. benchmarks/speed_review/bench_shared.py:176 benchmarks/speed_review/run_all.py:66

Performance

No findings. This PR adds benchmark harness/scripts/docs only; I did not find a new library-side performance regression in the changed files.

Maintainability

No separate findings beyond the documentation/spec drift noted below.

Tech Debt

No findings. I did not find a matching TODO.md entry that tracks the remaining doc/spec drift, but these are P2 documentation issues rather than correctness blockers. TODO.md:51

Security

No findings. The previous P3 path-leak concern is mitigated by the new scrubbing logic and the committed artifact no longer exposes a workstation absolute path. benchmarks/speed_review/mem_profile_brfss.py:107 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:14

Documentation/Tests

Severity: P2. Impact: docs/performance-scenarios.md is still not a faithful spec of the timed benchmark chains, so a reader cannot reliably tell what the committed baselines measure. Scenario 6 says event-study diagnostics come from results.to_dataframe(level="event_study") after an aggregate="dose" fit, but the API requires aggregate="eventstudy" and the script does a second fit; Scenario 4 documents a seventh unit-weight extraction step that the script never times; Scenario 5 documents results.print_summary(), but the script only snapshots a few fields; and the BRFSS-small prose says “single year” even though all scales use n_years=10. Concrete fix: either update docs/performance-scenarios.md to match the scripts exactly or make the scripts conform to the declared chains, then add a lightweight sync check so future scenario edits cannot drift silently. docs/performance-scenarios.md:176 benchmarks/speed_review/bench_brfss_panel.py:29 docs/performance-scenarios.md:231 benchmarks/speed_review/bench_geo_few_markets.py:87 docs/performance-scenarios.md:263 benchmarks/speed_review/bench_reversible_dcdh.py:68 docs/performance-scenarios.md:292 benchmarks/speed_review/bench_dose_response.py:48 diff_diff/continuous_did_results.py:344 diff_diff/continuous_did.py:187

CI review P2: performance-scenarios.md had four drift points where the documented operation chain did not match what the scripts actually time. Fixed each to be a faithful spec the reviewer can cross-check against: - BRFSS small scale: "single year" -> "narrow analytic slice on a state-year grid" (all scales use n_years=10). - Scenario 4 (SDiD): removed the seventh plot_synth_weights step the script never times; chain is now 6 steps, matching the script. - Scenario 5 (dCDH): replaced "results.print_summary()" with the actual attribute snapshot the script performs (placebo_effect, overall_att, joiners_att, leavers_att); chain is now 4 steps. - Scenario 6 (dose-response): event-study step is no longer described as to_dataframe(level="event_study") on a dose-only fit (that API path raises because aggregate="dose" does not populate event_study); it is now described as a second CDiD fit with aggregate="eventstudy", matching the separate phase the script times. The within-estimator API-spelling inconsistency that surfaced during this cleanup (ContinuousDiD uses "eventstudy" on fit(aggregate=...) but "event_study" on to_dataframe(level=...)) is captured in the correctness-adjacent observations in performance-plan.md. No changes under diff_diff/, rust/, scripts, or baselines. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T17:37:15Z

/ai-review

github-actions · 2026-04-19T17:45:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bf9b9d7a0f9789fc12d27c180fa4ee33e13dcce5

Overall Assessment
✅ Looks good — no unmitigated P0/P1 findings. The prior phase-failure propagation and absolute-path leak issues remain fixed. One residual P2 documentation inconsistency remains.

Executive Summary

The prior false-pass harness issue is fixed: phase exceptions are recorded as ok=false, converted into a nonzero process exit, and run_all.py keys success off subprocess return codes. benchmarks/speed_review/bench_shared.py:L111-L124 benchmarks/speed_review/bench_shared.py:L176-L248 benchmarks/speed_review/run_all.py:L39-L47 benchmarks/speed_review/run_all.py:L66-L79
No methodology blocker: the benchmark uses of HonestDiD with bootstrap/varying-base CS results and with dCDH placebo surfaces are documented deviations in the Methodology Registry, so they are P3 informational rather than defects. benchmarks/speed_review/bench_campaign_staggered.py:L62-L83 benchmarks/speed_review/bench_brfss_panel.py:L88-L111 benchmarks/speed_review/bench_reversible_dcdh.py:L59-L89 docs/methodology/REGISTRY.md:L2448-L2450 docs/methodology/REGISTRY.md:L631-L631
The earlier Scenario 4/5/6 and BRFSS spec drift is mostly corrected; the dose-response scenario now matches the actual ContinuousDiD API by separating the aggregate="dose" fit from the aggregate="eventstudy" pretrend fit. benchmarks/speed_review/bench_dose_response.py:L42-L67 docs/performance-scenarios.md:L291-L303 diff_diff/continuous_did.py:L187-L204 diff_diff/continuous_did_results.py:L305-L364
Residual P2: the prose docs are not yet fully self-consistent. performance-scenarios.md still describes scale-free output names for all scenarios, and performance-plan.md still includes stale pre-sweep language and overstates backend coverage despite the intentional geo_few_markets large/python skip. docs/performance-scenarios.md:L58-L60 docs/performance-plan.md:L20-L25 docs/performance-plan.md:L242-L247 benchmarks/speed_review/bench_geo_few_markets.py:L97-L102
Security regression not found: the committed mem-profile artifact no longer exposes an absolute workstation path. benchmarks/speed_review/mem_profile_brfss.py:L107-L129 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L29

Methodology

Severity: P3. Impact: campaign_staggered and brfss_panel benchmark compute_honest_did() on bootstrap-fitted CallawaySantAnna results without forcing base_period="universal". That differs from canonical R usage, but both behaviors are explicitly documented in the registry, so this is informational only. Concrete fix: none required; if you want canonical HonestDiD timing, add a separate n_bootstrap=0, base_period="universal" variant. benchmarks/speed_review/bench_campaign_staggered.py:L62-L83 benchmarks/speed_review/bench_brfss_panel.py:L88-L111 docs/methodology/REGISTRY.md:L2448-L2450
Severity: P3. Impact: reversible_dcdh uses the library’s documented HonestDiD extension on the dCDH placebo/event-study surface. Interpretation differs from canonical event-study HonestDiD, but the registry calls it out explicitly. Concrete fix: none required; optional prose note in performance-plan.md if you want that distinction surfaced in the narrative. benchmarks/speed_review/bench_reversible_dcdh.py:L77-L89 docs/methodology/REGISTRY.md:L631-L631

Code Quality

No findings. The previous silent-pass issue is addressed by the phase-failure recording and exit propagation in bench_shared.py and run_all.py. benchmarks/speed_review/bench_shared.py:L176-L248 benchmarks/speed_review/run_all.py:L66-L79

Performance

No findings. This PR adds benchmark/docs surface only; I did not find a new library-side performance regression in the changed files.

Maintainability

No findings.

Tech Debt

No findings. I did not find a matching TODO.md entry for the remaining doc drift, but it is a P2 documentation issue rather than a correctness blocker. TODO.md:L51-L109

Security

No findings. mem_profile_brfss.py now scrubs repo/home/site-package roots before writing the committed artifact, and the checked-in file no longer exposes an absolute workstation path. benchmarks/speed_review/mem_profile_brfss.py:L107-L129 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L29

Documentation/Tests

Severity: P2. Impact: the docs are much closer to the scripts now, but they are still not a fully faithful spec of the committed measurement surface. docs/performance-scenarios.md still says scenario baselines are written as "<scenario>_<backend>.json" even though four scenarios emit "<scenario>_<scale>_<backend>.json", and docs/performance-plan.md still contains stale prose saying each scenario runs at a single data shape and that multi-scale scenarios run under both backends, despite the committed three-scale sweep and the intentional geo_few_markets large/python skip. Concrete fix: update those stray prose lines, or better, drive scenario scale/output metadata from one structured source so run_all.py, README.md, and the markdown docs cannot drift independently. docs/performance-scenarios.md:L58-L60 benchmarks/speed_review/README.md:L69-L73 docs/performance-plan.md:L20-L25 docs/performance-plan.md:L37-L53 docs/performance-plan.md:L242-L247 benchmarks/speed_review/bench_geo_few_markets.py:L97-L102
Verification note: gen_findings_tables.py reproduces the committed tables exactly, and the new benchmarks/speed_review/*.py files parse cleanly in a read-only AST check. I did not execute the benchmark writers because the sandbox is read-only and those scripts write artifacts under the repo.

…skip CI re-review P2: remaining stale prose lines that didn't reflect the three-scale sweep and the intentional SDiD-Python-large skip. All straightforward text edits: - performance-scenarios.md output-path description now uses <scenario>[_<scale>]_<backend> notation and explicitly calls out that single-scale scenarios omit the scale segment. - performance-scenarios.md "Runs under both backends" line now acknowledges the SDiD large-scale Python skip by design. - performance-plan.md environment paragraph now mentions the SDiD skip alongside the three-scale sweep. - performance-plan.md "What this baseline does not answer" section no longer claims each scenario runs at a single data shape (which is no longer true); replaced with an OOM-behaviour bullet that reflects what actually is and isn't covered. - Pointers block at the end of performance-scenarios.md updated to the multi-scale filename pattern. No changes under diff_diff/, rust/, scripts, or baselines. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T17:47:34Z

/ai-review

github-actions · 2026-04-19T17:58:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 922317846c441ec736c5fcfa25f972dc65a7bf83

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The prior rerun blockers remain fixed, the methodology-sensitive benchmark calls are either source-consistent or explicitly documented deviations in the registry, and the remaining issues are documentation/reporting drift at P2.

Executive Summary

The earlier false-pass harness issue remains fixed: per-phase failures are recorded as ok=false, then converted into a nonzero process exit, and run_all.py now keys success off subprocess return codes. benchmarks/speed_review/bench_shared.py:L111-L124 benchmarks/speed_review/bench_shared.py:L176-L248 benchmarks/speed_review/run_all.py:L39-L47 benchmarks/speed_review/run_all.py:L66-L79
The methodology-sensitive HonestDiD uses in the staggered-CS, BRFSS-CS, and reversible-dCDH scenarios are all covered by explicit REGISTRY notes/deviations, so they are P3 informational rather than defects. benchmarks/speed_review/bench_campaign_staggered.py:L62-L82 benchmarks/speed_review/bench_brfss_panel.py:L88-L110 benchmarks/speed_review/bench_reversible_dcdh.py:L77-L83 docs/methodology/REGISTRY.md:L2448-L2450 docs/methodology/REGISTRY.md:L631-L631
The prior dose-response contract mismatch is fixed: the benchmark now uses a separate aggregate="dose" fit for curves and aggregate="eventstudy" fit for pretrends, matching the estimator’s documented API. benchmarks/speed_review/bench_dose_response.py:L42-L67 diff_diff/continuous_did.py:L187-L200 diff_diff/continuous_did_results.py:L305-L346
Residual P2: the generated “Top phases” table is still not scale-aware for intentionally skipped backends. gen_findings_tables.py hard-codes large, so the Python geo_few_markets row disappears instead of using that backend’s largest available measured scale (medium). benchmarks/speed_review/gen_findings_tables.py:L154-L201 benchmarks/speed_review/bench_geo_few_markets.py:L97-L102 docs/performance-plan.md:L88-L104
Residual P2: the hand-written brand-awareness narrative still drifts from the committed medium-scale baselines; JK1 remains clearly dominant there, especially on Python, not merely “comparable” to the multi-outcome loop. docs/performance-plan.md:L114-L120 benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json:L24-L31 benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json:L24-L31
No new security regression found: the mem-profile writer still scrubs repo/home/site-package paths, and the committed artifact contains only scrubbed placeholders. benchmarks/speed_review/mem_profile_brfss.py:L107-L129 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L29

Methodology

Severity: P3. Impact: campaign_staggered and brfss_panel run HonestDiD on bootstrap-fitted, default-base-period CallawaySantAnna results. Canonical R usage wants full covariance and base_period="universal", but both deviations are explicitly documented in the Methodology Registry, so this is informational only, not a blocker. Concrete fix: none required; if you want canonical HonestDiD timing, add a separate n_bootstrap=0, base_period="universal" benchmark variant. benchmarks/speed_review/bench_campaign_staggered.py:L62-L82 benchmarks/speed_review/bench_brfss_panel.py:L88-L110 docs/methodology/REGISTRY.md:L2448-L2450
Severity: P3. Impact: reversible_dcdh uses HonestDiD on the dCDH placebo/event-study surface rather than canonical pre-treatment event-study coefficients. The Registry explicitly documents this as a library extension with different interpretation, so it is not a methodology defect. Concrete fix: none required. benchmarks/speed_review/bench_reversible_dcdh.py:L77-L83 docs/methodology/REGISTRY.md:L631-L631

Code Quality

No findings. The prior silent-pass issue remains fixed in the changed harness/orchestrator code. benchmarks/speed_review/bench_shared.py:L111-L124 benchmarks/speed_review/bench_shared.py:L176-L248 benchmarks/speed_review/run_all.py:L66-L79

Performance

No findings. This PR adds measurement and reporting surface only; I did not find a new library-side performance regression in the changed files.

Maintainability

No findings.

Tech Debt

No findings. I did not find a matching TODO.md entry for the remaining reporting drift, so it remains untracked, but it is still non-blocking at P2. TODO.md:L51-L109

Security

No findings. The prior absolute-path leak remains fixed. benchmarks/speed_review/mem_profile_brfss.py:L107-L129 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L29

Documentation/Tests

Severity: P2. Impact: the auto-generated “Top phases by scenario at largest measured scale” table is still not faithful to the actual measurement surface when a backend intentionally skips a scale. gen_findings_tables.py always selects large for multi-scale scenarios, so geo_few_markets loses its Python row entirely even though the benchmark intentionally measures Python through medium and only skips large. That leaves the published table and surrounding prose incomplete for one of the most important Python-vs-Rust comparisons. Concrete fix: in benchmarks/speed_review/gen_findings_tables.py:L178-L189, choose the largest available scale per (scenario, backend) instead of unconditional large, regenerate docs/performance-plan.md:L90-L104, and label backend-specific scales when they differ. benchmarks/speed_review/bench_geo_few_markets.py:L97-L102 benchmarks/speed_review/baselines/geo_few_markets_medium_python.json:L24-L36 benchmarks/speed_review/baselines/geo_few_markets_large_rust.json:L24-L36
Severity: P2. Impact: the hand-written narrative in performance-plan.md still misstates the medium-scale brand-awareness hotspot mix. The prose says the multi-outcome loop and JK1 path are “comparable” at medium scale, but the committed baselines show JK1 as the clear top phase on Python (0.377s vs 0.142s) and still the top phase on Rust (0.202s vs 0.144s). That matters because this doc is explicitly used to prioritize follow-up optimization work. Concrete fix: update docs/performance-plan.md:L114-L120 to match the committed JSONs, or derive a short narrative stub from the same JSON input that feeds the generated tables. benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json:L24-L45 benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json:L24-L45

… narrative Addresses the two remaining P2s from CI review: - gen_findings_tables.py hard-coded scale="large" for the top-phases table, which silently dropped the geo_few_markets Python row (Python intentionally skips the large scale). The generator now iterates reversed(SCALE_ORDER) and picks the largest record actually present per (scenario, backend). The regenerated table now shows SDiD Python at medium and Rust at large side-by-side, which is the Python-vs-Rust comparison the table is supposed to surface. - Brand-awareness medium-scale narrative said the multi-outcome loop and the JK1 replicate path are "comparable" at medium. The committed baselines contradict this: JK1 is 2-3x the multi-outcome loop on Python and still the top phase on Rust. Rewrote the bullet to say JK1 is the clear top phase from medium onwards and consolidates at large, matching the data. Docs + generator only. No baseline regeneration needed (the top-phases table regeneration is cosmetic - the JSONs didn't change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T18:00:53Z

/ai-review

github-actions · 2026-04-19T18:10:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ce05d01539c697ee3dee4e8e65d7636bd96fc9e8

Overall Assessment
⚠️ Needs changes

Executive Summary

The two prior documentation/reporting P2s appear resolved: the table generator now falls back to the largest available scale per (scenario, backend), and the brand-awareness narrative matches the committed baselines. Refs: benchmarks/speed_review/gen_findings_tables.py:L154-L214, docs/performance-plan.md:L90-L122
Severity P1: the staggered-CS “without covariates” phase is not a pure covariate toggle. It also changes estimation_method from "dr" to "reg" and cuts bootstrap reps from 999 to 199, so it is not the Baker-style with/without-covariates comparison the scenario claims. Refs: benchmarks/speed_review/bench_campaign_staggered.py:L62-L69, benchmarks/speed_review/bench_campaign_staggered.py:L93-L99, docs/performance-scenarios.md:L119-L125
Severity P1: the reversible-dCDH heterogeneity refit drops survey_design, but the scenario/findings treat it as part of the survey-TSL workflow and use it to motivate a shared-TSL optimization opportunity. The measured timing therefore does not support the documented conclusion. Refs: benchmarks/speed_review/bench_reversible_dcdh.py:L59-L66, benchmarks/speed_review/bench_reversible_dcdh.py:L85-L89, docs/performance-plan.md:L129-L141
The HonestDiD-sensitive calls that differ from canonical R remain explicitly documented in the Methodology Registry, so they are P3 informational rather than defects. Refs: docs/methodology/REGISTRY.md:L2448-L2450, docs/methodology/REGISTRY.md:L631-L631

Methodology

Severity: P1. Method affected: CallawaySantAnna benchmark with/without-covariates comparison. Impact: phase 2 benchmarks CallawaySantAnna(estimation_method="dr", n_bootstrap=999) with covariates, but phase 7 changes both the estimator family and the inference workload ("dr" -> "reg", 999 -> 199) while still being labeled and documented as “CS without covariates.” That confounds the Baker-mandated with/without-covariates comparison and makes the published 7_cs_without_covariates timing misleading, including in the top-phases table. Concrete fix: keep phase 7 on the same CS configuration as phase 2 and only remove covariates, then regenerate the affected baselines and docs/performance-plan.md. Refs: benchmarks/speed_review/bench_campaign_staggered.py:L62-L69, benchmarks/speed_review/bench_campaign_staggered.py:L93-L99, docs/performance-scenarios.md:L119-L125, diff_diff/guides/llms-practitioner.txt:L419-L423, diff_diff/guides/llms-practitioner.txt:L546-L549, docs/performance-plan.md:L93-L94
Severity: P1. Method affected: ChaisemartinDHaultfoeuille heterogeneity survey path. Impact: the main fit is benchmarked with survey_design, but the heterogeneity refit is not. That does not match the scenario’s survey-weighted framing or the registry’s survey-aware heterogeneity path, and it undercuts the performance-plan claim that the main fit and heterogeneity refit “both rebuild TSL scaffolding.” Concrete fix: pass the same SurveyDesign(weights="pw", strata="stratum", psu="psu") into heterogeneity_refit() and regenerate the reversible-dCDH baselines/docs, or explicitly relabel the phase as unweighted and remove the TSL-sharing conclusion from docs/performance-plan.md. Refs: benchmarks/speed_review/bench_reversible_dcdh.py:L59-L66, benchmarks/speed_review/bench_reversible_dcdh.py:L85-L89, docs/performance-scenarios.md:L250-L274, docs/methodology/REGISTRY.md:L629-L629, docs/performance-plan.md:L129-L141
Severity: P3. Methods affected: CallawaySantAnna + HonestDiD. Impact: campaign_staggered and brfss_panel run HonestDiD on bootstrap-fitted CallawaySantAnna results with the default non-universal base period. That differs from canonical R usage, but both deviations are explicitly documented in REGISTRY.md. Concrete fix: none required unless you want a separate canonical HonestDiD timing variant. Refs: benchmarks/speed_review/bench_campaign_staggered.py:L77-L83, benchmarks/speed_review/bench_brfss_panel.py:L105-L111, docs/methodology/REGISTRY.md:L2448-L2450
Severity: P3. Methods affected: ChaisemartinDHaultfoeuille + HonestDiD. Impact: reversible_dcdh uses compute_honest_did() on the dCDH placebo/event-study surface. That is a documented library extension, not a defect. Concrete fix: none. Refs: benchmarks/speed_review/bench_reversible_dcdh.py:L77-L83, docs/methodology/REGISTRY.md:L631-L631

Code Quality
No findings. The earlier partial-failure propagation issue remains fixed in benchmarks/speed_review/bench_shared.py:L111-L248 and benchmarks/speed_review/run_all.py:L39-L79.

Performance
No new performance-regression findings beyond the methodology issues above.

Maintainability
No findings.

Tech Debt
No findings. The two P1 methodology issues above are not mitigated by the current tracker in TODO.md:L51-L109.

Security
No findings. The mem-profile writer still scrubs repo/home/site-package paths before writing the committed artifact. Refs: benchmarks/speed_review/mem_profile_brfss.py:L107-L129, benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L20

Documentation/Tests
No findings. The prior re-review documentation/reporting issues are addressed: benchmarks/speed_review/gen_findings_tables.py:L154-L214 now falls back to the largest available scale per backend, and the brand-awareness narrative in docs/performance-plan.md:L115-L122 matches the committed baselines.

Path to Approval

In campaign_staggered, make phase 7 a true with/without-covariates comparison by holding the CS method and bootstrap settings constant and changing only covariates; then refresh the affected baselines and generated findings.
In reversible_dcdh, benchmark the heterogeneity refit with the same survey design as the main fit, or explicitly redefine the scenario/docs as unweighted and remove the TSL-sharing conclusion; then refresh the affected baselines and findings text.

…-aware heterogeneity CI re-review surfaced two P1 methodology defects where the benchmark was not actually measuring what the scenario/findings claimed: 1. Staggered CS with/without-covariates comparison. Phase 7 was configured with estimation_method="reg", n_bootstrap=199 vs phase 2's "dr" + 999. That confounded two axes at once (method + inference workload) so the Baker-mandated comparison was not clean. Phase 7 now matches phase 2 exactly except for the covariates argument. CS with estimation_method="dr" and no covariates is a supported path (verified by spot-check); the resulting 5x bootstrap workload increase in phase 7 raises the staggered-large chain total slightly, which is already reflected in the regenerated tables. 2. dCDH heterogeneity refit without survey_design. The scenario framing and the performance-plan TSL-sharing optimization recommendation both assume the refit runs under the same survey design as the main fit. The refit was passing no survey_design, which meant the measured timing did not support the documented conclusion. The refit now uses the same SurveyDesign(weights="pw", strata="stratum", psu="psu") as the main fit. Confirmed supported (not NotImplementedError-gated on this shape). The refit is now as expensive as the main fit (was ~40% of chain, now ~50%), and the TSL-sharing optimization recommendation is strictly stronger. Narrative updated against the freshly regenerated tables: - Staggered campaign: removed the "Rust at large is tied with SunAbraham" claim - ImputationDiD still leads under both backends. - Reversible dCDH: updated the ~60/40 split claim to the new ~50/50 split and called out the TSL-sharing opportunity more directly. - Top-hotspots table row 4 strengthened to reflect the now-equal phase costs. All other narrative claims cross-checked against the new data and hold (BRFSS ~24s at 1M rows, staggered single-digit scale multiplier, SDiD Rust gap stable, peak RSS under 600 MB, etc.). Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T18:24:23Z

/ai-review

Three narrative corrections that need to match the freshly regenerated tables after the P1 methodology fixes: - Staggered campaign narrative: under Rust at large scale ImputationDiD still leads SunAbraham (43% vs 33%), not "tied". Removed the tied language. - Reversible dCDH narrative: main fit / heterogeneity refit split is now ~50/50 (was ~60/40 before the heterogeneity refit got a survey_design). Under Python the heterogeneity refit slightly edges out the main fit. Updated the narrative and strengthened the TSL-sharing opportunity wording. - Top-hotspots table rows 2 and 4 updated to match. Prior commit 6b1715a intended to include these edits but they raced with a linter refresh and dropped silently. Caught and fixing now. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T18:25:03Z

/ai-review

github-actions · 2026-04-19T18:34:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 09f24a80b1e0646b569faaf4464c975aef43fe89

Overall Assessment

✅ Looks good

Executive Summary

Re-review: the two prior P1 methodology blockers are fixed. campaign_staggered phase 7 now keeps the CallawaySantAnna estimator/inference settings fixed and only removes covariates, and reversible_dcdh now passes the same SurveyDesign into the heterogeneity refit as the main fit. Refs: bench_campaign_staggered.py:93, bench_reversible_dcdh.py:85
No new P0/P1 findings in the changed benchmark/docs surface.
The remaining HonestDiD non-canonical uses are documented in the Methodology Registry, so they are P3 informational rather than defects. Refs: bench_campaign_staggered.py:77, bench_brfss_panel.py:105, bench_reversible_dcdh.py:77, REGISTRY.md:2454, REGISTRY.md:2455, REGISTRY.md:632
Severity P2: the brand-awareness findings prose in docs/performance-plan.md does not match the committed medium-scale baselines; on Rust/medium the multi-outcome loop is slower than JK1, and on Python/medium JK1 is about 1.9x, not “2-3x.” Refs: performance-plan.md:115, brand_awareness_survey_medium_rust.json:24, brand_awareness_survey_medium_python.json:24
Severity P3: the BRFSS mem-profile artifact labels the top allocation site's delta as “net allocated (end - start),” so the headline metric is misnamed. Refs: mem_profile_brfss.py:97, mem_profile_brfss_large_rust.txt:8

Methodology

No P0/P1 findings.
Severity P3. Impact: CallawaySantAnna + HonestDiD in the campaign and BRFSS scenarios still use two non-canonical combinations: bootstrap-fitted CS results and default non-universal base period. Both are explicitly documented registry deviations, so they are informational only. Concrete fix: none required unless you want a separate canonical timing variant with analytical CS inference and base_period="universal". Refs: bench_campaign_staggered.py:77, bench_brfss_panel.py:105, REGISTRY.md:2454, REGISTRY.md:2455
Severity P3. Impact: ChaisemartinDHaultfoeuille + HonestDiD on the placebo/event-study surface is a documented library extension, not a methodology defect. Concrete fix: none. Refs: bench_reversible_dcdh.py:77, REGISTRY.md:632

Code Quality

No findings. No new inline inference anti-patterns or NaN/inference inconsistencies were introduced in the changed Python files.

Performance

Severity P2. Impact: the hand-written brand-awareness narrative overstates JK1 dominance at medium scale and is false for Rust/medium, which can misdirect the optimization takeaway for that scenario. Concrete fix: revise the prose to say JK1 is clearly dominant at large scale, and update the medium-scale wording to match the committed baselines. Refs: performance-plan.md:115, brand_awareness_survey_medium_rust.json:24, brand_awareness_survey_medium_python.json:24

Maintainability

No findings.

Tech Debt

No findings. The remaining methodology deviations above are already documented in REGISTRY.md and, where relevant, tracked in TODO.md.

Security

No findings.

Documentation/Tests

Severity P3. Impact: mem_profile_brfss.py emits a misleading headline metric by printing stats[0].size_diff as “net allocated (end - start)”; the committed text file therefore reports the largest single site delta, not the total retained delta. Concrete fix: either compute sum(s.size_diff for s in stats) for a true net total, or relabel the line as top site size diff. Refs: mem_profile_brfss.py:97, mem_profile_brfss_large_rust.txt:8

P2 - brand-awareness medium-scale prose: the narrative said JK1 is "2-3x the multi-outcome loop on Python at medium" and "still the top phase on Rust though by a smaller margin there." Committed baselines contradict both: on Python/medium JK1 is about 1.9x the multi-outcome loop (not 2-3x) with HonestDiD close behind; on Rust/medium the multi-outcome loop is actually the top phase, with JK1 second. Only at large does JK1 become the clearly dominant phase under both backends. Prose rewritten to match. P3 - mem_profile_brfss.py headline: the output labeled stats[0].size_diff (largest single allocation site) as "net allocated (end - start)", which sounds like the total retained delta. Relabeled to "top single-site size diff" and added a "total net size diff across all sites" line alongside it. Regenerated the committed text artifact with the corrected labels. Docs-and-script-only. No baseline timing regeneration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T18:41:05Z

/ai-review

github-actions · 2026-04-19T18:48:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3d8c5ebe72bf4749ba1d5a60db0b6b9c6b1c6b87

Overall Assessment
Needs changes

One unmitigated P1 remains. The prior P1 blockers from the last re-review are fixed in bench_campaign_staggered.py:93 and bench_reversible_dcdh.py:85, but there is a newly identified methodology drift in the brand-awareness benchmark that means the committed baselines do not fully measure the survey workflow described in the scenario spec.

Executive Summary

The two prior P1 issues are resolved: the campaign with/without-covariates refit now holds the CS estimator/inference settings fixed, and the reversible dCDH heterogeneity refit now reuses the same SurveyDesign. bench_campaign_staggered.py:93 bench_reversible_dcdh.py:85
Severity P1 [Newly identified]: bench_brand_awareness_survey.py keeps fpc in the main analytical TSL phase, but drops it in the multi-outcome, placebo, and event-study/HonestDiD phases. That benchmarks a different survey variance path than the one defined in the scenario doc/tutorial. performance-scenarios.md:148 bench_brand_awareness_survey.py:63 bench_brand_awareness_survey.py:87 bench_brand_awareness_survey.py:107 bench_brand_awareness_survey.py:119
That drift is material because those later phases account for a large share of the committed brand_awareness_survey_* baselines, so the dependent rankings/narrative in docs/performance-plan.md are not timing the declared methodology.
The previously flagged medium-scale brand-awareness prose drift appears fixed. performance-plan.md:115 brand_awareness_survey_medium_python.json:24 brand_awareness_survey_medium_rust.json:24
The previous BRFSS mem-profile labeling issue is also fixed. mem_profile_brfss.py:89 mem_profile_brfss_large_rust.txt:7
The remaining HonestDiD-on-CS and dCDH HonestDiD differences are documented in REGISTRY.md, so they are P3 informational rather than blockers. REGISTRY.md:2454 REGISTRY.md:2455 REGISTRY.md:632

Methodology

Severity P1 [Newly identified]. Impact: Scenario 2 defines the analytical survey path with SurveyDesign(weights, strata, psu, fpc), and the tutorial source reuses that same design for the funnel and placebo analyses. The benchmark only preserves fpc in phase 2; phases 4, 6, and 7 rebuild analytical survey designs without fpc, so a material share of the committed brand-awareness baselines is timing a different variance/SE path than the documented workflow. Concrete fix: create one analytical SurveyDesign(weights="weight", strata="stratum", psu="psu", fpc="fpc", nest=True) and reuse it across phases 2, 4, 6, and 7, then regenerate the affected brand-awareness baselines and refresh the dependent prose in docs/performance-plan.md. performance-scenarios.md:148 performance-scenarios.md:161 17_brand_awareness_survey.ipynb:121 17_brand_awareness_survey.ipynb:172 17_brand_awareness_survey.ipynb:255 bench_brand_awareness_survey.py:63 bench_brand_awareness_survey.py:87 bench_brand_awareness_survey.py:107 bench_brand_awareness_survey.py:119
Severity P3. Impact: The campaign and BRFSS scenarios still pass bootstrap-fitted, varying-base CallawaySantAnna results into HonestDiD. That is non-canonical relative to R, but it is explicitly documented in the Methodology Registry, so it is informational only. Concrete fix: none required unless you want separate canonical timing variants with n_bootstrap=0 and base_period="universal". bench_campaign_staggered.py:62 bench_campaign_staggered.py:77 bench_brfss_panel.py:88 bench_brfss_panel.py:105 REGISTRY.md:2454 REGISTRY.md:2455
Severity P3. Impact: compute_honest_did() on the dCDH placebo/event-study surface remains a documented library extension, not a defect. Concrete fix: none. bench_reversible_dcdh.py:77 REGISTRY.md:632

Code Quality

No new findings. I did not find new inline inference computations or partial NaN-guard patterns in the changed Python files.

Performance

No new findings. The earlier medium-scale brand-awareness prose mismatch appears resolved. performance-plan.md:115 brand_awareness_survey_medium_python.json:24 brand_awareness_survey_medium_rust.json:24

Maintainability

No findings.

Tech Debt

No new findings. The remaining non-canonical HonestDiD surfaces discussed above are already documented in REGISTRY.md, so they are not blockers for this PR.

Security

No findings.

Documentation/Tests

No new findings. The previous BRFSS memory-profile wording issue is resolved; the script and committed artifact now distinguish total net size diff from top single-site size diff. mem_profile_brfss.py:89 mem_profile_brfss_large_rust.txt:7

Path to Approval

In bench_brand_awareness_survey.py, define one analytical SurveyDesign(weights="weight", strata="stratum", psu="psu", fpc="fpc", nest=True) and reuse it in the TSL, multi-outcome, placebo, and MultiPeriodDiD/HonestDiD phases.
Regenerate the affected brand-awareness baselines for both backends, then rerun benchmarks/speed_review/gen_findings_tables.py so the tables in docs/performance-plan.md reflect the corrected workflow.
Re-read the brand-awareness narrative in docs/performance-plan.md and update any wording that shifts after the regenerated timings land.

CI re-review P1: `bench_brand_awareness_survey.py` declared the analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)` only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event study + HonestDiD) built their own SurveyDesigns without `fpc`. That means a material share of the committed brand-awareness baselines timed a different variance path than the scenario doc declares. Fix: - One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True) is now constructed once at the top of `make_phases` and reused across phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a different variance surface and correctly keeps its own design. - Regenerated baselines for both backends. - Regenerated findings tables via gen_findings_tables.py. Narrative refreshed against the new tables: - Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in the previous rerun); on Rust the multi-outcome loop and JK1 come in essentially tied. Medium is also where Python is slowest relative to Rust (~1.6x) - the full analytical TSL path with FPC exposes vectorization differences at that shape. Totals re-converge at large scale. - Reversible dCDH: ~48-52% split under both backends (previously the Python heterogeneity refit edged out the main fit slightly). - Scaling finding #5 retuned: Rust-only uplift is still the SDiD story; brand-aware medium now surfaces as a secondary, modest ~1.6x case rather than "within noise." Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T18:55:18Z

/ai-review

github-actions · 2026-04-19T19:08:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 539c7d78bddcedf7c51fec4ea89aa3a15b618319

Overall Assessment

✅ Looks good

The prior P1 blockers from the last review are resolved, and I did not find any unmitigated P0/P1 issues in the changed code. I found two low-severity P3 drifts: one benchmark phase is still slightly less faithful to the cited brand-awareness workflow than its label suggests, and parts of the new findings prose no longer exactly match the committed JSON baselines.

Executive Summary

The previous brand-awareness blocker is fixed: the analytical survey phases now reuse one SurveyDesign(weights="weight", strata="stratum", psu="psu", fpc="fpc", nest=True), so the TSL, multi-outcome, placebo, and event-study/HonestDiD phases benchmark the same analytical design path. bench_brand_awareness_survey.py:L63-L66 bench_brand_awareness_survey.py:L94-L126
The previous campaign blocker is fixed: the no-covariates refit now matches the main CS fit’s estimator/inference settings, so phase 7 isolates only the covariate toggle. bench_campaign_staggered.py:L93-L102
The previous reversible-dCDH blocker is fixed: the heterogeneity refit now reuses the same survey design as the main fit. bench_reversible_dcdh.py:L85-L93
Severity P3: the brand-awareness “naive” phase is labeled as no-survey-design, but it still clusters by PSU, which is not the truly naive tutorial step. bench_brand_awareness_survey.py:L68-L72 performance-scenarios.md:L161-L166 17_brand_awareness_survey.ipynb:L91-L121
Severity P3: parts of the new performance-plan.md narrative overstate what the committed baselines show for campaign medium-rust and SDiD rust. performance-plan.md:L113-L117 campaign_staggered_medium_rust.json:L34-L40 performance-plan.md:L130-L132 geo_few_markets_medium_rust.json:L24-L36 geo_few_markets_large_rust.json:L24-L36
The remaining HonestDiD-on-CS and dCDH HonestDiD differences are documented in the Methodology Registry, so they stay informational rather than blockers. REGISTRY.md:L2453-L2455 REGISTRY.md:L632-L632

Methodology

Severity P3. Impact: Scenario 2 phase 1 is presented as the naive, no-survey-design benchmark, and the cited tutorial/spec define that step as plain DifferenceInDifferences() with no survey correction. The benchmark currently adds cluster="psu", which partially accounts for the sampling structure and makes the “naive vs survey-aware” comparison less faithful to the cited workflow. Concrete fix: either remove cluster="psu" from phase 1 to match the tutorial/spec, or explicitly relabel the phase/docs as “cluster-robust but unweighted / no SurveyDesign.” bench_brand_awareness_survey.py:L68-L72 performance-scenarios.md:L161-L166 17_brand_awareness_survey.ipynb:L91-L121
Severity P3. Impact: The remaining CS HonestDiD and dCDH HonestDiD differences are already documented deviations/extensions, so they are not defects for this PR. Concrete fix: none required unless you want separate canonical benchmark variants. bench_campaign_staggered.py:L77-L91 bench_brfss_panel.py:L105-L119 bench_reversible_dcdh.py:L77-L83 REGISTRY.md:L2453-L2455 REGISTRY.md:L632-L632

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The non-canonical HonestDiD behaviors above are already documented in the Methodology Registry, so they are not untracked debt for this PR. REGISTRY.md:L2453-L2455 REGISTRY.md:L632-L632

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new findings prose in docs/performance-plan.md no longer exactly matches the committed baselines in two places. First, it says ImputationDiD is dominant for the campaign chain “at every scale under both backends,” but the medium Rust baseline has SunAbraham slightly above ImputationDiD. Second, it says Rust “eliminates” SDiD’s sensitivity_to_zeta_omega and in_time_placebo, but those remain the largest Rust phases in the committed SDiD baselines. Concrete fix: reword the campaign paragraph to narrow the dominance claim, and reword the SDiD paragraph to say Rust collapses total runtime while sensitivity_to_zeta_omega still dominates the already-small Rust runtime. performance-plan.md:L113-L117 campaign_staggered_medium_rust.json:L34-L40 performance-plan.md:L130-L132 geo_few_markets_medium_rust.json:L24-L36 geo_few_markets_large_rust.json:L24-L36

CI re-review P3: - bench_brand_awareness_survey.py "naive" phase was using cluster="psu", which is already a partial sampling-design correction - the SE-inflation comparison is more faithful to Tutorial 17 when the first phase is genuinely untreated-for-design. Removed the cluster argument. - performance-plan.md narrative overreaches corrected: - Staggered campaign: at Rust medium SunAbraham is now the clearly leading phase (~1.7x ImputationDiD there), not "slightly edges out". Reworded to say ImputationDiD / SunAbraham are the top two at every scale but their order is not stable across backend and scale. - Reversible dCDH: split is not "~evenly under both backends" - Python is closer to 58/41 with the main fit leading, Rust is 51/49 with the heterogeneity refit leading. Reworded to reflect the split per backend. Regenerated the affected brand-awareness and campaign-staggered baselines (the naive-fit change slightly reduces brand-awareness chain totals and shifts phase-percentage shares). Tables in performance-plan.md re-derived via gen_findings_tables.py. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T19:13:50Z

/ai-review

github-actions · 2026-04-19T19:20:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4bf991c6568a53db5612f6a78f71d2a88ea93a4e

Overall Assessment

✅ Looks good

The prior re-review methodology-fidelity issues are resolved, and I did not find any unmitigated P0/P1 issues in the changed code. The remaining issues are P3-only documentation/prose drift.

Executive Summary

The earlier review items that mattered most are fixed: the survey “naive” phase is now truly naive, the campaign no-covariates refit now holds estimator/inference settings constant, and the dCDH heterogeneity refit now reuses the same survey design as the main fit. benchmarks/speed_review/bench_brand_awareness_survey.py:68-77 benchmarks/speed_review/bench_campaign_staggered.py:93-102 benchmarks/speed_review/bench_reversible_dcdh.py:85-93
I did not find new P0/P1 statistical correctness issues in the benchmark scripts or generated baselines.
The remaining non-canonical HonestDiD uses are documented in the Methodology Registry, so they stay informational rather than defects. docs/methodology/REGISTRY.md:2453-2455 docs/methodology/REGISTRY.md:632-632
One scenario description still overstates the BRFSS second-stage survey design actually being benchmarked. docs/performance-scenarios.md:203-208 diff_diff/prep.py:1498-1501 diff_diff/prep.py:1777-1781
Two performance-plan.md prose claims still overgeneralize what the committed JSON baselines show. docs/performance-plan.md:70-72 docs/performance-plan.md:134-139

Methodology

Severity P3. Impact: Scenario 3 says the BRFSS second-stage fit uses analytical TSL with strata + PSU, but aggregate_survey() returns a second-stage design with only weight_type="pweight" and psu=geo_col; no stage-2 strata are benchmarked. That overstates the survey-variance surface this scenario actually measures. Concrete fix: reword Scenario 3 to “pweight + geographic PSU clustering” unless you intend to benchmark a genuinely stratified second stage. docs/performance-scenarios.md:203-208 diff_diff/prep.py:1498-1501 diff_diff/prep.py:1777-1781
Severity P3 informational. Impact: The remaining HonestDiD differences are documented deviations/extensions: bootstrap-fitted/varying-base Callaway-Sant’Anna results and placebo-based dCDH HonestDiD are both explicitly covered in REGISTRY.md, so they are not defects for this PR. Concrete fix: none. benchmarks/speed_review/bench_campaign_staggered.py:77-91 benchmarks/speed_review/bench_brfss_panel.py:105-119 benchmarks/speed_review/bench_reversible_dcdh.py:77-83 docs/methodology/REGISTRY.md:2453-2455 docs/methodology/REGISTRY.md:632-632

Code Quality

No findings.

Performance

Severity P3. Impact: The high-level writeup still says ImputationDiD is “consistently” dominant, and hotspot row Add fixed effects and absorb parameters to DifferenceInDifferences #2 says it dominates the CS chain at all scales/backends, but the committed medium-Rust baseline has SunAbraham clearly leading (0.353s vs 0.214s). This slightly overstates the case for ImputationDiD as the universal next hotspot. Concrete fix: narrow both claims to Python all scales plus Rust small/large, or explicitly call out the medium-Rust exception. docs/performance-plan.md:70-72 docs/performance-plan.md:155-155 benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json:34-40
Severity P3. Impact: The SDiD narrative says sensitivity_to_zeta_omega and in_time_placebo are the top two phases under both backends at every scale, but the small-Rust baseline has 2_sdid_bootstrap_variance_200 as the largest phase. Minor, but the prose should match the committed baselines. Concrete fix: qualify the statement to “most measured scales” or note the small-Rust exception. docs/performance-plan.md:134-139 benchmarks/speed_review/baselines/geo_few_markets_small_rust.json:19-35

Maintainability

No findings.

Tech Debt

No new blocker-level debt. TODO.md does not currently track the minor prose/methodology-description drifts above, but they do not rise above P3.

Security

No findings.

Documentation/Tests

No findings beyond the documentation drift already called out above.

CI re-review P3 items, all documentation-only: - Scenario 3 operation chain: said "analytical TSL via strata + PSU", but aggregate_survey()'s returned second-stage design is pweight with geographic PSU clustering and no stage-2 strata. Reworded to match the actual second-stage design surface being benchmarked. - ImputationDiD "consistently dominant" claim in scaling finding #2 and hotspot table row #2: at Rust medium SunAbraham clearly leads (0.353s vs 0.214s). Both claims narrowed to "Python all scales + Rust small/large" with the Rust-medium SunAbraham exception called out explicitly; the "together ~70-80% of the chain" framing preserves the optimization recommendation. - SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo are the two largest at every scale/backend, but at Rust small bootstrap_variance slightly edges both (at sub-50ms totals, per- phase fixed overhead dominates ranking). Qualified to Python all scales + Rust medium/large. Docs-only. No script or baseline changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T19:22:01Z

/ai-review

github-actions · 2026-04-19T19:29:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a0aafc5f85d4280fc0fa30daf88f38acd21431d6

Overall Assessment

⚠️ Needs changes

One unmitigated P1 remains: scenario 6’s implemented treatment timing does not match its documented methodology, and the binarized comparison phase is keyed to a different treatment start than the ContinuousDiD fits.

Executive Summary

The prior re-review’s P3 drift items look resolved: the BRFSS scenario now describes the second-stage design correctly, and the performance-plan narrative now matches the committed baseline exceptions instead of overstating them. docs/performance-scenarios.md:L203-L209 docs/performance-plan.md:L70-L75 docs/performance-plan.md:L136-L144
Severity P1 [Newly identified]. Scenario 6 is specified as a single cohort at period 3, but the script calls generate_continuous_did_data() without cohort_periods, so it inherits the generator’s documented default cohort [2]; the fallback that would set first_treat=3 never runs, while the binarized DiD phase still hardcodes post >= 3. That means the committed dose-response baselines are benchmarking a different workflow than the one the docs claim, and phase 4 is not aligned to the CDiD treatment timing. benchmarks/speed_review/bench_dose_response.py:L21-L30 benchmarks/speed_review/bench_dose_response.py:L69-L76 diff_diff/prep_dgp.py:L870-L905 diff_diff/prep_dgp.py:L985-L992 docs/performance-scenarios.md:L287-L309
The methodology-sensitive HonestDiD uses in the new benchmarks are otherwise covered by documented registry notes: bootstrapped / varying-base Callaway-Sant’Anna HonestDiD and dCDH placebo-based HonestDiD remain informational, not defects. benchmarks/speed_review/bench_campaign_staggered.py:L59-L79 benchmarks/speed_review/bench_brfss_panel.py:L85-L106 benchmarks/speed_review/bench_reversible_dcdh.py:L70-L77 docs/methodology/REGISTRY.md:L2453-L2455 docs/methodology/REGISTRY.md:L632-L632
I did not find additional P0/P1 issues in the other changed benchmark scripts, harness code, or docs.

Methodology

Severity P1 [Newly identified]. Impact: bench_dose_response.py does not implement the scenario it documents. The generator default is a single treated cohort at period 2, not period 3, so the CDiD fits and committed dose_response_{python,rust}.json baselines are for the wrong design relative to the scenario spec. On top of that, the “binarized DiD comparison” phase hardcodes post >= 3, so it is not using the same treatment onset as the CDiD phases. This is an undocumented methodology mismatch, not just prose drift. Concrete fix: pin the DGP to cohort_periods=[3] in the script, derive the binarized post cutoff from the same treatment start used by the CDiD fits, and regenerate both dose-response baseline JSONs. benchmarks/speed_review/bench_dose_response.py:L21-L30 benchmarks/speed_review/bench_dose_response.py:L69-L76 diff_diff/prep_dgp.py:L870-L905 docs/performance-scenarios.md:L287-L309
Severity P3 informational. Impact: the PR’s CS HonestDiD calls rely on documented library deviations from R for bootstrap covariance fallback and varying-base handling, and the dCDH HonestDiD call relies on the documented placebo-based extension. These are registry-covered behaviors, so they are not blockers. Concrete fix: none. benchmarks/speed_review/bench_campaign_staggered.py:L59-L79 benchmarks/speed_review/bench_brfss_panel.py:L85-L106 benchmarks/speed_review/bench_reversible_dcdh.py:L70-L77 docs/methodology/REGISTRY.md:L2453-L2455 docs/methodology/REGISTRY.md:L632-L632

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The P1 above is not tracked in TODO.md, so it is not mitigated by existing tech-debt tracking. TODO.md:L51-L80

Security

No findings.

Documentation/Tests

No additional findings beyond the scenario-6 methodology mismatch. The earlier documentation-drift issues from the previous re-review appear fixed.

Path to Approval

In benchmarks/speed_review/bench_dose_response.py, make the DGP match the documented scenario by passing cohort_periods=[3], or intentionally revise the docs/comments to period 2 if that is the real target.
Derive the binarized comparison’s post indicator from the same treatment start used by the CDiD fits, and add a simple assertion that only one positive first_treat value is present.
Regenerate dose_response_python.json and dose_response_rust.json, then refresh any affected prose in docs/performance-plan.md.

…race shifts CI re-review P1: bench_dose_response.py inherited the CDiD generator's default cohort [2], not the documented period 3. The fallback that would have set first_treat=3 never ran (generator already populates first_treat), so the committed baselines measured a different cohort onset than the scenario doc. The binarized DiD phase also hardcoded post >= 3, which further desynced it from the actual CDiD treatment start under the default DGP. Fix: - Pin the generator to cohort_periods=[3] so the DGP matches the docs. - Assert exactly one positive first_treat after generation; future DGP changes that break the single-cohort contract will fail loudly instead of drifting silently. - Binarized phase now derives its post cutoff from the actual first_treat in the data, not a hardcoded period number. No opportunity to desync from the CDiD fits above. - Regenerated dose-response baselines for both backends. Structural narrative hardening: Prior CI rounds have repeatedly re-flagged the same drift pattern: the staggered campaign and reversible dCDH narratives make phase- order claims at close-race cells (staggered Rust medium, dCDH at this shape) that can flip on rerun because the two contenders are within a few percentage points of each other. The underlying ranking is not the right level of abstraction for narrative; the phase-share table is. This commit rewrites both narratives to describe the aggregate share pattern and defer per-cell ordering to the generator-produced table. Scaling finding #2 and hotspot table row #2 get the same treatment. Net effect: narrative claims are now robust to rerun noise at close-race cells. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T19:34:15Z

/ai-review

github-actions · 2026-04-19T19:40:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 030d5f573a57a49444a14e8e441b81575c0749c2

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in this re-review.

Executive Summary

The prior re-review P1 is resolved: Scenario 6 now explicitly pins the treated cohort to period 3 and derives the binarized comparator’s post cutoff from the realized treatment timing, matching the scenario spec and avoiding the earlier onset mismatch. benchmarks/speed_review/bench_dose_response.py:21, benchmarks/speed_review/bench_dose_response.py:76, diff_diff/prep_dgp.py:870, docs/performance-scenarios.md:287
I did not find any new P0/P1 methodology defects in the estimator-sensitive benchmark workflows. The HonestDiD uses in the CS and dCDH scenarios are covered by registry-noted deviations, so they remain informational rather than blocking. benchmarks/speed_review/bench_campaign_staggered.py:77, benchmarks/speed_review/bench_brfss_panel.py:105, benchmarks/speed_review/bench_reversible_dcdh.py:77, docs/methodology/REGISTRY.md:632, docs/methodology/REGISTRY.md:2453
Severity P2: the hand-written brand-awareness narrative in docs/performance-plan.md still claims the medium scale showed the biggest non-SDiD Python/Rust gap (~1.6x), but the committed table/baselines show medium already at near parity and large at only ~1.1x. That is a documentation drift issue, not a numerical one. docs/performance-plan.md:47, docs/performance-plan.md:49, docs/performance-plan.md:82, docs/performance-plan.md:125
Severity P3 informational: bench_brand_awareness_survey.py calls the JK1 path “replicate-weight bootstrap,” but the implemented method is replicate-weight variance, which the registry treats separately from bootstrap inference. benchmarks/speed_review/bench_brand_awareness_survey.py:4, benchmarks/speed_review/bench_brand_awareness_survey.py:89, docs/methodology/REGISTRY.md:2891

Methodology

No unmitigated findings. The prior Scenario 6 methodology mismatch is fixed in the changed code and docs. benchmarks/speed_review/bench_dose_response.py:21, docs/performance-scenarios.md:287
Severity P3 informational. Impact: the Scenario 2 docstring describes JK1 as a “replicate-weight bootstrap,” but the code is actually timing JK1 replicate-weight variance. The implementation is correct; only the label is methodologically imprecise. Concrete fix: rename the docstring text to “replicate-weight variance (JK1)” or equivalent. benchmarks/speed_review/bench_brand_awareness_survey.py:4, benchmarks/speed_review/bench_brand_awareness_survey.py:89, docs/methodology/REGISTRY.md:2891
Severity P3 informational. Impact: the HonestDiD usages in the CS and dCDH scenarios rely on registry-documented deviations: bootstrap-fitted CS falls back to diagonal event-study covariance, and dCDH HonestDiD uses the placebo/event-study extension. Because both are documented in REGISTRY.md, they are not defects. Concrete fix: none. benchmarks/speed_review/bench_campaign_staggered.py:77, benchmarks/speed_review/bench_brfss_panel.py:105, benchmarks/speed_review/bench_reversible_dcdh.py:77, docs/methodology/REGISTRY.md:632, docs/methodology/REGISTRY.md:2454

Code Quality

No findings.

Performance

No findings in the benchmark scripts or committed baselines themselves beyond the documentation drift noted below.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the brand-awareness narrative in docs/performance-plan.md still says medium scale is where Python and Rust “separate the most” at “~1.6x under Python,” but the same file’s auto-generated table reports medium at 0.56s vs 0.55s, and large at 1.09s vs 1.00s. The underlying measurements appear fine; the prose is stale. That can skew the optimization takeaway because this section is explicitly the decision output. Concrete fix: update the prose at the affected narrative paragraphs to match the committed baselines, or rerun/regenerate and then manually refresh the hand-written narrative as the file itself requires. docs/performance-plan.md:47, docs/performance-plan.md:49, docs/performance-plan.md:82, docs/performance-plan.md:125, benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json:5, benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json:5

…ht variance" CI re-review P2 + P3, both docs/label only: - docs/performance-plan.md had two remaining specific-magnitude claims about brand-awareness medium ("~1.6x under Python", "Python and Rust separate the most at medium", "~1.6x at worst on brand-awareness medium"). Those were true on one earlier rerun but the current committed baselines show medium at 0.56 / 0.55 (essentially tied) and the widest non-SDiD gap is now ~1.1x at brand-large. Reworded per-scenario paragraph and scaling finding #5 to describe the stable aggregate pattern and defer exact ratios to the scale-sweep table. Same treatment as the earlier staggered/dCDH pass: narrative stops claiming magnitudes that can shift on rerun; the generator-owned table carries the specifics. - bench_brand_awareness_survey.py module docstring labeled JK1 as "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate- weight variance (jackknife-style), not bootstrap inference - they are distinct methodology surfaces. Renamed to "replicate-weight variance (JK1 delete-one-PSU)" with an inline note pointing to the registry. Docs + docstring only. No script behaviour change; no baseline regeneration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T19:42:17Z

/ai-review

github-actions · 2026-04-19T20:14:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 307868c9367132d616d4e0eea92b77196d15001c

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in this re-review.

Executive Summary

The prior Scenario 6 methodology issue is resolved: the benchmark now pins the treated cohort to period 3 and aligns the binarized comparator cutoff to realized treatment timing. benchmarks/speed_review/bench_dose_response.py:21, docs/performance-scenarios.md:287
The prior Scenario 2 wording issue is resolved: JK1 is now described as replicate-weight variance rather than bootstrap. benchmarks/speed_review/bench_brand_awareness_survey.py:4
The prior docs/performance-plan.md brand-awareness drift is resolved; the prose now matches the committed tables/baselines. docs/performance-plan.md:125
I did not find any new P0/P1 methodology defects in the estimator-sensitive benchmark workflows. The remaining CS/dCDH HonestDiD quirks are already documented in REGISTRY.md, so they are informational only. benchmarks/speed_review/bench_campaign_staggered.py:62, benchmarks/speed_review/bench_brfss_panel.py:88, benchmarks/speed_review/bench_reversible_dcdh.py:77, docs/methodology/REGISTRY.md:2454, docs/methodology/REGISTRY.md:2455, docs/methodology/REGISTRY.md:632
Remaining issues are minor: one misleading baseline metadata field and one stale mem-profile description/labeling issue. benchmarks/speed_review/bench_shared.py:215, benchmarks/speed_review/mem_profile_brfss.py:8

Methodology

Severity P3 informational. Impact: Scenario 1 and Scenario 3 benchmark the library’s documented non-canonical CallawaySantAnna + HonestDiD surface: bootstrap-fitted CS falls back to diagonal event-study covariance, and base_period="varying" only warns instead of enforcing base_period="universal". Both behaviors are explicitly recorded in REGISTRY.md, so they are not review blockers. Concrete fix: none required; if you want these baselines to measure canonical R-style HonestDiD, switch those phases to analytical CS and base_period="universal". benchmarks/speed_review/bench_campaign_staggered.py:62, benchmarks/speed_review/bench_brfss_panel.py:88, diff_diff/honest_did.py:675, docs/methodology/REGISTRY.md:2454, docs/methodology/REGISTRY.md:2455
Severity P3 informational. Impact: Scenario 5’s HonestDiD phase uses the documented dCDH placebo/event-study extension rather than canonical event-study HonestDiD. That is an explicit library extension already noted in REGISTRY.md, so it is non-blocking. Concrete fix: none. benchmarks/speed_review/bench_reversible_dcdh.py:77, docs/methodology/REGISTRY.md:632

Code Quality

Severity P3. Impact: The baseline JSON field has_rust_backend records the active backend state from HAS_RUST_BACKEND, not whether Rust was actually available on the machine. On forced-Python runs it serializes false, which reads like unavailability rather than deliberate disabling. Concrete fix: rename it to rust_backend_active, or emit both rust_backend_active and rust_backend_available. benchmarks/speed_review/bench_shared.py:215, benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:2

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3. Impact: mem_profile_brfss.py still describes the main JSON baselines as resource.getrusage-style RSS capture and writes backend-specific filenames, but the harness now uses a psutil sampler and the profiled aggregate_survey path is backend-neutral. This is only descriptive, but it can mislead readers about what is being measured. Concrete fix: update the docstring wording and consider a backend-neutral output name for this script. benchmarks/speed_review/mem_profile_brfss.py:8, benchmarks/speed_review/mem_profile_brfss.py:61, benchmarks/speed_review/bench_shared.py:204

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit 3a1fb12 into main Apr 19, 2026
1 of 2 checks passed

igerber deleted the perf-review branch April 19, 2026 20:18

igerber mentioned this pull request Apr 19, 2026

Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup) #338

Merged

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Conversation

igerber commented Apr 19, 2026

Summary

Findings at a glance

Scope discipline

Test plan

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security