Add practitioner-workflow performance baseline#333
Conversation
Six end-to-end scenarios covering CS + 8-step chain, survey DiD, BRFSS microdata -> CS panel, SDiD few-markets, reversible dCDH, and continuous dose-response -- anchored to applied-econ paper and industry conventions rather than the 200 x 8 cookie cutter. Each chain is timed per-phase and profiled with pyinstrument under both backends; findings and recommended actions are in docs/performance-plan.md. Measurement only -- no changes under diff_diff/ or rust/. The decision doc identifies aggregate_survey per-cell scaffolding, ImputationDiD fit loop, and dCDH heterogeneity refit as candidates for follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt Security Documentation/Tests
Path to Approval
|
Four scenarios (campaign_staggered, brand_awareness_survey, brfss_panel, geo_few_markets) now run at small/medium/large data scales rather than a single tutorial-scale point. The large scales reflect practitioner realism: 1M-row BRFSS pooled panels, 1,500-unit county-level staggered studies, 1,000-unit multi-region brand surveys, 500-unit zip-level geo-experiments. Key finding from the sweep: aggregate_survey at 1M microdata rows takes ~24 seconds (100% of BRFSS chain runtime), with 97% of that in _compute_stratified_psu_meat self-time. The tutorial-scale pass had flagged this as a 1.5s finding; at practitioner scale it is 15-20x larger and becomes the single highest-value optimization target identified. The other four findings hold across scales: CS chain scales well to 1,500 units, brand-survey chain scales sub-linearly, SDiD Rust gap is stable, ImputationDiD remains the top phase of the staggered chain at all scales. Measurement only. docs/performance-plan.md and docs/performance-scenarios.md updated with scale-sweep tables and scaling-finding narrative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance No findings. The prior Scenario 2 workload-size mismatch appears resolved: the script, scenario doc, and committed JSONs now align on a three-scale 200/500/1000-unit sweep with 40/90/160 replicate columns in Maintainability
Tech Debt No findings. Security No findings. Documentation/Tests
Path to Approval
|
Addresses the four CI review findings: - BRR -> JK1 rename. generate_survey_did_data(include_replicate_weights= True) emits JK1 delete-one-PSU weights per prep.py:1248; Scenario 2 was labeling them as BRR, which uses a different variance formula. Fixed script, phase label, scenario doc data-shape text, and example code snippet. - Exit-code propagation. run_scenario now records a module-level failure flag; an atexit handler os._exit(1)s if any phase recorded ok=False. run_all.py's subprocess return-code check now reliably surfaces phase failures. Verified with a forced-failure harness test. - Path references. bench_shared.py and run_all.py docstrings plus performance-plan.md prose normalized to benchmarks/speed_review/baselines/. - Contributor README. "Commit HTMLs" instruction removed; flame HTMLs are gitignored and regenerated per run. Adds memory measurement: - psutil background RSS sampler (10ms) in run_scenario writes a memory field to every scenario JSON: start, peak, growth-during-run. Zero timing impact (background thread, single-syscall samples). - mem_profile_brfss.py - standalone tracemalloc allocator attribution for the BRFSS-1M scenario. Separate from the timing harness so its 2-5x overhead does not contaminate wall-clock baselines. Memory findings extend the optimization priority list without changing the #1 recommendation. Headline insight: BRFSS aggregate_survey at 1M rows grows only 23 MB of working memory (vs 46 MB input), and tracemalloc's net-retained allocation is 0.6 MB. The 24-second cost is pure CPU - confirms the precompute-scaffolding fix is low-risk and fits in any deployment target including 512 MB Lambda. Secondary finding: staggered CS chain allocates 252-322 MB at 1,500 units (peak RSS 486-589 MB). Fine for workstations, tight for Lambda- tier deployments. Flagged as a lower-priority follow-up. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology The previous Scenario 2 BRR/JK1 finding appears resolved.
Code Quality
Performance No findings. Maintainability No findings. Tech Debt No findings. I did not find matching Security
Documentation/Tests
Path to Approval
|
Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI review P2: performance-scenarios.md had four drift points where the documented operation chain did not match what the scripts actually time. Fixed each to be a faithful spec the reviewer can cross-check against: - BRFSS small scale: "single year" -> "narrow analytic slice on a state-year grid" (all scales use n_years=10). - Scenario 4 (SDiD): removed the seventh plot_synth_weights step the script never times; chain is now 6 steps, matching the script. - Scenario 5 (dCDH): replaced "results.print_summary()" with the actual attribute snapshot the script performs (placebo_effect, overall_att, joiners_att, leavers_att); chain is now 4 steps. - Scenario 6 (dose-response): event-study step is no longer described as to_dataframe(level="event_study") on a dose-only fit (that API path raises because aggregate="dose" does not populate event_study); it is now described as a second CDiD fit with aggregate="eventstudy", matching the separate phase the script times. The within-estimator API-spelling inconsistency that surfaced during this cleanup (ContinuousDiD uses "eventstudy" on fit(aggregate=...) but "event_study" on to_dataframe(level=...)) is captured in the correctness-adjacent observations in performance-plan.md. No changes under diff_diff/, rust/, scripts, or baselines. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…skip CI re-review P2: remaining stale prose lines that didn't reflect the three-scale sweep and the intentional SDiD-Python-large skip. All straightforward text edits: - performance-scenarios.md output-path description now uses <scenario>[_<scale>]_<backend> notation and explicitly calls out that single-scale scenarios omit the scale segment. - performance-scenarios.md "Runs under both backends" line now acknowledges the SDiD large-scale Python skip by design. - performance-plan.md environment paragraph now mentions the SDiD skip alongside the three-scale sweep. - performance-plan.md "What this baseline does not answer" section no longer claims each scenario runs at a single data shape (which is no longer true); replaced with an OOM-behaviour bullet that reflects what actually is and isn't covered. - Pointers block at the end of performance-scenarios.md updated to the multi-scale filename pattern. No changes under diff_diff/, rust/, scripts, or baselines. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. The prior rerun blockers remain fixed, the methodology-sensitive benchmark calls are either source-consistent or explicitly documented deviations in the registry, and the remaining issues are documentation/reporting drift at P2. Executive Summary
Methodology
Code Quality No findings. The prior silent-pass issue remains fixed in the changed harness/orchestrator code. benchmarks/speed_review/bench_shared.py:L111-L124 benchmarks/speed_review/bench_shared.py:L176-L248 benchmarks/speed_review/run_all.py:L66-L79 Performance No findings. This PR adds measurement and reporting surface only; I did not find a new library-side performance regression in the changed files. Maintainability No findings. Tech Debt No findings. I did not find a matching Security No findings. The prior absolute-path leak remains fixed. benchmarks/speed_review/mem_profile_brfss.py:L107-L129 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L29 Documentation/Tests
|
… narrative Addresses the two remaining P2s from CI review: - gen_findings_tables.py hard-coded scale="large" for the top-phases table, which silently dropped the geo_few_markets Python row (Python intentionally skips the large scale). The generator now iterates reversed(SCALE_ORDER) and picks the largest record actually present per (scenario, backend). The regenerated table now shows SDiD Python at medium and Rust at large side-by-side, which is the Python-vs-Rust comparison the table is supposed to surface. - Brand-awareness medium-scale narrative said the multi-outcome loop and the JK1 replicate path are "comparable" at medium. The committed baselines contradict this: JK1 is 2-3x the multi-outcome loop on Python and still the top phase on Rust. Rewrote the bullet to say JK1 is the clear top phase from medium onwards and consolidates at large, matching the data. Docs + generator only. No baseline regeneration needed (the top-phases table regeneration is cosmetic - the JSONs didn't change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability Tech Debt Security Documentation/Tests Path to Approval
|
…-aware heterogeneity CI re-review surfaced two P1 methodology defects where the benchmark was not actually measuring what the scenario/findings claimed: 1. Staggered CS with/without-covariates comparison. Phase 7 was configured with estimation_method="reg", n_bootstrap=199 vs phase 2's "dr" + 999. That confounded two axes at once (method + inference workload) so the Baker-mandated comparison was not clean. Phase 7 now matches phase 2 exactly except for the covariates argument. CS with estimation_method="dr" and no covariates is a supported path (verified by spot-check); the resulting 5x bootstrap workload increase in phase 7 raises the staggered-large chain total slightly, which is already reflected in the regenerated tables. 2. dCDH heterogeneity refit without survey_design. The scenario framing and the performance-plan TSL-sharing optimization recommendation both assume the refit runs under the same survey design as the main fit. The refit was passing no survey_design, which meant the measured timing did not support the documented conclusion. The refit now uses the same SurveyDesign(weights="pw", strata="stratum", psu="psu") as the main fit. Confirmed supported (not NotImplementedError-gated on this shape). The refit is now as expensive as the main fit (was ~40% of chain, now ~50%), and the TSL-sharing optimization recommendation is strictly stronger. Narrative updated against the freshly regenerated tables: - Staggered campaign: removed the "Rust at large is tied with SunAbraham" claim - ImputationDiD still leads under both backends. - Reversible dCDH: updated the ~60/40 split claim to the new ~50/50 split and called out the TSL-sharing opportunity more directly. - Top-hotspots table row 4 strengthened to reflect the now-equal phase costs. All other narrative claims cross-checked against the new data and hold (BRFSS ~24s at 1M rows, staggered single-digit scale multiplier, SDiD Rust gap stable, peak RSS under 600 MB, etc.). Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
Three narrative corrections that need to match the freshly regenerated tables after the P1 methodology fixes: - Staggered campaign narrative: under Rust at large scale ImputationDiD still leads SunAbraham (43% vs 33%), not "tied". Removed the tied language. - Reversible dCDH narrative: main fit / heterogeneity refit split is now ~50/50 (was ~60/40 before the heterogeneity refit got a survey_design). Under Python the heterogeneity refit slightly edges out the main fit. Updated the narrative and strengthened the TSL-sharing opportunity wording. - Top-hotspots table rows 2 and 4 updated to match. Prior commit 6b1715a intended to include these edits but they raced with a linter refresh and dropped silently. Caught and fixing now. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P2 - brand-awareness medium-scale prose: the narrative said JK1 is "2-3x the multi-outcome loop on Python at medium" and "still the top phase on Rust though by a smaller margin there." Committed baselines contradict both: on Python/medium JK1 is about 1.9x the multi-outcome loop (not 2-3x) with HonestDiD close behind; on Rust/medium the multi-outcome loop is actually the top phase, with JK1 second. Only at large does JK1 become the clearly dominant phase under both backends. Prose rewritten to match. P3 - mem_profile_brfss.py headline: the output labeled stats[0].size_diff (largest single allocation site) as "net allocated (end - start)", which sounds like the total retained delta. Relabeled to "top single-site size diff" and added a "total net size diff across all sites" line alongside it. Regenerated the committed text artifact with the corrected labels. Docs-and-script-only. No baseline timing regeneration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment One unmitigated Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
CI re-review P1: `bench_brand_awareness_survey.py` declared the analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)` only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event study + HonestDiD) built their own SurveyDesigns without `fpc`. That means a material share of the committed brand-awareness baselines timed a different variance path than the scenario doc declares. Fix: - One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True) is now constructed once at the top of `make_phases` and reused across phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a different variance surface and correctly keeps its own design. - Regenerated baselines for both backends. - Regenerated findings tables via gen_findings_tables.py. Narrative refreshed against the new tables: - Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in the previous rerun); on Rust the multi-outcome loop and JK1 come in essentially tied. Medium is also where Python is slowest relative to Rust (~1.6x) - the full analytical TSL path with FPC exposes vectorization differences at that shape. Totals re-converge at large scale. - Reversible dCDH: ~48-52% split under both backends (previously the Python heterogeneity refit edged out the main fit slightly). - Scaling finding #5 retuned: Rust-only uplift is still the SDiD story; brand-aware medium now surfaces as a secondary, modest ~1.6x case rather than "within noise." Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good The prior P1 blockers from the last review are resolved, and I did not find any unmitigated P0/P1 issues in the changed code. I found two low-severity P3 drifts: one benchmark phase is still slightly less faithful to the cited brand-awareness workflow than its label suggests, and parts of the new findings prose no longer exactly match the committed JSON baselines. Executive Summary
Methodology
Code QualityNo findings. PerformanceNo findings. MaintainabilityNo findings. Tech DebtNo new findings. The non-canonical HonestDiD behaviors above are already documented in the Methodology Registry, so they are not untracked debt for this PR. REGISTRY.md:L2453-L2455 REGISTRY.md:L632-L632 SecurityNo findings. Documentation/Tests
|
CI re-review P3:
- bench_brand_awareness_survey.py "naive" phase was using
cluster="psu", which is already a partial sampling-design correction
- the SE-inflation comparison is more faithful to Tutorial 17 when
the first phase is genuinely untreated-for-design. Removed the
cluster argument.
- performance-plan.md narrative overreaches corrected:
- Staggered campaign: at Rust medium SunAbraham is now the clearly
leading phase (~1.7x ImputationDiD there), not "slightly edges
out". Reworded to say ImputationDiD / SunAbraham are the top two
at every scale but their order is not stable across backend and
scale.
- Reversible dCDH: split is not "~evenly under both backends" -
Python is closer to 58/41 with the main fit leading, Rust is
51/49 with the heterogeneity refit leading. Reworded to reflect
the split per backend.
Regenerated the affected brand-awareness and campaign-staggered
baselines (the naive-fit change slightly reduces brand-awareness
chain totals and shifts phase-percentage shares). Tables in
performance-plan.md re-derived via gen_findings_tables.py.
Still measurement only. No changes under diff_diff/ or rust/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good The prior re-review methodology-fidelity issues are resolved, and I did not find any unmitigated P0/P1 issues in the changed code. The remaining issues are P3-only documentation/prose drift. Executive Summary
Methodology
Code QualityNo findings. Performance
MaintainabilityNo findings. Tech DebtNo new blocker-level debt. SecurityNo findings. Documentation/TestsNo findings beyond the documentation drift already called out above. |
CI re-review P3 items, all documentation-only: - Scenario 3 operation chain: said "analytical TSL via strata + PSU", but aggregate_survey()'s returned second-stage design is pweight with geographic PSU clustering and no stage-2 strata. Reworded to match the actual second-stage design surface being benchmarked. - ImputationDiD "consistently dominant" claim in scaling finding #2 and hotspot table row #2: at Rust medium SunAbraham clearly leads (0.353s vs 0.214s). Both claims narrowed to "Python all scales + Rust small/large" with the Rust-medium SunAbraham exception called out explicitly; the "together ~70-80% of the chain" framing preserves the optimization recommendation. - SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo are the two largest at every scale/backend, but at Rust small bootstrap_variance slightly edges both (at sub-50ms totals, per- phase fixed overhead dominates ranking). Qualified to Python all scales + Rust medium/large. Docs-only. No script or baseline changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentOne unmitigated P1 remains: scenario 6’s implemented treatment timing does not match its documented methodology, and the binarized comparison phase is keyed to a different treatment start than the ContinuousDiD fits. Executive Summary
Methodology
Code QualityNo findings. PerformanceNo findings. MaintainabilityNo findings. Tech DebtNo findings. The P1 above is not tracked in SecurityNo findings. Documentation/TestsNo additional findings beyond the scenario-6 methodology mismatch. The earlier documentation-drift issues from the previous re-review appear fixed. Path to Approval
|
…race shifts CI re-review P1: bench_dose_response.py inherited the CDiD generator's default cohort [2], not the documented period 3. The fallback that would have set first_treat=3 never ran (generator already populates first_treat), so the committed baselines measured a different cohort onset than the scenario doc. The binarized DiD phase also hardcoded post >= 3, which further desynced it from the actual CDiD treatment start under the default DGP. Fix: - Pin the generator to cohort_periods=[3] so the DGP matches the docs. - Assert exactly one positive first_treat after generation; future DGP changes that break the single-cohort contract will fail loudly instead of drifting silently. - Binarized phase now derives its post cutoff from the actual first_treat in the data, not a hardcoded period number. No opportunity to desync from the CDiD fits above. - Regenerated dose-response baselines for both backends. Structural narrative hardening: Prior CI rounds have repeatedly re-flagged the same drift pattern: the staggered campaign and reversible dCDH narratives make phase- order claims at close-race cells (staggered Rust medium, dCDH at this shape) that can flip on rerun because the two contenders are within a few percentage points of each other. The underlying ranking is not the right level of abstraction for narrative; the phase-share table is. This commit rewrites both narratives to describe the aggregate share pattern and defer per-cell ordering to the generator-produced table. Scaling finding #2 and hotspot table row #2 get the same treatment. Net effect: narrative claims are now robust to rerun noise at close-race cells. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings remain in this re-review. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…ht variance"
CI re-review P2 + P3, both docs/label only:
- docs/performance-plan.md had two remaining specific-magnitude
claims about brand-awareness medium ("~1.6x under Python",
"Python and Rust separate the most at medium", "~1.6x at worst on
brand-awareness medium"). Those were true on one earlier rerun but
the current committed baselines show medium at 0.56 / 0.55
(essentially tied) and the widest non-SDiD gap is now ~1.1x at
brand-large. Reworded per-scenario paragraph and scaling finding #5
to describe the stable aggregate pattern and defer exact ratios to
the scale-sweep table. Same treatment as the earlier staggered/dCDH
pass: narrative stops claiming magnitudes that can shift on rerun;
the generator-owned table carries the specifics.
- bench_brand_awareness_survey.py module docstring labeled JK1 as
"replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate-
weight variance (jackknife-style), not bootstrap inference - they
are distinct methodology surfaces. Renamed to "replicate-weight
variance (JK1 delete-one-PSU)" with an inline note pointing to the
registry.
Docs + docstring only. No script behaviour change; no baseline
regeneration needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings remain in this re-review. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
docs/performance-scenarios.md) anchored to applied-econ papers, industry writeups, and the bundled tutorials — covering CS + 8-step chain, survey 2x2, BRFSS microdata → CS panel, SDiD few-markets, reversible dCDH, and continuous dose-response.speed_review/bench_callaway.pypattern into six scenario scripts that time each phase of the full chain (Bacon → fit → HonestDiD → robustness refits → reporting), profile it with pyinstrument, and write a JSON baseline + flame HTML per (scenario, backend).docs/performance-plan.mdwith per-scenario top-5 hot phases, Python-vs-Rust gap, and recommended action category (Rust / algorithmic / cache / leave alone).Findings at a glance
aggregate_survey(93%)Three concrete follow-up candidates surfaced (see findings doc for details):
_compute_stratified_psu_meatper-cell loop inaggregate_survey, ImputationDiD's unexpected 4x slowdown vs CS withn_bootstrap=999, and shared-precomputation opportunity between dCDH main fit and heterogeneity refit. No source changes are in this PR — optimizations become separate PRs citing specific findings.Scope discipline
diff_diff/orrust/.benchmarks/run_benchmarks.py(R-parity accuracy) is untouched; these scenarios complement it rather than replace it.baselines/profiles/; only the JSONs are committed.Test plan
python benchmarks/speed_review/run_all.py.benchmarks/speed_review/baselines/; HTML profiles regenerate on re-run.diff_diff/orrust/modified —git diff --name-onlyconfirms docs + benchmarks only.benchmarks/speed_review/run_all.py --backend pythonand--backend rustboth succeed as single-backend commands.