Skip to content

Add practitioner-workflow performance baseline#333

Merged
igerber merged 15 commits intomainfrom
perf-review
Apr 19, 2026
Merged

Add practitioner-workflow performance baseline#333
igerber merged 15 commits intomainfrom
perf-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

  • Defines 6 canonical practitioner scenarios (docs/performance-scenarios.md) anchored to applied-econ papers, industry writeups, and the bundled tutorials — covering CS + 8-step chain, survey 2x2, BRFSS microdata → CS panel, SDiD few-markets, reversible dCDH, and continuous dose-response.
  • Ports the existing speed_review/bench_callaway.py pattern into six scenario scripts that time each phase of the full chain (Bacon → fit → HonestDiD → robustness refits → reporting), profile it with pyinstrument, and write a JSON baseline + flame HTML per (scenario, backend).
  • Adds a "Practitioner Workflow Baseline (v3.1.3)" section to docs/performance-plan.md with per-scenario top-5 hot phases, Python-vs-Rust gap, and recommended action category (Rust / algorithmic / cache / leave alone).

Findings at a glance

Scenario Python Rust Gap Dominant phase
Staggered campaign (CS + chain) 0.48s 0.48s 1.0x ImputationDiD robustness (49%)
Brand awareness survey 0.18s 0.20s 0.9x multi-outcome + HonestDiD (~70%)
BRFSS microdata → CS panel 1.58s 1.58s 1.0x aggregate_survey (93%)
Geo-experiment few markets (SDiD) 2.96s 0.04s 76x Frank-Wolfe weight solver
Reversible treatment (dCDH L_max=3 + TSL) 0.49s 0.55s 0.9x dCDH fit + heterogeneity refit
Pricing dose-response 0.57s 0.58s 1.0x four spline variants, ~25% each

Three concrete follow-up candidates surfaced (see findings doc for details): _compute_stratified_psu_meat per-cell loop in aggregate_survey, ImputationDiD's unexpected 4x slowdown vs CS with n_bootstrap=999, and shared-precomputation opportunity between dCDH main fit and heterogeneity refit. No source changes are in this PR — optimizations become separate PRs citing specific findings.

Scope discipline

  • No changes under diff_diff/ or rust/.
  • Existing benchmarks/run_benchmarks.py (R-parity accuracy) is untouched; these scenarios complement it rather than replace it.
  • Flame HTMLs are regenerated locally and live under a gitignored baselines/profiles/; only the JSONs are committed.

Test plan

  • All 12 scenario × backend combinations run green via python benchmarks/speed_review/run_all.py.
  • Baseline JSONs written to benchmarks/speed_review/baselines/; HTML profiles regenerate on re-run.
  • No source code under diff_diff/ or rust/ modified — git diff --name-only confirms docs + benchmarks only.
  • benchmarks/speed_review/run_all.py --backend python and --backend rust both succeed as single-backend commands.

Six end-to-end scenarios covering CS + 8-step chain, survey DiD, BRFSS
microdata -> CS panel, SDiD few-markets, reversible dCDH, and continuous
dose-response -- anchored to applied-econ paper and industry conventions
rather than the 200 x 8 cookie cutter. Each chain is timed per-phase and
profiled with pyinstrument under both backends; findings and recommended
actions are in docs/performance-plan.md.

Measurement only -- no changes under diff_diff/ or rust/. The decision
doc identifies aggregate_survey per-cell scaffolding, ImputationDiD fit
loop, and dCDH heterogeneity refit as candidates for follow-up PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment
Needs changes. Two unmitigated P1 issues block approval: Scenario 2 is not benchmarking the replicate-variance method or workload it claims to benchmark, and the new harness can report success even when benchmark phases fail.

Executive Summary

  • Scenario 2’s replicate-weight path is methodologically wrong as written: the generator produces JK1 delete-one-PSU replicate weights, but the benchmark registers them as BRR, which changes the variance formula.
  • The same Scenario 2 is documented as a ~32K-row, ~160-replicate practitioner survey, but the committed code and JSON are a 2,400-row, 40-replicate panel. The reported timings therefore do not measure the stated scenario.
  • run_scenario() records per-phase failures but never exits nonzero, and run_all.py checks only subprocess return codes. A run can therefore end with “All scenarios passed” after failed phases.
  • Output and reproduction docs are inconsistent: code writes under benchmarks/speed_review/baselines/, while several new docs still point to benchmarks/results/, and the README tells contributors to commit gitignored HTML files.

Methodology

Code Quality

  • Severity: P1 At benchmarks/speed_review/bench_shared.py:88, per-phase exceptions are caught and only recorded as "ok": false; at benchmarks/speed_review/run_all.py:35, success is inferred solely from the subprocess exit code. Impact: run_all.py can print “All scenarios passed” even when one or more benchmark phases failed, so the PR’s test plan is not trustworthy. Concrete fix: make any failed phase force a non-zero exit, for example by tracking had_failure in run_scenario() and raising SystemExit(1) after writing artifacts.

Performance

  • Severity: P1 Scenario 2’s stated workload does not match the measured workload. The doc/spec side says ~32,000 respondent rows and ~160 replicate columns at docs/performance-scenarios.md:130, while the code uses n_units=200, n_periods=12 at benchmarks/speed_review/bench_brand_awareness_survey.py:30, and the committed baseline records n_obs=2400, n_replicate_weights=40 at benchmarks/speed_review/baselines/brand_awareness_survey_python.json:43. docs/performance-plan.md then treats that run as the practitioner baseline at docs/performance-plan.md:66. Impact: the “~200 ms end-to-end” conclusion is not about the scenario the docs claim to measure, so it is not a reliable input to optimization decisions. Concrete fix: either scale the benchmark up to the documented workload and recommit the JSON/plan text, or update the scenario docs, script docstring, and findings text to the smaller 2,400-row, 40-replicate panel that is actually being timed.

Maintainability

Tech Debt
No findings. None of the P1 issues above are currently mitigated by TODO.md.

Security
No findings.

Documentation/Tests

  • Severity: P2 The new scenario doc includes an unrunnable code snippet at docs/performance-scenarios.md:143: it uses replicate_method="brr", but diff_diff/survey.py:86 accepts only uppercase BRR, Fay, JK1, JKn, or SDR. Impact: the documentation is not copy-paste correct and repeats the same replicate-method confusion as Scenario 2. Concrete fix: update the snippet to the actual supported method and casing once the Scenario 2 replicate design is corrected.

Path to Approval

  1. Fix Scenario 2’s replicate design so method, weights, docs, and baselines all agree. Either use JK1 everywhere, or generate true BRR weights and keep the BRR labeling.
  2. Make Scenario 2’s workload consistent across bench_brand_awareness_survey.py, docs/performance-scenarios.md, and the committed baseline JSON, then regenerate the Scenario 2 numbers in docs/performance-plan.md.
  3. Propagate per-phase failures to process exit status so python benchmarks/speed_review/run_all.py cannot false-pass.
  4. Sync the output-path and artifact docs to the actual benchmarks/speed_review/baselines/ layout, and fix the invalid replicate_method="brr" snippet.

Four scenarios (campaign_staggered, brand_awareness_survey, brfss_panel,
geo_few_markets) now run at small/medium/large data scales rather than a
single tutorial-scale point. The large scales reflect practitioner realism:
1M-row BRFSS pooled panels, 1,500-unit county-level staggered studies,
1,000-unit multi-region brand surveys, 500-unit zip-level geo-experiments.

Key finding from the sweep: aggregate_survey at 1M microdata rows takes
~24 seconds (100% of BRFSS chain runtime), with 97% of that in
_compute_stratified_psu_meat self-time. The tutorial-scale pass had
flagged this as a 1.5s finding; at practitioner scale it is 15-20x larger
and becomes the single highest-value optimization target identified. The
other four findings hold across scales: CS chain scales well to 1,500
units, brand-survey chain scales sub-linearly, SDiD Rust gap is stable,
ImputationDiD remains the top phase of the staggered chain at all scales.

Measurement only. docs/performance-plan.md and
docs/performance-scenarios.md updated with scale-sweep tables and
scaling-finding narrative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 33d55fbfa721e5b7d21bd85495aad421cf0cce9f


Overall Assessment

⚠️ Needs changes — two unmitigated P1 findings remain. The prior Scenario 2 workload-size mismatch appears resolved, but the replicate-method mismatch and benchmark false-pass behavior are still open.

Executive Summary

  • Scenario 2 still labels and times JK1 delete-one-PSU replicate weights as BRR, so the replicate-fit phase is methodologically misidentified and would use the wrong variance factor if its SEs were inspected.
  • The benchmark harness still false-passes failures: run_scenario() records phase errors without exiting nonzero, run_all.py only checks subprocess return codes, and some scenario scripts swallow substep exceptions before the harness can see them.
  • The earlier Scenario 2 size/workload mismatch is largely fixed: the PR now uses a consistent small/medium/large sweep across script, docs, and committed JSONs.
  • Documentation consistency improved but is still incomplete: some files still point to benchmarks/results/, README still tells contributors to commit gitignored profile HTMLs, and the Scenario 2 snippet still uses invalid replicate_method="brr".

Methodology

  • Severity: P1 Method affected: survey replicate-weight variance in the Scenario 2 DifferenceInDifferences + SurveyDesign workflow. generate_survey_did_data(..., include_replicate_weights=True) is documented and implemented as JK1 delete-one-PSU replicate weights in diff_diff/prep_dgp.py:L1247-L1249 and diff_diff/prep_dgp.py:L1653-L1669, while the benchmark constructs SurveyDesign(..., replicate_method="BRR") in benchmarks/speed_review/bench_brand_awareness_survey.py:L73-L84 and labels the phase/docs/baselines as BRR in benchmarks/speed_review/bench_brand_awareness_survey.py:L139-L146, docs/performance-scenarios.md:L132-L150, docs/performance-plan.md:L100-L106, and benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:L17-L18. The Methodology Registry and variance helper make clear that BRR and JK1 use different factors (1/R vs (R-1)/R) in docs/methodology/REGISTRY.md:L2876-L2889 and diff_diff/survey.py:L1510-L1524. Impact: the benchmark is not measuring the method it claims to measure, and the replicate-fit result object would carry the wrong SE formula if examined. Concrete fix: either switch Scenario 2 end-to-end to JK1 (constructor, phase names, docs, plan text, committed JSON labels) or replace the generator output with true BRR/Fay-style replicate weights before continuing to call the path BRR.

  • Severity: P3 Method affected: HonestDiD on Callaway-Sant’Anna event-study results in the campaign and BRFSS scenarios. Both scenarios benchmark compute_honest_did() on CS results built with the default base_period="varying" in benchmarks/speed_review/bench_campaign_staggered.py:L62-L83 and benchmarks/speed_review/bench_brfss_panel.py:L88-L114. compute_honest_did() explicitly warns that methodologically valid R-style interpretation requires base_period="universal" in diff_diff/honest_did.py:L675-L686, and the Registry documents this difference in docs/methodology/REGISTRY.md:L393-L412. Impact: these timings are for the library’s documented convenience path, not the canonical HonestDiD setup. Concrete fix: none required unless you want a separate canonical benchmark variant; this deviation is documented.

Code Quality

  • Severity: P1 run_scenario() still cannot make the overall run fail when benchmark phases fail. It catches phase exceptions and only records "ok": false in benchmarks/speed_review/bench_shared.py:L88-L105, while run_all.py treats success purely as subprocess exit code in benchmarks/speed_review/run_all.py:L35-L75. Several scenario functions also swallow substep exceptions and therefore never even reach the harness failure path, e.g. benchmarks/speed_review/bench_brand_awareness_survey.py:L129-L137, benchmarks/speed_review/bench_brfss_panel.py:L105-L114, and benchmarks/speed_review/bench_reversible_dcdh.py:L77-L95. Impact: python benchmarks/speed_review/run_all.py can still print “All scenarios passed” after incomplete practitioner chains, so the committed baselines and the PR’s test plan are not trustworthy. Concrete fix: track any failed phase/substep in run_scenario(), write artifacts, then exit nonzero; also stop converting failed substeps into successful phase timings unless the JSON explicitly records them as failures.

Performance

No findings. The prior Scenario 2 workload-size mismatch appears resolved: the script, scenario doc, and committed JSONs now align on a three-scale 200/500/1000-unit sweep with 40/90/160 replicate columns in benchmarks/speed_review/bench_brand_awareness_survey.py:L31-L35, docs/performance-scenarios.md:L132-L139, benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json:L43-L50, benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json:L43-L50, and benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json:L43-L50.

Maintainability

  • Severity: P2 Output locations and artifact instructions remain internally inconsistent. The harness docstring still says outputs go to benchmarks/results/ in benchmarks/speed_review/bench_shared.py:L8-L9, run_all.py repeats the same path in benchmarks/speed_review/run_all.py:L5-L6, and the plan doc still says results are written under benchmarks/results/ in docs/performance-plan.md:L175-L176, even though the code writes to benchmarks/speed_review/baselines/ in benchmarks/speed_review/bench_shared.py:L41-L42. README also says flame HTMLs are gitignored in benchmarks/speed_review/README.md:L38-L42, but then tells contributors to commit them in benchmarks/speed_review/README.md:L73-L81, despite benchmarks/speed_review/baselines/.gitignore:L1. Impact: reproduction and contribution instructions are still not self-consistent. Concrete fix: normalize every reference to benchmarks/speed_review/baselines/ and either remove the “commit HTML” instruction or stop ignoring baselines/profiles/.

Tech Debt

No findings. TODO.md does not currently track the open P1 issues, so they are not mitigated.

Security

No findings.

Documentation/Tests

  • Severity: P2 Scenario 2 documentation still contains an unrunnable example. docs/performance-scenarios.md:L148-L150 uses replicate_method="brr", but SurveyDesign only accepts uppercase methods in diff_diff/survey.py:L86-L91; the same PR’s plan text already notes that "brr" fails while "BRR" works in docs/performance-plan.md:L143-L146. Impact: the benchmark docs are not copy-paste correct and still echo the BRR/JK1 confusion. Concrete fix: update the snippet to the actually supported method and casing once the Scenario 2 replicate design is corrected.

Path to Approval

  1. Align Scenario 2’s replicate benchmark end-to-end: either rename/reconfigure it as JK1 everywhere, or generate true BRR/Fay replicate weights and keep the BRR labeling.
  2. Make benchmark failures fatal: propagate any failed phase or substep to a non-zero exit from benchmarks/speed_review/bench_shared.py / benchmarks/speed_review/run_all.py, and stop swallowing failed substeps as successful timings.

Addresses the four CI review findings:

- BRR -> JK1 rename. generate_survey_did_data(include_replicate_weights=
  True) emits JK1 delete-one-PSU weights per prep.py:1248; Scenario 2 was
  labeling them as BRR, which uses a different variance formula. Fixed
  script, phase label, scenario doc data-shape text, and example code
  snippet.
- Exit-code propagation. run_scenario now records a module-level
  failure flag; an atexit handler os._exit(1)s if any phase recorded
  ok=False. run_all.py's subprocess return-code check now reliably
  surfaces phase failures. Verified with a forced-failure harness test.
- Path references. bench_shared.py and run_all.py docstrings plus
  performance-plan.md prose normalized to
  benchmarks/speed_review/baselines/.
- Contributor README. "Commit HTMLs" instruction removed; flame HTMLs
  are gitignored and regenerated per run.

Adds memory measurement:

- psutil background RSS sampler (10ms) in run_scenario writes a memory
  field to every scenario JSON: start, peak, growth-during-run. Zero
  timing impact (background thread, single-syscall samples).
- mem_profile_brfss.py - standalone tracemalloc allocator attribution
  for the BRFSS-1M scenario. Separate from the timing harness so its
  2-5x overhead does not contaminate wall-clock baselines.

Memory findings extend the optimization priority list without changing
the #1 recommendation. Headline insight: BRFSS aggregate_survey at 1M
rows grows only 23 MB of working memory (vs 46 MB input), and
tracemalloc's net-retained allocation is 0.6 MB. The 24-second cost is
pure CPU - confirms the precompute-scaffolding fix is low-risk and fits
in any deployment target including 512 MB Lambda.

Secondary finding: staggered CS chain allocates 252-322 MB at 1,500
units (peak RSS 486-589 MB). Fine for workstations, tight for Lambda-
tier deployments. Flagged as a lower-priority follow-up.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8d5eae8fec6be57f98983576bb33de6e378b2135


Overall Assessment

⚠️ Needs changes — one prior P1 remains unmitigated. The earlier Scenario 2 JK1/BRR mismatch looks fixed, but benchmark failure propagation is still incomplete because several required substeps can fail inside a “successful” phase.

Executive Summary

  • Scenario 2’s earlier replicate-method mismatch appears resolved end-to-end: the benchmark, docs, and committed baselines now consistently describe JK1.
  • The new harness fix is only partial: uncaught phase failures now exit nonzero, but several scenario phases still swallow substep exceptions and return ok=true, so run_all.py can still report success for incomplete practitioner chains.
  • The HonestDiD usage on Callaway-Sant’Anna results remains a documented REGISTRY deviation, so it is P3 informational rather than a blocker.
  • The benchmark docs are not synchronized to the committed scripts/artifacts: several headline timing numbers and some scenario descriptions no longer match the checked-in JSON baselines or scripts.
  • mem_profile_brfss_large_rust.txt commits workstation-specific absolute paths, which is a minor security/privacy leak.

Methodology

The previous Scenario 2 BRR/JK1 finding appears resolved.

Code Quality

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find matching TODO.md entries that would mitigate the P1/P2 items above.

Security

  • Severity: P3 Impact: the committed mem-profile artifact leaks workstation-specific absolute paths, including a user home directory, in benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L15. This is minor, but unnecessary repository leakage. Concrete fix: scrub or relativize traced file paths before writing committed artifacts, or keep these profiles uncommitted like the HTML flames.

Documentation/Tests

Path to Approval

  1. Remove the phase-local exception swallowing, or teach run_scenario() to record substep failures and exit nonzero when any required substep inside a phase fails. Once incomplete chains can no longer report success, the remaining items are P2/P3 and this review becomes ✅.

Addresses the second-round CI review findings:

- P1 false-pass (remaining): removed five phase-local try/except blocks
  that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness
  and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response
  dataframe extraction). Exceptions now escape, the phase is marked
  ok=false, and run_scenario's atexit handler exits nonzero. The fix
  caught a real API-usage bug on its first rerun: dose_response extract
  phase tried to pull event_study level on a result fit with
  aggregate="dose"; the event_study fit lives in a dedicated phase, so
  that level is removed from the extraction loop.
- P2 scenario-spec drift: BRFSS scenario text now says pweight TSL
  stage-2 (matching the aggregate_survey-returned design), not "Full
  replicate-weight path"; dCDH reversible scenario text now says
  heterogeneity="group" (matching the script), not "cohort".
- P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and
  site-packages before writing the committed txt.

Drift-prevention layer:

- gen_findings_tables.py reads every JSON baseline and rewrites the
  numerical tables in performance-plan.md between
  <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables
  now re-derive from data on every rerun, eliminating the hand-edit
  drift the prior review flagged. Narrative prose stays hand-written
  by design, forcing a human re-read of findings when numbers shift.

Findings refresh (the numbers moved slightly; three narrative claims
needed updating):

- "Rust marginally slower than Python on JK1 at large scale" -> removed;
  fresh data has Rust and Python within noise on brand awareness at
  large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04).
- "ImputationDiD consistently dominant phase at all scales" -> narrowed
  to "dominant under Python; tied with SunAbraham under Rust at large".
- "Nine-figures of MB" in memory finding #3 was a phrasing error
  (literally 100+ TB); corrected to "mid-100s of MB".

Priority of optimization opportunities refreshed against new data:

- #1 aggregate_survey precompute stratum scaffolding: High (unchanged,
  now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100%
  of chain runtime, growth only +31 MB).
- #2 Staggered CS working-memory audit: Low with explicit bump-trigger
  (Rust large crosses 512 MB Lambda line).
- #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low -
  the "Rust regression to fix" leg of the rationale is gone because
  Rust is no longer slower.

Net: one clear priority (aggregate_survey fix), four optional follow-ups.
Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 98a1f3a925944eade94483c4c516f7ea1e747248


Overall Assessment
✅ Looks good — this re-review resolves the prior P1. I did not find any unmitigated P0/P1 issues in the changed code; the remaining items are P2/P3 documentation/spec drift only.

Executive Summary

Methodology

Code Quality

Performance

  • No findings. This PR adds benchmark harness/scripts/docs only; I did not find a new library-side performance regression in the changed files.

Maintainability

  • No separate findings beyond the documentation/spec drift noted below.

Tech Debt

  • No findings. I did not find a matching TODO.md entry that tracks the remaining doc/spec drift, but these are P2 documentation issues rather than correctness blockers. TODO.md:51

Security

Documentation/Tests

CI review P2: performance-scenarios.md had four drift points where the
documented operation chain did not match what the scripts actually time.
Fixed each to be a faithful spec the reviewer can cross-check against:

- BRFSS small scale: "single year" -> "narrow analytic slice on a
  state-year grid" (all scales use n_years=10).
- Scenario 4 (SDiD): removed the seventh plot_synth_weights step the
  script never times; chain is now 6 steps, matching the script.
- Scenario 5 (dCDH): replaced "results.print_summary()" with the
  actual attribute snapshot the script performs (placebo_effect,
  overall_att, joiners_att, leavers_att); chain is now 4 steps.
- Scenario 6 (dose-response): event-study step is no longer described
  as to_dataframe(level="event_study") on a dose-only fit (that API
  path raises because aggregate="dose" does not populate event_study);
  it is now described as a second CDiD fit with aggregate="eventstudy",
  matching the separate phase the script times.

The within-estimator API-spelling inconsistency that surfaced during
this cleanup (ContinuousDiD uses "eventstudy" on fit(aggregate=...) but
"event_study" on to_dataframe(level=...)) is captured in the
correctness-adjacent observations in performance-plan.md.

No changes under diff_diff/, rust/, scripts, or baselines. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bf9b9d7a0f9789fc12d27c180fa4ee33e13dcce5


Overall Assessment
✅ Looks good — no unmitigated P0/P1 findings. The prior phase-failure propagation and absolute-path leak issues remain fixed. One residual P2 documentation inconsistency remains.

Executive Summary

Methodology

Code Quality

Performance

  • No findings. This PR adds benchmark/docs surface only; I did not find a new library-side performance regression in the changed files.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not find a matching TODO.md entry for the remaining doc drift, but it is a P2 documentation issue rather than a correctness blocker. TODO.md:L51-L109

Security

Documentation/Tests

  • Severity: P2. Impact: the docs are much closer to the scripts now, but they are still not a fully faithful spec of the committed measurement surface. docs/performance-scenarios.md still says scenario baselines are written as "<scenario>_<backend>.json" even though four scenarios emit "<scenario>_<scale>_<backend>.json", and docs/performance-plan.md still contains stale prose saying each scenario runs at a single data shape and that multi-scale scenarios run under both backends, despite the committed three-scale sweep and the intentional geo_few_markets large/python skip. Concrete fix: update those stray prose lines, or better, drive scenario scale/output metadata from one structured source so run_all.py, README.md, and the markdown docs cannot drift independently. docs/performance-scenarios.md:L58-L60 benchmarks/speed_review/README.md:L69-L73 docs/performance-plan.md:L20-L25 docs/performance-plan.md:L37-L53 docs/performance-plan.md:L242-L247 benchmarks/speed_review/bench_geo_few_markets.py:L97-L102
  • Verification note: gen_findings_tables.py reproduces the committed tables exactly, and the new benchmarks/speed_review/*.py files parse cleanly in a read-only AST check. I did not execute the benchmark writers because the sandbox is read-only and those scripts write artifacts under the repo.

…skip

CI re-review P2: remaining stale prose lines that didn't reflect the
three-scale sweep and the intentional SDiD-Python-large skip. All
straightforward text edits:

- performance-scenarios.md output-path description now uses
  <scenario>[_<scale>]_<backend> notation and explicitly calls out
  that single-scale scenarios omit the scale segment.
- performance-scenarios.md "Runs under both backends" line now
  acknowledges the SDiD large-scale Python skip by design.
- performance-plan.md environment paragraph now mentions the SDiD skip
  alongside the three-scale sweep.
- performance-plan.md "What this baseline does not answer" section no
  longer claims each scenario runs at a single data shape (which is
  no longer true); replaced with an OOM-behaviour bullet that reflects
  what actually is and isn't covered.
- Pointers block at the end of performance-scenarios.md updated to
  the multi-scale filename pattern.

No changes under diff_diff/, rust/, scripts, or baselines. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 922317846c441ec736c5fcfa25f972dc65a7bf83


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The prior rerun blockers remain fixed, the methodology-sensitive benchmark calls are either source-consistent or explicitly documented deviations in the registry, and the remaining issues are documentation/reporting drift at P2.

Executive Summary

Methodology

Code Quality

No findings. The prior silent-pass issue remains fixed in the changed harness/orchestrator code. benchmarks/speed_review/bench_shared.py:L111-L124 benchmarks/speed_review/bench_shared.py:L176-L248 benchmarks/speed_review/run_all.py:L66-L79

Performance

No findings. This PR adds measurement and reporting surface only; I did not find a new library-side performance regression in the changed files.

Maintainability

No findings.

Tech Debt

No findings. I did not find a matching TODO.md entry for the remaining reporting drift, so it remains untracked, but it is still non-blocking at P2. TODO.md:L51-L109

Security

No findings. The prior absolute-path leak remains fixed. benchmarks/speed_review/mem_profile_brfss.py:L107-L129 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L29

Documentation/Tests

… narrative

Addresses the two remaining P2s from CI review:

- gen_findings_tables.py hard-coded scale="large" for the top-phases
  table, which silently dropped the geo_few_markets Python row (Python
  intentionally skips the large scale). The generator now iterates
  reversed(SCALE_ORDER) and picks the largest record actually present
  per (scenario, backend). The regenerated table now shows SDiD Python
  at medium and Rust at large side-by-side, which is the Python-vs-Rust
  comparison the table is supposed to surface.
- Brand-awareness medium-scale narrative said the multi-outcome loop
  and the JK1 replicate path are "comparable" at medium. The committed
  baselines contradict this: JK1 is 2-3x the multi-outcome loop on
  Python and still the top phase on Rust. Rewrote the bullet to say
  JK1 is the clear top phase from medium onwards and consolidates at
  large, matching the data.

Docs + generator only. No baseline regeneration needed (the top-phases
table regeneration is cosmetic - the JSONs didn't change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ce05d01539c697ee3dee4e8e65d7636bd96fc9e8


Overall Assessment
⚠️ Needs changes

Executive Summary

Methodology

Code Quality
No findings. The earlier partial-failure propagation issue remains fixed in benchmarks/speed_review/bench_shared.py:L111-L248 and benchmarks/speed_review/run_all.py:L39-L79.

Performance
No new performance-regression findings beyond the methodology issues above.

Maintainability
No findings.

Tech Debt
No findings. The two P1 methodology issues above are not mitigated by the current tracker in TODO.md:L51-L109.

Security
No findings. The mem-profile writer still scrubs repo/home/site-package paths before writing the committed artifact. Refs: benchmarks/speed_review/mem_profile_brfss.py:L107-L129, benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt:L1-L20

Documentation/Tests
No findings. The prior re-review documentation/reporting issues are addressed: benchmarks/speed_review/gen_findings_tables.py:L154-L214 now falls back to the largest available scale per backend, and the brand-awareness narrative in docs/performance-plan.md:L115-L122 matches the committed baselines.

Path to Approval

  1. In campaign_staggered, make phase 7 a true with/without-covariates comparison by holding the CS method and bootstrap settings constant and changing only covariates; then refresh the affected baselines and generated findings.
  2. In reversible_dcdh, benchmark the heterogeneity refit with the same survey design as the main fit, or explicitly redefine the scenario/docs as unweighted and remove the TSL-sharing conclusion; then refresh the affected baselines and findings text.

…-aware heterogeneity

CI re-review surfaced two P1 methodology defects where the benchmark
was not actually measuring what the scenario/findings claimed:

1. Staggered CS with/without-covariates comparison. Phase 7 was
   configured with estimation_method="reg", n_bootstrap=199 vs phase 2's
   "dr" + 999. That confounded two axes at once (method + inference
   workload) so the Baker-mandated comparison was not clean. Phase 7
   now matches phase 2 exactly except for the covariates argument.
   CS with estimation_method="dr" and no covariates is a supported
   path (verified by spot-check); the resulting 5x bootstrap workload
   increase in phase 7 raises the staggered-large chain total slightly,
   which is already reflected in the regenerated tables.
2. dCDH heterogeneity refit without survey_design. The scenario
   framing and the performance-plan TSL-sharing optimization
   recommendation both assume the refit runs under the same survey
   design as the main fit. The refit was passing no survey_design,
   which meant the measured timing did not support the documented
   conclusion. The refit now uses the same SurveyDesign(weights="pw",
   strata="stratum", psu="psu") as the main fit. Confirmed supported
   (not NotImplementedError-gated on this shape). The refit is now as
   expensive as the main fit (was ~40% of chain, now ~50%), and the
   TSL-sharing optimization recommendation is strictly stronger.

Narrative updated against the freshly regenerated tables:

- Staggered campaign: removed the "Rust at large is tied with
  SunAbraham" claim - ImputationDiD still leads under both backends.
- Reversible dCDH: updated the ~60/40 split claim to the new
  ~50/50 split and called out the TSL-sharing opportunity more
  directly.
- Top-hotspots table row 4 strengthened to reflect the now-equal
  phase costs.

All other narrative claims cross-checked against the new data and
hold (BRFSS ~24s at 1M rows, staggered single-digit scale multiplier,
SDiD Rust gap stable, peak RSS under 600 MB, etc.).

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

Three narrative corrections that need to match the freshly regenerated
tables after the P1 methodology fixes:

- Staggered campaign narrative: under Rust at large scale ImputationDiD
  still leads SunAbraham (43% vs 33%), not "tied". Removed the tied
  language.
- Reversible dCDH narrative: main fit / heterogeneity refit split is
  now ~50/50 (was ~60/40 before the heterogeneity refit got a
  survey_design). Under Python the heterogeneity refit slightly edges
  out the main fit. Updated the narrative and strengthened the
  TSL-sharing opportunity wording.
- Top-hotspots table rows 2 and 4 updated to match.

Prior commit 6b1715a intended to include these edits but they raced
with a linter refresh and dropped silently. Caught and fixing now.

Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 09f24a80b1e0646b569faaf4464c975aef43fe89


Overall Assessment

✅ Looks good

Executive Summary

Methodology

  • No P0/P1 findings.
  • Severity P3. Impact: CallawaySantAnna + HonestDiD in the campaign and BRFSS scenarios still use two non-canonical combinations: bootstrap-fitted CS results and default non-universal base period. Both are explicitly documented registry deviations, so they are informational only. Concrete fix: none required unless you want a separate canonical timing variant with analytical CS inference and base_period="universal". Refs: bench_campaign_staggered.py:77, bench_brfss_panel.py:105, REGISTRY.md:2454, REGISTRY.md:2455
  • Severity P3. Impact: ChaisemartinDHaultfoeuille + HonestDiD on the placebo/event-study surface is a documented library extension, not a methodology defect. Concrete fix: none. Refs: bench_reversible_dcdh.py:77, REGISTRY.md:632

Code Quality

  • No findings. No new inline inference anti-patterns or NaN/inference inconsistencies were introduced in the changed Python files.

Performance

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining methodology deviations above are already documented in REGISTRY.md and, where relevant, tracked in TODO.md.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: mem_profile_brfss.py emits a misleading headline metric by printing stats[0].size_diff as “net allocated (end - start)”; the committed text file therefore reports the largest single site delta, not the total retained delta. Concrete fix: either compute sum(s.size_diff for s in stats) for a true net total, or relabel the line as top site size diff. Refs: mem_profile_brfss.py:97, mem_profile_brfss_large_rust.txt:8

P2 - brand-awareness medium-scale prose: the narrative said JK1 is
"2-3x the multi-outcome loop on Python at medium" and "still the top
phase on Rust though by a smaller margin there." Committed baselines
contradict both: on Python/medium JK1 is about 1.9x the multi-outcome
loop (not 2-3x) with HonestDiD close behind; on Rust/medium the
multi-outcome loop is actually the top phase, with JK1 second. Only at
large does JK1 become the clearly dominant phase under both backends.
Prose rewritten to match.

P3 - mem_profile_brfss.py headline: the output labeled
stats[0].size_diff (largest single allocation site) as "net allocated
(end - start)", which sounds like the total retained delta. Relabeled
to "top single-site size diff" and added a "total net size diff across
all sites" line alongside it. Regenerated the committed text artifact
with the corrected labels.

Docs-and-script-only. No baseline timing regeneration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3d8c5ebe72bf4749ba1d5a60db0b6b9c6b1c6b87


Overall Assessment
Needs changes

One unmitigated P1 remains. The prior P1 blockers from the last re-review are fixed in bench_campaign_staggered.py:93 and bench_reversible_dcdh.py:85, but there is a newly identified methodology drift in the brand-awareness benchmark that means the committed baselines do not fully measure the survey workflow described in the scenario spec.

Executive Summary

Methodology

Code Quality

  • No new findings. I did not find new inline inference computations or partial NaN-guard patterns in the changed Python files.

Performance

Maintainability

  • No findings.

Tech Debt

  • No new findings. The remaining non-canonical HonestDiD surfaces discussed above are already documented in REGISTRY.md, so they are not blockers for this PR.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. In bench_brand_awareness_survey.py, define one analytical SurveyDesign(weights="weight", strata="stratum", psu="psu", fpc="fpc", nest=True) and reuse it in the TSL, multi-outcome, placebo, and MultiPeriodDiD/HonestDiD phases.
  2. Regenerate the affected brand-awareness baselines for both backends, then rerun benchmarks/speed_review/gen_findings_tables.py so the tables in docs/performance-plan.md reflect the corrected workflow.
  3. Re-read the brand-awareness narrative in docs/performance-plan.md and update any wording that shifts after the regenerated timings land.

CI re-review P1: `bench_brand_awareness_survey.py` declared the
analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)`
only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event
study + HonestDiD) built their own SurveyDesigns without `fpc`. That
means a material share of the committed brand-awareness baselines
timed a different variance path than the scenario doc declares.

Fix:
- One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True)
  is now constructed once at the top of `make_phases` and reused across
  phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a
  different variance surface and correctly keeps its own design.
- Regenerated baselines for both backends.
- Regenerated findings tables via gen_findings_tables.py.

Narrative refreshed against the new tables:

- Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in
  the previous rerun); on Rust the multi-outcome loop and JK1 come in
  essentially tied. Medium is also where Python is slowest relative
  to Rust (~1.6x) - the full analytical TSL path with FPC exposes
  vectorization differences at that shape. Totals re-converge at
  large scale.
- Reversible dCDH: ~48-52% split under both backends (previously the
  Python heterogeneity refit edged out the main fit slightly).
- Scaling finding #5 retuned: Rust-only uplift is still the SDiD
  story; brand-aware medium now surfaces as a secondary, modest
  ~1.6x case rather than "within noise."

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 539c7d78bddcedf7c51fec4ea89aa3a15b618319


Overall Assessment

✅ Looks good

The prior P1 blockers from the last review are resolved, and I did not find any unmitigated P0/P1 issues in the changed code. I found two low-severity P3 drifts: one benchmark phase is still slightly less faithful to the cited brand-awareness workflow than its label suggests, and parts of the new findings prose no longer exactly match the committed JSON baselines.

Executive Summary

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The non-canonical HonestDiD behaviors above are already documented in the Methodology Registry, so they are not untracked debt for this PR. REGISTRY.md:L2453-L2455 REGISTRY.md:L632-L632

Security

No findings.

Documentation/Tests

CI re-review P3:

- bench_brand_awareness_survey.py "naive" phase was using
  cluster="psu", which is already a partial sampling-design correction
  - the SE-inflation comparison is more faithful to Tutorial 17 when
  the first phase is genuinely untreated-for-design. Removed the
  cluster argument.
- performance-plan.md narrative overreaches corrected:
  - Staggered campaign: at Rust medium SunAbraham is now the clearly
    leading phase (~1.7x ImputationDiD there), not "slightly edges
    out". Reworded to say ImputationDiD / SunAbraham are the top two
    at every scale but their order is not stable across backend and
    scale.
  - Reversible dCDH: split is not "~evenly under both backends" -
    Python is closer to 58/41 with the main fit leading, Rust is
    51/49 with the heterogeneity refit leading. Reworded to reflect
    the split per backend.

Regenerated the affected brand-awareness and campaign-staggered
baselines (the naive-fit change slightly reduces brand-awareness
chain totals and shifts phase-percentage shares). Tables in
performance-plan.md re-derived via gen_findings_tables.py.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4bf991c6568a53db5612f6a78f71d2a88ea93a4e


Overall Assessment

✅ Looks good

The prior re-review methodology-fidelity issues are resolved, and I did not find any unmitigated P0/P1 issues in the changed code. The remaining issues are P3-only documentation/prose drift.

Executive Summary

Methodology

Code Quality

No findings.

Performance

Maintainability

No findings.

Tech Debt

No new blocker-level debt. TODO.md does not currently track the minor prose/methodology-description drifts above, but they do not rise above P3.

Security

No findings.

Documentation/Tests

No findings beyond the documentation drift already called out above.

CI re-review P3 items, all documentation-only:

- Scenario 3 operation chain: said "analytical TSL via strata + PSU",
  but aggregate_survey()'s returned second-stage design is pweight
  with geographic PSU clustering and no stage-2 strata. Reworded to
  match the actual second-stage design surface being benchmarked.
- ImputationDiD "consistently dominant" claim in scaling finding #2
  and hotspot table row #2: at Rust medium SunAbraham clearly leads
  (0.353s vs 0.214s). Both claims narrowed to "Python all scales +
  Rust small/large" with the Rust-medium SunAbraham exception called
  out explicitly; the "together ~70-80% of the chain" framing
  preserves the optimization recommendation.
- SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo
  are the two largest at every scale/backend, but at Rust small
  bootstrap_variance slightly edges both (at sub-50ms totals, per-
  phase fixed overhead dominates ranking). Qualified to Python all
  scales + Rust medium/large.

Docs-only. No script or baseline changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a0aafc5f85d4280fc0fa30daf88f38acd21431d6


Overall Assessment

⚠️ Needs changes

One unmitigated P1 remains: scenario 6’s implemented treatment timing does not match its documented methodology, and the binarized comparison phase is keyed to a different treatment start than the ContinuousDiD fits.

Executive Summary

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The P1 above is not tracked in TODO.md, so it is not mitigated by existing tech-debt tracking. TODO.md:L51-L80

Security

No findings.

Documentation/Tests

No additional findings beyond the scenario-6 methodology mismatch. The earlier documentation-drift issues from the previous re-review appear fixed.

Path to Approval

  1. In benchmarks/speed_review/bench_dose_response.py, make the DGP match the documented scenario by passing cohort_periods=[3], or intentionally revise the docs/comments to period 2 if that is the real target.
  2. Derive the binarized comparison’s post indicator from the same treatment start used by the CDiD fits, and add a simple assertion that only one positive first_treat value is present.
  3. Regenerate dose_response_python.json and dose_response_rust.json, then refresh any affected prose in docs/performance-plan.md.

…race shifts

CI re-review P1: bench_dose_response.py inherited the CDiD generator's
default cohort [2], not the documented period 3. The fallback that
would have set first_treat=3 never ran (generator already populates
first_treat), so the committed baselines measured a different cohort
onset than the scenario doc. The binarized DiD phase also hardcoded
post >= 3, which further desynced it from the actual CDiD treatment
start under the default DGP.

Fix:
- Pin the generator to cohort_periods=[3] so the DGP matches the docs.
- Assert exactly one positive first_treat after generation; future
  DGP changes that break the single-cohort contract will fail loudly
  instead of drifting silently.
- Binarized phase now derives its post cutoff from the actual
  first_treat in the data, not a hardcoded period number. No
  opportunity to desync from the CDiD fits above.
- Regenerated dose-response baselines for both backends.

Structural narrative hardening:

Prior CI rounds have repeatedly re-flagged the same drift pattern:
the staggered campaign and reversible dCDH narratives make phase-
order claims at close-race cells (staggered Rust medium, dCDH at
this shape) that can flip on rerun because the two contenders are
within a few percentage points of each other. The underlying ranking
is not the right level of abstraction for narrative; the phase-share
table is. This commit rewrites both narratives to describe the
aggregate share pattern and defer per-cell ordering to the
generator-produced table. Scaling finding #2 and hotspot table row
#2 get the same treatment. Net effect: narrative claims are now
robust to rerun noise at close-race cells.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 030d5f573a57a49444a14e8e441b81575c0749c2


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in this re-review.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings in the benchmark scripts or committed baselines themselves beyond the documentation drift noted below.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

…ht variance"

CI re-review P2 + P3, both docs/label only:

- docs/performance-plan.md had two remaining specific-magnitude
  claims about brand-awareness medium ("~1.6x under Python",
  "Python and Rust separate the most at medium", "~1.6x at worst on
  brand-awareness medium"). Those were true on one earlier rerun but
  the current committed baselines show medium at 0.56 / 0.55
  (essentially tied) and the widest non-SDiD gap is now ~1.1x at
  brand-large. Reworded per-scenario paragraph and scaling finding #5
  to describe the stable aggregate pattern and defer exact ratios to
  the scale-sweep table. Same treatment as the earlier staggered/dCDH
  pass: narrative stops claiming magnitudes that can shift on rerun;
  the generator-owned table carries the specifics.
- bench_brand_awareness_survey.py module docstring labeled JK1 as
  "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate-
  weight variance (jackknife-style), not bootstrap inference - they
  are distinct methodology surfaces. Renamed to "replicate-weight
  variance (JK1 delete-one-PSU)" with an inline note pointing to the
  registry.

Docs + docstring only. No script behaviour change; no baseline
regeneration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 307868c9367132d616d4e0eea92b77196d15001c


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in this re-review.

Executive Summary

Methodology

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit 3a1fb12 into main Apr 19, 2026
1 of 2 checks passed
@igerber igerber deleted the perf-review branch April 19, 2026 20:18
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant