Comprehensive documentation review and update by igerber · Pull Request #206 · igerber/diff-diff

igerber · 2026-03-16T13:22:35Z

Summary

Fix incorrect parameter names (treated=/post= → treatment=/time=) across 12 documentation pages, including complete rewrites of diagnostics and utils examples to match actual function signatures
Add 3 new API documentation pages: TwoStageDiD (Gardner 2022), BaconDecomposition (Goodman-Bacon 2021), and built-in Datasets
Restructure API reference to single entry point via api/index, eliminating confusing duplicate sidebar navigation
Expand Choosing an Estimator page with 6 new estimators across all sections (flowchart, quick reference table, detailed guidance, SE methods table)
Add 9 new troubleshooting sections: Rust backend, TROP tuning, ContinuousDiD, Imputation/TwoStage data issues, Bacon panel requirements, deprecation warnings
Update front page features (5 → 7 bullets covering all 13+ estimators), comparison pages (fix inaccurate feature flags, add 7-9 new rows), and data preparation docs (add 6 missing generation functions)

Methodology references (required if estimator / math changes)

Method name(s): N/A - no methodology changes
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: No test changes (documentation only)
Backtest / simulation / notebook evidence (if applicable): Sphinx build succeeds with no new warnings from modified files
Verified all treated=/post= parameter name errors eliminated via grep
Cross-checked api/index.rst autosummary entries against __all__ in __init__.py

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Fix incorrect parameter names across 12 documentation pages (treated=/post= should be treatment=/time=), including complete rewrites of diagnostics examples to match actual function signatures. Add 3 new API pages: TwoStageDiD (Gardner 2022), BaconDecomposition (Goodman-Bacon 2021), and built-in Datasets (Card & Krueger, Castle Doctrine, etc.). Restructure API reference to single entry point via api/index, eliminating confusing duplicate navigation. Add all missing estimators and functions to autosummary index. Expand Choosing an Estimator with 6 new estimators in flowchart, quick reference table, detailed guidance sections, and SE methods table. Add 9 new troubleshooting sections covering Rust backend, TROP tuning failures, ContinuousDiD discrete dose warnings, Imputation/TwoStage data issues, Bacon panel requirements, and deprecation warnings. Update front page features (5 → 7 bullets, all 13+ estimators), comparison pages (fix inaccurate feature flags, add 7-9 new feature rows), and data preparation docs (add 6 missing generation functions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-16T13:32:04Z

Overall Assessment

⚠️ Needs changes

Executive Summary

The blocking issue is methodology-doc drift in the new inference guidance for CallawaySantAnna: the PR now describes its default SEs incorrectly and points readers to a nonexistent bootstrap() method.
Several newly added examples will fail if copied verbatim because they use kwargs the implementation does not accept (cohort=, treatment= for BaconDecomposition.fit, treatment= for ImputationDiD.fit, and lambda_L_grid= for TROP).
The new TROP troubleshooting text also overstates the hard minimum pre-treatment window as 4 periods; the code and registry enforce 2.
I did not find implementation-level methodology regressions in the new Bacon/TwoStage narrative pages themselves; the main problems are in user-facing docs/examples.

Methodology

Severity: P1. Impact: The PR’s new SE guidance for CallawaySantAnna is incorrect in two prominent places. docs/choosing_estimator.rst:L490-L492 says the default is “Analytical (simple difference)” and tells users to use a bootstrap() method, while docs/api/two_stage.rst:L142-L145 reduces the estimator to “Cluster-robust or bootstrap.” The implementation registry defines the default as analytical influence-function SEs with WIF adjustment, with optional multiplier bootstrap via n_bootstrap/bootstrap_weights docs/methodology/REGISTRY.md:L311-L323, and the estimator API exposes those parameters directly diff_diff/staggered.py:L245-L258 diff_diff/staggered.py:L1246-L1263. Concrete fix: rewrite both rows to match the registry (“analytical influence-function SEs; multiplier bootstrap via n_bootstrap/bootstrap_weights”) and remove the nonexistent bootstrap() guidance.
Severity: P2. Impact: The new TROP troubleshooting section says users “need at least 4 pre-treatment periods” docs/troubleshooting.rst:L343-L361, but both the registry and runtime checks enforce a minimum of 2 docs/methodology/REGISTRY.md:L1228-L1230 diff_diff/trop.py:L1093-L1098. Concrete fix: change the wording to “requires at least 2 pre-treatment periods” and, if desired, separately note that >=4 is only a stability recommendation.

No other methodology defects stood out in the changed estimator overview pages; the documented TwoStageDiD deviations already recorded in the registry are not blockers.

Code Quality

No findings in the underlying implementation from this docs-only diff.

Performance

No findings from the changed files.

Maintainability

No standalone finding beyond the documentation drift already called out above.

Tech Debt

No new tracked-tech-debt issue was added in TODO.md.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: Multiple new examples will fail at runtime because they use unsupported kwargs. The new datasets page uses cohort= for CallawaySantAnna.fit in three places docs/api/datasets.rst:L65-L70 docs/api/datasets.rst:L92-L97 docs/api/datasets.rst:L119-L124, but the signature is first_treat= diff_diff/staggered.py:L1035-L1045. The new Bacon examples use treatment= instead of first_treat= docs/choosing_estimator.rst:L423-L424 docs/troubleshooting.rst:L571-L573, but BaconDecomposition.fit requires first_treat diff_diff/bacon.py:L402-L409. The new Imputation troubleshooting snippet passes treatment= to ImputationDiD.fit docs/troubleshooting.rst:L541-L543, but that method accepts unit, time, and first_treat only diff_diff/imputation.py:L169-L179. The new TROP snippet uses nonexistent lambda_L_grid= docs/troubleshooting.rst:L354-L356, while the constructor exposes lambda_time_grid, lambda_unit_grid, and lambda_nn_grid diff_diff/trop.py:L139-L145. Concrete fix: correct those kwargs, instantiate ImputationDiD() explicitly in the troubleshooting example, and add a docs smoke test/doctest for touched snippets so these regressions fail CI next time.

I did not run the docs build in this environment because project dependencies are missing here (numpy import fails), so this review is source-only.

Path to Approval

Fix the CallawaySantAnna SE descriptions in docs/choosing_estimator.rst and docs/api/two_stage.rst to match the registry and implementation: analytical IF/WIF by default, optional multiplier bootstrap via n_bootstrap/bootstrap_weights.
Correct the broken kwargs in the new examples: cohort → first_treat for CallawaySantAnna, treatment → first_treat for BaconDecomposition.fit, remove treatment= from ImputationDiD.fit, and replace lambda_L_grid with the intended TROP grid parameter.
Reword the TROP troubleshooting minimum-window guidance so it matches the actual hard requirement (>=2) and, if desired, separately labels >=4 as a recommendation rather than a requirement.
Add lightweight doc-example execution coverage for the touched snippets so future signature drift is caught automatically.

Fix 27 documentation bugs where code examples used wrong kwargs, attributes, or calling conventions vs. the actual API: - CallawaySantAnna: SE description (influence function, not simple difference), .overall_att not .att, aggregate via fit() not method - BaconDecomposition/ImputationDiD: first_treat= not treatment= - MultiPeriodDiD: reference_period is fit() param, not __init__() - SyntheticDiD: remove nonexistent treatment_start param - DifferenceInDifferences: cluster= in __init__(), not cluster_col= in fit() - TROP: lambda_time_grid/lambda_unit_grid, min 2 periods not 4 - balance_panel: unit_column=/time_column= not unit=/time= - make_treatment_indicator: treated_values= (plural), returns DataFrame - create_event_time: time_column=/treatment_time_column= - aggregate_to_cohorts: unit_column=/time_column=/treatment_column= - generate_did_data: treatment_period= not treatment_start= - PlaceboTestResults: .placebo_effect not .effect - HonestDiD: method=/M= not delta=DeltaRM() - DeltaRM: Mbar= not M_bar= - datasets.rst: first_treat= not cohort= Add tests/test_doc_snippets.py that extracts and executes all .. code-block:: python snippets from 11 RST files, catching TypeError/AttributeError to prevent kwarg regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-16T21:47:11Z

/ai-review

github-actions · 2026-03-16T21:53:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5da7d86fdfc2408408307a53e08aa52656cdc04e

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: a P1 methodology mismatch in the new Bacon guidance.

Executive Summary

The prior blocking items on Callaway-Sant’Anna SE wording and the TROP minimum pre-treatment window appear resolved in this revision.
The remaining blocker is in the new Bacon guidance: the estimator-selection page now tells users to use BaconDecomposition to check for “negative weights,” which is not how the Goodman-Bacon decomposition is defined.
Two Bacon examples still pass a treatment-status column into first_treat instead of a cohort/timing column, so they are still not safe copy-paste examples.
The new tests/test_doc_snippets.py coverage is materially narrower than it looks: it skips the entire datasets page and ignores every exception except TypeError/AttributeError, so broken snippets can still pass CI.

Methodology

Severity: P1. Impact: The new Bacon selection guidance in docs/choosing_estimator.rst:L394-L412 says users should use BaconDecomposition to “check for negative weights from forbidden comparisons.” That is not source-faithful to Goodman-Bacon (2021): the decomposition expresses TWFE as a weighted average of 2x2 DiD comparisons, and the diagnostic issue is the weight placed on later-vs-earlier / already-treated-as-control comparisons plus heterogeneity across those 2x2 estimates, not negative Bacon weights. This PR does not add a corresponding REGISTRY.md note documenting that wording as an intentional deviation. Concrete fix: reword this to “check whether later-vs-earlier / already-treated-as-control comparisons carry substantial weight” and keep the discussion in terms of comparison types and their weight shares.

Code Quality

No findings.

Performance

No findings.

Maintainability

No separate findings beyond the documentation/test issues below.

Tech Debt

No separate finding, but the documentation-test blind spots below are not tracked in TODO.md.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: Two Bacon snippets still use the wrong input column for first_treat: docs/choosing_estimator.rst:L394-L412 uses first_treat='treated', and docs/troubleshooting.rst:L570-L578 uses first_treat='treatment'. first_treat needs the treatment-timing/cohort column, not a treatment-status indicator. Depending on the data, these examples will either error or drive the decomposition off the wrong variable. Concrete fix: change both to the actual cohort-timing column, e.g. first_treat='first_treat', and make sure the example data is explicitly staggered-adoption data.
Severity: P2. Impact: The new smoke test is too permissive to catch several real doc regressions. api/datasets.rst is listed in RST_FILES at tests/test_doc_snippets.py:L19-L31, but every code block on docs/api/datasets.rst:L25-L157 matches the blanket skip regexes in tests/test_doc_snippets.py:L58-L66. On top of that, the test swallows every exception except TypeError and AttributeError at tests/test_doc_snippets.py:L185-L197, so ValueError, NameError, SyntaxError, and other execution failures still pass CI. Concrete fix: fail on all unexpected exceptions, and replace the blanket dataset skips with fixtures/stubs or narrower per-block skips so at least one dataset snippet is actually exercised.

Source-only note: I did not run the new snippet test in this environment because shell execution is failing here.

Path to Approval

Fix the Bacon methodology wording in docs/choosing_estimator.rst so it describes weight on comparison types, not “negative weights.”
Correct the two remaining Bacon examples to pass the cohort/timing column to first_treat and use an explicitly staggered dataset in those snippets.
Harden tests/test_doc_snippets.py so unexpected exceptions fail the test, and stop skipping the entire datasets page via broad regexes.

… test - Fix Bacon wording: "negative weights" → accurate Goodman-Bacon (2021) language - Fix first_treat column name in Bacon examples (2 files) - Fix 15 doc snippet bugs: wrong reference_period, column names, return types, and estimator-data mismatches (SyntheticDiD, wild bootstrap, diagnostics) - Harden test_doc_snippets: fail on all exceptions except NameError - Add mock dataset loaders so dataset page snippets execute without network - Add dose/exposure columns to test namespace for troubleshooting snippets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-16T23:09:25Z

/ai-review

github-actions · 2026-03-16T23:19:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 264af125936439ef17d8ca5512f9cf91e517bac7

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: a P1 inference-documentation mismatch in docs/choosing_estimator.rst for CallawaySantAnna and SunAbraham. citeturn21view3

Executive Summary

The prior Bacon blocker looks resolved: the guidance now talks about weight on later-vs-earlier / already-treated comparisons rather than “negative weights,” and the Bacon examples now pass first_treat='first_treat'.
The new snippet smoke test is materially better than the last revision: it now includes api/datasets.rst, fails most unexpected exceptions, and provides dataset mocks.
New blocker: the SE table now recommends bootstrap_weights='bayes' for CallawaySantAnna and describes SunAbraham bootstrap as pairs bootstrap, but the library’s published docs expose multiplier-style bootstrap weights (rademacher / mammen / webb) for both estimators. citeturn21view3
The modified HonestDiD / PreTrends examples still use an older MultiPeriodDiD calling convention; the published MultiPeriodDiD docs put reference_period / post_periods on fit(), not reference_period on the constructor or treatment_start on fit(). citeturn21view3turn13view3
Those stale pages are still outside the new snippet-test coverage, and some changed plotting snippets are blanket-xfailed, so these regressions can still merge without CI signal.

Execution note: shell access is failing in this environment, so this is a source-only re-review.

Methodology

Severity: P1. Impact: docs/choosing_estimator.rst:L493-L512 now says to use bootstrap_weights='bayes' for CallawaySantAnna and says SunAbraham uses pairs bootstrap inference. In the library’s published docs, both estimators expose multiplier-style bootstrap weights rademacher, mammen, and webb; there is no corresponding REGISTRY.md note in this PR diff documenting a deliberate deviation. This is a methodology-sensitive inference mismatch in the estimator-selection guide. Concrete fix: rewrite those two rows to describe multiplier bootstrap consistently and use the documented weight names, or avoid naming a specific weight if the goal is only to point readers to n_bootstrap. citeturn21view3

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the documentation/test coverage issue below.

Tech Debt

No separate findings. The remaining doc-test blind spots are concrete enough to fix in this PR rather than defer.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: docs/api/honest_did.rst:L46-L53, docs/api/honest_did.rst:L144-L151, docs/api/pretrends.rst:L47-L54, and docs/api/pretrends.rst:L129-L136 still show MultiPeriodDiD(reference_period=-1) with fit(..., treatment_start=5). The published MultiPeriodDiD docs instead show reference_period and post_periods on fit(), and the published HonestDiD examples use that newer pattern. Concrete fix: update these snippets to the current MultiPeriodDiD().fit(..., post_periods=[...], reference_period=...) style and align the surrounding text accordingly. citeturn21view3turn13view3
Severity: P2. Impact: tests/test_doc_snippets.py:L22-L33 omits both api/honest_did.rst and api/pretrends.rst, even though this PR modifies them, and tests/test_doc_snippets.py:L101-L109 unconditionally xfails changed plotting snippets such as choosing_estimator:block2 and api_visualization:block0. That means the stale examples above, and any API break inside those xfailed blocks, can still land without test signal. Concrete fix: add the modified files to RST_FILES, and replace unconditional plotting xfails with conditional skips only when plotting deps are unavailable, or split setup/API calls from the actual display call so non-plotting regressions still fail.

Path to Approval

Fix the CallawaySantAnna and SunAbraham rows in docs/choosing_estimator.rst:L493-L512 so the bootstrap/SE descriptions match the documented estimator interfaces.
Update the modified HonestDiD and PreTrends examples to the current MultiPeriodDiD calling convention in docs/api/honest_did.rst:L46-L53, docs/api/honest_did.rst:L144-L151, docs/api/pretrends.rst:L47-L54, and docs/api/pretrends.rst:L129-L136.
Extend tests/test_doc_snippets.py:L22-L33 to cover those modified pages, and stop blanket-xfailing changed plotting snippets in tests/test_doc_snippets.py:L101-L109 so API regressions inside them are actually exercised.

…coverage - Fix CallawaySantAnna SE table: replace invalid 'bayes' bootstrap weight with list of valid types (rademacher, mammen, webb) - Fix HonestDiD examples: correct constructor (method/M instead of delta=DeltaRM), fix result attributes (ci_lb/ci_ub not robust_ci), fix plot return types (ax not fig) - Fix PreTrends examples: correct MultiPeriodDiD calling convention (post_periods/reference_period on fit()), fix compute_mdv and plot_pretrends_power kwargs - Fix visualization example: plot_event_study returns Axes not Figure - Add honest_did.rst and pretrends.rst to doc snippet test coverage - Add matplotlib>=3.5 to dev deps; remove _KNOWN_FAILURES xfail set Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-17T11:44:05Z

/ai-review

github-actions · 2026-03-17T11:58:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cb86289575f75e723845bb0d2c819381e500b910

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: a P1 methodology-sensitive inference mismatch in docs/api/two_stage.rst:L162-L165, where the new comparison table labels CallawaySantAnna bootstrap as “Wild bootstrap” even though the Methodology Registry and estimator docs define it as multiplier bootstrap.

Execution note: this was a source-only re-review. I could not run the doc tests locally because pytest is not installed in this environment, and importing diff_diff fails here because numpy is unavailable.

Executive Summary

The prior choosing_estimator inference blocker appears fixed: docs/choosing_estimator.rst:L494-L511 now matches current CallawaySantAnna, SunAbraham, ImputationDiD, and TwoStageDiD inference docs.
The prior HonestDiD / PreTrends MultiPeriodDiD calling-convention issue is fixed in docs/api/honest_did.rst:L47-L57 and docs/api/pretrends.rst:L48-L58.
New blocker: the new TwoStageDiD comparison table misstates CallawaySantAnna inference as wild bootstrap in docs/api/two_stage.rst:L162-L165, contradicting docs/methodology/REGISTRY.md:L311-L319, docs/methodology/REGISTRY.md:L1774-L1777, and diff_diff/staggered.py:L126-L142.
Several changed/new doc examples still do not match the runtime API: cluster_col is still shown on DifferenceInDifferences.fit, the modified honest-event-study example still uses the old two-argument plotting call, and the new Bacon page treats plot_bacon as returning a figure.
The new smoke test is useful, but it still misses or masks changed-doc regressions because it only scans explicit .. code-block:: python blocks, preloads all public symbols, and treats every NameError as success.

Methodology

Severity: P1. Impact: the affected method is CallawaySantAnna bootstrap/inference documentation in the new TwoStageDiD comparison page. docs/api/two_stage.rst:L162-L165 says CallawaySantAnna uses “Wild bootstrap,” but the registry defines CallawaySantAnna bootstrap as multiplier bootstrap with Rademacher/Mammen/Webb weights in docs/methodology/REGISTRY.md:L311-L319 and again in the summary table at docs/methodology/REGISTRY.md:L1774-L1777. The estimator docstring matches that at diff_diff/staggered.py:L126-L142. The documented TwoStageDiD bootstrap deviation from R is already covered in docs/methodology/REGISTRY.md:L880-L886, so I am not treating TwoStageDiD’s multiplier bootstrap as a defect. Concrete fix: change the CallawaySantAnna bootstrap cell to multiplier-bootstrap wording consistent with the registry and leave the documented TwoStageDiD wording intact.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. The issues below are not tracked in TODO.md and are concrete changed-doc regressions rather than deferrable follow-ups.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the changed standard-error troubleshooting example still documents a nonexistent cluster_col fit argument in docs/troubleshooting.rst:L130-L143. DifferenceInDifferences takes cluster on the estimator constructor, not on fit(), per diff_diff/estimators.py:L120-L156. Concrete fix: rewrite that example as DifferenceInDifferences(cluster='unit_id') and DifferenceInDifferences(inference='wild_bootstrap', cluster='unit_id'), then call fit(..., treatment='treated', time='post') without cluster_col.
Severity: P2. Impact: two changed/new plotting examples still use stale plotting APIs. The modified honest-event-study example in docs/api/visualization.rst:L84-L91 still calls plot_honest_event_study(event_study_results, bounds) even though the function takes only HonestDiDResults at diff_diff/visualization.py:L760-L786. The new Bacon page in docs/api/bacon.rst:L99-L105 treats plot_bacon as returning a figure and calls fig.show(), but the function returns an axes object at diff_diff/visualization.py:L1011-L1014. Concrete fix: use ax = plot_honest_event_study(bounds) and ax = plot_bacon(results), then access ax.figure if the example needs to save/show the figure explicitly.
Severity: P2. Impact: the new datasets page surfaces stale loader docstring examples. docs/api/datasets.rst:L53-L107 renders load_castle_doctrine, load_divorce_laws, and load_mpdta via .. autofunction::, but the underlying docstrings in diff_diff/datasets.py:L257-L270, diff_diff/datasets.py:L403-L416, and diff_diff/datasets.py:L562-L575 still show CallawaySantAnna.fit(..., cohort='first_treat'). Because the page is new, those broken examples become user-visible in this PR. Concrete fix: update those docstrings to first_treat='first_treat' or suppress the stale Examples blocks until they are corrected.
Severity: P2. Impact: the new doc smoke test does not yet validate the changed docs reliably. It only scans an allowlist of explicit .. code-block:: python blocks in tests/test_doc_snippets.py:L23-L47, so changed literal :: examples on new pages like docs/api/bacon.rst:L87-L134 and docs/api/two_stage.rst:L91-L125 are skipped entirely, and changed comparison pages such as docs/python_comparison.rst:L368-L374 and docs/r_comparison.rst:L189-L202 are not included. It also preloads every public diff_diff symbol into the namespace in tests/test_doc_snippets.py:L117-L126 and treats any NameError as success in tests/test_doc_snippets.py:L297-L304, which masks changed-snippet problems like the missing DifferenceInDifferences import in docs/choosing_estimator.rst:L540-L552 and the stale event_study_results reference in docs/api/visualization.rst:L89-L91. Concrete fix: add the changed comparison pages to RST_FILES, convert :: examples to .. code-block:: python or extend the extractor to collect them, stop blanket-passing NameError, and execute snippets in a minimal namespace so missing imports and API mismatches fail.

Path to Approval

Update the CallawaySantAnna bootstrap entry in docs/api/two_stage.rst:L162-L165 so it describes multiplier bootstrap, consistent with docs/methodology/REGISTRY.md:L311-L319 and diff_diff/staggered.py:L126-L142. Once that P1 is fixed, the remaining items are P2 documentation/test follow-ups and the PR can move to ✅.

…overage - Fix CallawaySantAnna bootstrap label: "Wild bootstrap" → "Multiplier bootstrap (IF/WIF)" - Fix cluster_col → cluster on constructor in troubleshooting and choosing_estimator docs - Fix plot_honest_event_study to single-arg call, plot_bacon return type (ax not fig) - Fix dataset docstrings: cohort= → first_treat= - Fix HonestDiD/SyntheticDiD API calls in comparison pages - Add :: shorthand block extractor to test_doc_snippets.py - Add python_comparison.rst and r_comparison.rst to tested pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T11:50:42Z

/ai-review

github-actions · 2026-03-18T11:57:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2cab8b5fe94c6d2bda27c443a90915a0414bb6fc

Overall Assessment

✅ Looks good

Highest unmitigated finding: P2 documentation/test validation gap in tests/test_doc_snippets.py:150-154 and tests/test_doc_snippets.py:333-344; the new smoke test still lets several edited broken snippets through.

Execution note: this was a source-only re-review. I could not run tests/test_doc_snippets.py here because pytest is unavailable in this environment, and importing diff_diff fails here because numpy is not installed.

Executive Summary

The prior P1 methodology finding is fixed: the new TwoStageDiD comparison table now describes CallawaySantAnna bootstrap as multiplier bootstrap, matching docs/api/two_stage.rst:138-165, docs/methodology/REGISTRY.md:311-323, and docs/methodology/REGISTRY.md:868-886.
The earlier changed-doc API mismatches I checked are fixed: dataset loader docstrings now use first_treat= in diff_diff/datasets.py:257-270, diff_diff/datasets.py:403-416, and diff_diff/datasets.py:562-575; the troubleshooting SE example now uses estimator-level cluster= in docs/troubleshooting.rst:128-145; and the HonestDiD/Bacon plotting examples now use the current plotting signatures in docs/api/honest_did.rst:162-169 and docs/api/bacon.rst:101-105.
The new smoke test is materially broader than before: it now includes shorthand :: blocks and the comparison pages via tests/test_doc_snippets.py:23-39 and tests/test_doc_snippets.py:71-87.
Remaining P2: several edited snippets are still not actually standalone, including a missing DifferenceInDifferences import in docs/choosing_estimator.rst:540-551 and undefined results / event_study_results in docs/api/visualization.rst:64-91, docs/python_comparison.rst:435-446, and docs/r_comparison.rst:143-174.
Remaining P2: the new test globally replaces diff_diff.datasets in sys.modules and never restores it in tests/test_doc_snippets.py:297-308, so later tests can accidentally run against the mock module.

Methodology

No findings. The changed TwoStageDiD and BaconDecomposition docs are consistent with the Methodology Registry, including the documented TwoStageDiD multiplier-bootstrap deviation from R in docs/methodology/REGISTRY.md:880-886 and the Goodman-Bacon comparison structure in docs/methodology/REGISTRY.md:1507-1529.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P2. Impact: _build_namespace() overwrites sys.modules["diff_diff.datasets"] and diff_diff.datasets with a mock module in tests/test_doc_snippets.py:297-308 and never restores either one. That makes later tests order-dependent and can hide real dataset-module regressions by leaving the mock installed for the rest of the pytest session. Concrete fix: scope this replacement with monkeypatch or a fixture that restores the original module after each test, or avoid sys.modules mutation entirely and inject mocked loader callables only into the snippet exec namespace.

Tech Debt

No separate findings. TODO.md does not currently track the remaining doc-snippet validation gaps in TODO.md:38-64, so they are unmitigated P2 items rather than accepted deferred work.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: several edited snippets are still broken or context-dependent, but the new smoke test is structured to pass them anyway. The changed clustering example in docs/choosing_estimator.rst:540-551 uses DifferenceInDifferences without importing it, and the changed HonestDiD examples in docs/api/visualization.rst:64-91, docs/python_comparison.rst:435-446, and docs/r_comparison.rst:143-174 rely on undefined results / event_study_results. Because tests/test_doc_snippets.py:150-154 preloads all public diff_diff symbols and tests/test_doc_snippets.py:333-344 treats any NameError as success, these changed examples can regress without the test ever failing. Concrete fix: make changed snippets self-contained or explicitly mark them as partial/context-dependent, execute them in a minimal namespace, and fail on NameError by default.

- Add autouse fixture to restore diff_diff.datasets after each doc snippet test, preventing mock module from leaking into other tests - Add missing DifferenceInDifferences import in choosing_estimator.rst clustering example - Track context-dependent snippet validation gap in TODO.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T12:08:22Z

/ai-review

github-actions · 2026-03-18T12:21:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a755e8b45f38e90406c8f985b92f746ad6ae51f9

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: P1 undocumented methodology/default-behavior mismatch for SyntheticDiD in docs/choosing_estimator.rst:L497-L499. The row says bootstrap is the default SE path and recommends n_bootstrap=0 for placebo inference, but the in-code docstring/constructor say variance_method="placebo" by default and require n_bootstrap >= 2 diff_diff/synthetic_did.py:L50-L59, diff_diff/synthetic_did.py:L136-L178, with the fit path using placebo unless variance_method == "bootstrap" diff_diff/synthetic_did.py:L399-L421. The Methodology Registry also records default placebo variance docs/methodology/REGISTRY.md:L1078-L1097, docs/methodology/REGISTRY.md:L1769-L1778.

Execution note: this was a source-only re-review. I could not run tests/test_doc_snippets.py here because importing diff_diff fails in this sandbox without numpy.

Executive Summary

SyntheticDiD is the only unmitigated blocker: the new estimator-selection table reverses the default inference method and documents an invalid n_bootstrap=0 workflow for placebo inference docs/choosing_estimator.rst:L497-L499.
The earlier diff_diff.datasets mock leak in the doc-snippet test is fixed by the restore fixture at tests/test_doc_snippets.py:L324-L337.
The remaining “context-dependent snippets pass via blanket NameError” limitation is now explicitly tracked in TODO.md:L65, so it is mitigated as tracked tech debt; the test still allows that path at tests/test_doc_snippets.py:L352-L356.
The new TwoStageDiD and BaconDecomposition API pages I spot-checked are consistent with the registry/source on core methodology docs/api/two_stage.rst:L130-L165, docs/api/bacon.rst:L1-L151, docs/methodology/REGISTRY.md:L844-L910, docs/methodology/REGISTRY.md:L1494-L1548.
Non-blocking docs mismatch: troubleshooting says ImputationDiD, TwoStageDiD, and BaconDecomposition “raise an error” in cases where current code warns and proceeds/coerces instead docs/troubleshooting.rst:L475-L476, diff_diff/imputation.py:L231-L250, diff_diff/two_stage.py:L223-L238, docs/troubleshooting.rst:L552-L553, diff_diff/bacon.py:L450-L458.

Methodology

Severity: P1. Impact: SyntheticDiD is documented with the wrong default variance method in docs/choosing_estimator.rst:L497-L499. That row tells users the default is bootstrap and to set n_bootstrap=0 for placebo inference, but the in-code docstring and constructor define default variance_method="placebo" and use n_bootstrap as the replication count for both placebo and bootstrap with a hard >= 2 validation diff_diff/synthetic_did.py:L50-L59, diff_diff/synthetic_did.py:L136-L178. The runtime path also dispatches to placebo unless variance_method == "bootstrap" diff_diff/synthetic_did.py:L399-L421, and the Methodology Registry says the default SE is placebo variance with bootstrap as the alternative docs/methodology/REGISTRY.md:L1078-L1097, docs/methodology/REGISTRY.md:L1769-L1778. No Note/Deviation in the registry mitigates this. Concrete fix: rewrite that SyntheticDiD row to say the default SE is placebo variance, document bootstrap as variance_method="bootstrap", and remove the invalid n_bootstrap=0 guidance.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The prior sys.modules["diff_diff.datasets"] leakage concern appears resolved by the restore fixture at tests/test_doc_snippets.py:L324-L337.

Tech Debt

Severity: P3. Impact: the remaining limitation that context-dependent doc snippets can pass via blanket NameError is now explicitly tracked in TODO.md:L65, so it is mitigated for approval purposes. The behavior remains in tests/test_doc_snippets.py:L352-L356. Concrete fix: no action required for approval; when revisited, execute snippets in a minimal namespace and fail on NameError unless a block is explicitly marked partial/context-dependent.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new troubleshooting text overstates several failure modes as hard errors. docs/troubleshooting.rst:L475-L476 says ImputationDiD / TwoStageDiD “raises an error” when first_treat varies within unit, but the implementations emit UserWarning and proceed using a per-unit coerced value diff_diff/imputation.py:L231-L250, diff_diff/two_stage.py:L223-L238, and the Methodology Registry already documents this warning/coercion for ImputationDiD docs/methodology/REGISTRY.md:L817. docs/troubleshooting.rst:L552-L553 likewise says BaconDecomposition raises on unbalanced panels, while the code warns and continues diff_diff/bacon.py:L450-L458. Concrete fix: change those troubleshooting entries to describe the current warning/continuation behavior, including the .first() coercion for non-constant first_treat; alternatively, if hard failure is intended, make the code raise and document that behavior in the registry.

Path to Approval

Correct the SyntheticDiD standard-error row in docs/choosing_estimator.rst:L497-L499 so it matches the implementation and Methodology Registry: default placebo variance, bootstrap via variance_method="bootstrap", and no n_bootstrap=0 guidance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T12:44:42Z

/ai-review

github-actions · 2026-03-18T12:54:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1fc290e7409047b87af0d943472780e410a1849b

Overall Assessment

Needs changes

Highest unmitigated finding: P1 methodology/documentation mismatch in the quickstart MultiPeriodDiD example.

Executive Summary

The previous SyntheticDiD blocker appears resolved: the standard-error guidance in docs/choosing_estimator.rst:L497-L499 now matches diff_diff/synthetic_did.py:L50-L59, diff_diff/synthetic_did.py:L136-L178, and diff_diff/synthetic_did.py:L399-L420.
The prior troubleshooting warning/error mismatches also appear resolved: docs/troubleshooting.rst:L475-L576 now matches diff_diff/imputation.py:L231-L251, diff_diff/two_stage.py:L223-L238, and diff_diff/bacon.py:L450-L458.
P1: the updated quickstart event-study example now mis-specifies the post-treatment window. The page generates 10 periods with treatment starting at period 5 in docs/quickstart.rst:L33-L40, but the changed example uses only post_periods=[5, 6, 7] in docs/quickstart.rst:L128-L136. Since MultiPeriodDiD.fit() treats every non-listed period as pre-treatment in diff_diff/estimators.py:L859-L866, periods 8 and 9 are incorrectly classified as pre-treatment.
The new TwoStageDiD and BaconDecomposition API pages I spot-checked are consistent with the source descriptions in docs/api/two_stage.rst:L6-L16, diff_diff/two_stage.py:L48-L61, docs/api/bacon.rst:L7-L12, and diff_diff/bacon.py:L290-L308.
The new doc-snippet smoke-test limitation around standalone validation is now explicitly tracked in TODO.md:L65, so it is mitigated for approval purposes.
Execution note: this was a source-only review. I could not run tests/test_doc_snippets.py here because pytest is not installed in the sandbox, and direct package import also fails here without numpy.

Methodology

Severity: P1. Impact: the quickstart MultiPeriodDiD example in docs/quickstart.rst:L128-L136 now documents the wrong event-study specification for the page’s own synthetic data setup in docs/quickstart.rst:L33-L40. With n_periods=10 and treatment_period=5, periods 5 through 9 are post-treatment. Passing only post_periods=[5, 6, 7] makes diff_diff/estimators.py:L859-L866 treat periods 8 and 9 as pre-treatment, contaminating the pre-period set and changing which periods contribute to post-treatment effects. Concrete fix: restore post_periods=[5, 6, 7, 8, 9], or change the data-generation example so only three post periods exist.
No other methodology findings in the changed estimator/default-behavior docs. The prior SyntheticDiD SE-default mismatch is fixed in docs/choosing_estimator.rst:L497-L499, consistent with diff_diff/synthetic_did.py:L50-L59 and docs/methodology/REGISTRY.md:L1016-L1026.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: the new doc-snippet smoke test still is not true standalone validation because it relies on a synthetic convenience namespace in tests/test_doc_snippets.py:L161-L191 and tolerates context-dependent NameError in tests/test_doc_snippets.py:L349-L356. This limitation is explicitly tracked in TODO.md:L65, so it is mitigated. Concrete fix: no action required for approval; later, move toward page-specific minimal fixtures and explicit allowlists for intentionally partial snippets.
The earlier diff_diff.datasets mock-leak concern appears fixed by the restore fixture in tests/test_doc_snippets.py:L324-L337.

Security

No findings.

Documentation/Tests

No additional unmitigated findings beyond the quickstart methodology issue above and the tracked doc-snippet test limitation.
Review limitation: I could not execute the new smoke test in this environment because pytest is unavailable, and direct diff_diff import fails here without numpy.

Path to Approval

Fix the event-study quickstart example in docs/quickstart.rst:L128-L136 so its post_periods matches the page’s synthetic data definition in docs/quickstart.rst:L33-L40; the direct repair is to restore [5, 6, 7, 8, 9] or regenerate the example data with only three post-treatment periods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T13:07:09Z

/ai-review

github-actions · 2026-03-18T13:24:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0ef224b0c73fd95a830dc7479517b9adb923f446

Overall Assessment

⛔ Blocker

Highest unmitigated finding: P0 control-group corruption in the new “always-treated units” troubleshooting snippet.

Executive Summary

The previous quickstart P1 is fixed: the MultiPeriodDiD example now uses the full post-treatment window, consistent with the page’s synthetic data setup (docs/quickstart.rst:L33, docs/quickstart.rst:L128).
No estimator implementation or registry entries changed in the merge result, so this re-review focused on whether the modified docs match the current implementations and docs/methodology/REGISTRY.md.
P0: the new troubleshooting code for “Units treated in all observed periods” drops never-treated units under the library’s standard first_treat=0 convention, corrupting the comparison group.
P1: the updated synthdid translation still does not match SyntheticDiD’s documented interface; it omits the required ever-treated conversion and treats T0 as three literal post-period labels.
P1: several changed CallawaySantAnna examples still use stale aggregation/bootstrap patterns or read aggregation outputs without requesting aggregate=....
I did not find new P1+ issues outside the modified docs/test harness. This was a source-only review; I could not run the new snippet tests here because pytest and numpy are unavailable in the sandbox.

Methodology

Severity: P0. Impact: the new “Units treated in all observed periods” snippet in docs/troubleshooting.rst:L513 identifies always-treated units with lambda g: (g['period'] >= g['first_treat']).all(). Under the library’s documented never-treated encoding (docs/methodology/REGISTRY.md:L817, docs/methodology/REGISTRY.md:L1010, diff_diff/staggered.py:L186), every unit with first_treat == 0 satisfies that predicate, so the snippet drops never-treated controls instead of only always-treated units. The estimators themselves exclude never-treated before checking first_treat <= min_time (diff_diff/imputation.py:L254, diff_diff/two_stage.py:L241). Copying the docs silently changes the control group and can bias or invalidate ImputationDiD / TwoStageDiD results. Concrete fix: compute unit-level first_treat, keep never-treated units, and only drop units with 0 < first_treat <= data['period'].min().
Severity: P1. Impact: the updated synthdid translation in docs/r_comparison.rst:L191 still does not match this implementation. SyntheticDiD.fit() requires a time-invariant ever-treated indicator (diff_diff/synthetic_did.py:L204, docs/methodology/REGISTRY.md:L1022, docs/methodology/REGISTRY.md:L1120) and treats every period omitted from post_periods as pre-treatment (diff_diff/synthetic_did.py:L255). As written, the example neither shows the required ever-treated conversion nor supplies the full post-treatment set, so a copied example either raises on within-unit treatment variation or silently misclassifies later post periods as pre. Concrete fix: add an ever_treated column, and derive post_periods from the sorted time axis using T0 as the split index rather than hard-coding [T0, T0+1, T0+2].

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: the new snippet smoke-test harness still treats context-dependent NameError as pass-through in tests/test_doc_snippets.py:L352, and that limitation is explicitly tracked in TODO.md:L65. This is mitigated under the review rules, but it also means partial snippets such as cs.fit(data, ...) are not actually exercised. Concrete fix: no approval blocker while tracked; later, make the high-traffic troubleshooting/comparison snippets standalone or give them dedicated fixtures.

Security

No findings.

Documentation/Tests

Severity: P1. Impact: the changed CallawaySantAnna examples still mix the old post-fit API with the current results contract. In docs/troubleshooting.rst:L208, the page calls results.bootstrap(), results.aggregate('event_time'), and results.att; in docs/python_comparison.rst:L404 and docs/r_comparison.rst:L118, it reads event_study_effects / group_effects off default fit() calls. The actual API requires fit-time aggregate= selection (diff_diff/staggered.py:L1049), only populates event_study_effects / group_effects when requested (diff_diff/staggered.py:L1237), and exposes overall_att, event_study_effects, group_effects, and bootstrap_results on the results object (diff_diff/staggered_results.py:L100). plot_event_study() already handles CallawaySantAnnaResults directly once event_study_effects exists (diff_diff/visualization.py:L444). Concrete fix: update these snippets to fit with aggregate='event_study' or 'all', use results.overall_att, results.event_study_effects, results.group_effects, and results.bootstrap_results, and pass results directly to plot_event_study.
Review limitation: I could not execute tests/test_doc_snippets.py here because pytest is not installed and import diff_diff fails without numpy in this sandbox.

Path to Approval

Fix the “Units treated in all observed periods” troubleshooting snippet so it mirrors the estimator logic: identify unit-level first_treat, keep never-treated units, and only drop units with first_treat > 0 and first_treat <= min(period).
Rewrite the synthdid translation in docs/r_comparison.rst to show the required ever-treated indicator and derive the full post-treatment set from the sorted time axis using T0 as a split index.
Update the changed CallawaySantAnna examples in docs/troubleshooting.rst, docs/python_comparison.rst, and docs/r_comparison.rst to request aggregation at fit time and use the current results interface (overall_att, event_study_effects, group_effects, bootstrap_results).

…ation - Fix always-treated snippet that silently dropped never-treated controls by matching actual estimator logic (first_treat > 0 && <= min_period) - Replace stale results.bootstrap()/aggregate()/att with correct CS API (n_bootstrap= at constructor, aggregate= at fit time, overall_att) - Fix SyntheticDiD R comparison to derive ever-treated indicator and post_periods from data instead of passing time-varying treatment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T13:39:45Z

/ai-review

github-actions · 2026-03-18T13:46:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 70e09bd9a924e7e350674c54cd60e203dd11591d

Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings remain in the changed PR scope.

Executive Summary

The prior blocker-level doc issues appear resolved on re-review: the always-treated troubleshooting snippet now preserves never-treated controls, the synthdid translation now uses an ever-treated indicator plus explicit post_periods, and the changed CallawaySantAnna examples now request aggregation at fit time and read from the current results interface. See docs/troubleshooting.rst:L207-L215, docs/troubleshooting.rst:L517-L525, docs/r_comparison.rst:L117-L122, docs/r_comparison.rst:L191-L210.
Severity P2. Impact: the new doc-snippet smoke test now silently passes ImportError, so broken snippet imports can still clear CI even though the module header says only NameError is exempt. See tests/test_doc_snippets.py:L1-L6, tests/test_doc_snippets.py:L350-L361.
Severity P3. Impact: the new Bacon docs match the implementation, but docs/methodology/REGISTRY.md still states a conflicting assumption about never-treated units, leaving the methodology registry internally inconsistent. See docs/api/bacon.rst:L7-L23, docs/methodology/REGISTRY.md:L1507-L1511, diff_diff/bacon.py:L296-L308, diff_diff/bacon.py:L486-L494.
Severity P3. Impact: docs/api/visualization.rst still presents plot_group_effects() and plot_sensitivity() as figure-returning examples even though both APIs return Axes, so the examples are still slightly misleading. See docs/api/visualization.rst:L43-L50, docs/api/visualization.rst:L64-L72, diff_diff/visualization.py:L534-L537, diff_diff/visualization.py:L671-L674.

Methodology

No unmitigated P0/P1 methodology issues found in the changed docs.

Severity P3. Impact: the newly added Bacon documentation is consistent with the code and source paper, but the Methodology Registry still says Bacon “assumes no never-treated,” which conflicts with both the implementation’s treated-vs-never path and the new public docs. Concrete fix: update docs/methodology/REGISTRY.md:L1507-L1511 to match docs/api/bacon.rst:L7-L23 and diff_diff/bacon.py:L296-L308, diff_diff/bacon.py:L486-L494.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the prior NameError limitation in the new snippet harness is now properly tracked in TODO.md:L60-L65, so that specific gap is mitigated under the project rules. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

Severity P2. Impact: test_doc_snippet() now treats both NameError and ImportError as pass-through, which means the new smoke test no longer guarantees that changed snippets can successfully import the APIs they document; this is especially relevant for pages whose main purpose is API translation or plotting. Concrete fix: keep the existing NameError deferral if desired, but narrow ImportError suppression to explicit third-party comparison snippets or skip patterns, and let in-package/stdlib import failures fail the test. Update the header comment in tests/test_doc_snippets.py:L1-L6 to match the actual policy.
Severity P3. Impact: the visualization examples still use fig = ... for functions that return Axes, so the rewritten docs are not fully aligned with the current plotting API. Concrete fix: change those examples to ax = plot_group_effects(...) / ax = plot_sensitivity(...), and when demonstrating save behavior, use ax.figure.savefig(...). See docs/api/visualization.rst:L43-L50, docs/api/visualization.rst:L64-L72.

Review limitation: I could not execute tests/test_doc_snippets.py in this sandbox because numpy, pytest, and matplotlib are not installed here.

…ypes Restrict test harness ImportError handling to known third-party modules (pyfixest, linearmodels, differences) so broken diff_diff imports fail CI. Fix REGISTRY.md Bacon never-treated assumption to match implementation. Fix fig= to ax= across visualization docs (plot_* returns Axes, not Figure). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-18T14:16:28Z

/ai-review

github-actions · 2026-03-18T14:31:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 21fd61ad8bc2990faa9ab9daebcad9f983cb3f46

Overall Assessment

✅ Looks good

Executive Summary

Prior re-review issues in the changed scope appear resolved: the Callaway-Sant’Anna R-translation now documents fit-time aggregation, the always-treated troubleshooting example now preserves never-treated controls, ImportError suppression in the doc-snippet harness is narrowed to known third-party modules, and the plotting docs now use Axes return values rather than Figure return values. Refs: docs/r_comparison.rst:L117-L122, docs/troubleshooting.rst:L517-L525, tests/test_doc_snippets.py:L363-L370, docs/api/visualization.rst:L43-L72.
No unmitigated methodology defects were found in the PR’s estimator-facing documentation. The Bacon registry wording now matches the implementation’s treated-vs-never / timing-only behavior. Refs: docs/methodology/REGISTRY.md:L1501-L1510, diff_diff/bacon.py:L296-L308, diff_diff/bacon.py:L486-L494.
Severity P2. The new R-equivalents note in docs/r_comparison.rst is internally inconsistent with the repository’s own methodology sources for TripleDifference and StackedDiD, so the migration guidance is still not fully trustworthy. Refs: docs/r_comparison.rst:L372-L381, docs/methodology/REGISTRY.md:L1798-L1812, diff_diff/triple_diff.py:L16-L17.
Severity P3. The new snippet smoke test still does not cover the changed docs/api/power.rst example, so the PR’s updated plot_power_curve() snippet is unvalidated by CI. Refs: tests/test_doc_snippets.py:L24-L40, docs/api/power.rst:L122-L161.
Severity P3. Blanket NameError pass-through remains in the snippet harness, but the PR explicitly tracks that limitation in TODO.md, so it is mitigated under the project rules. Refs: tests/test_doc_snippets.py:L350-L362, TODO.md:L58-L65.

Methodology

No findings. I cross-checked the changed methodology-facing docs against the Methodology Registry and in-code references; the PR does not change estimator math, weighting, SEs, assumptions, or defaults in code.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: test_doc_snippet() still treats any NameError as non-failing, so snippets that rely on undeclared prior context are not yet standalone-validated. This is already tracked in-project and therefore mitigated. Concrete fix: none required for approval; later, replace blanket NameError suppression with explicit per-snippet fixtures or skip markers. Refs: tests/test_doc_snippets.py:L350-L362, TODO.md:L58-L65.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new note in docs/r_comparison.rst says Triple Difference “requires manual implementation in R” and gives Stacked DiD “manual implementation or the stackedev package,” but the repository’s own methodology sources document TripleDifference against triplediff::ddd() and StackedDiD against stacked-did-weights. That leaves the public migration note internally inconsistent with the registry and code comments. Concrete fix: align the note in docs/r_comparison.rst with the mappings already documented in docs/methodology/REGISTRY.md, and for DDD specifically match the in-code triplediff::ddd() reference. Refs: docs/r_comparison.rst:L372-L381, docs/methodology/REGISTRY.md:L1798-L1812, docs/methodology/REGISTRY.md:L1213-L1214, diff_diff/triple_diff.py:L16-L17.
Severity P3. Impact: the new smoke-test file list omits api/power.rst, so the PR’s updated plot_power_curve() example is not exercised even though it changed in this PR. Concrete fix: add api/power.rst to RST_FILES in tests/test_doc_snippets.py so the changed snippet is covered. Refs: tests/test_doc_snippets.py:L24-L40, docs/api/power.rst:L122-L161.

Review limitation: I could not execute pytest or import the package in this sandbox because pytest and numpy are not installed here, so this re-review is based on static inspection of the merge diff and surrounding source.

The visualization module raises ImportError with a message (not exc.name) when matplotlib is missing. Also match against the error message string so optional-dependency guards are correctly suppressed in Pure Python CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix SyntheticDiD SE row and stale "raises an error" wording in docs

1fc290e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix quickstart post_periods and catch ImportError in doc snippet tests

0ef224b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber merged commit 405de88 into main Mar 18, 2026
10 checks passed

igerber deleted the documentation-review branch March 18, 2026 17:59

Conversation

igerber commented Mar 16, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

igerber commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

igerber commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

igerber commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

igerber commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant