Wave 2: PanelProfile outcome/dose shape extensions + autonomous-guide worked examples by igerber · Pull Request #366 · igerber/diff-diff

igerber · 2026-04-24T23:27:10Z

Summary

Extend PanelProfile with outcome_shape: Optional[OutcomeShape] (numeric outcomes) and treatment_dose: Optional[TreatmentDoseShape] (continuous treatments). New fields populate distributional facts that gate WooldridgeDiD (QMLE) and ContinuousDiD prerequisites pre-fit.
Add three end-to-end worked examples to llms-autonomous.txt (binary-staggered with never-treated, continuous-dose with zero baseline, count-shaped outcome). New §5 between the existing §4 reasoning and §5 post-fit validation; existing §5–§8 renumbered to §6–§9.
Top-level exports: OutcomeShape, TreatmentDoseShape from diff_diff. Both new dataclasses are frozen and JSON-serializable via PanelProfile.to_dict().

Methodology references (required if estimator / math changes)

Method name(s): profile_panel is descriptive (not an estimator); the worked examples reference CallawaySantAnna, WooldridgeDiD, ContinuousDiD. No estimator math is changed in this PR.
Paper / source link(s): no new methodology claims beyond what is already in docs/methodology/REGISTRY.md.
Any intentional deviations from the source: None. The is_time_invariant field uses exact distinct-count on observed non-zero doses (matching the documented contract and ContinuousDiD.fit()'s exact equality check). The is_count_like and is_integer_valued fields are explicitly labeled heuristics in the docstring + guide.

Validation

Tests added/updated: 17 new unit tests in tests/test_profile_panel.py covering each shape heuristic (count-like Poisson, binary-as-not-count-like, continuous normal, bounded unit, categorical returning None, skewness/kurtosis gating, JSON roundtrip, time-invariant dose, time-varying dose, no-zero-dose, binary/categorical treatment returning None, frozen invariants on both new dataclasses, sub-1e-8 dose-precision regression). 2 new content-stability tests in tests/test_guides.py guard the §5 worked-examples block and the new field references.
Backtest / simulation / notebook evidence: N/A (descriptive utility + documentation).

Security / privacy

Confirm no secrets/PII in this PR: Yes

github-actions · 2026-04-24T23:32:59Z

Overall assessment

⚠️ Needs changes — one unmitigated P1 methodology/contract mismatch in the new ContinuousDiD pre-fit gating surface.

Executive summary

P1: The new TreatmentDoseShape fields are documented as the ContinuousDiD prerequisites, but they do not match what ContinuousDiD.fit() actually enforces. That can mark ineligible panels as eligible pre-fit.
The mismatch is internal as well as methodological: the existing autonomous guide still says ContinuousDiD requires has_never_treated and treatment_varies_within_unit == False, while the new section shifts the gate to row-level has_zero_dose and nonzero-only is_time_invariant.
P2: The new worked example uses WooldridgeDiD(family="poisson"), but the public API is WooldridgeDiD(method="poisson"). The new guide test hard-codes the wrong call.
The new tests do not cover the two cases that would have exposed the P1: 0,0,d,d within-unit dose paths, and row-level zeros without any never-treated units.
I did not find a mitigating entry for the P1 in TODO.md. This was a static source review; I could not execute the code in this environment.

Methodology

Severity: P1
Impact: ContinuousDiD is still implemented as requiring unit-level never-treated controls (first_treat == 0) and full per-unit dose constancy, but the new profile surface defines has_zero_dose as “any zero row” and is_time_invariant on nonzero doses only, then states those new fields are the prerequisites. That is an undocumented mismatch between the shipped pre-fit guidance and the estimator contract. A panel with pre-period zeros and no never-treated units, or a 0,0,d,d dose path, can satisfy the new guide’s gate while ContinuousDiD.fit() still raises.
Concrete fix: Either align _compute_treatment_dose() to the actual estimator contract, or keep the fields descriptive and stop claiming they alone gate ContinuousDiD. In either case, keep has_never_treated and full-path time-invariance as the authoritative prerequisites unless you also change ContinuousDiD.fit() and document that deviation in REGISTRY.md.
Locations: diff_diff/profile.py:L656-L710, diff_diff/continuous_did.py:L222-L227, diff_diff/continuous_did.py:L348-L360, diff_diff/guides/llms-autonomous.txt:L115-L140, diff_diff/guides/llms-autonomous.txt:L198-L211, diff_diff/guides/llms-autonomous.txt:L690-L723, docs/methodology/REGISTRY.md:L719-L745

Code Quality

No findings in the changed Python implementation beyond the methodology contract issue above.

Performance

No findings.

Maintainability

No separate findings.

Tech Debt

No separate findings.
The P1 above is not mitigated by TODO.md; I did not find an existing tracked entry for it.

Security

No findings.

Documentation/Tests

Severity: P2
Impact: The new autonomous guide tells users/agents to call WooldridgeDiD(family="poisson"), but the estimator constructor accepts method, not family. Anyone following the example will get a runtime error, and the new guide test now locks that bad example in place.
Concrete fix: Change the guide and the stability test to WooldridgeDiD(method="poisson").
Locations: diff_diff/guides/llms-autonomous.txt:L767-L771, diff_diff/wooldridge.py:L273-L275, tests/test_guides.py:L62-L64
Severity: P2
Impact: The new tests only cover “constant dose in every row” and “changing nonzero dose,” so they miss the exact contradictory cases introduced by the new guidance: 0,0,d,d within-unit paths and zero rows without any never-treated units. That leaves the P1 above unguarded.
Concrete fix: Add regression tests asserting consistency between profile_panel guidance and ContinuousDiD.fit() for:
1. balanced panel with 0,0,d,d dose paths;
2. panel with row-level zeros but no never-treated units.
  Locations: tests/test_profile_panel.py:L1028-L1105

Path to Approval

Make the new TreatmentDoseShape contract consistent with ContinuousDiD.fit() and the Methodology Registry, or explicitly scope it as descriptive-only and restore the existing prerequisites (has_never_treated, full-path time invariance) in the guide.
Add regression tests for the two contradictory continuous-dose scenarios: 0,0,d,d paths and row-level zeros without never-treated controls.
Fix the autonomous guide and its stability test to use WooldridgeDiD(method="poisson").

…ptive-only; fix WooldridgeDiD method kwarg P1 (TreatmentDoseShape vs ContinuousDiD contract): - Reviewer correctly flagged that the new `is_time_invariant` field (per-unit non-zero distinct-count) does NOT match the actual `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose column (including pre-treatment zeros). My nonzero-only check silently classified `0,0,d,d` paths as time-invariant while ContinuousDiD would reject them. - Removed `is_time_invariant` field from `TreatmentDoseShape` entirely. The pre-existing `PanelProfile.treatment_varies_within_unit` field already encodes the correct ContinuousDiD prerequisite (matches the estimator's nunique check at line 224) and is correctly documented in §2 of the autonomous guide. Adding a second, narrower, mismatched gate was confusing - the reviewer's "scope as descriptive-only" path is the cleaner fix. - Reframed `TreatmentDoseShape` docstring + autonomous guide §2 field reference: explicitly NOT a ContinuousDiD prerequisite. `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide descriptive distributional context; `has_never_treated` (unit-level) + `treatment_varies_within_unit == False` (full-path constancy) + `is_balanced` are the authoritative gates. - Rewrote §5.2 worked example reasoning chain to use the existing correct gates and added a counter-example showing `has_zero_dose=True` does NOT imply `has_never_treated=True` (the row-level vs unit-level distinction). - Added `test_treatment_dose_does_not_gate_continuous_did` covering the two contradictory cases the reviewer named: (1) `0,0,d,d` within-unit dose path, asserting `treatment_varies_within_unit=True` (the actual ContinuousDiD gate fires correctly); (2) row-level zeros without never-treated units, asserting `has_zero_dose=True` BUT `has_never_treated=False` (the two facts are distinct). - Removed `test_treatment_dose_continuous_time_varying_within_unit` and `test_treatment_dose_distinguishes_doses_at_high_precision` - both tested the dropped `is_time_invariant` field. P2 (WooldridgeDiD constructor kwarg): - The autonomous guide §5.3 worked example used `WooldridgeDiD(family="poisson")` but the actual constructor at `wooldridge.py:264` takes `method=`. Following the example would raise `TypeError: __init__() got an unexpected keyword argument 'family'`. Fixed in two places (the prose and the code snippet) and added a negative assertion in `test_guides.py` to prevent regression: `assert 'WooldridgeDiD(family="poisson")' not in text`. CHANGELOG updated to reflect the revised TreatmentDoseShape scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T23:48:34Z

/ai-review

github-actions · 2026-04-24T23:53:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cc91a7dad85dbb36ec713b2deedeae7d0938e3bc

Overall Assessment
✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The prior re-review P1 is resolved: the new TreatmentDoseShape surface is now explicitly descriptive-only, and the guide/tests defer to the same ContinuousDiD gates the estimator and Methodology Registry enforce: has_never_treated, treatment_varies_within_unit, and is_balanced. diff_diff/profile.py:L64-L83, diff_diff/profile.py:L658-L697, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/guides/llms-autonomous.txt:L732-L757, tests/test_profile_panel.py:L1057-L1115
The prior re-review P2 is resolved: the worked example now uses WooldridgeDiD(method="poisson"), matching the public API and the Wooldridge documentation. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L276, docs/methodology/REGISTRY.md:L1333-L1349, tests/test_guides.py:L50-L71
I did not find estimator-math, weighting, variance/SE, identification-assumption, or default-behavior changes in the PR; the changed surface is descriptive profiling, exports, guide content, and tests. diff_diff/profile.py:L43-L200, diff_diff/__init__.py:L253-L259, diff_diff/guides/llms-autonomous.txt:L637-L811
P3: ROADMAP.md still overstates the new fields as if they themselves gate Wooldridge/ContinuousDiD routing, which conflicts with the corrected guide/docstrings. ROADMAP.md:L141-L141
P3: the new top-level exports lack a matching regression in the existing top-level import-surface test. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L508, tests/test_profile_panel.py:L404-L413
Static source review only: I could not execute tests here because this environment is missing pytest and even numpy.

Methodology

No findings. The previous ContinuousDiD contract mismatch is fixed and now matches both the estimator implementation and REGISTRY.md. diff_diff/profile.py:L64-L83, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745
No findings. The previous Wooldridge guide/API mismatch is fixed. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L276

Code Quality

No findings.

Performance

No findings.

Maintainability

P3 — ROADMAP.md still says the new extensions expose facts that “gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit,” which reintroduces the exact conceptual confusion the guide/docstrings just corrected; reword it to say the new fields add descriptive outcome/dose-shape context, while the authoritative ContinuousDiD gates remain has_never_treated, treatment_varies_within_unit, and is_balanced, and outcome_shape.is_count_like only informs Poisson-QMLE consideration. ROADMAP.md:L141-L141, compare diff_diff/profile.py:L64-L83, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507

Tech Debt

No findings. I do not see a new correctness issue that should be deferred into TODO.md.

Security

No findings.

Documentation/Tests

P3 — The PR adds OutcomeShape and TreatmentDoseShape to the public top-level namespace, but test_top_level_import_surface() still only covers profile_panel, PanelProfile, and Alert; extend that test to assert diff_diff.OutcomeShape, diff_diff.TreatmentDoseShape, and their presence in diff_diff.__all__. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L508, tests/test_profile_panel.py:L404-L413
Review note: I could not run the added tests in this environment because pytest is unavailable and importing the package runtime also fails for lack of numpy.

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T23:56:26Z

/ai-review

github-actions · 2026-04-25T00:01:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 370915efd4056efc0cfe882a96041818a5d62239

Overall assessment

✅ Looks good — no unmitigated P0/P1 findings. The prior substantive re-review issues I checked are resolved.

Executive summary

The earlier ContinuousDiD contract concern is fixed: the new treatment_dose surface is now explicitly descriptive-only, and the guide points back to the same field-based gates enforced in ContinuousDiD.fit() and documented in the registry. diff_diff/profile.py:L64-L89, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745
The earlier Wooldridge worked-example/API mismatch is fixed: the guide now uses WooldridgeDiD(method="poisson"), matching the public API and the Wooldridge registry entry. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L275, docs/methodology/REGISTRY.md:L1333-L1349
The earlier top-level export/test gap is fixed: OutcomeShape and TreatmentDoseShape are exported from diff_diff and covered by the import-surface regression. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L510, tests/test_profile_panel.py:L404-L422
P3: the new “authoritative ContinuousDiD pre-fit gates” wording now sounds exhaustive but omits the separate duplicate-cell hard stop already documented elsewhere; _precompute_structures() still resolves duplicate (unit, time) rows by last-row-wins overwrite. diff_diff/guides/llms-autonomous.txt:L330-L343, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L818-L823, diff_diff/profile.py:L69-L75, CHANGELOG.md:L11-L11, ROADMAP.md:L141-L141
Static review only: I could not run tests here because this environment is missing pytest, numpy, and pandas.

Methodology

No P0/P1 findings. The changed guide/docstring text now matches the current ContinuousDiD and WooldridgeDiD contracts and does not introduce an undocumented estimator-methodology deviation.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the documentation item below.

Tech Debt

No findings. I did not see a related mitigation entry needed in TODO.md.

Security

No findings.

Documentation/Tests

P3 — Impact: the new ContinuousDiD prerequisite summary omits the already-documented duplicate-row hard gate, so a reader treating the new treatment_dose summary as exhaustive can still route duplicate-containing panels into a code path that silently overwrites earlier (unit, time) rows. Concrete fix: in the new TreatmentDoseShape docs and the related guide/ROADMAP/CHANGELOG wording, change “authoritative gates are ...” to “core field-based gates are ... plus no duplicate_unit_time_rows alert,” and add a regression in tests/test_guides.py that the ContinuousDiD prerequisite summary still mentions duplicates. diff_diff/guides/llms-autonomous.txt:L330-L343, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L818-L823, diff_diff/profile.py:L69-L75, CHANGELOG.md:L11-L11, ROADMAP.md:L141-L141
Verification note: I could not execute the new tests in this environment.

…ontinuousDiD prerequisite summaries Reviewer correctly noted that the round-2 wording lists `has_never_treated` + `treatment_varies_within_unit == False` + `is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but omits the duplicate-cell hard stop. Verified `continuous_did.py:_precompute_structures` (line 818-823) builds `outcome_matrix` cell-by-cell with last-row-wins on duplicate `(unit, time)` keys - so absence of the `duplicate_unit_time_rows` alert is also a real prerequisite, not just a style preference. Updated wording in five places to add "+ absence of the `duplicate_unit_time_rows` alert" alongside the other gates and explain the silent-overwrite behavior: - `diff_diff/profile.py` `TreatmentDoseShape` docstring - `diff_diff/guides/llms-autonomous.txt` §2 field reference - `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature) - `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning chain (now lists four gates instead of three) - `CHANGELOG.md` Unreleased entry - `ROADMAP.md` AI-Agent Track building-block Also softened "authoritative" -> "core field-based" since the non-field-based duplicate-row gate makes the original phrasing slightly misleading. Added a test_guides.py regression asserting the autonomous guide mentions `duplicate_unit_time_rows` so future wording changes can't silently drop the gate from the summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T00:12:18Z

/ai-review

github-actions · 2026-04-25T00:17:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 22ff5dc161534d087d0aa191b74f04efd99eaa43

Overall Assessment

⚠️ Needs changes — I found two unmitigated P1 methodology issues in the new count-outcome guidance/routing.

Executive Summary

The prior duplicate-row documentation issue appears resolved: the ContinuousDiD prerequisite summary now keeps duplicate_unit_time_rows in scope.
P1: The new Wooldridge Poisson guidance says the estimator returns a multiplicative/proportional effect, but the implementation and reporting contract expose an ASF-based ATT difference on the outcome scale.
P1: The new OutcomeShape.is_count_like heuristic does not exclude negative support, so it can steer negative integer outcomes toward WooldridgeDiD(method="poisson"), which hard-rejects negative outcomes.
P2: The added tests pin section presence and API spelling, but they do not guard the nonlinear Wooldridge target-parameter wording or the negative-count edge case.
Static review only: I could not run the test suite because pytest is not installed in this environment.

Methodology

Severity: P1. Affected method: WooldridgeDiD(method="poisson"). Impact: The new guide tells agents that Poisson ETWFE “estimates the multiplicative effect directly” and that they should “report the multiplicative effect (proportional change),” but the library’s own methodology contract is ASF-based ATT, and the Poisson path computes E[exp(η_1)] - E[exp(η_0)], i.e. an outcome-scale difference, not a ratio. An agent following the new worked example can therefore misreport overall_att on the wrong scale. Concrete fix: Rewrite the §4.11 / §5.3 wording so Poisson ETWFE is described as an ASF-based ATT on the natural outcome scale; if you want to mention proportional effects, label them as a separate derived interpretation, not the estimator’s overall_att target. Refs: diff_diff/guides/llms-autonomous.txt:L621-L629, diff_diff/guides/llms-autonomous.txt:L801-L811, docs/methodology/REGISTRY.md:L1335-L1357, diff_diff/wooldridge.py:L1191-L1225, diff_diff/_reporting_helpers.py:L262-L281.
Severity: P1. Affected surface: profile_panel outcome-shape routing for WooldridgeDiD(method="poisson"). Impact: is_count_like currently requires integer-valued data, zeros, positive skew, and >2 distinct values, but it does not require non-negative support. That means a right-skewed integer outcome with zeros and some negative values can still set is_count_like=True, after which the new guide treats it as a Poisson candidate even though the Poisson fitter raises on any negative outcome. This is a missing edge-case guard on the new routing signal. Concrete fix: Add a non-negativity requirement to is_count_like (value_min >= 0 or equivalent), document it, and add a regression test with negative integer outcomes plus zeros expecting is_count_like=False. Refs: diff_diff/profile.py:L621-L645, diff_diff/guides/llms-autonomous.txt:L621-L629, diff_diff/wooldridge.py:L1105-L1108.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see these P1 items mitigated in TODO.md, and they are not deferrable under the review policy.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests verify the worked-examples section exists and that the Poisson example uses method= rather than family=, but they do not verify the nonlinear Wooldridge estimand wording or the negative-outcome exclusion. Both P1 issues above can therefore pass the current suite. Concrete fix: Add guide-content assertions that §4.11/§5.3 do not describe overall_att as “multiplicative” / “proportional change,” and add a profile_panel regression test for a right-skewed integer outcome with zeros and some negative values. Refs: tests/test_guides.py:L62-L83, tests/test_profile_panel.py:L887-L907.

Path to Approval

Fix the Poisson ETWFE wording in diff_diff/guides/llms-autonomous.txt so it matches the ASF-based ATT contract already documented in REGISTRY.md and _reporting_helpers.py.
Tighten OutcomeShape.is_count_like to require non-negative support before it can steer users toward Poisson.
Add regression tests for both the estimand wording and the negative-integer edge case.

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T00:24:46Z

/ai-review

github-actions · 2026-04-25T00:31:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b444300244662048617cd21fc6357d5cae0b1e32

Overall Assessment

⚠️ Needs changes — the two prior Wooldridge-related P1s look resolved in the provided diff, but there is one P1 [Newly identified] in the new ContinuousDiD pre-fit routing/docs.

Executive Summary

The prior Poisson issues appear fixed in the provided diff: WooldridgeDiD(method="poisson") is now described as ASF/outcome-scale, and outcome_shape.is_count_like now excludes negative outcomes.
P1 [Newly identified]: the new treatment_dose / ContinuousDiD prerequisite summaries omit the estimator’s strictly-positive-treated-dose restriction. A panel can satisfy every documented gate and still hard-fail in ContinuousDiD.fit() on negative dose support.
The added tests cover the two prior Poisson regressions, but there is still no regression around negative-dose ContinuousDiD routing.
No findings in code quality, performance, security, or tracked-tech-debt handling.
Static review only: pytest is not installed in this environment, so I could not run the suite.

Methodology

Re-review note: the previous Poisson estimand-wording and negative-count-routing findings appear resolved in the provided diff.

Severity: P1 [Newly identified]. Impact: the PR now presents has_never_treated, treatment_varies_within_unit == False, is_balanced, and no duplicate_unit_time_rows alert as the authoritative pre-fit gates for ContinuousDiD, and §5.2 says that if those pass the estimator is “in scope.” That omits the implementation’s hard support restriction that treated doses must be strictly positive; ContinuousDiD.fit() raises on negative treated dose support, and the methodology note is written on D > 0 support. An autonomous agent can therefore route a balanced, never-treated, time-invariant panel with treatment_dose.dose_min < 0 into a deterministic fit-time error. Concrete fix: extend the new routing contract so ContinuousDiD is only in scope when nonzero doses are strictly positive as well; at minimum, document treatment_dose.dose_min > 0 as a hard gate, and preferably expose an explicit negative-dose alert/boolean. Refs: diff_diff/profile.py:L65-L85, diff_diff/guides/llms-autonomous.txt:L737-L751, diff_diff/continuous_did.py:L287-L294, docs/methodology/continuous-did.md:L180-L184.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find the ContinuousDiD gate omission documented as a registry deviation or tracked in TODO.md, so it is unmitigated.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests now guard the prior Poisson regressions, but they still do not cover the expanded ContinuousDiD routing surface for negative-dose panels. The treatment-dose tests only cover zero-baseline/no-zero/binary/categorical cases, and the guide tests do not assert any negative-dose exclusion. Concrete fix: add a regression with a balanced, never-treated, time-invariant continuous panel containing a negative nonzero dose and assert the routing surface rejects ContinuousDiD (via explicit alert/boolean or guide wording keyed to dose_min < 0). Refs: tests/test_profile_panel.py:L1037-L1187, tests/test_guides.py:L62-L105.

Path to Approval

Update the new TreatmentDoseShape / autonomous-guide prerequisite summaries so ContinuousDiD is only “in scope” when nonzero doses are strictly positive in addition to being zero-supported, time-invariant, balanced, and duplicate-free.
Add a targeted regression for a negative-dose continuous panel that otherwise passes the existing documented gates, so this support restriction is covered going forward.

…ptive-only; fix WooldridgeDiD method kwarg P1 (TreatmentDoseShape vs ContinuousDiD contract): - Reviewer correctly flagged that the new `is_time_invariant` field (per-unit non-zero distinct-count) does NOT match the actual `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose column (including pre-treatment zeros). My nonzero-only check silently classified `0,0,d,d` paths as time-invariant while ContinuousDiD would reject them. - Removed `is_time_invariant` field from `TreatmentDoseShape` entirely. The pre-existing `PanelProfile.treatment_varies_within_unit` field already encodes the correct ContinuousDiD prerequisite (matches the estimator's nunique check at line 224) and is correctly documented in §2 of the autonomous guide. Adding a second, narrower, mismatched gate was confusing - the reviewer's "scope as descriptive-only" path is the cleaner fix. - Reframed `TreatmentDoseShape` docstring + autonomous guide §2 field reference: explicitly NOT a ContinuousDiD prerequisite. `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide descriptive distributional context; `has_never_treated` (unit-level) + `treatment_varies_within_unit == False` (full-path constancy) + `is_balanced` are the authoritative gates. - Rewrote §5.2 worked example reasoning chain to use the existing correct gates and added a counter-example showing `has_zero_dose=True` does NOT imply `has_never_treated=True` (the row-level vs unit-level distinction). - Added `test_treatment_dose_does_not_gate_continuous_did` covering the two contradictory cases the reviewer named: (1) `0,0,d,d` within-unit dose path, asserting `treatment_varies_within_unit=True` (the actual ContinuousDiD gate fires correctly); (2) row-level zeros without never-treated units, asserting `has_zero_dose=True` BUT `has_never_treated=False` (the two facts are distinct). - Removed `test_treatment_dose_continuous_time_varying_within_unit` and `test_treatment_dose_distinguishes_doses_at_high_precision` - both tested the dropped `is_time_invariant` field. P2 (WooldridgeDiD constructor kwarg): - The autonomous guide §5.3 worked example used `WooldridgeDiD(family="poisson")` but the actual constructor at `wooldridge.py:264` takes `method=`. Following the example would raise `TypeError: __init__() got an unexpected keyword argument 'family'`. Fixed in two places (the prose and the code snippet) and added a negative assertion in `test_guides.py` to prevent regression: `assert 'WooldridgeDiD(family="poisson")' not in text`. CHANGELOG updated to reflect the revised TreatmentDoseShape scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ontinuousDiD prerequisite summaries Reviewer correctly noted that the round-2 wording lists `has_never_treated` + `treatment_varies_within_unit == False` + `is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but omits the duplicate-cell hard stop. Verified `continuous_did.py:_precompute_structures` (line 818-823) builds `outcome_matrix` cell-by-cell with last-row-wins on duplicate `(unit, time)` keys - so absence of the `duplicate_unit_time_rows` alert is also a real prerequisite, not just a style preference. Updated wording in five places to add "+ absence of the `duplicate_unit_time_rows` alert" alongside the other gates and explain the silent-overwrite behavior: - `diff_diff/profile.py` `TreatmentDoseShape` docstring - `diff_diff/guides/llms-autonomous.txt` §2 field reference - `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature) - `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning chain (now lists four gates instead of three) - `CHANGELOG.md` Unreleased entry - `ROADMAP.md` AI-Agent Track building-block Also softened "authoritative" -> "core field-based" since the non-field-based duplicate-row gate makes the original phrasing slightly misleading. Added a test_guides.py regression asserting the autonomous guide mentions `duplicate_unit_time_rows` so future wording changes can't silently drop the gate from the summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-dose gate to ContinuousDiD prerequisite summaries P1 (newly identified — `dose_min > 0` ContinuousDiD gate omitted): Reviewer correctly noted that the round-3 prerequisite summary (`has_never_treated`, `treatment_varies_within_unit == False`, `is_balanced`, no `duplicate_unit_time_rows` alert) omits the estimator's strictly-positive-treated-dose restriction. Verified `continuous_did.py:287-294` raises `ValueError` ("Dose must be strictly positive for treated units (D > 0)") on negative treated dose support — a panel can satisfy every documented gate above and still hard-fail at fit time when `treatment_dose.dose_min < 0`. Updated wording in five surfaces to add `treatment_dose.dose_min > 0` as the fifth pre-fit gate: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells out all five gates with numbered list referencing the `continuous_did.py:287-294` line range). - `diff_diff/guides/llms-autonomous.txt` §2 field reference (notes `dose_min > 0` is itself a gate; the other treatment_dose sub-fields remain descriptive). - `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature paragraph). - `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning chain (now five gates checked; added a new counter-example covering the negative-dose path so an agent reading the example sees the contradictory case explicitly). - `CHANGELOG.md` Unreleased entry. - `ROADMAP.md` AI-Agent Track building-block. P2 (regression test for negative-dose path): Added `test_treatment_dose_min_flags_negative_dose_continuous_panels` in `tests/test_profile_panel.py` covering a balanced, never-treated, time-invariant continuous panel where every dose level is negative. The test asserts that all four other gates pass cleanly and that `dose.dose_min < 0` is correctly observed — the fixture an agent would see when reasoning about whether ContinuousDiD applies. Added `dose_min > 0` content-stability assertion in `tests/test_guides.py` so future wording changes can't silently drop the gate from the autonomous-guide summary. Rebased onto origin/main; resolved CHANGELOG conflicts (kept HAD Phase 4.5 B + dCDH by_path Wave 2 + my Wave 2 entries side by side). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T00:41:13Z

/ai-review

github-actions · 2026-04-25T00:46:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ed57c7cffa81eec1685695f3ad11894729e41372

Overall Assessment

⚠️ Needs changes — the prior negative-dose / Wooldridge Poisson wording issues look resolved, but there is one unmitigated P1 in the new ContinuousDiD routing contract.

Executive Summary

The earlier dose_min > 0 omission appears fixed: the guide now reflects ContinuousDiD’s negative-dose rejection, and the Wooldridge Poisson text now matches the ASF outcome-scale estimand.
P1 [Newly identified]: the PR repeatedly calls a five-item PanelProfile checklist the full ContinuousDiD pre-fit gate set, but ContinuousDiD.fit() still has separate first_treat validity/consistency gates and still requires P(D=0) > 0 regardless of control_group.
The new OutcomeShape / is_count_like heuristic is aligned with the Poisson fitter’s non-negativity requirement and does not look methodologically off.
Static review only: pytest is not installed in this environment, so I could not execute the added tests.

Methodology

Severity: P1 [Newly identified]. Impact: the new docs present has_never_treated, treatment_varies_within_unit == False, is_balanced, no duplicate_unit_time_rows, and treatment_dose.dose_min > 0 as the full ContinuousDiD gate set. That is not the estimator’s full contract. ContinuousDiD.fit() also validates the separate first_treat column, rejects NaN/negative values, drops first_treat > 0 units with dose == 0, zeroes first_treat == 0 rows with nonzero dose, and enforces P(D=0) > 0 even under control_group="not_yet_treated". An agent can therefore mark a panel as “in scope” from PanelProfile alone even though the actual first_treat input will deterministically warn/error or change the sample. Concrete fix: stop calling the five profile-derived facts the “full” gate set; describe them as the profile-side screening surface and add an explicit caveat that ContinuousDiD.fit() still requires a valid/consistent first_treat column. Refs: diff_diff/guides/llms-autonomous.txt:L203-L235, diff_diff/guides/llms-autonomous.txt:L514-L526, diff_diff/guides/llms-autonomous.txt:L758-L788, diff_diff/profile.py:L65-L99, diff_diff/continuous_did.py:L230-L360, docs/methodology/REGISTRY.md:L719-L745, docs/methodology/continuous-did.md:L182-L184, ROADMAP.md:L141, CHANGELOG.md:L12.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: the new internal commentary is inconsistent. Public docs now say treatment_dose.dose_min > 0 is part of the profile-side ContinuousDiD screen, but _compute_treatment_dose() and test_treatment_dose_does_not_gate_continuous_did() still say treatment-dose fields “do not gate” ContinuousDiD. That inconsistency will make future regressions more likely. Concrete fix: align the helper/test docstrings with the public contract (“most fields descriptive; dose_min > 0 is the extra screen”). Refs: diff_diff/profile.py:L65-L99, diff_diff/profile.py:L689-L694, tests/test_profile_panel.py:L1110-L1128.

Tech Debt

No findings. The P1 above is not mitigated by TODO.md tracking or a registry deviation note.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests pin the negative-dose wording and ASF outcome-scale wording, but they do not protect the missing first_treat caveat that causes the P1 above. Concrete fix: add a guide regression asserting that the ContinuousDiD section describes the five PanelProfile facts as screening checks only and explicitly mentions separate first_treat validation/consistency. Refs: tests/test_guides.py:L37-L69, diff_diff/guides/llms-autonomous.txt:L758-L788, diff_diff/continuous_did.py:L230-L360.

Path to Approval

Replace every “full ContinuousDiD pre-fit gate set” claim in CHANGELOG.md, ROADMAP.md, diff_diff/profile.py, and diff_diff/guides/llms-autonomous.txt with “profile-side screening checks,” and document the separate first_treat contract from ContinuousDiD.fit().
In llms-autonomous.txt §5.2, rewrite the has_never_treated == False counterexample so it states that the current implementation requires P(D=0) > 0 under both control_group options, not just the default.
Add a regression in tests/test_guides.py that requires the ContinuousDiD section to mention separate first_treat validation/consistency; optionally pair it with a small tests/test_continuous_did.py scenario where the five profile-side checks pass but inconsistent first_treat still warns/errors.

…nuousDiD prerequisite list as profile-side screening + add first_treat caveat P1 (the five profile-derived facts are not the "full" gate set): Reviewer correctly noted that calling `{has_never_treated, treatment_varies_within_unit==False, is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}` the "full ContinuousDiD pre-fit gate set" overreaches. `profile_panel` only sees the four columns it accepts and CANNOT see the separate `first_treat` column that `ContinuousDiD.fit()` consumes. Verified against `continuous_did.py:230-360`: `fit()` additionally rejects NaN/inf/negative `first_treat`, drops units with `first_treat > 0` AND `dose == 0`, and force-zeroes `first_treat == 0` rows whose `dose != 0` with a `UserWarning`. A panel that passes all five profile-side checks can still surface warnings, drop rows, or raise at fit time depending on the `first_treat` column the caller supplies. Reframed the wording in five surfaces from "full gate set" to "profile-side screening checks" with an explicit caveat that the checks are necessary-but-not-sufficient and that `ContinuousDiD.fit()` applies separate `first_treat` validation: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells out the screening framing explicitly + lists the `first_treat` validations that fit() applies). - `diff_diff/profile.py` `_compute_treatment_dose` helper docstring (aligned with public contract: most fields descriptive, `dose_min > 0` is one of the screening checks). - `diff_diff/guides/llms-autonomous.txt` §2 field reference (rewrote the multi-paragraph block to describe screening + first_treat caveat). - `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature paragraph: screening checks + necessary-not-sufficient language + pointer to §2). - `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning chain (rewrote step 2 to call out screening + first_treat caveat; clarified counter-example #4 that `P(D=0) > 0` is required under BOTH `control_group="never_treated"` and `"not_yet_treated"`, not just default). - `CHANGELOG.md` Unreleased entry. - `ROADMAP.md` AI-Agent Track. P2 (test coverage for the missing `first_treat` caveat): Added a content-stability assertion in `tests/test_guides.py`: `assert "first_treat" in text` so the autonomous guide cannot silently drop the explicit `first_treat` validation caveat. P3 (helper / test-name inconsistency with public contract): Renamed `test_treatment_dose_does_not_gate_continuous_did` to `test_treatment_dose_descriptive_fields_supplement_existing_gates` and rewrote its docstring to match the now-honest public contract ("most fields descriptive distributional context that supplements the existing top-level screening checks"). The test body still asserts the same two things — `treatment_varies_within_unit` fires True on `0,0,d,d` paths and `has_never_treated` is independent of `has_zero_dose` — both of which remain accurate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T00:50:59Z

/ai-review

github-actions · 2026-04-25T00:56:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8601b611fa2c21684b91ecb58bf4b5df314557d6

Overall Assessment

Looks good — the prior P1 is resolved, and I did not find any new unmitigated P0/P1 issues in the changed code/docs.

Executive Summary

Prior re-review P1 is resolved: the ContinuousDiD language now consistently frames the five PanelProfile facts as a profile-side pre-flight screen and separately documents first_treat validation/coercion/drop behavior (diff_diff/profile.py:L65-L115, diff_diff/guides/llms-autonomous.txt:L203-L250, diff_diff/guides/llms-autonomous.txt:L530-L548, diff_diff/guides/llms-autonomous.txt:L780-L812, CHANGELOG.md:L12, ROADMAP.md:L141).
The new OutcomeShape.is_count_like routing is aligned with Wooldridge Poisson fit behavior by excluding negative-support outcomes, and the guide now correctly describes Poisson overall_att as an ASF-based outcome-scale difference (diff_diff/profile.py:L669-L683, docs/methodology/REGISTRY.md:L1333-L1348, diff_diff/wooldridge.py:L1102-L1109, diff_diff/wooldridge.py:L1191-L1245).
No estimator math, weighting, variance/SE, identification assumptions, or defaults changed in the implementation; this PR is limited to descriptive profiling fields, exports, guide content, and tests.
One minor documentation drift remains: the changelog summary omits the non-negativity clause from the is_count_like heuristic (CHANGELOG.md:L12).
Added tests cover the new profile fields and the prior review regression surface (tests/test_guides.py:L37-L185, tests/test_profile_panel.py:L889-L1313), but I could not execute them here because pytest and runtime deps such as numpy are unavailable.

Methodology

No findings. The changed docs now match the ContinuousDiD contract in docs/methodology/REGISTRY.md:L717-L745 and diff_diff/continuous_did.py:L222-L327,L348-L360, and the Wooldridge Poisson wording matches docs/methodology/REGISTRY.md:L1333-L1348 and diff_diff/wooldridge.py:L1191-L1245.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: CHANGELOG.md:L12 describes is_count_like without the value_min >= 0 / non-negative-support gate. Readers relying on the release notes alone may expect negative-valued outcomes to route as count-like even though the implementation intentionally suppresses that to stay compatible with WooldridgeDiD(method="poisson").
Concrete fix: Update the changelog bullet to include the non-negativity clause, matching diff_diff/profile.py:L669-L683, diff_diff/guides/llms-autonomous.txt:L188-L198, and tests/test_profile_panel.py:L930-L971.

…se in CHANGELOG is_count_like description Reviewer correctly noted that the CHANGELOG bullet describing `is_count_like` listed only four of the five conditions (integer-valued + has zeros + right-skewed + > 2 distinct values) but omitted the `value_min >= 0` non-negativity clause added in round 4. Readers of the release notes alone would have expected negative-valued count-like outcomes to route as `is_count_like=True` even though the implementation intentionally suppresses that to stay compatible with `WooldridgeDiD(method="poisson")`'s hard non-negative requirement (`wooldridge.py:1105-1109`). Updated the bullet to include the non-negativity clause and explicitly cite the wooldridge.py line range so the release-notes description matches `diff_diff/profile.py`, `diff_diff/guides/llms-autonomous.txt`, and the `test_outcome_shape_count_like_excludes_negative_support` regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T00:58:26Z

/ai-review

github-actions · 2026-04-25T01:04:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bc13abb50011b5a5495d6448d458cff7dfa157e4

Overall Assessment

✅ Looks good

Executive Summary

Re-review status: the prior changelog drift is resolved; I did not find any new unmitigated P0/P1 issues in the changed files.
Methodology check passed for the affected guidance surfaces: the new ContinuousDiD profile-side screening language matches the estimator contract on zero-dose controls, full-path dose constancy, balanced-panel requirement, duplicate-row caveat, negative-dose rejection, and separate first_treat validation (diff_diff/guides/llms-autonomous.txt:L203-L251, diff_diff/guides/llms-autonomous.txt:L507-L548, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745).
Methodology check passed for the Wooldridge Poisson path: OutcomeShape.is_count_like now gates on non-negative support, and the guide correctly describes overall_att as an ASF-based outcome-scale difference rather than a multiplicative ratio (diff_diff/profile.py:L631-L695, diff_diff/wooldridge.py:L1102-L1278, docs/methodology/REGISTRY.md:L1333-L1349).
No estimator math, weighting, variance/SE, identification assumptions, or defaults changed; the implementation is confined to descriptive profile fields, public exports, guide content, and tests (diff_diff/profile.py:L43-L442, diff_diff/init.py:L253-L259, diff_diff/init.py:L503-L508).
One minor docs inconsistency remains: §4.7 still opens with “Three eligibility prerequisites” for ContinuousDiD, then later in the same section correctly expands the profile-side screen to five checks by adding duplicate-row absence and treatment_dose.dose_min > 0 (diff_diff/guides/llms-autonomous.txt:L507-L548).
I could not execute the added tests here because this environment lacks pytest and runtime dependencies such as pandas and numpy.

Methodology

No findings. Cross-checking the changed profile/guide language against docs/methodology/REGISTRY.md:L717-L745 and docs/methodology/REGISTRY.md:L1333-L1349 found no undocumented deviation in estimator behavior.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: §4.7 contains two different summaries of the ContinuousDiD pre-fit contract. The opening bullet says there are only three eligibility prerequisites, while the later paragraph correctly adds the duplicate-row hard stop and the negative-dose screen. A reader who stops at the first summary can miss two load-bearing gates and route a duplicate-containing or negative-dose panel into a failed or silent-overwrite fit path (diff_diff/guides/llms-autonomous.txt:L507-L548, diff_diff/continuous_did.py:L222-L338, diff_diff/continuous_did.py:L800-L823).
Concrete fix: Update the opening ContinuousDiD bullet in §4.7 to enumerate all five profile-side checks, or relabel the first three as “core estimator prerequisites” and explicitly cross-reference the fuller five-check screen immediately below.

Static review only: the added coverage in tests/test_profile_panel.py:L887-L1313 and tests/test_guides.py:L37-L182 looks appropriate, but I could not run it in this environment.

…D opening bullet with the five-check screen below it §4.7 had two summaries of the ContinuousDiD pre-fit contract that disagreed: the opening bullet (Wave 1 era) said "Three eligibility prerequisites", while the paragraph immediately below correctly listed all five profile-side screening checks (Wave 2 added the duplicate-row hard stop and the negative-dose `dose_min > 0` screen). A reader who stopped at the opening bullet could miss those two load-bearing gates and route a duplicate-containing or negative-dose panel into a failed/silent-overwrite fit path. Updated the opening bullet to enumerate all five profile-side checks: (a) zero-dose controls / `P(D=0) > 0`, (b) per-unit time-invariant dose, (c) balanced panel, (d) no `duplicate_unit_time_rows` alert, (e) strictly positive treated doses. Also clarified that the duplicate-row case (d) coerces silently rather than raising — distinct from (a)/(b)/(c)/(e) which all raise ValueError. Pointer to the paragraph below + §2 for the `first_treat` validation surface kept intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T01:06:51Z

/ai-review

github-actions · 2026-04-25T01:12:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a8880b46a47e39fb3c89a60718cc9b80a7bc2994

Overall Assessment

✅ Looks good

Executive Summary

Prior re-review issue is resolved: the autonomous guide now consistently names all five ContinuousDiD profile-side checks and the separate first_treat validation surface.
No estimator implementation, weighting, variance/SE computation, identification assumption, or default behavior changed in this PR; the code changes are confined to descriptive PanelProfile metadata, exports, guide prose, and tests.
One methodology-facing docs issue remains: the new count-outcome guidance justifies Poisson QMLE by saying linear DiD asymptotic SEs assume “normal-shaped errors,” which does not match the library’s documented cluster-robust OLS inference contract.
One minor docs clarity issue remains: the new ContinuousDiD worked example is written like a zero-pre/post-dose adoption path even though the estimator, the profile gate, and the same section later reject 0,0,d,d dose paths.
Static review only: I could not run the added tests here because this environment does not have pytest, numpy, or pandas.

Methodology

Severity: P2. Impact: The new count-outcome routing text tells agents that linear DiD SEs are questionable because they “assume normal-shaped errors,” then uses that as the rationale for preferring WooldridgeDiD(method="poisson"). That does not match the library’s documented inference surface, which is cluster-robust sandwich OLS for linear ETWFE and QMLE sandwich for nonlinear ETWFE; the current wording can steer estimator choice for the wrong reason even though the Poisson path itself is valid. Location: diff_diff/guides/llms-autonomous.txt:L188-L197, diff_diff/guides/llms-autonomous.txt:L666-L680, diff_diff/guides/llms-autonomous.txt:L882-L907, docs/methodology/REGISTRY.md:L1343-L1349. Concrete fix: rewrite the rationale in support/functional-form terms (or efficiency/model-fit terms) rather than claiming robust linear DiD inference depends on normal-shaped errors.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The new ContinuousDiD worked example is titled “zero baseline” and describes “block-style adoption in year 3,” but the shown profile has treatment_varies_within_unit=False, and the same section later correctly says a 0,0,d,d path is out of scope. That mixed framing can confuse an agent about the estimator’s full-path dose-invariance contract. Location: diff_diff/guides/llms-autonomous.txt:L759-L781, diff_diff/guides/llms-autonomous.txt:L822-L828, diff_diff/continuous_did.py:L222-L228. Concrete fix: rename the example to something like “continuous-dose panel with zero-dose controls,” and state explicitly that positive-dose units keep the same dose in every observed period while first_treat carries the timing information separately.

No additional test-coverage findings in the diff. Testing note: I could not execute tests/test_profile_panel.py or tests/test_guides.py in this environment because the required Python packages are not installed.

…ationale in functional-form terms; rename §5.2 + clarify dose constancy P2 (linear-OLS SE rationale wording): The §4.11 and §5.3 prose justified `WooldridgeDiD(method="poisson")` over linear-OLS DiD by claiming linear-OLS asymptotic SEs "assume normal-shaped errors" that a right-skewed count distribution violates. That misrepresents the library's documented inference contract: linear ETWFE uses cluster-robust sandwich SEs (`OLS path` in REGISTRY.md) which are valid for any error distribution with finite moments — not just Gaussian. Rewrote both surfaces in functional-form / efficiency terms: - Linear-OLS DiD imposes an additive functional form on a non-negative count outcome. SEs are calibrated (cluster-robust) but the linear model can be inefficient and may produce counterfactual predictions outside the non-negative support. - `WooldridgeDiD(method="poisson")` imposes a multiplicative (log-link) functional form that respects non-negativity and matches the typical generative process for count data; QMLE sandwich SEs are robust to distributional misspecification. - The choice is about WHICH functional form best summarizes the treatment effect (additive vs multiplicative), not about whether SEs are calibrated. §5.3 reasoning chain step 2 + step 3 rewritten to reflect this: both estimators are now described as having calibrated inference; the trade-off is parameterization (linear ATT vs. ASF outcome-scale ATT). The Poisson non-negativity gate explanation is preserved. P3 (§5.2 example self-contradiction): §5.2 was titled "Continuous-dose panel with zero baseline" with prose "block-style adoption in year 3", which suggested a `0,0,d,d` within-unit dose path — but the shown profile has `treatment_varies_within_unit=False` (per-unit constant dose) and the same section's later counter-example correctly says `0,0,d,d` is out of scope for ContinuousDiD. Self-contradictory framing. Renamed to "Continuous-dose panel with zero-dose controls" (per reviewer suggestion) and clarified the prose: each dose group holds its assigned dose value (including 0 for the never-treated controls) in every observed period; adoption timing is carried separately via the `first_treat` column passed to `ContinuousDiD.fit()`, not via within-unit dose variation. Updated the matching `test_autonomous_contains_worked_examples_section` assertion in `tests/test_guides.py` to track the new title. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T01:14:16Z

/ai-review

github-actions · 2026-04-25T01:19:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0d711f1c334cb3fb6cccad02ab6eecfbcacac0ad

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope: the prior continuous-dose example issue appears resolved. §5.2 now keeps positive-dose units time-invariant across the full observed path and carries timing separately in first_treat, which matches ContinuousDiD.fit()’s contract in diff_diff/guides/llms-autonomous.txt:L769-L858 and diff_diff/continuous_did.py:L222-L360.
No estimator implementation, weighting, variance/SE computation, identification assumption, or default behavior changed in this PR. The code changes are limited to descriptive PanelProfile metadata, exports, guide prose, and targeted tests.
The ContinuousDiD screening narrative is now materially better aligned with the implementation: it explicitly names the duplicate-row hard stop, the dose_min > 0 gate, and the separate first_treat validation surface in diff_diff/guides/llms-autonomous.txt:L203-L251, diff_diff/guides/llms-autonomous.txt:L539-L556, and diff_diff/continuous_did.py:L287-L360.
One methodology-facing docs issue remains from the prior review: the new outcome_shape.is_count_like field reference and the §4.11 preamble still imply count outcomes make linear DiD asymptotic SEs suspect, which conflicts with the registry’s cluster-robust OLS inference contract.
Static review only: python -m pytest tests/test_profile_panel.py tests/test_guides.py -q could not be run here because pytest is not installed.

Methodology

Severity: P2. Impact: The previous count-outcome docs finding is only partially resolved. The detailed §4.11/§5.3 discussion now correctly reframes Poisson-vs-linear as a functional-form/support/efficiency choice, but the §2 outcome_shape.is_count_like field reference still says raw-outcome OLS has “questionable asymptotic SEs,” and the §4.11 preamble still says asymptotic SEs “may mislead.” That conflicts with the documented OLS cluster-robust sandwich inference surface and can still steer estimator choice for the wrong reason. Location: diff_diff/guides/llms-autonomous.txt:L188-L193, diff_diff/guides/llms-autonomous.txt:L663-L671, docs/methodology/REGISTRY.md:L1343-L1349. Concrete fix: replace the remaining SE-based rationale with the same functional-form/support language already used later in §4.11/§5.3.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings in the diff beyond the methodology item above.
Static review note: the added guide/profile tests look well targeted on inspection, but I could not execute them here because pytest is unavailable.

…tcome rationale to functional-form framing Round 9 rewrote §4.11 detailed bullet and §5.3 reasoning chain to describe the Poisson-vs-linear choice in functional-form / efficiency terms, but two surfaces still carried the old SE-based framing: - §2 `is_count_like` field reference: "questionable asymptotic SEs" - §4.11 preamble paragraph: "asymptotic SEs may mislead" Both phrases conflict with the library's documented inference contract (REGISTRY.md §WooldridgeDiD: linear ETWFE uses cluster-robust sandwich SEs which are valid for any error distribution with finite moments — not just Gaussian). Reviewer correctly flagged this as the remaining methodology P2. Updated both surfaces to match the round-9 framing: - Linear-OLS DiD imposes an additive functional form on a non-negative count outcome; cluster-robust SEs remain calibrated but the linear model can be inefficient and may produce counterfactual predictions outside the non-negative support. - `WooldridgeDiD(method="poisson")` is the multiplicative (log-link) ETWFE alternative that respects the non-negative support and matches the typical count-process generative model, with QMLE sandwich SEs robust to distributional misspecification. - The choice is functional-form / efficiency / support, NOT SE calibration. Verified no remaining "asymptotic SEs" / "questionable" / "mislead" phrases in the guide via grep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T01:24:09Z

/ai-review

github-actions · 2026-04-25T01:39:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 31e887523fb6cd1f31a5c3acc4be8b07330dfa42

Overall Assessment

⚠️ Needs changes

Executive Summary

P1 The new agent-facing ContinuousDiD guidance overstates PanelProfile.has_never_treated and treatment_dose.dose_min > 0 as necessary/authoritative pre-fit gates, but ContinuousDiD.fit() actually keys off first_treat, zeroes nonzero dose on never-treated rows, and only rejects negative doses among treated units.
That mismatch is now repeated across the public guide, docstrings, changelog/roadmap prose, and regression tests, so an autonomous agent can wrongly route valid panels away from ContinuousDiD.
The prior count-outcome wording issue appears resolved: the updated Wooldridge Poisson guidance now correctly frames overall_att as an ASF-based outcome-scale difference rather than a multiplicative ratio.
No estimator math, weighting, variance, or SE implementation changed in this diff.
Static review only: python -m pytest tests/test_profile_panel.py tests/test_guides.py -q could not be run here because pytest is not installed.

Methodology

P1 Impact: the PR documents the five profile-side ContinuousDiD checks as “necessary” and treats has_never_treated as the authoritative gate, with dose_min > 0 presented as a hard pre-fit screen in diff_diff/profile.py:L64-L114, diff_diff/guides/llms-autonomous.txt:L206-L254, diff_diff/guides/llms-autonomous.txt:L510-L558, diff_diff/guides/llms-autonomous.txt:L807-L860, CHANGELOG.md:L12-L12, ROADMAP.md:L141-L141, tests/test_guides.py:L45-L82, and tests/test_profile_panel.py:L1110-L1212. But the actual estimator contract is different: ContinuousDiD.fit() defines controls from first_treat == 0, zeroes nonzero dose on those rows, and only rejects negative dose among treated units first_treat > 0 in diff_diff/continuous_did.py:L276-L327 and diff_diff/continuous_did.py:L348-L360; that coercion is explicitly documented in docs/methodology/REGISTRY.md:L743-L745. So a panel can fail the new profile-side gate while still warning-and-fitting. Concrete fix: reword these fields as conservative “clean-data / no-coercion preflight checks” rather than necessary eligibility conditions, or add a profiling helper that also accepts first_treat so the pre-fit gate matches the estimator’s real contract; update the new tests accordingly.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the methodology contract issue above.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No separate findings beyond the methodology item above.
Static review note: tests were not executed here because pytest is unavailable in the environment.

Path to Approval

Remove “necessary” / “authoritative gate” wording for has_never_treated and treatment_dose.dose_min > 0 anywhere this PR added it, and replace it with contract-accurate wording tied to first_treat-based coercion/validation.
Add a regression that covers first_treat == 0 rows carrying nonzero dose (including negative dose on never-treated rows), asserting the documented warning/coercion behavior in ContinuousDiD.fit(), then align the guide/profile tests with that behavior.

…mous-guide worked examples (Wave 2) Wave 2 of the AI-agent enablement track. Extends profile_panel() with two new optional sub-dataclasses: - OutcomeShape (numeric outcomes only): n_distinct_values, pct_zeros, value_min/max, NaN-safe skewness + excess_kurtosis (gated on n_distinct >= 3 and std > 0), is_integer_valued, is_count_like (heuristic: integer-valued AND has zeros AND right-skewed AND > 2 distinct values), is_bounded_unit ([0, 1] support). - TreatmentDoseShape (treatment_type == "continuous" only): n_distinct_doses, has_zero_dose, dose_min/max/mean over non-zero doses, is_time_invariant (per-unit non-zero doses have at most one distinct value). Both fields are None when their classification gate is not met. to_dict() serializes the nested dataclasses as JSON-compatible nested dicts. llms-autonomous.txt gains a new §5 "Worked examples" with three end-to-end PanelProfile -> reasoning -> validation walkthroughs (binary staggered with never-treated controls, continuous dose with zero baseline, count-shaped outcome) plus §2 field-reference subsections, §3 footnote cross-ref, §4.7 cross-ref, and a new §4.11 outcome-shape considerations section. Existing §5-§8 renumbered to §6-§9. Descriptive only - no recommender language inside the worked examples. Tests: 16 new unit tests in tests/test_profile_panel.py covering each heuristic (count-like Poisson, binary-as-not-count-like, continuous normal, bounded unit, categorical returning None, skewness gating, JSON roundtrip, time-invariant dose, time-varying dose, no-zero-dose, binary-treatment returning None, categorical-treatment returning None, JSON roundtrip, frozen invariants on both new dataclasses). Two new content-stability tests in tests/test_guides.py guard the §5 worked examples and the new field references. CHANGELOG and ROADMAP updated; ROADMAP marks Wave 2 shipped, promotes sanity_checks block to top of "Next blocks toward the vision," and documents why the originally-proposed post-hoc mismatch detection was rescoped (largely overlaps existing fit-time validators and caveats). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…are integer detection P1 from local review (`treatment_dose.is_time_invariant`): - Removed `np.round(..., 8)` tolerance from `_compute_treatment_dose`'s per-unit non-zero distinct-count check. The documented contract is "per-unit non-zero doses have at most one distinct value" (exact), but the implementation was rounding to 8 decimals before comparing, silently classifying tiny-but-real dose variation as time-invariant and contradicting the docstring + CHANGELOG + autonomous guide §2. Now uses exact `np.unique(unit_nonzero).size > 1`. Added a regression test (`test_treatment_dose_distinguishes_doses_at_high_precision`) for a unit with two non-equal doses separated by 1e-9 (sub the previous rounding window) — asserts `is_time_invariant=False`. Related dead-code removal: - Removed the `len(nonzero) == 0` defensive branch in `_compute_treatment_dose`. `treatment_type == "continuous"` is reached only when the treatment column has more than two distinct values OR a 2-valued numeric outside `{0, 1}`; an all-zero numeric column is classified as `binary_absorbing` and never reaches this branch, so `nonzero` is guaranteed non-empty. Removing the branch eliminates the NaN-vs-Optional[float] inconsistency the reviewer flagged on `dose_min/max/mean`. P2 from local review (`is_integer_valued` brittleness): - Switched from `np.equal(np.mod(arr, 1.0), 0.0)` to `np.isclose(arr, np.round(arr), rtol=0.0, atol=1e-12)`. The treatment / outcome column is user input (system boundary), and CSV-roundtripped count columns commonly carry float64 representation noise (e.g., `1.0` stored as `1.0000000000000002`). Tolerance-aware integer detection is the right discipline at the boundary; downstream the `is_count_like` heuristic remains gated on this AND `pct_zeros > 0` AND `skewness > 0.5` AND `n_distinct > 2`, so isolated noise can't flip the classification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>