Skip to content

Wave 2: PanelProfile outcome/dose shape extensions + autonomous-guide worked examples#366

Open
igerber wants to merge 15 commits intomainfrom
agent-profile-shape-extensions
Open

Wave 2: PanelProfile outcome/dose shape extensions + autonomous-guide worked examples#366
igerber wants to merge 15 commits intomainfrom
agent-profile-shape-extensions

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

  • Extend PanelProfile with outcome_shape: Optional[OutcomeShape] (numeric outcomes) and treatment_dose: Optional[TreatmentDoseShape] (continuous treatments). New fields populate distributional facts that gate WooldridgeDiD (QMLE) and ContinuousDiD prerequisites pre-fit.
  • Add three end-to-end worked examples to llms-autonomous.txt (binary-staggered with never-treated, continuous-dose with zero baseline, count-shaped outcome). New §5 between the existing §4 reasoning and §5 post-fit validation; existing §5–§8 renumbered to §6–§9.
  • Top-level exports: OutcomeShape, TreatmentDoseShape from diff_diff. Both new dataclasses are frozen and JSON-serializable via PanelProfile.to_dict().

Methodology references (required if estimator / math changes)

  • Method name(s): profile_panel is descriptive (not an estimator); the worked examples reference CallawaySantAnna, WooldridgeDiD, ContinuousDiD. No estimator math is changed in this PR.
  • Paper / source link(s): no new methodology claims beyond what is already in docs/methodology/REGISTRY.md.
  • Any intentional deviations from the source: None. The is_time_invariant field uses exact distinct-count on observed non-zero doses (matching the documented contract and ContinuousDiD.fit()'s exact equality check). The is_count_like and is_integer_valued fields are explicitly labeled heuristics in the docstring + guide.

Validation

  • Tests added/updated: 17 new unit tests in tests/test_profile_panel.py covering each shape heuristic (count-like Poisson, binary-as-not-count-like, continuous normal, bounded unit, categorical returning None, skewness/kurtosis gating, JSON roundtrip, time-invariant dose, time-varying dose, no-zero-dose, binary/categorical treatment returning None, frozen invariants on both new dataclasses, sub-1e-8 dose-precision regression). 2 new content-stability tests in tests/test_guides.py guard the §5 worked-examples block and the new field references.
  • Backtest / simulation / notebook evidence: N/A (descriptive utility + documentation).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

@github-actions
Copy link
Copy Markdown

Overall assessment

⚠️ Needs changes — one unmitigated P1 methodology/contract mismatch in the new ContinuousDiD pre-fit gating surface.

Executive summary

  • P1: The new TreatmentDoseShape fields are documented as the ContinuousDiD prerequisites, but they do not match what ContinuousDiD.fit() actually enforces. That can mark ineligible panels as eligible pre-fit.
  • The mismatch is internal as well as methodological: the existing autonomous guide still says ContinuousDiD requires has_never_treated and treatment_varies_within_unit == False, while the new section shifts the gate to row-level has_zero_dose and nonzero-only is_time_invariant.
  • P2: The new worked example uses WooldridgeDiD(family="poisson"), but the public API is WooldridgeDiD(method="poisson"). The new guide test hard-codes the wrong call.
  • The new tests do not cover the two cases that would have exposed the P1: 0,0,d,d within-unit dose paths, and row-level zeros without any never-treated units.
  • I did not find a mitigating entry for the P1 in TODO.md. This was a static source review; I could not execute the code in this environment.

Methodology

  • Severity: P1
    Impact: ContinuousDiD is still implemented as requiring unit-level never-treated controls (first_treat == 0) and full per-unit dose constancy, but the new profile surface defines has_zero_dose as “any zero row” and is_time_invariant on nonzero doses only, then states those new fields are the prerequisites. That is an undocumented mismatch between the shipped pre-fit guidance and the estimator contract. A panel with pre-period zeros and no never-treated units, or a 0,0,d,d dose path, can satisfy the new guide’s gate while ContinuousDiD.fit() still raises.
    Concrete fix: Either align _compute_treatment_dose() to the actual estimator contract, or keep the fields descriptive and stop claiming they alone gate ContinuousDiD. In either case, keep has_never_treated and full-path time-invariance as the authoritative prerequisites unless you also change ContinuousDiD.fit() and document that deviation in REGISTRY.md.
    Locations: diff_diff/profile.py:L656-L710, diff_diff/continuous_did.py:L222-L227, diff_diff/continuous_did.py:L348-L360, diff_diff/guides/llms-autonomous.txt:L115-L140, diff_diff/guides/llms-autonomous.txt:L198-L211, diff_diff/guides/llms-autonomous.txt:L690-L723, docs/methodology/REGISTRY.md:L719-L745

Code Quality

  • No findings in the changed Python implementation beyond the methodology contract issue above.

Performance

  • No findings.

Maintainability

  • No separate findings.

Tech Debt

  • No separate findings.
  • The P1 above is not mitigated by TODO.md; I did not find an existing tracked entry for it.

Security

  • No findings.

Documentation/Tests

  • Severity: P2
    Impact: The new autonomous guide tells users/agents to call WooldridgeDiD(family="poisson"), but the estimator constructor accepts method, not family. Anyone following the example will get a runtime error, and the new guide test now locks that bad example in place.
    Concrete fix: Change the guide and the stability test to WooldridgeDiD(method="poisson").
    Locations: diff_diff/guides/llms-autonomous.txt:L767-L771, diff_diff/wooldridge.py:L273-L275, tests/test_guides.py:L62-L64

  • Severity: P2
    Impact: The new tests only cover “constant dose in every row” and “changing nonzero dose,” so they miss the exact contradictory cases introduced by the new guidance: 0,0,d,d within-unit paths and zero rows without any never-treated units. That leaves the P1 above unguarded.
    Concrete fix: Add regression tests asserting consistency between profile_panel guidance and ContinuousDiD.fit() for:

    1. balanced panel with 0,0,d,d dose paths;
    2. panel with row-level zeros but no never-treated units.
      Locations: tests/test_profile_panel.py:L1028-L1105

Path to Approval

  1. Make the new TreatmentDoseShape contract consistent with ContinuousDiD.fit() and the Methodology Registry, or explicitly scope it as descriptive-only and restore the existing prerequisites (has_never_treated, full-path time invariance) in the guide.
  2. Add regression tests for the two contradictory continuous-dose scenarios: 0,0,d,d paths and row-level zeros without never-treated controls.
  3. Fix the autonomous guide and its stability test to use WooldridgeDiD(method="poisson").

igerber added a commit that referenced this pull request Apr 24, 2026
…ptive-only; fix WooldridgeDiD method kwarg

P1 (TreatmentDoseShape vs ContinuousDiD contract):
- Reviewer correctly flagged that the new `is_time_invariant` field
  (per-unit non-zero distinct-count) does NOT match the actual
  `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which
  uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose
  column (including pre-treatment zeros). My nonzero-only check
  silently classified `0,0,d,d` paths as time-invariant while
  ContinuousDiD would reject them.
- Removed `is_time_invariant` field from `TreatmentDoseShape`
  entirely. The pre-existing `PanelProfile.treatment_varies_within_unit`
  field already encodes the correct ContinuousDiD prerequisite (matches
  the estimator's nunique check at line 224) and is correctly documented
  in §2 of the autonomous guide. Adding a second, narrower, mismatched
  gate was confusing - the reviewer's "scope as descriptive-only" path
  is the cleaner fix.
- Reframed `TreatmentDoseShape` docstring + autonomous guide §2
  field reference: explicitly NOT a ContinuousDiD prerequisite.
  `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide
  descriptive distributional context; `has_never_treated` (unit-level)
  + `treatment_varies_within_unit == False` (full-path constancy) +
  `is_balanced` are the authoritative gates.
- Rewrote §5.2 worked example reasoning chain to use the existing
  correct gates and added a counter-example showing
  `has_zero_dose=True` does NOT imply `has_never_treated=True` (the
  row-level vs unit-level distinction).
- Added `test_treatment_dose_does_not_gate_continuous_did` covering
  the two contradictory cases the reviewer named: (1) `0,0,d,d`
  within-unit dose path, asserting `treatment_varies_within_unit=True`
  (the actual ContinuousDiD gate fires correctly); (2) row-level zeros
  without never-treated units, asserting `has_zero_dose=True` BUT
  `has_never_treated=False` (the two facts are distinct).
- Removed `test_treatment_dose_continuous_time_varying_within_unit`
  and `test_treatment_dose_distinguishes_doses_at_high_precision` -
  both tested the dropped `is_time_invariant` field.

P2 (WooldridgeDiD constructor kwarg):
- The autonomous guide §5.3 worked example used
  `WooldridgeDiD(family="poisson")` but the actual constructor at
  `wooldridge.py:264` takes `method=`. Following the example would
  raise `TypeError: __init__() got an unexpected keyword argument
  'family'`. Fixed in two places (the prose and the code snippet)
  and added a negative assertion in `test_guides.py` to prevent
  regression: `assert 'WooldridgeDiD(family="poisson")' not in text`.

CHANGELOG updated to reflect the revised TreatmentDoseShape scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cc91a7dad85dbb36ec713b2deedeae7d0938e3bc


Overall Assessment
Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The prior re-review P1 is resolved: the new TreatmentDoseShape surface is now explicitly descriptive-only, and the guide/tests defer to the same ContinuousDiD gates the estimator and Methodology Registry enforce: has_never_treated, treatment_varies_within_unit, and is_balanced. diff_diff/profile.py:L64-L83, diff_diff/profile.py:L658-L697, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/guides/llms-autonomous.txt:L732-L757, tests/test_profile_panel.py:L1057-L1115
  • The prior re-review P2 is resolved: the worked example now uses WooldridgeDiD(method="poisson"), matching the public API and the Wooldridge documentation. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L276, docs/methodology/REGISTRY.md:L1333-L1349, tests/test_guides.py:L50-L71
  • I did not find estimator-math, weighting, variance/SE, identification-assumption, or default-behavior changes in the PR; the changed surface is descriptive profiling, exports, guide content, and tests. diff_diff/profile.py:L43-L200, diff_diff/__init__.py:L253-L259, diff_diff/guides/llms-autonomous.txt:L637-L811
  • P3: ROADMAP.md still overstates the new fields as if they themselves gate Wooldridge/ContinuousDiD routing, which conflicts with the corrected guide/docstrings. ROADMAP.md:L141-L141
  • P3: the new top-level exports lack a matching regression in the existing top-level import-surface test. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L508, tests/test_profile_panel.py:L404-L413
  • Static source review only: I could not execute tests here because this environment is missing pytest and even numpy.

Methodology

  • No findings. The previous ContinuousDiD contract mismatch is fixed and now matches both the estimator implementation and REGISTRY.md. diff_diff/profile.py:L64-L83, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745
  • No findings. The previous Wooldridge guide/API mismatch is fixed. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L276

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • P3ROADMAP.md still says the new extensions expose facts that “gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit,” which reintroduces the exact conceptual confusion the guide/docstrings just corrected; reword it to say the new fields add descriptive outcome/dose-shape context, while the authoritative ContinuousDiD gates remain has_never_treated, treatment_varies_within_unit, and is_balanced, and outcome_shape.is_count_like only informs Poisson-QMLE consideration. ROADMAP.md:L141-L141, compare diff_diff/profile.py:L64-L83, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507

Tech Debt

  • No findings. I do not see a new correctness issue that should be deferred into TODO.md.

Security

  • No findings.

Documentation/Tests

  • P3 — The PR adds OutcomeShape and TreatmentDoseShape to the public top-level namespace, but test_top_level_import_surface() still only covers profile_panel, PanelProfile, and Alert; extend that test to assert diff_diff.OutcomeShape, diff_diff.TreatmentDoseShape, and their presence in diff_diff.__all__. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L508, tests/test_profile_panel.py:L404-L413
  • Review note: I could not run the added tests in this environment because pytest is unavailable and importing the package runtime also fails for lack of numpy.

igerber added a commit that referenced this pull request Apr 24, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 370915efd4056efc0cfe882a96041818a5d62239


Overall assessment

Looks good — no unmitigated P0/P1 findings. The prior substantive re-review issues I checked are resolved.

Executive summary

  • The earlier ContinuousDiD contract concern is fixed: the new treatment_dose surface is now explicitly descriptive-only, and the guide points back to the same field-based gates enforced in ContinuousDiD.fit() and documented in the registry. diff_diff/profile.py:L64-L89, diff_diff/guides/llms-autonomous.txt:L198-L220, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L222-L360, docs/methodology/REGISTRY.md:L717-L745
  • The earlier Wooldridge worked-example/API mismatch is fixed: the guide now uses WooldridgeDiD(method="poisson"), matching the public API and the Wooldridge registry entry. diff_diff/guides/llms-autonomous.txt:L794-L805, diff_diff/wooldridge.py:L271-L275, docs/methodology/REGISTRY.md:L1333-L1349
  • The earlier top-level export/test gap is fixed: OutcomeShape and TreatmentDoseShape are exported from diff_diff and covered by the import-surface regression. diff_diff/__init__.py:L253-L259, diff_diff/__init__.py:L503-L510, tests/test_profile_panel.py:L404-L422
  • P3: the new “authoritative ContinuousDiD pre-fit gates” wording now sounds exhaustive but omits the separate duplicate-cell hard stop already documented elsewhere; _precompute_structures() still resolves duplicate (unit, time) rows by last-row-wins overwrite. diff_diff/guides/llms-autonomous.txt:L330-L343, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L818-L823, diff_diff/profile.py:L69-L75, CHANGELOG.md:L11-L11, ROADMAP.md:L141-L141
  • Static review only: I could not run tests here because this environment is missing pytest, numpy, and pandas.

Methodology

  • No P0/P1 findings. The changed guide/docstring text now matches the current ContinuousDiD and WooldridgeDiD contracts and does not introduce an undocumented estimator-methodology deviation.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the documentation item below.

Tech Debt

  • No findings. I did not see a related mitigation entry needed in TODO.md.

Security

  • No findings.

Documentation/Tests

  • P3 — Impact: the new ContinuousDiD prerequisite summary omits the already-documented duplicate-row hard gate, so a reader treating the new treatment_dose summary as exhaustive can still route duplicate-containing panels into a code path that silently overwrites earlier (unit, time) rows. Concrete fix: in the new TreatmentDoseShape docs and the related guide/ROADMAP/CHANGELOG wording, change “authoritative gates are ...” to “core field-based gates are ... plus no duplicate_unit_time_rows alert,” and add a regression in tests/test_guides.py that the ContinuousDiD prerequisite summary still mentions duplicates. diff_diff/guides/llms-autonomous.txt:L330-L343, diff_diff/guides/llms-autonomous.txt:L499-L507, diff_diff/continuous_did.py:L818-L823, diff_diff/profile.py:L69-L75, CHANGELOG.md:L11-L11, ROADMAP.md:L141-L141
  • Verification note: I could not execute the new tests in this environment.

igerber added a commit that referenced this pull request Apr 25, 2026
…ontinuousDiD prerequisite summaries

Reviewer correctly noted that the round-2 wording lists
`has_never_treated` + `treatment_varies_within_unit == False` +
`is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but
omits the duplicate-cell hard stop. Verified
`continuous_did.py:_precompute_structures` (line 818-823) builds
`outcome_matrix` cell-by-cell with last-row-wins on duplicate
`(unit, time)` keys - so absence of the `duplicate_unit_time_rows`
alert is also a real prerequisite, not just a style preference.

Updated wording in five places to add "+ absence of the
`duplicate_unit_time_rows` alert" alongside the other gates and
explain the silent-overwrite behavior:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring
- `diff_diff/guides/llms-autonomous.txt` §2 field reference
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature)
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning
  chain (now lists four gates instead of three)
- `CHANGELOG.md` Unreleased entry
- `ROADMAP.md` AI-Agent Track building-block

Also softened "authoritative" -> "core field-based" since the
non-field-based duplicate-row gate makes the original phrasing
slightly misleading.

Added a test_guides.py regression asserting the autonomous guide
mentions `duplicate_unit_time_rows` so future wording changes can't
silently drop the gate from the summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 22ff5dc161534d087d0aa191b74f04efd99eaa43


Overall Assessment

⚠️ Needs changes — I found two unmitigated P1 methodology issues in the new count-outcome guidance/routing.

Executive Summary

  • The prior duplicate-row documentation issue appears resolved: the ContinuousDiD prerequisite summary now keeps duplicate_unit_time_rows in scope.
  • P1: The new Wooldridge Poisson guidance says the estimator returns a multiplicative/proportional effect, but the implementation and reporting contract expose an ASF-based ATT difference on the outcome scale.
  • P1: The new OutcomeShape.is_count_like heuristic does not exclude negative support, so it can steer negative integer outcomes toward WooldridgeDiD(method="poisson"), which hard-rejects negative outcomes.
  • P2: The added tests pin section presence and API spelling, but they do not guard the nonlinear Wooldridge target-parameter wording or the negative-count edge case.
  • Static review only: I could not run the test suite because pytest is not installed in this environment.

Methodology

  • Severity: P1. Affected method: WooldridgeDiD(method="poisson"). Impact: The new guide tells agents that Poisson ETWFE “estimates the multiplicative effect directly” and that they should “report the multiplicative effect (proportional change),” but the library’s own methodology contract is ASF-based ATT, and the Poisson path computes E[exp(η_1)] - E[exp(η_0)], i.e. an outcome-scale difference, not a ratio. An agent following the new worked example can therefore misreport overall_att on the wrong scale. Concrete fix: Rewrite the §4.11 / §5.3 wording so Poisson ETWFE is described as an ASF-based ATT on the natural outcome scale; if you want to mention proportional effects, label them as a separate derived interpretation, not the estimator’s overall_att target. Refs: diff_diff/guides/llms-autonomous.txt:L621-L629, diff_diff/guides/llms-autonomous.txt:L801-L811, docs/methodology/REGISTRY.md:L1335-L1357, diff_diff/wooldridge.py:L1191-L1225, diff_diff/_reporting_helpers.py:L262-L281.
  • Severity: P1. Affected surface: profile_panel outcome-shape routing for WooldridgeDiD(method="poisson"). Impact: is_count_like currently requires integer-valued data, zeros, positive skew, and >2 distinct values, but it does not require non-negative support. That means a right-skewed integer outcome with zeros and some negative values can still set is_count_like=True, after which the new guide treats it as a Poisson candidate even though the Poisson fitter raises on any negative outcome. This is a missing edge-case guard on the new routing signal. Concrete fix: Add a non-negativity requirement to is_count_like (value_min >= 0 or equivalent), document it, and add a regression test with negative integer outcomes plus zeros expecting is_count_like=False. Refs: diff_diff/profile.py:L621-L645, diff_diff/guides/llms-autonomous.txt:L621-L629, diff_diff/wooldridge.py:L1105-L1108.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not see these P1 items mitigated in TODO.md, and they are not deferrable under the review policy.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests verify the worked-examples section exists and that the Poisson example uses method= rather than family=, but they do not verify the nonlinear Wooldridge estimand wording or the negative-outcome exclusion. Both P1 issues above can therefore pass the current suite. Concrete fix: Add guide-content assertions that §4.11/§5.3 do not describe overall_att as “multiplicative” / “proportional change,” and add a profile_panel regression test for a right-skewed integer outcome with zeros and some negative values. Refs: tests/test_guides.py:L62-L83, tests/test_profile_panel.py:L887-L907.

Path to Approval

  1. Fix the Poisson ETWFE wording in diff_diff/guides/llms-autonomous.txt so it matches the ASF-based ATT contract already documented in REGISTRY.md and _reporting_helpers.py.
  2. Tighten OutcomeShape.is_count_like to require non-negative support before it can steer users toward Poisson.
  3. Add regression tests for both the estimand wording and the negative-integer edge case.

igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b444300244662048617cd21fc6357d5cae0b1e32


Overall Assessment

⚠️ Needs changes — the two prior Wooldridge-related P1s look resolved in the provided diff, but there is one P1 [Newly identified] in the new ContinuousDiD pre-fit routing/docs.

Executive Summary

  • The prior Poisson issues appear fixed in the provided diff: WooldridgeDiD(method="poisson") is now described as ASF/outcome-scale, and outcome_shape.is_count_like now excludes negative outcomes.
  • P1 [Newly identified]: the new treatment_dose / ContinuousDiD prerequisite summaries omit the estimator’s strictly-positive-treated-dose restriction. A panel can satisfy every documented gate and still hard-fail in ContinuousDiD.fit() on negative dose support.
  • The added tests cover the two prior Poisson regressions, but there is still no regression around negative-dose ContinuousDiD routing.
  • No findings in code quality, performance, security, or tracked-tech-debt handling.
  • Static review only: pytest is not installed in this environment, so I could not run the suite.

Methodology

Re-review note: the previous Poisson estimand-wording and negative-count-routing findings appear resolved in the provided diff.

  • Severity: P1 [Newly identified]. Impact: the PR now presents has_never_treated, treatment_varies_within_unit == False, is_balanced, and no duplicate_unit_time_rows alert as the authoritative pre-fit gates for ContinuousDiD, and §5.2 says that if those pass the estimator is “in scope.” That omits the implementation’s hard support restriction that treated doses must be strictly positive; ContinuousDiD.fit() raises on negative treated dose support, and the methodology note is written on D > 0 support. An autonomous agent can therefore route a balanced, never-treated, time-invariant panel with treatment_dose.dose_min < 0 into a deterministic fit-time error. Concrete fix: extend the new routing contract so ContinuousDiD is only in scope when nonzero doses are strictly positive as well; at minimum, document treatment_dose.dose_min > 0 as a hard gate, and preferably expose an explicit negative-dose alert/boolean. Refs: diff_diff/profile.py:L65-L85, diff_diff/guides/llms-autonomous.txt:L737-L751, diff_diff/continuous_did.py:L287-L294, docs/methodology/continuous-did.md:L180-L184.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find the ContinuousDiD gate omission documented as a registry deviation or tracked in TODO.md, so it is unmitigated.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests now guard the prior Poisson regressions, but they still do not cover the expanded ContinuousDiD routing surface for negative-dose panels. The treatment-dose tests only cover zero-baseline/no-zero/binary/categorical cases, and the guide tests do not assert any negative-dose exclusion. Concrete fix: add a regression with a balanced, never-treated, time-invariant continuous panel containing a negative nonzero dose and assert the routing surface rejects ContinuousDiD (via explicit alert/boolean or guide wording keyed to dose_min < 0). Refs: tests/test_profile_panel.py:L1037-L1187, tests/test_guides.py:L62-L105.

Path to Approval

  1. Update the new TreatmentDoseShape / autonomous-guide prerequisite summaries so ContinuousDiD is only “in scope” when nonzero doses are strictly positive in addition to being zero-supported, time-invariant, balanced, and duplicate-free.
  2. Add a targeted regression for a negative-dose continuous panel that otherwise passes the existing documented gates, so this support restriction is covered going forward.

@igerber igerber force-pushed the agent-profile-shape-extensions branch from b444300 to ed57c7c Compare April 25, 2026 00:41
igerber added a commit that referenced this pull request Apr 25, 2026
…ptive-only; fix WooldridgeDiD method kwarg

P1 (TreatmentDoseShape vs ContinuousDiD contract):
- Reviewer correctly flagged that the new `is_time_invariant` field
  (per-unit non-zero distinct-count) does NOT match the actual
  `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which
  uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose
  column (including pre-treatment zeros). My nonzero-only check
  silently classified `0,0,d,d` paths as time-invariant while
  ContinuousDiD would reject them.
- Removed `is_time_invariant` field from `TreatmentDoseShape`
  entirely. The pre-existing `PanelProfile.treatment_varies_within_unit`
  field already encodes the correct ContinuousDiD prerequisite (matches
  the estimator's nunique check at line 224) and is correctly documented
  in §2 of the autonomous guide. Adding a second, narrower, mismatched
  gate was confusing - the reviewer's "scope as descriptive-only" path
  is the cleaner fix.
- Reframed `TreatmentDoseShape` docstring + autonomous guide §2
  field reference: explicitly NOT a ContinuousDiD prerequisite.
  `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide
  descriptive distributional context; `has_never_treated` (unit-level)
  + `treatment_varies_within_unit == False` (full-path constancy) +
  `is_balanced` are the authoritative gates.
- Rewrote §5.2 worked example reasoning chain to use the existing
  correct gates and added a counter-example showing
  `has_zero_dose=True` does NOT imply `has_never_treated=True` (the
  row-level vs unit-level distinction).
- Added `test_treatment_dose_does_not_gate_continuous_did` covering
  the two contradictory cases the reviewer named: (1) `0,0,d,d`
  within-unit dose path, asserting `treatment_varies_within_unit=True`
  (the actual ContinuousDiD gate fires correctly); (2) row-level zeros
  without never-treated units, asserting `has_zero_dose=True` BUT
  `has_never_treated=False` (the two facts are distinct).
- Removed `test_treatment_dose_continuous_time_varying_within_unit`
  and `test_treatment_dose_distinguishes_doses_at_high_precision` -
  both tested the dropped `is_time_invariant` field.

P2 (WooldridgeDiD constructor kwarg):
- The autonomous guide §5.3 worked example used
  `WooldridgeDiD(family="poisson")` but the actual constructor at
  `wooldridge.py:264` takes `method=`. Following the example would
  raise `TypeError: __init__() got an unexpected keyword argument
  'family'`. Fixed in two places (the prose and the code snippet)
  and added a negative assertion in `test_guides.py` to prevent
  regression: `assert 'WooldridgeDiD(family="poisson")' not in text`.

CHANGELOG updated to reflect the revised TreatmentDoseShape scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…ontinuousDiD prerequisite summaries

Reviewer correctly noted that the round-2 wording lists
`has_never_treated` + `treatment_varies_within_unit == False` +
`is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but
omits the duplicate-cell hard stop. Verified
`continuous_did.py:_precompute_structures` (line 818-823) builds
`outcome_matrix` cell-by-cell with last-row-wins on duplicate
`(unit, time)` keys - so absence of the `duplicate_unit_time_rows`
alert is also a real prerequisite, not just a style preference.

Updated wording in five places to add "+ absence of the
`duplicate_unit_time_rows` alert" alongside the other gates and
explain the silent-overwrite behavior:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring
- `diff_diff/guides/llms-autonomous.txt` §2 field reference
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature)
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning
  chain (now lists four gates instead of three)
- `CHANGELOG.md` Unreleased entry
- `ROADMAP.md` AI-Agent Track building-block

Also softened "authoritative" -> "core field-based" since the
non-field-based duplicate-row gate makes the original phrasing
slightly misleading.

Added a test_guides.py regression asserting the autonomous guide
mentions `duplicate_unit_time_rows` so future wording changes can't
silently drop the gate from the summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…e-dose gate to ContinuousDiD prerequisite summaries

P1 (newly identified — `dose_min > 0` ContinuousDiD gate omitted):
Reviewer correctly noted that the round-3 prerequisite summary
(`has_never_treated`, `treatment_varies_within_unit == False`,
`is_balanced`, no `duplicate_unit_time_rows` alert) omits the
estimator's strictly-positive-treated-dose restriction. Verified
`continuous_did.py:287-294` raises `ValueError` ("Dose must be
strictly positive for treated units (D > 0)") on negative treated
dose support — a panel can satisfy every documented gate above and
still hard-fail at fit time when `treatment_dose.dose_min < 0`.

Updated wording in five surfaces to add `treatment_dose.dose_min > 0`
as the fifth pre-fit gate:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells
  out all five gates with numbered list referencing the
  `continuous_did.py:287-294` line range).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (notes
  `dose_min > 0` is itself a gate; the other treatment_dose
  sub-fields remain descriptive).
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design
  feature paragraph).
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example
  reasoning chain (now five gates checked; added a new
  counter-example covering the negative-dose path so an agent
  reading the example sees the contradictory case explicitly).
- `CHANGELOG.md` Unreleased entry.
- `ROADMAP.md` AI-Agent Track building-block.

P2 (regression test for negative-dose path):
Added `test_treatment_dose_min_flags_negative_dose_continuous_panels`
in `tests/test_profile_panel.py` covering a balanced, never-treated,
time-invariant continuous panel where every dose level is negative.
The test asserts that all four other gates pass cleanly and that
`dose.dose_min < 0` is correctly observed — the fixture an agent
would see when reasoning about whether ContinuousDiD applies.

Added `dose_min > 0` content-stability assertion in
`tests/test_guides.py` so future wording changes can't silently drop
the gate from the autonomous-guide summary.

Rebased onto origin/main; resolved CHANGELOG conflicts (kept HAD
Phase 4.5 B + dCDH by_path Wave 2 + my Wave 2 entries side by side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ed57c7cffa81eec1685695f3ad11894729e41372


Overall Assessment

⚠️ Needs changes — the prior negative-dose / Wooldridge Poisson wording issues look resolved, but there is one unmitigated P1 in the new ContinuousDiD routing contract.

Executive Summary

  • The earlier dose_min > 0 omission appears fixed: the guide now reflects ContinuousDiD’s negative-dose rejection, and the Wooldridge Poisson text now matches the ASF outcome-scale estimand.
  • P1 [Newly identified]: the PR repeatedly calls a five-item PanelProfile checklist the full ContinuousDiD pre-fit gate set, but ContinuousDiD.fit() still has separate first_treat validity/consistency gates and still requires P(D=0) > 0 regardless of control_group.
  • The new OutcomeShape / is_count_like heuristic is aligned with the Poisson fitter’s non-negativity requirement and does not look methodologically off.
  • Static review only: pytest is not installed in this environment, so I could not execute the added tests.

Methodology

  • Severity: P1 [Newly identified]. Impact: the new docs present has_never_treated, treatment_varies_within_unit == False, is_balanced, no duplicate_unit_time_rows, and treatment_dose.dose_min > 0 as the full ContinuousDiD gate set. That is not the estimator’s full contract. ContinuousDiD.fit() also validates the separate first_treat column, rejects NaN/negative values, drops first_treat > 0 units with dose == 0, zeroes first_treat == 0 rows with nonzero dose, and enforces P(D=0) > 0 even under control_group="not_yet_treated". An agent can therefore mark a panel as “in scope” from PanelProfile alone even though the actual first_treat input will deterministically warn/error or change the sample. Concrete fix: stop calling the five profile-derived facts the “full” gate set; describe them as the profile-side screening surface and add an explicit caveat that ContinuousDiD.fit() still requires a valid/consistent first_treat column. Refs: diff_diff/guides/llms-autonomous.txt:L203-L235, diff_diff/guides/llms-autonomous.txt:L514-L526, diff_diff/guides/llms-autonomous.txt:L758-L788, diff_diff/profile.py:L65-L99, diff_diff/continuous_did.py:L230-L360, docs/methodology/REGISTRY.md:L719-L745, docs/methodology/continuous-did.md:L182-L184, ROADMAP.md:L141, CHANGELOG.md:L12.

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity: P3. Impact: the new internal commentary is inconsistent. Public docs now say treatment_dose.dose_min > 0 is part of the profile-side ContinuousDiD screen, but _compute_treatment_dose() and test_treatment_dose_does_not_gate_continuous_did() still say treatment-dose fields “do not gate” ContinuousDiD. That inconsistency will make future regressions more likely. Concrete fix: align the helper/test docstrings with the public contract (“most fields descriptive; dose_min > 0 is the extra screen”). Refs: diff_diff/profile.py:L65-L99, diff_diff/profile.py:L689-L694, tests/test_profile_panel.py:L1110-L1128.

Tech Debt

No findings. The P1 above is not mitigated by TODO.md tracking or a registry deviation note.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests pin the negative-dose wording and ASF outcome-scale wording, but they do not protect the missing first_treat caveat that causes the P1 above. Concrete fix: add a guide regression asserting that the ContinuousDiD section describes the five PanelProfile facts as screening checks only and explicitly mentions separate first_treat validation/consistency. Refs: tests/test_guides.py:L37-L69, diff_diff/guides/llms-autonomous.txt:L758-L788, diff_diff/continuous_did.py:L230-L360.

Path to Approval

  1. Replace every “full ContinuousDiD pre-fit gate set” claim in CHANGELOG.md, ROADMAP.md, diff_diff/profile.py, and diff_diff/guides/llms-autonomous.txt with “profile-side screening checks,” and document the separate first_treat contract from ContinuousDiD.fit().
  2. In llms-autonomous.txt §5.2, rewrite the has_never_treated == False counterexample so it states that the current implementation requires P(D=0) > 0 under both control_group options, not just the default.
  3. Add a regression in tests/test_guides.py that requires the ContinuousDiD section to mention separate first_treat validation/consistency; optionally pair it with a small tests/test_continuous_did.py scenario where the five profile-side checks pass but inconsistent first_treat still warns/errors.

igerber added a commit that referenced this pull request Apr 25, 2026
…nuousDiD prerequisite list as profile-side screening + add first_treat caveat

P1 (the five profile-derived facts are not the "full" gate set):
Reviewer correctly noted that calling
`{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}` the
"full ContinuousDiD pre-fit gate set" overreaches. `profile_panel`
only sees the four columns it accepts and CANNOT see the separate
`first_treat` column that `ContinuousDiD.fit()` consumes. Verified
against `continuous_did.py:230-360`: `fit()` additionally rejects
NaN/inf/negative `first_treat`, drops units with `first_treat > 0`
AND `dose == 0`, and force-zeroes `first_treat == 0` rows whose
`dose != 0` with a `UserWarning`. A panel that passes all five
profile-side checks can still surface warnings, drop rows, or raise
at fit time depending on the `first_treat` column the caller
supplies.

Reframed the wording in five surfaces from "full gate set" to
"profile-side screening checks" with an explicit caveat that the
checks are necessary-but-not-sufficient and that `ContinuousDiD.fit()`
applies separate `first_treat` validation:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells
  out the screening framing explicitly + lists the `first_treat`
  validations that fit() applies).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
  (aligned with public contract: most fields descriptive,
  `dose_min > 0` is one of the screening checks).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (rewrote
  the multi-paragraph block to describe screening + first_treat
  caveat).
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design
  feature paragraph: screening checks + necessary-not-sufficient
  language + pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example
  reasoning chain (rewrote step 2 to call out screening +
  first_treat caveat; clarified counter-example #4 that
  `P(D=0) > 0` is required under BOTH `control_group="never_treated"`
  and `"not_yet_treated"`, not just default).
- `CHANGELOG.md` Unreleased entry.
- `ROADMAP.md` AI-Agent Track.

P2 (test coverage for the missing `first_treat` caveat):
Added a content-stability assertion in `tests/test_guides.py`:
`assert "first_treat" in text` so the autonomous guide cannot
silently drop the explicit `first_treat` validation caveat.

P3 (helper / test-name inconsistency with public contract):
Renamed `test_treatment_dose_does_not_gate_continuous_did` to
`test_treatment_dose_descriptive_fields_supplement_existing_gates`
and rewrote its docstring to match the now-honest public contract
("most fields descriptive distributional context that supplements
the existing top-level screening checks"). The test body still
asserts the same two things — `treatment_varies_within_unit` fires
True on `0,0,d,d` paths and `has_never_treated` is independent of
`has_zero_dose` — both of which remain accurate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8601b611fa2c21684b91ecb58bf4b5df314557d6


Overall Assessment

Looks good — the prior P1 is resolved, and I did not find any new unmitigated P0/P1 issues in the changed code/docs.

Executive Summary

  • Prior re-review P1 is resolved: the ContinuousDiD language now consistently frames the five PanelProfile facts as a profile-side pre-flight screen and separately documents first_treat validation/coercion/drop behavior (diff_diff/profile.py:L65-L115, diff_diff/guides/llms-autonomous.txt:L203-L250, diff_diff/guides/llms-autonomous.txt:L530-L548, diff_diff/guides/llms-autonomous.txt:L780-L812, CHANGELOG.md:L12, ROADMAP.md:L141).
  • The new OutcomeShape.is_count_like routing is aligned with Wooldridge Poisson fit behavior by excluding negative-support outcomes, and the guide now correctly describes Poisson overall_att as an ASF-based outcome-scale difference (diff_diff/profile.py:L669-L683, docs/methodology/REGISTRY.md:L1333-L1348, diff_diff/wooldridge.py:L1102-L1109, diff_diff/wooldridge.py:L1191-L1245).
  • No estimator math, weighting, variance/SE, identification assumptions, or defaults changed in the implementation; this PR is limited to descriptive profiling fields, exports, guide content, and tests.
  • One minor documentation drift remains: the changelog summary omits the non-negativity clause from the is_count_like heuristic (CHANGELOG.md:L12).
  • Added tests cover the new profile fields and the prior review regression surface (tests/test_guides.py:L37-L185, tests/test_profile_panel.py:L889-L1313), but I could not execute them here because pytest and runtime deps such as numpy are unavailable.

Methodology

No findings. The changed docs now match the ContinuousDiD contract in docs/methodology/REGISTRY.md:L717-L745 and diff_diff/continuous_did.py:L222-L327,L348-L360, and the Wooldridge Poisson wording matches docs/methodology/REGISTRY.md:L1333-L1348 and diff_diff/wooldridge.py:L1191-L1245.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: CHANGELOG.md:L12 describes is_count_like without the value_min >= 0 / non-negative-support gate. Readers relying on the release notes alone may expect negative-valued outcomes to route as count-like even though the implementation intentionally suppresses that to stay compatible with WooldridgeDiD(method="poisson").
    Concrete fix: Update the changelog bullet to include the non-negativity clause, matching diff_diff/profile.py:L669-L683, diff_diff/guides/llms-autonomous.txt:L188-L198, and tests/test_profile_panel.py:L930-L971.

igerber added a commit that referenced this pull request Apr 25, 2026
…se in CHANGELOG is_count_like description

Reviewer correctly noted that the CHANGELOG bullet describing
`is_count_like` listed only four of the five conditions
(integer-valued + has zeros + right-skewed + > 2 distinct values)
but omitted the `value_min >= 0` non-negativity clause added in
round 4. Readers of the release notes alone would have expected
negative-valued count-like outcomes to route as `is_count_like=True`
even though the implementation intentionally suppresses that to stay
compatible with `WooldridgeDiD(method="poisson")`'s hard non-negative
requirement (`wooldridge.py:1105-1109`).

Updated the bullet to include the non-negativity clause and
explicitly cite the wooldridge.py line range so the release-notes
description matches `diff_diff/profile.py`,
`diff_diff/guides/llms-autonomous.txt`, and the
`test_outcome_shape_count_like_excludes_negative_support`
regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bc13abb50011b5a5495d6448d458cff7dfa157e4


Overall Assessment

✅ Looks good

Executive Summary

Methodology

No findings. Cross-checking the changed profile/guide language against docs/methodology/REGISTRY.md:L717-L745 and docs/methodology/REGISTRY.md:L1333-L1349 found no undocumented deviation in estimator behavior.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: §4.7 contains two different summaries of the ContinuousDiD pre-fit contract. The opening bullet says there are only three eligibility prerequisites, while the later paragraph correctly adds the duplicate-row hard stop and the negative-dose screen. A reader who stops at the first summary can miss two load-bearing gates and route a duplicate-containing or negative-dose panel into a failed or silent-overwrite fit path (diff_diff/guides/llms-autonomous.txt:L507-L548, diff_diff/continuous_did.py:L222-L338, diff_diff/continuous_did.py:L800-L823).
    Concrete fix: Update the opening ContinuousDiD bullet in §4.7 to enumerate all five profile-side checks, or relabel the first three as “core estimator prerequisites” and explicitly cross-reference the fuller five-check screen immediately below.

Static review only: the added coverage in tests/test_profile_panel.py:L887-L1313 and tests/test_guides.py:L37-L182 looks appropriate, but I could not run it in this environment.

igerber added a commit that referenced this pull request Apr 25, 2026
…D opening bullet with the five-check screen below it

§4.7 had two summaries of the ContinuousDiD pre-fit contract that
disagreed: the opening bullet (Wave 1 era) said "Three eligibility
prerequisites", while the paragraph immediately below correctly
listed all five profile-side screening checks (Wave 2 added the
duplicate-row hard stop and the negative-dose `dose_min > 0`
screen). A reader who stopped at the opening bullet could miss
those two load-bearing gates and route a duplicate-containing or
negative-dose panel into a failed/silent-overwrite fit path.

Updated the opening bullet to enumerate all five profile-side
checks: (a) zero-dose controls / `P(D=0) > 0`, (b) per-unit
time-invariant dose, (c) balanced panel, (d) no
`duplicate_unit_time_rows` alert, (e) strictly positive treated
doses. Also clarified that the duplicate-row case (d) coerces
silently rather than raising — distinct from (a)/(b)/(c)/(e) which
all raise ValueError. Pointer to the paragraph below + §2 for the
`first_treat` validation surface kept intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a8880b46a47e39fb3c89a60718cc9b80a7bc2994


Overall Assessment

✅ Looks good

Executive Summary

  • Prior re-review issue is resolved: the autonomous guide now consistently names all five ContinuousDiD profile-side checks and the separate first_treat validation surface.
  • No estimator implementation, weighting, variance/SE computation, identification assumption, or default behavior changed in this PR; the code changes are confined to descriptive PanelProfile metadata, exports, guide prose, and tests.
  • One methodology-facing docs issue remains: the new count-outcome guidance justifies Poisson QMLE by saying linear DiD asymptotic SEs assume “normal-shaped errors,” which does not match the library’s documented cluster-robust OLS inference contract.
  • One minor docs clarity issue remains: the new ContinuousDiD worked example is written like a zero-pre/post-dose adoption path even though the estimator, the profile gate, and the same section later reject 0,0,d,d dose paths.
  • Static review only: I could not run the added tests here because this environment does not have pytest, numpy, or pandas.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: The new ContinuousDiD worked example is titled “zero baseline” and describes “block-style adoption in year 3,” but the shown profile has treatment_varies_within_unit=False, and the same section later correctly says a 0,0,d,d path is out of scope. That mixed framing can confuse an agent about the estimator’s full-path dose-invariance contract. Location: diff_diff/guides/llms-autonomous.txt:L759-L781, diff_diff/guides/llms-autonomous.txt:L822-L828, diff_diff/continuous_did.py:L222-L228. Concrete fix: rename the example to something like “continuous-dose panel with zero-dose controls,” and state explicitly that positive-dose units keep the same dose in every observed period while first_treat carries the timing information separately.

No additional test-coverage findings in the diff. Testing note: I could not execute tests/test_profile_panel.py or tests/test_guides.py in this environment because the required Python packages are not installed.

igerber added a commit that referenced this pull request Apr 25, 2026
…ationale in functional-form terms; rename §5.2 + clarify dose constancy

P2 (linear-OLS SE rationale wording):
The §4.11 and §5.3 prose justified `WooldridgeDiD(method="poisson")`
over linear-OLS DiD by claiming linear-OLS asymptotic SEs "assume
normal-shaped errors" that a right-skewed count distribution
violates. That misrepresents the library's documented inference
contract: linear ETWFE uses cluster-robust sandwich SEs (`OLS path`
in REGISTRY.md) which are valid for any error distribution with
finite moments — not just Gaussian.

Rewrote both surfaces in functional-form / efficiency terms:
- Linear-OLS DiD imposes an additive functional form on a
  non-negative count outcome. SEs are calibrated (cluster-robust)
  but the linear model can be inefficient and may produce
  counterfactual predictions outside the non-negative support.
- `WooldridgeDiD(method="poisson")` imposes a multiplicative
  (log-link) functional form that respects non-negativity and
  matches the typical generative process for count data; QMLE
  sandwich SEs are robust to distributional misspecification.
- The choice is about WHICH functional form best summarizes the
  treatment effect (additive vs multiplicative), not about
  whether SEs are calibrated.

§5.3 reasoning chain step 2 + step 3 rewritten to reflect this:
both estimators are now described as having calibrated inference;
the trade-off is parameterization (linear ATT vs. ASF outcome-scale
ATT). The Poisson non-negativity gate explanation is preserved.

P3 (§5.2 example self-contradiction):
§5.2 was titled "Continuous-dose panel with zero baseline" with
prose "block-style adoption in year 3", which suggested a
`0,0,d,d` within-unit dose path — but the shown profile has
`treatment_varies_within_unit=False` (per-unit constant dose) and
the same section's later counter-example correctly says `0,0,d,d`
is out of scope for ContinuousDiD. Self-contradictory framing.

Renamed to "Continuous-dose panel with zero-dose controls" (per
reviewer suggestion) and clarified the prose: each dose group holds
its assigned dose value (including 0 for the never-treated
controls) in every observed period; adoption timing is carried
separately via the `first_treat` column passed to
`ContinuousDiD.fit()`, not via within-unit dose variation. Updated
the matching `test_autonomous_contains_worked_examples_section`
assertion in `tests/test_guides.py` to track the new title.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0d711f1c334cb3fb6cccad02ab6eecfbcacac0ad


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review scope: the prior continuous-dose example issue appears resolved. §5.2 now keeps positive-dose units time-invariant across the full observed path and carries timing separately in first_treat, which matches ContinuousDiD.fit()’s contract in diff_diff/guides/llms-autonomous.txt:L769-L858 and diff_diff/continuous_did.py:L222-L360.
  • No estimator implementation, weighting, variance/SE computation, identification assumption, or default behavior changed in this PR. The code changes are limited to descriptive PanelProfile metadata, exports, guide prose, and targeted tests.
  • The ContinuousDiD screening narrative is now materially better aligned with the implementation: it explicitly names the duplicate-row hard stop, the dose_min > 0 gate, and the separate first_treat validation surface in diff_diff/guides/llms-autonomous.txt:L203-L251, diff_diff/guides/llms-autonomous.txt:L539-L556, and diff_diff/continuous_did.py:L287-L360.
  • One methodology-facing docs issue remains from the prior review: the new outcome_shape.is_count_like field reference and the §4.11 preamble still imply count outcomes make linear DiD asymptotic SEs suspect, which conflicts with the registry’s cluster-robust OLS inference contract.
  • Static review only: python -m pytest tests/test_profile_panel.py tests/test_guides.py -q could not be run here because pytest is not installed.

Methodology

  • Severity: P2. Impact: The previous count-outcome docs finding is only partially resolved. The detailed §4.11/§5.3 discussion now correctly reframes Poisson-vs-linear as a functional-form/support/efficiency choice, but the §2 outcome_shape.is_count_like field reference still says raw-outcome OLS has “questionable asymptotic SEs,” and the §4.11 preamble still says asymptotic SEs “may mislead.” That conflicts with the documented OLS cluster-robust sandwich inference surface and can still steer estimator choice for the wrong reason. Location: diff_diff/guides/llms-autonomous.txt:L188-L193, diff_diff/guides/llms-autonomous.txt:L663-L671, docs/methodology/REGISTRY.md:L1343-L1349. Concrete fix: replace the remaining SE-based rationale with the same functional-form/support language already used later in §4.11/§5.3.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No findings in the diff beyond the methodology item above.
  • Static review note: the added guide/profile tests look well targeted on inspection, but I could not execute them here because pytest is unavailable.

igerber added a commit that referenced this pull request Apr 25, 2026
…tcome rationale to functional-form framing

Round 9 rewrote §4.11 detailed bullet and §5.3 reasoning chain to
describe the Poisson-vs-linear choice in functional-form / efficiency
terms, but two surfaces still carried the old SE-based framing:

- §2 `is_count_like` field reference: "questionable asymptotic SEs"
- §4.11 preamble paragraph: "asymptotic SEs may mislead"

Both phrases conflict with the library's documented inference
contract (REGISTRY.md §WooldridgeDiD: linear ETWFE uses
cluster-robust sandwich SEs which are valid for any error
distribution with finite moments — not just Gaussian). Reviewer
correctly flagged this as the remaining methodology P2.

Updated both surfaces to match the round-9 framing:
- Linear-OLS DiD imposes an additive functional form on a
  non-negative count outcome; cluster-robust SEs remain
  calibrated but the linear model can be inefficient and may
  produce counterfactual predictions outside the non-negative
  support.
- `WooldridgeDiD(method="poisson")` is the multiplicative
  (log-link) ETWFE alternative that respects the non-negative
  support and matches the typical count-process generative model,
  with QMLE sandwich SEs robust to distributional misspecification.
- The choice is functional-form / efficiency / support, NOT SE
  calibration.

Verified no remaining "asymptotic SEs" / "questionable" / "mislead"
phrases in the guide via grep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 31e887523fb6cd1f31a5c3acc4be8b07330dfa42


Overall Assessment

⚠️ Needs changes

Executive Summary

  • P1 The new agent-facing ContinuousDiD guidance overstates PanelProfile.has_never_treated and treatment_dose.dose_min > 0 as necessary/authoritative pre-fit gates, but ContinuousDiD.fit() actually keys off first_treat, zeroes nonzero dose on never-treated rows, and only rejects negative doses among treated units.
  • That mismatch is now repeated across the public guide, docstrings, changelog/roadmap prose, and regression tests, so an autonomous agent can wrongly route valid panels away from ContinuousDiD.
  • The prior count-outcome wording issue appears resolved: the updated Wooldridge Poisson guidance now correctly frames overall_att as an ASF-based outcome-scale difference rather than a multiplicative ratio.
  • No estimator math, weighting, variance, or SE implementation changed in this diff.
  • Static review only: python -m pytest tests/test_profile_panel.py tests/test_guides.py -q could not be run here because pytest is not installed.

Methodology

  • P1 Impact: the PR documents the five profile-side ContinuousDiD checks as “necessary” and treats has_never_treated as the authoritative gate, with dose_min > 0 presented as a hard pre-fit screen in diff_diff/profile.py:L64-L114, diff_diff/guides/llms-autonomous.txt:L206-L254, diff_diff/guides/llms-autonomous.txt:L510-L558, diff_diff/guides/llms-autonomous.txt:L807-L860, CHANGELOG.md:L12-L12, ROADMAP.md:L141-L141, tests/test_guides.py:L45-L82, and tests/test_profile_panel.py:L1110-L1212. But the actual estimator contract is different: ContinuousDiD.fit() defines controls from first_treat == 0, zeroes nonzero dose on those rows, and only rejects negative dose among treated units first_treat > 0 in diff_diff/continuous_did.py:L276-L327 and diff_diff/continuous_did.py:L348-L360; that coercion is explicitly documented in docs/methodology/REGISTRY.md:L743-L745. So a panel can fail the new profile-side gate while still warning-and-fitting. Concrete fix: reword these fields as conservative “clean-data / no-coercion preflight checks” rather than necessary eligibility conditions, or add a profiling helper that also accepts first_treat so the pre-fit gate matches the estimator’s real contract; update the new tests accordingly.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the methodology contract issue above.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No separate findings beyond the methodology item above.
  • Static review note: tests were not executed here because pytest is unavailable in the environment.

Path to Approval

  1. Remove “necessary” / “authoritative gate” wording for has_never_treated and treatment_dose.dose_min > 0 anywhere this PR added it, and replace it with contract-accurate wording tied to first_treat-based coercion/validation.
  2. Add a regression that covers first_treat == 0 rows carrying nonzero dose (including negative dose on never-treated rows), asserting the documented warning/coercion behavior in ContinuousDiD.fit(), then align the guide/profile tests with that behavior.

igerber and others added 13 commits April 25, 2026 06:45
…mous-guide worked examples (Wave 2)

Wave 2 of the AI-agent enablement track. Extends profile_panel() with two
new optional sub-dataclasses:

- OutcomeShape (numeric outcomes only): n_distinct_values, pct_zeros,
  value_min/max, NaN-safe skewness + excess_kurtosis (gated on
  n_distinct >= 3 and std > 0), is_integer_valued, is_count_like (heuristic:
  integer-valued AND has zeros AND right-skewed AND > 2 distinct values),
  is_bounded_unit ([0, 1] support).
- TreatmentDoseShape (treatment_type == "continuous" only):
  n_distinct_doses, has_zero_dose, dose_min/max/mean over non-zero doses,
  is_time_invariant (per-unit non-zero doses have at most one distinct
  value).

Both fields are None when their classification gate is not met. to_dict()
serializes the nested dataclasses as JSON-compatible nested dicts.

llms-autonomous.txt gains a new §5 "Worked examples" with three end-to-end
PanelProfile -> reasoning -> validation walkthroughs (binary staggered
with never-treated controls, continuous dose with zero baseline,
count-shaped outcome) plus §2 field-reference subsections, §3 footnote
cross-ref, §4.7 cross-ref, and a new §4.11 outcome-shape considerations
section. Existing §5-§8 renumbered to §6-§9. Descriptive only - no
recommender language inside the worked examples.

Tests: 16 new unit tests in tests/test_profile_panel.py covering each
heuristic (count-like Poisson, binary-as-not-count-like, continuous
normal, bounded unit, categorical returning None, skewness gating,
JSON roundtrip, time-invariant dose, time-varying dose, no-zero-dose,
binary-treatment returning None, categorical-treatment returning None,
JSON roundtrip, frozen invariants on both new dataclasses). Two new
content-stability tests in tests/test_guides.py guard the §5 worked
examples and the new field references.

CHANGELOG and ROADMAP updated; ROADMAP marks Wave 2 shipped, promotes
sanity_checks block to top of "Next blocks toward the vision," and
documents why the originally-proposed post-hoc mismatch detection was
rescoped (largely overlaps existing fit-time validators and caveats).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…are integer detection

P1 from local review (`treatment_dose.is_time_invariant`):
- Removed `np.round(..., 8)` tolerance from `_compute_treatment_dose`'s
  per-unit non-zero distinct-count check. The documented contract is
  "per-unit non-zero doses have at most one distinct value" (exact),
  but the implementation was rounding to 8 decimals before comparing,
  silently classifying tiny-but-real dose variation as time-invariant
  and contradicting the docstring + CHANGELOG + autonomous guide §2.
  Now uses exact `np.unique(unit_nonzero).size > 1`. Added a regression
  test (`test_treatment_dose_distinguishes_doses_at_high_precision`) for
  a unit with two non-equal doses separated by 1e-9 (sub the previous
  rounding window) — asserts `is_time_invariant=False`.

Related dead-code removal:
- Removed the `len(nonzero) == 0` defensive branch in
  `_compute_treatment_dose`. `treatment_type == "continuous"` is reached
  only when the treatment column has more than two distinct values OR
  a 2-valued numeric outside `{0, 1}`; an all-zero numeric column is
  classified as `binary_absorbing` and never reaches this branch, so
  `nonzero` is guaranteed non-empty. Removing the branch eliminates
  the NaN-vs-Optional[float] inconsistency the reviewer flagged on
  `dose_min/max/mean`.

P2 from local review (`is_integer_valued` brittleness):
- Switched from `np.equal(np.mod(arr, 1.0), 0.0)` to
  `np.isclose(arr, np.round(arr), rtol=0.0, atol=1e-12)`. The treatment
  / outcome column is user input (system boundary), and CSV-roundtripped
  count columns commonly carry float64 representation noise (e.g.,
  `1.0` stored as `1.0000000000000002`). Tolerance-aware integer
  detection is the right discipline at the boundary; downstream the
  `is_count_like` heuristic remains gated on this AND `pct_zeros > 0`
  AND `skewness > 0.5` AND `n_distinct > 2`, so isolated noise can't
  flip the classification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ptive-only; fix WooldridgeDiD method kwarg

P1 (TreatmentDoseShape vs ContinuousDiD contract):
- Reviewer correctly flagged that the new `is_time_invariant` field
  (per-unit non-zero distinct-count) does NOT match the actual
  `ContinuousDiD.fit()` gate at `continuous_did.py:222-228`, which
  uses `df.groupby(unit)[dose].nunique() > 1` over the FULL dose
  column (including pre-treatment zeros). My nonzero-only check
  silently classified `0,0,d,d` paths as time-invariant while
  ContinuousDiD would reject them.
- Removed `is_time_invariant` field from `TreatmentDoseShape`
  entirely. The pre-existing `PanelProfile.treatment_varies_within_unit`
  field already encodes the correct ContinuousDiD prerequisite (matches
  the estimator's nunique check at line 224) and is correctly documented
  in §2 of the autonomous guide. Adding a second, narrower, mismatched
  gate was confusing - the reviewer's "scope as descriptive-only" path
  is the cleaner fix.
- Reframed `TreatmentDoseShape` docstring + autonomous guide §2
  field reference: explicitly NOT a ContinuousDiD prerequisite.
  `n_distinct_doses`, `has_zero_dose`, `dose_min/max/mean` provide
  descriptive distributional context; `has_never_treated` (unit-level)
  + `treatment_varies_within_unit == False` (full-path constancy) +
  `is_balanced` are the authoritative gates.
- Rewrote §5.2 worked example reasoning chain to use the existing
  correct gates and added a counter-example showing
  `has_zero_dose=True` does NOT imply `has_never_treated=True` (the
  row-level vs unit-level distinction).
- Added `test_treatment_dose_does_not_gate_continuous_did` covering
  the two contradictory cases the reviewer named: (1) `0,0,d,d`
  within-unit dose path, asserting `treatment_varies_within_unit=True`
  (the actual ContinuousDiD gate fires correctly); (2) row-level zeros
  without never-treated units, asserting `has_zero_dose=True` BUT
  `has_never_treated=False` (the two facts are distinct).
- Removed `test_treatment_dose_continuous_time_varying_within_unit`
  and `test_treatment_dose_distinguishes_doses_at_high_precision` -
  both tested the dropped `is_time_invariant` field.

P2 (WooldridgeDiD constructor kwarg):
- The autonomous guide §5.3 worked example used
  `WooldridgeDiD(family="poisson")` but the actual constructor at
  `wooldridge.py:264` takes `method=`. Following the example would
  raise `TypeError: __init__() got an unexpected keyword argument
  'family'`. Fixed in two places (the prose and the code snippet)
  and added a negative assertion in `test_guides.py` to prevent
  regression: `assert 'WooldridgeDiD(family="poisson")' not in text`.

CHANGELOG updated to reflect the revised TreatmentDoseShape scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontinuousDiD prerequisite summaries

Reviewer correctly noted that the round-2 wording lists
`has_never_treated` + `treatment_varies_within_unit == False` +
`is_balanced` as the "authoritative" ContinuousDiD pre-fit gates but
omits the duplicate-cell hard stop. Verified
`continuous_did.py:_precompute_structures` (line 818-823) builds
`outcome_matrix` cell-by-cell with last-row-wins on duplicate
`(unit, time)` keys - so absence of the `duplicate_unit_time_rows`
alert is also a real prerequisite, not just a style preference.

Updated wording in five places to add "+ absence of the
`duplicate_unit_time_rows` alert" alongside the other gates and
explain the silent-overwrite behavior:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring
- `diff_diff/guides/llms-autonomous.txt` §2 field reference
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design feature)
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example reasoning
  chain (now lists four gates instead of three)
- `CHANGELOG.md` Unreleased entry
- `ROADMAP.md` AI-Agent Track building-block

Also softened "authoritative" -> "core field-based" since the
non-field-based duplicate-row gate makes the original phrasing
slightly misleading.

Added a test_guides.py regression asserting the autonomous guide
mentions `duplicate_unit_time_rows` so future wording changes can't
silently drop the gate from the summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-dose gate to ContinuousDiD prerequisite summaries

P1 (newly identified — `dose_min > 0` ContinuousDiD gate omitted):
Reviewer correctly noted that the round-3 prerequisite summary
(`has_never_treated`, `treatment_varies_within_unit == False`,
`is_balanced`, no `duplicate_unit_time_rows` alert) omits the
estimator's strictly-positive-treated-dose restriction. Verified
`continuous_did.py:287-294` raises `ValueError` ("Dose must be
strictly positive for treated units (D > 0)") on negative treated
dose support — a panel can satisfy every documented gate above and
still hard-fail at fit time when `treatment_dose.dose_min < 0`.

Updated wording in five surfaces to add `treatment_dose.dose_min > 0`
as the fifth pre-fit gate:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells
  out all five gates with numbered list referencing the
  `continuous_did.py:287-294` line range).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (notes
  `dose_min > 0` is itself a gate; the other treatment_dose
  sub-fields remain descriptive).
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design
  feature paragraph).
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example
  reasoning chain (now five gates checked; added a new
  counter-example covering the negative-dose path so an agent
  reading the example sees the contradictory case explicitly).
- `CHANGELOG.md` Unreleased entry.
- `ROADMAP.md` AI-Agent Track building-block.

P2 (regression test for negative-dose path):
Added `test_treatment_dose_min_flags_negative_dose_continuous_panels`
in `tests/test_profile_panel.py` covering a balanced, never-treated,
time-invariant continuous panel where every dose level is negative.
The test asserts that all four other gates pass cleanly and that
`dose.dose_min < 0` is correctly observed — the fixture an agent
would see when reasoning about whether ContinuousDiD applies.

Added `dose_min > 0` content-stability assertion in
`tests/test_guides.py` so future wording changes can't silently drop
the gate from the autonomous-guide summary.

Rebased onto origin/main; resolved CHANGELOG conflicts (kept HAD
Phase 4.5 B + dCDH by_path Wave 2 + my Wave 2 entries side by side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nuousDiD prerequisite list as profile-side screening + add first_treat caveat

P1 (the five profile-derived facts are not the "full" gate set):
Reviewer correctly noted that calling
`{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}` the
"full ContinuousDiD pre-fit gate set" overreaches. `profile_panel`
only sees the four columns it accepts and CANNOT see the separate
`first_treat` column that `ContinuousDiD.fit()` consumes. Verified
against `continuous_did.py:230-360`: `fit()` additionally rejects
NaN/inf/negative `first_treat`, drops units with `first_treat > 0`
AND `dose == 0`, and force-zeroes `first_treat == 0` rows whose
`dose != 0` with a `UserWarning`. A panel that passes all five
profile-side checks can still surface warnings, drop rows, or raise
at fit time depending on the `first_treat` column the caller
supplies.

Reframed the wording in five surfaces from "full gate set" to
"profile-side screening checks" with an explicit caveat that the
checks are necessary-but-not-sufficient and that `ContinuousDiD.fit()`
applies separate `first_treat` validation:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (now spells
  out the screening framing explicitly + lists the `first_treat`
  validations that fit() applies).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
  (aligned with public contract: most fields descriptive,
  `dose_min > 0` is one of the screening checks).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (rewrote
  the multi-paragraph block to describe screening + first_treat
  caveat).
- `diff_diff/guides/llms-autonomous.txt` §4.7 (continuous design
  feature paragraph: screening checks + necessary-not-sufficient
  language + pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 worked example
  reasoning chain (rewrote step 2 to call out screening +
  first_treat caveat; clarified counter-example #4 that
  `P(D=0) > 0` is required under BOTH `control_group="never_treated"`
  and `"not_yet_treated"`, not just default).
- `CHANGELOG.md` Unreleased entry.
- `ROADMAP.md` AI-Agent Track.

P2 (test coverage for the missing `first_treat` caveat):
Added a content-stability assertion in `tests/test_guides.py`:
`assert "first_treat" in text` so the autonomous guide cannot
silently drop the explicit `first_treat` validation caveat.

P3 (helper / test-name inconsistency with public contract):
Renamed `test_treatment_dose_does_not_gate_continuous_did` to
`test_treatment_dose_descriptive_fields_supplement_existing_gates`
and rewrote its docstring to match the now-honest public contract
("most fields descriptive distributional context that supplements
the existing top-level screening checks"). The test body still
asserts the same two things — `treatment_varies_within_unit` fires
True on `0,0,d,d` paths and `has_never_treated` is independent of
`has_zero_dose` — both of which remain accurate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se in CHANGELOG is_count_like description

Reviewer correctly noted that the CHANGELOG bullet describing
`is_count_like` listed only four of the five conditions
(integer-valued + has zeros + right-skewed + > 2 distinct values)
but omitted the `value_min >= 0` non-negativity clause added in
round 4. Readers of the release notes alone would have expected
negative-valued count-like outcomes to route as `is_count_like=True`
even though the implementation intentionally suppresses that to stay
compatible with `WooldridgeDiD(method="poisson")`'s hard non-negative
requirement (`wooldridge.py:1105-1109`).

Updated the bullet to include the non-negativity clause and
explicitly cite the wooldridge.py line range so the release-notes
description matches `diff_diff/profile.py`,
`diff_diff/guides/llms-autonomous.txt`, and the
`test_outcome_shape_count_like_excludes_negative_support`
regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…D opening bullet with the five-check screen below it

§4.7 had two summaries of the ContinuousDiD pre-fit contract that
disagreed: the opening bullet (Wave 1 era) said "Three eligibility
prerequisites", while the paragraph immediately below correctly
listed all five profile-side screening checks (Wave 2 added the
duplicate-row hard stop and the negative-dose `dose_min > 0`
screen). A reader who stopped at the opening bullet could miss
those two load-bearing gates and route a duplicate-containing or
negative-dose panel into a failed/silent-overwrite fit path.

Updated the opening bullet to enumerate all five profile-side
checks: (a) zero-dose controls / `P(D=0) > 0`, (b) per-unit
time-invariant dose, (c) balanced panel, (d) no
`duplicate_unit_time_rows` alert, (e) strictly positive treated
doses. Also clarified that the duplicate-row case (d) coerces
silently rather than raising — distinct from (a)/(b)/(c)/(e) which
all raise ValueError. Pointer to the paragraph below + §2 for the
`first_treat` validation surface kept intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ationale in functional-form terms; rename §5.2 + clarify dose constancy

P2 (linear-OLS SE rationale wording):
The §4.11 and §5.3 prose justified `WooldridgeDiD(method="poisson")`
over linear-OLS DiD by claiming linear-OLS asymptotic SEs "assume
normal-shaped errors" that a right-skewed count distribution
violates. That misrepresents the library's documented inference
contract: linear ETWFE uses cluster-robust sandwich SEs (`OLS path`
in REGISTRY.md) which are valid for any error distribution with
finite moments — not just Gaussian.

Rewrote both surfaces in functional-form / efficiency terms:
- Linear-OLS DiD imposes an additive functional form on a
  non-negative count outcome. SEs are calibrated (cluster-robust)
  but the linear model can be inefficient and may produce
  counterfactual predictions outside the non-negative support.
- `WooldridgeDiD(method="poisson")` imposes a multiplicative
  (log-link) functional form that respects non-negativity and
  matches the typical generative process for count data; QMLE
  sandwich SEs are robust to distributional misspecification.
- The choice is about WHICH functional form best summarizes the
  treatment effect (additive vs multiplicative), not about
  whether SEs are calibrated.

§5.3 reasoning chain step 2 + step 3 rewritten to reflect this:
both estimators are now described as having calibrated inference;
the trade-off is parameterization (linear ATT vs. ASF outcome-scale
ATT). The Poisson non-negativity gate explanation is preserved.

P3 (§5.2 example self-contradiction):
§5.2 was titled "Continuous-dose panel with zero baseline" with
prose "block-style adoption in year 3", which suggested a
`0,0,d,d` within-unit dose path — but the shown profile has
`treatment_varies_within_unit=False` (per-unit constant dose) and
the same section's later counter-example correctly says `0,0,d,d`
is out of scope for ContinuousDiD. Self-contradictory framing.

Renamed to "Continuous-dose panel with zero-dose controls" (per
reviewer suggestion) and clarified the prose: each dose group holds
its assigned dose value (including 0 for the never-treated
controls) in every observed period; adoption timing is carried
separately via the `first_treat` column passed to
`ContinuousDiD.fit()`, not via within-unit dose variation. Updated
the matching `test_autonomous_contains_worked_examples_section`
assertion in `tests/test_guides.py` to track the new title.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tcome rationale to functional-form framing

Round 9 rewrote §4.11 detailed bullet and §5.3 reasoning chain to
describe the Poisson-vs-linear choice in functional-form / efficiency
terms, but two surfaces still carried the old SE-based framing:

- §2 `is_count_like` field reference: "questionable asymptotic SEs"
- §4.11 preamble paragraph: "asymptotic SEs may mislead"

Both phrases conflict with the library's documented inference
contract (REGISTRY.md §WooldridgeDiD: linear ETWFE uses
cluster-robust sandwich SEs which are valid for any error
distribution with finite moments — not just Gaussian). Reviewer
correctly flagged this as the remaining methodology P2.

Updated both surfaces to match the round-9 framing:
- Linear-OLS DiD imposes an additive functional form on a
  non-negative count outcome; cluster-robust SEs remain
  calibrated but the linear model can be inefficient and may
  produce counterfactual predictions outside the non-negative
  support.
- `WooldridgeDiD(method="poisson")` is the multiplicative
  (log-link) ETWFE alternative that respects the non-negative
  support and matches the typical count-process generative model,
  with QMLE sandwich SEs robust to distributional misspecification.
- The choice is functional-form / efficiency / support, NOT SE
  calibration.

Verified no remaining "asymptotic SEs" / "questionable" / "mislead"
phrases in the guide via grep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…light checks as standard-workflow predictions, not estimator gates

Reviewer correctly noted that calling
{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}
the "screening checks" / "necessary" gates of `ContinuousDiD`
overstates the contract. `ContinuousDiD.fit()` keys off the
separate `first_treat` column (which `profile_panel` does not see),
defines never-treated controls as `first_treat == 0` rows,
force-zeroes nonzero `dose` on those rows with a `UserWarning`,
and rejects negative dose only among treated units `first_treat > 0`
(see `continuous_did.py:276-327` and `:348-360`).

Two of the five checks (`has_never_treated`, `dose_min > 0`) are
first_treat-dependent: agents who relabel positive- or negative-dose
units as `first_treat == 0` trigger the force-zero coercion path
with a `UserWarning` and may still fit panels that fail those
preflights, with the methodology shifting. The other three
(`treatment_varies_within_unit`, `is_balanced`, duplicate-row
absence) are real fit-time gates that hold regardless of how
`first_treat` is constructed.

Reframed every wording site to call these "standard-workflow
preflight checks" — predictive when the agent derives `first_treat`
from the same dose column passed to `profile_panel`, but not the
estimator's literal contract:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
  the multi-paragraph block; explicit standard-workflow definition
  + per-check first_treat dependency map + force-zero coercion
  caveat).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
  (already brief; stays consistent).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (long
  rewrite covering the standard-workflow framing + override paths).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
  trailing paragraph (both updated; opening bullet now spells out
  which of the five checks are first_treat-dependent vs. hard
  fit-time stops; trailing paragraph promotes the standard-
  workflow framing).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step
  2 (rewrote the gate-checking paragraph; counter-example #4
  expanded to enumerate (a) supply matching first_treat and accept
  rejection, (b) deliberate relabel + coercion, (c) different
  estimator; counter-example #5 distinguishes negative-dose
  treated-unit rejection from never-treated coercion).
- `CHANGELOG.md` Wave 2 entry (matches the new framing).
- `ROADMAP.md` AI-Agent Track building block (matches).

Test coverage:
- Renamed assertion messages in
  `test_treatment_dose_descriptive_fields_supplement_existing_gates`
  and `test_treatment_dose_min_flags_negative_dose_continuous_panels`
  to remove "authoritative gate" phrasing; reframed as "standard-
  workflow preflight" assertions consistent with the corrected docs.
- Added `test_negative_dose_on_never_treated_coerces_not_rejects`
  in `tests/test_continuous_did.py::TestEdgeCases` covering the
  reviewer's specific request: never-treated rows with NEGATIVE
  nonzero dose must coerce (with `UserWarning`) rather than raise
  the treated-unit negative-dose error. Sister to the existing
  `test_nonzero_dose_on_never_treated_warns` which covers the
  positive-dose case.

Rebased onto origin/main during this round (no conflicts beyond
prior CHANGELOG resolutions; main advanced 19 commits).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the agent-profile-shape-extensions branch from 31e8875 to 25a1197 Compare April 25, 2026 10:53
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 25a1197a1d7f7f73fff10107bb5e49366c3cceb9


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review focus: the earlier ContinuousDiD “authoritative gate” problem is mostly fixed. The new text now correctly treats has_never_treated and dose_min > 0 as standard-workflow, first_treat-dependent preflight checks rather than universal fit-time gates.
  • No estimator math, weighting, variance/SE, or default-fit behavior changed in this diff.
  • P1 The new agent-facing docs now go beyond the documented coercion note and present deliberate first_treat == 0 relabeling of nonzero-dose units as an agent option to “manufacture controls” / still fit ContinuousDiD. That is not supported by the Methodology Registry or the estimator’s own requirement that actual P(D=0) > 0 exist.
  • P2 Some new wording also says duplicate-row failures are “hard fit-time stops,” but ContinuousDiD still silently applies last-row-wins on duplicate (unit, time) cells.
  • Static review only: python -m pytest tests/test_profile_panel.py tests/test_guides.py tests/test_continuous_did.py -q could not be run because pytest is not installed.

Methodology

  • P1 Impact: the prior mismatch is not fully closed because the new docs treat intentional relabeling of positive-dose / nonzero-dose units to first_treat == 0 as an available fallback for ContinuousDiD (diff_diff/guides/llms-autonomous.txt:L241-L247, diff_diff/guides/llms-autonomous.txt:L893-L900, diff_diff/profile.py:L88-L97, ROADMAP.md:L141-L141, CHANGELOG.md:L13-L13). The registry and implementation only document this as inconsistent-input coercion with a warning, while still requiring actual P(D=0) > 0 because Remark 3.1 is unimplemented (docs/methodology/REGISTRY.md:L719-L745, diff_diff/continuous_did.py:L319-L360). An autonomous agent following the worked example could therefore alter the data to create synthetic controls without any documented methodological basis. Concrete fix: remove/rewrite the “relabel to first_treat == 0” fallback everywhere; keep the coercion note as implementation behavior for bad/inconsistent inputs, not as a valid routing option. When has_never_treated == False or dose_min <= 0, direct users to re-encode treatment or choose a different estimator unless REGISTRY is updated with a documented deviation.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the documentation inconsistency noted below.

Tech Debt

  • No findings. I did not see this issue tracked in TODO.md.

Security

  • No findings.

Documentation/Tests

  • P2 Impact: some new text now says the duplicate-row check is a “hard fit-time stop” (diff_diff/profile.py:L117-L122, diff_diff/guides/llms-autonomous.txt:L864-L866), but the estimator still silently overwrites duplicate (unit, time) rows in _precompute_structures() (diff_diff/continuous_did.py:L820-L823). That contradiction can make readers rely on fit-time protection that does not exist. Concrete fix: change this wording to “hard preflight veto / must deduplicate before fit because ContinuousDiD otherwise uses last-row-wins,” matching the actual implementation and the other accurate duplicate-row warnings already present in the guide.
  • Static note: tests were not executed here because pytest is unavailable in the environment.

Path to Approval

  1. Remove the new “manufacture controls / relabel to first_treat == 0” fallback from the guide, profile docstring, changelog, and roadmap, and replace it with wording that matches REGISTRY.md and ContinuousDiD.fit() exactly.
  2. Update the §5.2 continuous-dose worked example and related prose so the no-never-treated / negative-dose counterexamples route agents to re-encoding or alternative estimators, not deliberate relabeling.
  3. Add or update guide/profile regression tests so they lock in the corrected wording and do not normalize relabeling as an admissible methodology.

…s-fallback wording; correct duplicate-row "fit-time stop" claim

P1 (relabel-to-manufacture-controls misframing):
Round 11 introduced wording across the guide, profile docstring,
CHANGELOG, ROADMAP, and test docstrings that presented intentional
`first_treat == 0` relabeling of nonzero-dose units as an
"option" / "fallback" for fitting `ContinuousDiD` when the
profile-side preflights (`has_never_treated`, `dose_min > 0`)
fail. REGISTRY does not document this as a routing option, and the
estimator still requires actual `P(D=0) > 0` because Remark 3.1
lowest-dose-as-control is not yet implemented. The force-zero
coercion at `continuous_did.py:311-327` is implementation behavior
for INCONSISTENT inputs (e.g., user accidentally passes nonzero
dose on a never-treated row), not a methodology fallback.

Reworded every site to remove the relabeling-as-option framing and
replace it with the registry-documented fixes when (1) or (5)
fails: re-encode the treatment column to a non-negative scale that
contains a true never-treated group, or route to a different
estimator (`HeterogeneousAdoptionDiD` for graded-adoption panels;
linear DiD with the treatment as a continuous covariate). Every
remaining "manufacture controls" mention in the guide, profile,
and tests is now an explicit anti-recommendation ("do not relabel
... to manufacture controls"). Updated:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (item (1):
  "not an opportunity to relabel ..."; item (5): coercion is
  "implementation behavior for inconsistent inputs, not a
  methodological fallback").
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (the
  When-(1)-or-(5)-fails paragraph names re-encode + alternative
  estimator only; explicit anti-relabel warning).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
  trailing paragraph (consolidated; opening bullet drops the
  relabel-as-fallback framing; trailing paragraph trimmed to a
  pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 step 2 + counter-
  example #4 + counter-example #5 (relabel-as-option language
  removed; explicit "do not relabel" callouts; counter-example #4
  options trimmed to (a) re-encode and (b) different estimator).
- `CHANGELOG.md` (relabel-as-option clause removed; replaced with
  re-encode / different-estimator framing).
- `ROADMAP.md` (same).
- `tests/test_profile_panel.py` two test docstrings (relabel-as-
  workflow language removed).

P2 (duplicate-row "hard fit-time stop" misclaim):
Round 11 wording said "duplicate-row failures are hard fit-time
stops" — incorrect. `_precompute_structures` at
`continuous_did.py:818-823` silently overwrites with last-row-wins,
no exception raised. Reworded as "hard preflight veto: the agent
must deduplicate before fit because `ContinuousDiD` otherwise uses
last-row-wins, no fit-time exception" in profile.py docstring,
guide §4.7 opening bullet, and §5.2 step 2 (now defers to §2 for
the breakdown). The previously-correct §2 description of the
silent-coerce path is preserved.

Length housekeeping:
The round-11 round-12 expansion pushed `llms-autonomous.txt`
above `llms-full.txt`, breaking `test_full_is_largest`. Trimmed
~2.7KB by consolidating the §4.7 trailing paragraph + §5.2 step 2
trailing block to point at §2's full breakdown rather than
duplicating the per-check semantics. autonomous: 65364 chars,
full: 66058 chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 307f424c9580d2739ab4b05313faaec4bb3b6f2e


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review: the earlier user-facing “manufacture controls” guidance appears fixed in the guide/profile prose.
  • P1 The replacement ContinuousDiD preflight story is still wrong: it tells agents to derive first_treat from the same dose column that profile_panel() inspects, but the method/library contract uses a time-invariant unit dose plus a separate timing variable.
  • The contradiction is internal to the new §5.2 example itself: it first describes constant per-unit dose with separate first_treat, then bases the screening logic on inferring first_treat from that same dose column.
  • The same assumption is now duplicated in diff_diff/profile.py and new tests/test_profile_panel.py regression prose, so the mismatch will persist unless those surfaces are corrected together.
  • No estimator math, weighting, variance, or inference code changed in this diff.
  • Static review only: this environment lacks numpy/pytest, so I could not execute the targeted test files.

Methodology

  • P1 Impact: the new “standard workflow” in diff_diff/guides/llms-autonomous.txt:L219-L239, diff_diff/guides/llms-autonomous.txt:L530-L563, diff_diff/guides/llms-autonomous.txt:L801-L852, diff_diff/profile.py:L72-L130, and tests/test_profile_panel.py:L1111-L1199 says agents can derive first_treat from the same dose column that profile_panel() sees. That conflicts with the method contract: multi-period continuous DiD uses a time-invariant unit dose D_i and a separate timing group G_i/first_treat (docs/methodology/continuous-did.md:L65-L73), ContinuousDiD.fit() hard-rejects within-unit dose variation (diff_diff/continuous_did.py:L222-L228), and the library’s own generator emits constant dose with separate first_treat (diff_diff/prep_dgp.py:L970-L993). REGISTRY documents the untreated-group and balanced-panel requirements, but not any “derive timing from dose” workflow (docs/methodology/REGISTRY.md:L717-L723). An agent following the new guidance must either build a 0,0,d,d path to recover timing, which fit() rejects, or keep a valid constant-dose panel, in which case the dose column cannot identify first_treat at all. Concrete fix: remove the “derive first_treat from the same dose column” workflow from the guide/profile text and replace it with the actual contract: profile_panel() on the dose column alone can only provide descriptive dose-support checks; timing-dependent ContinuousDiD eligibility requires a separate first_treat column already supplied by the caller, or a new helper that validates dose and first_treat together.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • P2 Impact: PanelProfile is a public top-level type (diff_diff/__init__.py:L503-L506), but the PR adds two new required dataclass fields without defaults (diff_diff/profile.py:L179-L188). Any direct caller that instantiates PanelProfile(...) manually will now fail with TypeError, even though both additions are semantically optional. Concrete fix: make outcome_shape and treatment_dose backward-compatible additions, e.g. keyword-only fields with defaults, and add a regression test for direct construction.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: the new docstring in tests/test_continuous_did.py:L903-L911 still says the coercion path “lets agents legally relabel negative-dose units as first_treat == 0,” which contradicts the corrected guide/profile/registry position. The behavior under test is fine; the comment is not. Concrete fix: rewrite the docstring to say coercion is observed implementation behavior for inconsistent inputs, not a sanctioned routing option.
  • Residual risk: static review only; I could not run tests/test_profile_panel.py, tests/test_guides.py, or tests/test_continuous_did.py because the environment is missing numpy/pytest.

Path to Approval

  1. Remove/rewrite the “derive first_treat from the same dose column” workflow in diff_diff/guides/llms-autonomous.txt, diff_diff/profile.py, CHANGELOG.md, and ROADMAP.md.
  2. Rewrite §5.2 and the new tests/test_profile_panel.py regression prose around the actual ContinuousDiD contract: time-invariant unit dose D_i, separate first_treat/G_i.
  3. If predictive pre-fit screening is still desired, add a helper or optional first_treat= input that validates dose and first_treat together; otherwise explicitly narrow the docs to descriptive dose-support checks only.

… first_treat from dose" framing; add PanelProfile backward-compat defaults; fix test_continuous_did docstring

P1 (canonical ContinuousDiD setup vs. derive-from-dose framing):
Round 12 introduced a "standard workflow" description across the
guide, profile docstring, CHANGELOG, ROADMAP, and test docstrings
that said agents derive `first_treat` from the same dose column
passed to `profile_panel`. Reviewer correctly noted this conflicts
with the actual ContinuousDiD contract (`continuous_did.py:222-228`,
`prep_dgp.py:970-993`, `docs/methodology/continuous-did.md:65-73`):
the canonical setup uses a **time-invariant per-unit dose** `D_i`
and a **separate `first_treat` column** the caller supplies — the
dose column has no within-unit time variation in this setup, so it
cannot encode timing. An agent following the rejected framing would
either build a `0,0,d,d` path (which `fit()` rejects) or keep a
valid constant-dose panel (in which case the dose column carries no
timing information).

Reworded every site to drop the derive-from-dose framing and
replace with the canonical setup. The five facts on the dose column
remain predictive of `fit()` outcomes BECAUSE the canonical
convention ties `first_treat == 0` to `D_i == 0` and treated units
carry their constant dose across all periods — so `has_never_treated`
proxies `P(D=0) > 0` and `dose_min > 0` predicts the strictly-
positive-treated-dose requirement, without any "derivation" of
`first_treat` from the dose column. Updated:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
  the multi-paragraph block to use the canonical-setup framing
  and added an explicit "agent must validate `first_treat`
  independently" note).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference.
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet.
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain
  step 2 + counter-examples #4 and #5 (now describe the
  canonical setup rather than a derive-from-dose workflow).
- `CHANGELOG.md` Wave 2 entry.
- `ROADMAP.md` AI-Agent Track building block.
- `tests/test_profile_panel.py` `test_treatment_dose_min_flags
  _negative_dose_continuous_panels` docstring/comments.

P2 (PanelProfile direct-construction backward compat):
Wave 2 added `outcome_shape` and `treatment_dose` to PanelProfile
without defaults, breaking direct `PanelProfile(...)` calls that
predate Wave 2. Made both fields default to `None` (moved them to
the end of the field list; both are `Optional[...]`). Added
`test_panel_profile_direct_construction_without_wave2_fields`
asserting that direct construction without the new fields succeeds
and yields `None` defaults that serialize correctly through
`to_dict()`.

P3 (test_continuous_did.py docstring overstating sanction):
The new `test_negative_dose_on_never_treated_coerces_not_rejects`
docstring said the contract "lets agents legally relabel
negative-dose units as `first_treat == 0` to coerce them away."
Reworded as observed implementation behavior for inconsistent
inputs, NOT a sanctioned routing option — the test locks in the
coercion contract while the autonomous guide §5.2 explicitly tells
agents not to use this path methodologically.

Length invariant maintained: autonomous (65748 chars) < full
(66031 chars); `test_full_is_largest` still passes (compares
character count, not byte count, so on-disk size with UTF-8
multi-byte characters differs from the assertion target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 60014f5c25cc54e67380a1d0c59cf93b62dc0f72


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review: the prior P1 about telling agents to derive first_treat from the dose column is resolved on the public-facing surfaces. The updated ContinuousDiD guidance now matches the canonical contract: time-invariant unit dose D_i, separate caller-supplied first_treat, and separate fit-time validation of first_treat (diff_diff/profile.py:L65-L141, diff_diff/guides/llms-autonomous.txt:L206-L283, diff_diff/guides/llms-autonomous.txt:L810-L919, diff_diff/continuous_did.py:L222-L360).
  • The new OutcomeShape.is_count_like guidance is consistent with the actual Poisson path: WooldridgeDiD(method="poisson") rejects negative outcomes, and the guide now correctly describes the reported overall_att as an ASF-based outcome-scale difference rather than a multiplicative reported estimand (diff_diff/profile.py:L700-L725, diff_diff/wooldridge.py:L1102-L1245, docs/methodology/REGISTRY.md:L1333-L1349).
  • The public API propagation looks complete for this PR: the new dataclasses are exported from diff_diff, and PanelProfile keeps backward compatibility by defaulting the new fields (diff_diff/__init__.py:L252-L258, diff_diff/__init__.py:L503-L509, diff_diff/profile.py:L189-L193, tests/test_profile_panel.py:L1325-L1362).
  • P3: two new regression docstrings still restate the old “derive first_treat from dose” workflow. This is test-only prose, not a runtime or methodology bug, but it leaves stale language in the tree (tests/test_profile_panel.py:L1110-L1137, tests/test_profile_panel.py:L1184-L1199).
  • Static review only: I could not run the touched tests in this environment because pytest is unavailable (python -m pytest ... failed with No module named pytest).

Methodology

  • No unmitigated findings. The affected methods are ContinuousDiD and WooldridgeDiD, but only through pre-fit descriptive guidance, not estimator math or inference. The updated text aligns with the ContinuousDiD contract in docs/methodology/continuous-did.md:L65-L76, docs/methodology/continuous-did.md:L265-L269, docs/methodology/REGISTRY.md:L717-L745, and diff_diff/continuous_did.py:L222-L360, and with the Poisson ASF path documented for WooldridgeDiD in docs/methodology/REGISTRY.md:L1333-L1349 and implemented at diff_diff/wooldridge.py:L1102-L1245.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: tests/test_profile_panel.py:L1110-L1137 and tests/test_profile_panel.py:L1184-L1199 still describe a “standard ContinuousDiD workflow” where first_treat is derived from the same dose column. That contradicts the now-corrected contract in diff_diff/profile.py:L65-L141 and diff_diff/guides/llms-autonomous.txt:L206-L283, diff_diff/guides/llms-autonomous.txt:L810-L919. It will not change runtime behavior, but it can mislead future editors and make this methodology regression easier to reintroduce.
    Concrete fix: rewrite those test docstrings/comments to the canonical formulation already used elsewhere: time-invariant per-unit dose D_i, separate caller-supplied first_treat, and profile-side screening only on the dose column.

  • Residual risk: static review only; targeted pytest execution was not possible in this environment because pytest is not installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant