Surface agent_workflow() + curated dir() for LLM discoverability (#460) by igerber · Pull Request #464 · igerber/diff-diff

igerber · 2026-05-18T02:38:14Z

Summary

Add diff_diff.agent_workflow(df, *, unit, time, treatment, outcome, first_treat=None) — stateless orchestrator that prints a copy-pasteable 5-step workflow (profile_panel → get_llm_guide("autonomous") → <Estimator>().fit(df, ...) → practitioner_next_steps(result) → BusinessReport(result).full_report()) with the caller's column names templated in via repr() (source-safe under hostile labels). Calls nothing internally; does not inspect df. Step 3 branches on first_treat so the emitted call always matches an actual fit() signature (CallawaySantAnna with first_treat= when supplied; DifferenceInDifferences with treatment= for the simple 2x2 case).
Rewrite top-level __doc__ so the first non-blank paragraph names agent_workflow(df, ...) as the recommended starting call. get_llm_guide("full") and get_llm_guide("practitioner") pointers preserved (the existing tests/test_guides.py::test_module_docstring_mentions_helper guard continues to hold).
Add module-level __dir__() override that surfaces agent-facing names (agent_workflow, profile_panel, get_llm_guide, practitioner_next_steps, BusinessReport, DiagnosticReport) at the head of dir(diff_diff). Implementation uses a small _OrderedName(str) subclass with custom __lt__ — CPython's dir() always sorts the result of __dir__(), but PyList_Sort respects custom comparison operators, so the priority head wins. __all__ membership and from diff_diff import * semantics are unchanged. inspect.getmembers(diff_diff) still returns 274 members including __doc__/__name__/__file__.
Update diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt with one-line agent_workflow(...) pointers.

Closes #460.
Companion PR for #461 (snapshot/contract test) will follow after this merges.

Methodology references (required if estimator / math changes)

Method name(s): N/A — no estimator, math, variance, or default inference behavior changes.
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: tests/test_agent_workflow.py (new, 18 tests) covering canonical output keys, workflow primitive references, column-name templating, first_treat conditional, first_treat-driven Step 3 branching, no-overclaim contract on the no-first_treat path, AST-parseability of every emitted call line, adversarial column labels (6 parametrized cases: embedded quotes, backslashes, injection attempts, whitespace), fit_candidates importability, verbose toggle, no-df-inspection contract.
Existing test guards continue to hold: tests/test_guides.py (41 tests, including test_module_docstring_mentions_helper).
Backtest / simulation / notebook evidence (if applicable): N/A — orchestrator only emits string templates; no fits run.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…lity Closes #460 items 1, 2, 3: - `__doc__` rewritten to lead with `agent_workflow(df, ...)` as the recommended starting call. Existing `get_llm_guide` mention preserved for the `test_module_docstring_mentions_helper` guard. - New `diff_diff.agent_workflow(df, *, unit, time, treatment, outcome, first_treat=None, verbose=True)` — stateless orchestrator that prints a copy-pasteable 5-step workflow (profile_panel → get_llm_guide → <Estimator>.fit → practitioner_next_steps → BusinessReport) with the caller's column names templated in. Calls nothing internally; does not inspect df. - New module-level `__dir__()` override paired with a `_OrderedName(str)` subclass that subverts CPython's unconditional alphabetic sort (PyList_Sort respects __lt__ on elements). Agent-facing names surface at the head of `dir(diff_diff)`; tail stays alphabetic via the `str.__lt__` fallback. `__all__` membership unchanged; `from diff_diff import *` semantics unaffected. Tests: tests/test_agent_workflow.py (9 tests covering canonical keys, script content, column templating, first_treat conditional, fit candidate importability, verbose toggle, no-df-inspection contract). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ tests) Addresses local /ai-review-local findings (4 of 4): P0 (Security): agent_workflow.py emitted column names via `f'{k}="{v}"'`, which broke on labels containing `"` and could inject Python statements into the "copy-pasteable" script. Replaced with `_safe_kwarg` / `_join_kwargs` that use Python's `repr()` — produces source-safe string literals for any str input (quotes, backslashes, newlines, etc.). Added 9 regression tests (test_emitted_calls_are_valid_python plus a parametrize over 6 adversarial labels) that ast.parse() the emitted script lines. P1 (Contract break): Step 3 example previously emitted `CallawaySantAnna(...).fit(df, ..., treatment=...)` but CS21's `.fit()` signature is `(data, outcome, unit, time, first_treat, ...)` — `treatment=` is rejected. The snippet would TypeError if anyone copy-pasted. Step 3 now branches on `first_treat` presence: - first_treat=None -> DifferenceInDifferences with `treatment=` - first_treat=<col> -> CallawaySantAnna with `first_treat=` (no treatment) Each branch's call signature matches the actual public API. New test `test_first_treat_switches_step3_estimator` locks the contract. Also dropped PreTrendsPower / HonestDiD from `_WORKFLOW_PATTERNS` (their `.fit()` takes `results`, not `df`+columns — pattern label "substitute the candidate" was misleading for them); they now appear in a separate "diagnostics" block in the templated script under Step 4. P2 (Introspection regression): __dir__() returned only `__all__` entries, which dropped module dunders (`__doc__`, `__name__`, `__file__`) from `inspect.getmembers(diff_diff)` — contradicted the CHANGELOG "compatible with inspect.getmembers" claim. __dir__() now returns `[_OrderedName(n) for n in globals()]`, so the full module namespace is enumerated while head-first ordering is preserved. P2 (Test coverage): Added `test_emitted_calls_are_valid_python` (ast.parse on emitted profile_call + fit example for both first_treat branches) and `test_adversarial_column_labels_produce_valid_python` (parametrized over quotes, backslashes, injection attempts, whitespace). Tests would have caught both P0 and P1 if present. Total tests: 9 → 17 in test_agent_workflow.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eat path Codex R2 narrowed P1: the no-`first_treat` branch unconditionally emitted `DifferenceInDifferences().fit(...)` and labeled it "matched to your data shape". That overclaim doesn't hold for continuous-dose or heterogeneous- adoption designs without first_treat — DiD validates and rejects non- binary treatment / time at fit-time (`diff_diff/estimators.py:307-312`). Fix in agent_workflow.py: - With `first_treat`: keep CallawaySantAnna example, relabel as "your data has first_treat -> staggered structure; CS is canonical". - Without `first_treat`: keep DiD example as the simple 2x2 case but reframe the label to explicitly condition on that shape, name the alternative candidates (ContinuousDiD, HeterogeneousAdoptionDiD), and reference DiD's fit-time validation so the agent knows when to switch. No claim of universal match. New regression test `test_no_first_treat_step3_does_not_overclaim_match`: - asserts "matched to your data shape" is NOT in the no-first_treat script (negative) - asserts "2x2" and "substitute" appear (positive — substitution hint) - asserts ContinuousDiD and HeterogeneousAdoptionDiD remain enumerated in Step 2's routing patterns Total tests: 17 → 18 in test_agent_workflow.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T02:46:26Z

Overall Assessment

⚠️ Needs changes — no estimator math/SE/default-inference code changed in this diff, but the new agent_workflow() entrypoint introduces one unmitigated P1 methodology/routing issue.

Executive Summary

No changes in the diff modify estimator formulas, weighting, variance/SE logic, or documented defaults.
P1: agent_workflow() treats first_treat as if it implies the binary staggered CallawaySantAnna path, which conflicts with the library’s own documented ContinuousDiD and dose-based HAD contracts.
P2: the emitted "script" is not actually a copy-pasteable Python script as advertised; it contains prose/footer lines and the final report call does not print anything in script execution.
P3: the new __dir__() discoverability hack is not contract-tested in this PR.
Security surface looks fine; the repr()-based templating is the right shape for hostile column labels.

Runtime note: I could not execute the added tests in this sandbox because numpy is unavailable here, so this review is from static inspection.

Methodology

Severity: P1. Impact: first_treat is not equivalent to “binary staggered adoption, so start with Callaway-Sant’Anna.” The new Step 3 branch says exactly that and emits a CallawaySantAnna().fit(...) example whenever first_treat is present, but the library’s own contracts treat first_treat as a required input for ContinuousDiD as well, and dose-based HAD event-study also accepts first_treat_col. Because agent_workflow() is now advertised as the recommended starting call, an agent on a continuous-dose or graded-adoption panel can be steered into the wrong estimator with no warning. I found no REGISTRY.md note documenting this routing shortcut. Concrete fix: keep Step 3 conditional on treatment shape even when first_treat is supplied; if you keep a single example, label it explicitly as the binary staggered case and name ContinuousDiD / HAD alternatives for dose-based panels. Add regression tests for first_treat + continuous-dose and first_treat + HAD event-study scenarios. References: diff_diff/agent_workflow.py (around L148-L170), diff_diff/continuous_did.py, diff_diff/had.py, docs/methodology/REGISTRY.md, docs/methodology/REGISTRY.md, diff_diff/init.py.

Code Quality

Severity: P2. Impact: the returned "script" is described and tested as “copy-pasteable,” but the emitted block is not executable Python as printed: it contains prose/footer lines (Full reference: ..., Practitioner recipe: ...) and Step 5 evaluates BusinessReport(...).full_report() without printing or assigning the returned string. In a real .py execution the block will syntax-error, and even line-by-line execution discards the final report output. Concrete fix: either emit a real executable Python block (comment-prefix prose lines and print(diff_diff.BusinessReport(result).full_report())) or rename/document the output as workflow text rather than a script. Add a whole-output smoke test aligned to the intended contract. References: diff_diff/agent_workflow.py (around L189-L233), diff_diff/business_report.py, tests/test_agent_workflow.py.

Performance

No findings.

Maintainability

No findings beyond the test-gap note below.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new public __dir__() discoverability surface relies on a custom _OrderedName(str) ordering trick, but this PR adds no contract test for head order, membership, or inspect.getmembers() compatibility. That is deferrable, but it leaves the most compatibility-sensitive new surface unpinned. Concrete fix: add a small discoverability contract test in this PR, or explicitly track the deferral in TODO.md before merge. References: diff_diff/init.py, tests/test_agent_workflow.py.

Path to Approval

Remove the universal first_treat -> CallawaySantAnna implication in agent_workflow() and add regression tests covering first_treat on continuous-dose / dose-based panels. That resolves the blocking P1 and would move the assessment to ✅.
Recommended before merge: make the emitted "script" contract honest by either generating executable Python or relabeling it as prose workflow text.

… script executability + P3 deferral note) CI codex review on PR #464 (initial push) flagged 3 items: P1 (Methodology): The first_treat-branch Step 3 example implied binary staggered -> CallawaySantAnna without naming alternatives. But ContinuousDiD.fit() and HeterogeneousAdoptionDiD event-study BOTH take a `first_treat` column (via `first_treat=` and `first_treat_col=` respectively); the same overclaim pattern I already caught and fixed for the no-first_treat path in R3 was present mirror-image on the with-first_treat path. The Step 3 comment block now enumerates the three estimator options for first_treat panels (CS21 / ContinuousDiD / HAD event-study) and names the HAD signature distinction (first_treat_col vs first_treat) so an agent isn't silently steered to CS21 on a continuous-dose panel. New test `test_first_treat_step3_names_non_binary_alternatives` locks this. P2 (Code Quality): The emitted "script" was tested for parseability only at the line-of-call granularity, not at the whole-file level. Section headers like "Step 1 - Describe the panel:" and footer lines "Full reference: ..." would SyntaxError if the agent dumped the script to a .py file and ran it. Step 5 also evaluated `BusinessReport(...).full_report()` without printing — the report str would be discarded on execution. Fix: every prose line in the template now starts with `#` (valid Python comment); code lines stand at column 0 and run as-is; Step 5 wraps the report call in `print()`. The full script `ast.parse()`s as a Python module under both first_treat branches. New tests `test_emitted_script_parses_as_python_module` and `test_emitted_script_prints_report` lock these contracts. P3 (Tests): The new `__dir__()` discoverability surface (`_OrderedName` ordering trick) has no contract test in this PR. Added a TODO.md entry under "Tech Debt from Code Reviews" / "Testing/Docs" deferring it to PR B (planned, addresses #461 with the full snapshot/contract surface). The deferral is the documented pattern the reviewer's "Concrete fix" suggested. Tests: 18 → 21 in test_agent_workflow.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T10:03:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 27bb822f4d9a0adfb5185471f65c6b0edb7433bd

Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings in the changed surface.

Executive Summary

Re-review: the prior P1 is resolved. agent_workflow() no longer treats first_treat as uniquely implying CallawaySantAnna; Step 3 now explicitly says first_treat does not identify a single estimator and names ContinuousDiD / HeterogeneousAdoptionDiD alternatives, which matches the estimator contracts and registry notes at diff_diff/agent_workflow.py:L148-L176, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2841, docs/methodology/REGISTRY.md:L2423-L2424, docs/methodology/REGISTRY.md:L2572-L2574.
The prior script-contract issue is also resolved: the emitted workflow is now valid Python text and Step 5 prints the report, with regression tests at diff_diff/agent_workflow.py:L213-L242 and tests/test_agent_workflow.py:L203-L242.
No estimator formulas, weighting, variance/SE logic, identification checks, or default inference behavior change in this diff.
The new surface is hardened against hostile column labels via repr() templating and parser-safety tests at diff_diff/agent_workflow.py:L43-L61 and tests/test_agent_workflow.py:L291-L326.
The prior __dir__() test-gap is now explicitly tracked in TODO.md:L164-L164, so it is mitigated P3 rather than a blocker.

Methodology

No findings. The only methodology-relevant prior issue was the first_treat routing shortcut, and that is addressed by the revised Step 3 wording plus targeted regression tests at tests/test_agent_workflow.py:L145-L200.

Code Quality

No findings. The new orchestrator is simple string templating, does not inspect df, and the copy-pasteable-script contract is materially improved and tested at diff_diff/agent_workflow.py:L140-L145, diff_diff/agent_workflow.py:L213-L242, and tests/test_agent_workflow.py:L203-L242.

Performance

No findings. The added code is negligible string assembly plus a dir() ordering hook.

Maintainability

No unmitigated findings. The __dir__() ordering trick is unusual, but it is documented inline at diff_diff/__init__.py:L525-L589.

Tech Debt

Severity: P3-informational (mitigated). Impact: __dir__() head-order / inspect.getmembers contract coverage is still deferred, so regressions on that discoverability surface would not be caught by this PR’s tests alone. Concrete fix: land the planned tests/test_agent_discoverability.py follow-up already tracked in TODO.md:L164-L164. References: diff_diff/__init__.py:L525-L589, TODO.md:L164-L164.

Security

No findings. repr()-based kwarg rendering prevents code injection through column labels, and the adversarial-label tests cover the relevant parser-safety cases at diff_diff/agent_workflow.py:L43-L61 and tests/test_agent_workflow.py:L291-L326.

Documentation/Tests

Severity: P3-informational (mitigated). Impact: this PR changes diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt, but .txt AI guides still sit outside dedicated snippet smoke validation in CI. That limitation predates this PR and is already tracked. Concrete fix: implement the existing .txt guide-validation follow-up tracked in TODO.md:L160-L160. References: diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L17, TODO.md:L160-L160.

Runtime note: I could not execute the package tests in this sandbox because importing diff_diff fails without numpy; this review is based on static inspection.

igerber and others added 3 commits May 17, 2026 22:37

igerber added the ready-for-ci Triggers CI test workflows label May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464

Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464
igerber wants to merge 4 commits into
mainfrom
agent-workflow-discoverability

igerber commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant