Skip to content

Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464

Open
igerber wants to merge 4 commits into
mainfrom
agent-workflow-discoverability
Open

Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464
igerber wants to merge 4 commits into
mainfrom
agent-workflow-discoverability

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 18, 2026

Summary

  • Add diff_diff.agent_workflow(df, *, unit, time, treatment, outcome, first_treat=None) — stateless orchestrator that prints a copy-pasteable 5-step workflow (profile_panelget_llm_guide("autonomous")<Estimator>().fit(df, ...)practitioner_next_steps(result)BusinessReport(result).full_report()) with the caller's column names templated in via repr() (source-safe under hostile labels). Calls nothing internally; does not inspect df. Step 3 branches on first_treat so the emitted call always matches an actual fit() signature (CallawaySantAnna with first_treat= when supplied; DifferenceInDifferences with treatment= for the simple 2x2 case).
  • Rewrite top-level __doc__ so the first non-blank paragraph names agent_workflow(df, ...) as the recommended starting call. get_llm_guide("full") and get_llm_guide("practitioner") pointers preserved (the existing tests/test_guides.py::test_module_docstring_mentions_helper guard continues to hold).
  • Add module-level __dir__() override that surfaces agent-facing names (agent_workflow, profile_panel, get_llm_guide, practitioner_next_steps, BusinessReport, DiagnosticReport) at the head of dir(diff_diff). Implementation uses a small _OrderedName(str) subclass with custom __lt__ — CPython's dir() always sorts the result of __dir__(), but PyList_Sort respects custom comparison operators, so the priority head wins. __all__ membership and from diff_diff import * semantics are unchanged. inspect.getmembers(diff_diff) still returns 274 members including __doc__/__name__/__file__.
  • Update diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt with one-line agent_workflow(...) pointers.

Closes #460.
Companion PR for #461 (snapshot/contract test) will follow after this merges.

Methodology references (required if estimator / math changes)

  • Method name(s): N/A — no estimator, math, variance, or default inference behavior changes.
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_agent_workflow.py (new, 18 tests) covering canonical output keys, workflow primitive references, column-name templating, first_treat conditional, first_treat-driven Step 3 branching, no-overclaim contract on the no-first_treat path, AST-parseability of every emitted call line, adversarial column labels (6 parametrized cases: embedded quotes, backslashes, injection attempts, whitespace), fit_candidates importability, verbose toggle, no-df-inspection contract.
  • Existing test guards continue to hold: tests/test_guides.py (41 tests, including test_module_docstring_mentions_helper).
  • Backtest / simulation / notebook evidence (if applicable): N/A — orchestrator only emits string templates; no fits run.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 3 commits May 17, 2026 22:37
…lity

Closes #460 items 1, 2, 3:
- `__doc__` rewritten to lead with `agent_workflow(df, ...)` as the
  recommended starting call. Existing `get_llm_guide` mention preserved
  for the `test_module_docstring_mentions_helper` guard.
- New `diff_diff.agent_workflow(df, *, unit, time, treatment, outcome,
  first_treat=None, verbose=True)` — stateless orchestrator that prints
  a copy-pasteable 5-step workflow (profile_panel → get_llm_guide →
  <Estimator>.fit → practitioner_next_steps → BusinessReport) with the
  caller's column names templated in. Calls nothing internally; does not
  inspect df.
- New module-level `__dir__()` override paired with a `_OrderedName(str)`
  subclass that subverts CPython's unconditional alphabetic sort
  (PyList_Sort respects __lt__ on elements). Agent-facing names surface
  at the head of `dir(diff_diff)`; tail stays alphabetic via the
  `str.__lt__` fallback. `__all__` membership unchanged; `from diff_diff
  import *` semantics unaffected.

Tests: tests/test_agent_workflow.py (9 tests covering canonical keys,
script content, column templating, first_treat conditional, fit
candidate importability, verbose toggle, no-df-inspection contract).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ tests)

Addresses local /ai-review-local findings (4 of 4):

P0 (Security): agent_workflow.py emitted column names via `f'{k}="{v}"'`,
which broke on labels containing `"` and could inject Python statements
into the "copy-pasteable" script. Replaced with `_safe_kwarg` /
`_join_kwargs` that use Python's `repr()` — produces source-safe string
literals for any str input (quotes, backslashes, newlines, etc.). Added
9 regression tests (test_emitted_calls_are_valid_python plus a
parametrize over 6 adversarial labels) that ast.parse() the emitted
script lines.

P1 (Contract break): Step 3 example previously emitted
`CallawaySantAnna(...).fit(df, ..., treatment=...)` but CS21's
`.fit()` signature is `(data, outcome, unit, time, first_treat, ...)`
— `treatment=` is rejected. The snippet would TypeError if anyone
copy-pasted. Step 3 now branches on `first_treat` presence:
- first_treat=None  -> DifferenceInDifferences with `treatment=`
- first_treat=<col> -> CallawaySantAnna with `first_treat=` (no treatment)
Each branch's call signature matches the actual public API. New test
`test_first_treat_switches_step3_estimator` locks the contract.

Also dropped PreTrendsPower / HonestDiD from `_WORKFLOW_PATTERNS` (their
`.fit()` takes `results`, not `df`+columns — pattern label "substitute
the candidate" was misleading for them); they now appear in a separate
"diagnostics" block in the templated script under Step 4.

P2 (Introspection regression): __dir__() returned only `__all__`
entries, which dropped module dunders (`__doc__`, `__name__`, `__file__`)
from `inspect.getmembers(diff_diff)` — contradicted the CHANGELOG
"compatible with inspect.getmembers" claim. __dir__() now returns
`[_OrderedName(n) for n in globals()]`, so the full module namespace
is enumerated while head-first ordering is preserved.

P2 (Test coverage): Added `test_emitted_calls_are_valid_python` (ast.parse
on emitted profile_call + fit example for both first_treat branches) and
`test_adversarial_column_labels_produce_valid_python` (parametrized over
quotes, backslashes, injection attempts, whitespace). Tests would have
caught both P0 and P1 if present.

Total tests: 9 → 17 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eat path

Codex R2 narrowed P1: the no-`first_treat` branch unconditionally emitted
`DifferenceInDifferences().fit(...)` and labeled it "matched to your data
shape". That overclaim doesn't hold for continuous-dose or heterogeneous-
adoption designs without first_treat — DiD validates and rejects non-
binary treatment / time at fit-time (`diff_diff/estimators.py:307-312`).

Fix in agent_workflow.py:
- With `first_treat`: keep CallawaySantAnna example, relabel as
  "your data has first_treat -> staggered structure; CS is canonical".
- Without `first_treat`: keep DiD example as the simple 2x2 case but
  reframe the label to explicitly condition on that shape, name the
  alternative candidates (ContinuousDiD, HeterogeneousAdoptionDiD),
  and reference DiD's fit-time validation so the agent knows when to
  switch. No claim of universal match.

New regression test `test_no_first_treat_step3_does_not_overclaim_match`:
- asserts "matched to your data shape" is NOT in the no-first_treat
  script (negative)
- asserts "2x2" and "substitute" appear (positive — substitution hint)
- asserts ContinuousDiD and HeterogeneousAdoptionDiD remain enumerated
  in Step 2's routing patterns

Total tests: 17 → 18 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes — no estimator math/SE/default-inference code changed in this diff, but the new agent_workflow() entrypoint introduces one unmitigated P1 methodology/routing issue.

Executive Summary

  • No changes in the diff modify estimator formulas, weighting, variance/SE logic, or documented defaults.
  • P1: agent_workflow() treats first_treat as if it implies the binary staggered CallawaySantAnna path, which conflicts with the library’s own documented ContinuousDiD and dose-based HAD contracts.
  • P2: the emitted "script" is not actually a copy-pasteable Python script as advertised; it contains prose/footer lines and the final report call does not print anything in script execution.
  • P3: the new __dir__() discoverability hack is not contract-tested in this PR.
  • Security surface looks fine; the repr()-based templating is the right shape for hostile column labels.

Runtime note: I could not execute the added tests in this sandbox because numpy is unavailable here, so this review is from static inspection.

Methodology

  • Severity: P1. Impact: first_treat is not equivalent to “binary staggered adoption, so start with Callaway-Sant’Anna.” The new Step 3 branch says exactly that and emits a CallawaySantAnna().fit(...) example whenever first_treat is present, but the library’s own contracts treat first_treat as a required input for ContinuousDiD as well, and dose-based HAD event-study also accepts first_treat_col. Because agent_workflow() is now advertised as the recommended starting call, an agent on a continuous-dose or graded-adoption panel can be steered into the wrong estimator with no warning. I found no REGISTRY.md note documenting this routing shortcut. Concrete fix: keep Step 3 conditional on treatment shape even when first_treat is supplied; if you keep a single example, label it explicitly as the binary staggered case and name ContinuousDiD / HAD alternatives for dose-based panels. Add regression tests for first_treat + continuous-dose and first_treat + HAD event-study scenarios. References: diff_diff/agent_workflow.py (around L148-L170), diff_diff/continuous_did.py, diff_diff/had.py, docs/methodology/REGISTRY.md, docs/methodology/REGISTRY.md, diff_diff/init.py.

Code Quality

  • Severity: P2. Impact: the returned "script" is described and tested as “copy-pasteable,” but the emitted block is not executable Python as printed: it contains prose/footer lines (Full reference: ..., Practitioner recipe: ...) and Step 5 evaluates BusinessReport(...).full_report() without printing or assigning the returned string. In a real .py execution the block will syntax-error, and even line-by-line execution discards the final report output. Concrete fix: either emit a real executable Python block (comment-prefix prose lines and print(diff_diff.BusinessReport(result).full_report())) or rename/document the output as workflow text rather than a script. Add a whole-output smoke test aligned to the intended contract. References: diff_diff/agent_workflow.py (around L189-L233), diff_diff/business_report.py, tests/test_agent_workflow.py.

Performance

No findings.

Maintainability

No findings beyond the test-gap note below.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the new public __dir__() discoverability surface relies on a custom _OrderedName(str) ordering trick, but this PR adds no contract test for head order, membership, or inspect.getmembers() compatibility. That is deferrable, but it leaves the most compatibility-sensitive new surface unpinned. Concrete fix: add a small discoverability contract test in this PR, or explicitly track the deferral in TODO.md before merge. References: diff_diff/init.py, tests/test_agent_workflow.py.

Path to Approval

  1. Remove the universal first_treat -> CallawaySantAnna implication in agent_workflow() and add regression tests covering first_treat on continuous-dose / dose-based panels. That resolves the blocking P1 and would move the assessment to .
  2. Recommended before merge: make the emitted "script" contract honest by either generating executable Python or relabeling it as prose workflow text.

… script executability + P3 deferral note)

CI codex review on PR #464 (initial push) flagged 3 items:

P1 (Methodology): The first_treat-branch Step 3 example implied
binary staggered -> CallawaySantAnna without naming alternatives.
But ContinuousDiD.fit() and HeterogeneousAdoptionDiD event-study
BOTH take a `first_treat` column (via `first_treat=` and
`first_treat_col=` respectively); the same overclaim pattern I
already caught and fixed for the no-first_treat path in R3 was
present mirror-image on the with-first_treat path. The Step 3
comment block now enumerates the three estimator options for
first_treat panels (CS21 / ContinuousDiD / HAD event-study) and
names the HAD signature distinction (first_treat_col vs first_treat)
so an agent isn't silently steered to CS21 on a continuous-dose
panel. New test `test_first_treat_step3_names_non_binary_alternatives`
locks this.

P2 (Code Quality): The emitted "script" was tested for parseability
only at the line-of-call granularity, not at the whole-file level.
Section headers like "Step 1 - Describe the panel:" and footer
lines "Full reference: ..." would SyntaxError if the agent dumped
the script to a .py file and ran it. Step 5 also evaluated
`BusinessReport(...).full_report()` without printing — the report
str would be discarded on execution. Fix: every prose line in the
template now starts with `#` (valid Python comment); code lines
stand at column 0 and run as-is; Step 5 wraps the report call in
`print()`. The full script `ast.parse()`s as a Python module under
both first_treat branches. New tests
`test_emitted_script_parses_as_python_module` and
`test_emitted_script_prints_report` lock these contracts.

P3 (Tests): The new `__dir__()` discoverability surface
(`_OrderedName` ordering trick) has no contract test in this PR.
Added a TODO.md entry under "Tech Debt from Code Reviews" /
"Testing/Docs" deferring it to PR B (planned, addresses #461 with
the full snapshot/contract surface). The deferral is the documented
pattern the reviewer's "Concrete fix" suggested.

Tests: 18 → 21 in test_agent_workflow.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 27bb822f4d9a0adfb5185471f65c6b0edb7433bd


Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings in the changed surface.

Executive Summary

  • Re-review: the prior P1 is resolved. agent_workflow() no longer treats first_treat as uniquely implying CallawaySantAnna; Step 3 now explicitly says first_treat does not identify a single estimator and names ContinuousDiD / HeterogeneousAdoptionDiD alternatives, which matches the estimator contracts and registry notes at diff_diff/agent_workflow.py:L148-L176, diff_diff/continuous_did.py:L159-L169, diff_diff/had.py:L2782-L2841, docs/methodology/REGISTRY.md:L2423-L2424, docs/methodology/REGISTRY.md:L2572-L2574.
  • The prior script-contract issue is also resolved: the emitted workflow is now valid Python text and Step 5 prints the report, with regression tests at diff_diff/agent_workflow.py:L213-L242 and tests/test_agent_workflow.py:L203-L242.
  • No estimator formulas, weighting, variance/SE logic, identification checks, or default inference behavior change in this diff.
  • The new surface is hardened against hostile column labels via repr() templating and parser-safety tests at diff_diff/agent_workflow.py:L43-L61 and tests/test_agent_workflow.py:L291-L326.
  • The prior __dir__() test-gap is now explicitly tracked in TODO.md:L164-L164, so it is mitigated P3 rather than a blocker.

Methodology

  • No findings. The only methodology-relevant prior issue was the first_treat routing shortcut, and that is addressed by the revised Step 3 wording plus targeted regression tests at tests/test_agent_workflow.py:L145-L200.

Code Quality

  • No findings. The new orchestrator is simple string templating, does not inspect df, and the copy-pasteable-script contract is materially improved and tested at diff_diff/agent_workflow.py:L140-L145, diff_diff/agent_workflow.py:L213-L242, and tests/test_agent_workflow.py:L203-L242.

Performance

  • No findings. The added code is negligible string assembly plus a dir() ordering hook.

Maintainability

  • No unmitigated findings. The __dir__() ordering trick is unusual, but it is documented inline at diff_diff/__init__.py:L525-L589.

Tech Debt

  • Severity: P3-informational (mitigated). Impact: __dir__() head-order / inspect.getmembers contract coverage is still deferred, so regressions on that discoverability surface would not be caught by this PR’s tests alone. Concrete fix: land the planned tests/test_agent_discoverability.py follow-up already tracked in TODO.md:L164-L164. References: diff_diff/__init__.py:L525-L589, TODO.md:L164-L164.

Security

  • No findings. repr()-based kwarg rendering prevents code injection through column labels, and the adversarial-label tests cover the relevant parser-safety cases at diff_diff/agent_workflow.py:L43-L61 and tests/test_agent_workflow.py:L291-L326.

Documentation/Tests

  • Severity: P3-informational (mitigated). Impact: this PR changes diff_diff/guides/llms.txt and diff_diff/guides/llms-autonomous.txt, but .txt AI guides still sit outside dedicated snippet smoke validation in CI. That limitation predates this PR and is already tracked. Concrete fix: implement the existing .txt guide-validation follow-up tracked in TODO.md:L160-L160. References: diff_diff/guides/llms.txt:L13-L15, diff_diff/guides/llms-autonomous.txt:L13-L17, TODO.md:L160-L160.

Runtime note: I could not execute the package tests in this sandbox because importing diff_diff fails without numpy; this review is based on static inspection.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface get_llm_guide / profile_panel / practitioner_next_steps more prominently for agent discovery

1 participant