Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464
Surface agent_workflow() + curated dir() for LLM discoverability (#460)#464igerber wants to merge 4 commits into
Conversation
…lity Closes #460 items 1, 2, 3: - `__doc__` rewritten to lead with `agent_workflow(df, ...)` as the recommended starting call. Existing `get_llm_guide` mention preserved for the `test_module_docstring_mentions_helper` guard. - New `diff_diff.agent_workflow(df, *, unit, time, treatment, outcome, first_treat=None, verbose=True)` — stateless orchestrator that prints a copy-pasteable 5-step workflow (profile_panel → get_llm_guide → <Estimator>.fit → practitioner_next_steps → BusinessReport) with the caller's column names templated in. Calls nothing internally; does not inspect df. - New module-level `__dir__()` override paired with a `_OrderedName(str)` subclass that subverts CPython's unconditional alphabetic sort (PyList_Sort respects __lt__ on elements). Agent-facing names surface at the head of `dir(diff_diff)`; tail stays alphabetic via the `str.__lt__` fallback. `__all__` membership unchanged; `from diff_diff import *` semantics unaffected. Tests: tests/test_agent_workflow.py (9 tests covering canonical keys, script content, column templating, first_treat conditional, fit candidate importability, verbose toggle, no-df-inspection contract). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ tests)
Addresses local /ai-review-local findings (4 of 4):
P0 (Security): agent_workflow.py emitted column names via `f'{k}="{v}"'`,
which broke on labels containing `"` and could inject Python statements
into the "copy-pasteable" script. Replaced with `_safe_kwarg` /
`_join_kwargs` that use Python's `repr()` — produces source-safe string
literals for any str input (quotes, backslashes, newlines, etc.). Added
9 regression tests (test_emitted_calls_are_valid_python plus a
parametrize over 6 adversarial labels) that ast.parse() the emitted
script lines.
P1 (Contract break): Step 3 example previously emitted
`CallawaySantAnna(...).fit(df, ..., treatment=...)` but CS21's
`.fit()` signature is `(data, outcome, unit, time, first_treat, ...)`
— `treatment=` is rejected. The snippet would TypeError if anyone
copy-pasted. Step 3 now branches on `first_treat` presence:
- first_treat=None -> DifferenceInDifferences with `treatment=`
- first_treat=<col> -> CallawaySantAnna with `first_treat=` (no treatment)
Each branch's call signature matches the actual public API. New test
`test_first_treat_switches_step3_estimator` locks the contract.
Also dropped PreTrendsPower / HonestDiD from `_WORKFLOW_PATTERNS` (their
`.fit()` takes `results`, not `df`+columns — pattern label "substitute
the candidate" was misleading for them); they now appear in a separate
"diagnostics" block in the templated script under Step 4.
P2 (Introspection regression): __dir__() returned only `__all__`
entries, which dropped module dunders (`__doc__`, `__name__`, `__file__`)
from `inspect.getmembers(diff_diff)` — contradicted the CHANGELOG
"compatible with inspect.getmembers" claim. __dir__() now returns
`[_OrderedName(n) for n in globals()]`, so the full module namespace
is enumerated while head-first ordering is preserved.
P2 (Test coverage): Added `test_emitted_calls_are_valid_python` (ast.parse
on emitted profile_call + fit example for both first_treat branches) and
`test_adversarial_column_labels_produce_valid_python` (parametrized over
quotes, backslashes, injection attempts, whitespace). Tests would have
caught both P0 and P1 if present.
Total tests: 9 → 17 in test_agent_workflow.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eat path Codex R2 narrowed P1: the no-`first_treat` branch unconditionally emitted `DifferenceInDifferences().fit(...)` and labeled it "matched to your data shape". That overclaim doesn't hold for continuous-dose or heterogeneous- adoption designs without first_treat — DiD validates and rejects non- binary treatment / time at fit-time (`diff_diff/estimators.py:307-312`). Fix in agent_workflow.py: - With `first_treat`: keep CallawaySantAnna example, relabel as "your data has first_treat -> staggered structure; CS is canonical". - Without `first_treat`: keep DiD example as the simple 2x2 case but reframe the label to explicitly condition on that shape, name the alternative candidates (ContinuousDiD, HeterogeneousAdoptionDiD), and reference DiD's fit-time validation so the agent knows when to switch. No claim of universal match. New regression test `test_no_first_treat_step3_does_not_overclaim_match`: - asserts "matched to your data shape" is NOT in the no-first_treat script (negative) - asserts "2x2" and "substitute" appear (positive — substitution hint) - asserts ContinuousDiD and HeterogeneousAdoptionDiD remain enumerated in Step 2's routing patterns Total tests: 17 → 18 in test_agent_workflow.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment
Executive Summary
Runtime note: I could not execute the added tests in this sandbox because Methodology
Code Quality
Performance No findings. Maintainability No findings beyond the test-gap note below. Tech Debt No findings. Security No findings. Documentation/Tests
Path to Approval
|
… script executability + P3 deferral note) CI codex review on PR #464 (initial push) flagged 3 items: P1 (Methodology): The first_treat-branch Step 3 example implied binary staggered -> CallawaySantAnna without naming alternatives. But ContinuousDiD.fit() and HeterogeneousAdoptionDiD event-study BOTH take a `first_treat` column (via `first_treat=` and `first_treat_col=` respectively); the same overclaim pattern I already caught and fixed for the no-first_treat path in R3 was present mirror-image on the with-first_treat path. The Step 3 comment block now enumerates the three estimator options for first_treat panels (CS21 / ContinuousDiD / HAD event-study) and names the HAD signature distinction (first_treat_col vs first_treat) so an agent isn't silently steered to CS21 on a continuous-dose panel. New test `test_first_treat_step3_names_non_binary_alternatives` locks this. P2 (Code Quality): The emitted "script" was tested for parseability only at the line-of-call granularity, not at the whole-file level. Section headers like "Step 1 - Describe the panel:" and footer lines "Full reference: ..." would SyntaxError if the agent dumped the script to a .py file and ran it. Step 5 also evaluated `BusinessReport(...).full_report()` without printing — the report str would be discarded on execution. Fix: every prose line in the template now starts with `#` (valid Python comment); code lines stand at column 0 and run as-is; Step 5 wraps the report call in `print()`. The full script `ast.parse()`s as a Python module under both first_treat branches. New tests `test_emitted_script_parses_as_python_module` and `test_emitted_script_prints_report` lock these contracts. P3 (Tests): The new `__dir__()` discoverability surface (`_OrderedName` ordering trick) has no contract test in this PR. Added a TODO.md entry under "Tech Debt from Code Reviews" / "Testing/Docs" deferring it to PR B (planned, addresses #461 with the full snapshot/contract surface). The deferral is the documented pattern the reviewer's "Concrete fix" suggested. Tests: 18 → 21 in test_agent_workflow.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Runtime note: I could not execute the package tests in this sandbox because importing |
Summary
diff_diff.agent_workflow(df, *, unit, time, treatment, outcome, first_treat=None)— stateless orchestrator that prints a copy-pasteable 5-step workflow (profile_panel→get_llm_guide("autonomous")→<Estimator>().fit(df, ...)→practitioner_next_steps(result)→BusinessReport(result).full_report()) with the caller's column names templated in viarepr()(source-safe under hostile labels). Calls nothing internally; does not inspectdf. Step 3 branches onfirst_treatso the emitted call always matches an actualfit()signature (CallawaySantAnna withfirst_treat=when supplied; DifferenceInDifferences withtreatment=for the simple 2x2 case).__doc__so the first non-blank paragraph namesagent_workflow(df, ...)as the recommended starting call.get_llm_guide("full")andget_llm_guide("practitioner")pointers preserved (the existingtests/test_guides.py::test_module_docstring_mentions_helperguard continues to hold).__dir__()override that surfaces agent-facing names (agent_workflow,profile_panel,get_llm_guide,practitioner_next_steps,BusinessReport,DiagnosticReport) at the head ofdir(diff_diff). Implementation uses a small_OrderedName(str)subclass with custom__lt__— CPython'sdir()always sorts the result of__dir__(), but PyList_Sort respects custom comparison operators, so the priority head wins.__all__membership andfrom diff_diff import *semantics are unchanged.inspect.getmembers(diff_diff)still returns 274 members including__doc__/__name__/__file__.diff_diff/guides/llms.txtanddiff_diff/guides/llms-autonomous.txtwith one-lineagent_workflow(...)pointers.Closes #460.
Companion PR for #461 (snapshot/contract test) will follow after this merges.
Methodology references (required if estimator / math changes)
Validation
tests/test_agent_workflow.py(new, 18 tests) covering canonical output keys, workflow primitive references, column-name templating,first_treatconditional,first_treat-driven Step 3 branching, no-overclaim contract on the no-first_treatpath, AST-parseability of every emitted call line, adversarial column labels (6 parametrized cases: embedded quotes, backslashes, injection attempts, whitespace),fit_candidatesimportability, verbose toggle, no-df-inspection contract.tests/test_guides.py(41 tests, includingtest_module_docstring_mentions_helper).Security / privacy
Generated with Claude Code