feat!: harden for first public PyPI release (0.1.0)#3
Merged
Conversation
Address the multi-agent PyPI-readiness review end-to-end so `pip install agent2model` works and the package is publish-ready. Blockers / correctness: - cli: make langgraph imports lazy and add an `agent2model[langgraph]` extra, so the whole CLI (compile YAML, --help) works on a clean core install instead of ModuleNotFoundError. - packaging: add the Apache-2.0 LICENSE file; single-source the version via hatch `[tool.hatch.version]` + `dynamic` (now 0.1.0); add `py.typed`; enrich classifiers + project URLs; ship examples in the wheel (force-include) and the deepspeed/runpod JSON; commitizen tracks __init__.py. - eval: fix the paired-test scenario misalignment — workers appended results in completion order, so Wilcoxon zipped mismatched scenarios across conditions and every p-value was wrong. Return from the worker so gather preserves scenario order; add a reversed-latency regression test. - training(8b): actually launch DeepSpeed ZeRO-3 — `launch_training` routes 8B through `accelerate launch -m agent2model.training._entry`; guard `train` to refuse single-process 8B; main-process-only checkpoint copy. (3B unchanged.) High / UX / security: - cli: add `--version`; preflight ANTHROPIC_API_KEY in generate/eval (fail fast, no raw traceback); refuse `generate` while TODO prompts remain; default `serve` host to 127.0.0.1; validate recipe `--name` as a slug (path-traversal guard). - langgraph: warn on zero user-role nodes + TODO prompts at compile time; uuid-suffixed module name; trust-boundary warning before executing a .py; document the supported subset in docs/adapters.md. - docs/README: lead Install with `pip install agent2model` + extras; absolutize repo-relative links (PyPI-renderable); attribute the 128–462x / 87–98% figures to the paper as not-yet-reproduced; annotate cost/GPU per pipeline step. - traversal: bias by BFS distance-to-terminal so multi-hop escapes converge instead of raising a spurious "trap cycles" error. - release.yml: twine check + wheel install smoke; fix stale PyPI URL. - rename Claude.md -> CLAUDE.md (case-correct for Linux/GitHub). ruff/black/mypy-strict clean; 318 unit tests pass; wheel builds + twine check OK.
Re-review of the hardening branch surfaced one real new blocker and several high/medium items; fixed here. - langgraph(blocker): keep the imported module in sys.modules across the whole discovery (factory call + var scan), popping only in finally. The previous pop-before-factory broke the dominant pattern — a build_graph() whose state is Annotated[list, add_messages] under `from __future__ import annotations` — with a spurious "NameError: Annotated is not defined". Regression test added. - eval(high): fix holm_bonferroni docstring example output ([True,False,False]; the code was already correct — the example was written as Benjamini-Hochberg). - serve(security): default the library serve() host to 127.0.0.1 (was 0.0.0.0, an unauthenticated all-interfaces bind for Python-API callers); the Modal container already passes 0.0.0.0 explicitly. - cloud(high): install the published wheel in Modal images by default (`pip install agent2model[extra]==<version>`); AGENT2MODEL_MODAL_LOCAL_SRC=1 keeps the local-source path for pre-publish/dev. Fixes the wheel-install demo. - cli(ux): add a typer.confirm cost gate + --yes to generate/eval; default --baselines to in_context (the langgraph baseline needs the [langgraph] extra). - packaging: add an aggregate [eval] extra (matplotlib+openai+langgraph). - langgraph: warn when escalation/abandonment-looking sinks collapse to terminal: success. README: correct test-count claim + serve status footnote; gitignore the local review artifact. ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds + twine check OK.
- build: pin `hatchling>=1.27` — required for PEP 639 license metadata (Metadata 2.4); older backends emit metadata PyPI rejects, breaking the build non-deterministically. - runpod: wire `--size` (AGENT2MODEL_SIZE, default 3b) through setup.sh and set `8b` in train_8b.json — the 8B pod was training on the single-GPU 3B preset and OOMing; drop the dead AGENT2MODEL_NUM_GPUS env. - docs honesty: README ships-table marks vLLM serving ⏳* (container-verified, HTTP pending) to match the status section; travel example README reframes the "within 5% of Table 1" parity claim as a goal not-yet-reproduced (empty benchmarks), per CLAUDE.md. - packaging: real maintainer name + email (trust contact); LangGraph-focused PyPI summary (CrewAI adapter is still a stub — kept as roadmap, not advertised as working); plain hyphen instead of em-dash in the summary. ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds + twine check OK.
… cases) Round 4 confirmed no finding survives at blocker severity. Fixing the verified high/correctness items so the first release is top-tier: - langgraph: reject a node that has BOTH a plain edge and conditional edges (would silently become a decision node with an unconditional branch). Extend the escalation/abandonment scan to predecessor node names (the shipped demo's `escalate -> END` collapses onto a success terminal) and warn loudly. - eval: warn when run without --served-url that the compiled model is NOT scored (baselines only) — the README/FAQ sell eval as the no-GPU win. - report: relabel the cost chart "API cost per conversation" + note it excludes self-hosted GPU inference cost, so it is not mistaken for the paper's cost-reduction figure. - packaging: exclude __pycache__/*.pyc from wheel+sdist and remove a stray .pyc that was shipping via the examples force-include. - runpod: self-bootstrap setup.sh via curl in every spec's dockerArgs (the recipe previously assumed setup.sh was already on the pod and hard-failed); document the no-egress fallback and mark the 8B spec an unvalidated template. - README: note the [langgraph] extra above the compile snippet + the trust-boundary caution for executing a .py graph. ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds (no .pyc) + twine OK.
Round 5 again confirmed no install/runtime blockers; fixing the verified high correctness + credibility items. - langgraph: reject graphs with multiple `add_edge(START, …)` (previously kept only the last, silently producing a wrong, nondeterministic IR for a legal graph). Makes "the conversion is deterministic and validated" true. - langgraph: warn on parallel fan-out (a node with >1 unconditional edge) — it silently becomes a single random branch in sequential traversal; every other lossy case already warned, this one didn't. - honesty(rubric): stop claiming the behavioral anchors are "verbatim from the paper §3" — they are agent2model's own 1/3/5 approximations of the paper's scale. Corrected in rubric.py, docs/evaluation.md, and the README. - honesty(benchmarks): relabel the "Paper" target numbers as reported-by-the- paper and NOT independently verified by this project (the source numbers can't be confirmed beyond the preprint); "Current" columns remain empty until our own reproductions land. ruff/black/mypy-strict clean; 319 unit tests pass.
…ipped) The CrewAI adapter raises NotImplementedError in v1, so the README/docs/CLI/ mkdocs headlines no longer advertise a CrewAI on-ramp. CrewAI stays as a roadmap mention and a discovery keyword, not a headline capability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the multi-agent PyPI-readiness review end-to-end. After this,
pip install agent2modelworks on a clean machine and the package is publish-ready. ruff/black/mypy-strict clean; 318 unit tests pass; wheel builds andtwine checkpasses.Blockers fixed
langgraphimports are now lazy + anagent2model[langgraph]extra; the CLI (compile YAML,--help) works with only the core install.LICENSE+license-files.0.1.0.dev0in 3 places → single-sourced via hatchdynamic/[tool.hatch.version], now0.1.0; commitizen tracks__init__.py.launch_trainingruns 8B underaccelerate launch -m agent2model.training._entrywith DeepSpeed;trainrefuses single-process 8B; main-process-only checkpoint copy. 3B still trains in-process.add_local_dir→ examples now ship in the wheel (force-include) so packaged installs resolve them; the Modal image stopgap is documented to flip topip_installpost-publish.High / UX / security
--version; ANTHROPIC_API_KEY preflight ingenerate/eval;generaterefuses placeholderTODO:prompts;servedefaults to127.0.0.1;--nameslug validation (path-traversal guard)..py, newdocs/adapters.md.pip install agent2model; absolutize links; attribute 128–462× / 87–98% to the paper as not-yet-reproduced; per-step cost/GPU annotations.release.ymladdstwine check+ wheel smoke and fixes the stale URL;Claude.md→CLAUDE.md.Not in this PR (follow-ups)
pip_install("agent2model[...]==0.1.0")— only possible after the first PyPI publish.