feat!: harden for first public PyPI release (0.1.0) by kamaalg · Pull Request #3 · kamaalg/agent2model

kamaalg · 2026-05-30T09:19:16Z

Addresses the multi-agent PyPI-readiness review end-to-end. After this, pip install agent2model works on a clean machine and the package is publish-ready. ruff/black/mypy-strict clean; 318 unit tests pass; wheel builds and twine check passes.

Blockers fixed

CLI dies on clean install → langgraph imports are now lazy + an agent2model[langgraph] extra; the CLI (compile YAML, --help) works with only the core install.
No LICENSE → added Apache-2.0 LICENSE + license-files.
Version 0.1.0.dev0 in 3 places → single-sourced via hatch dynamic/[tool.hatch.version], now 0.1.0; commitizen tracks __init__.py.
8B ZeRO-3 never launched → launch_training runs 8B under accelerate launch -m agent2model.training._entry with DeepSpeed; train refuses single-process 8B; main-process-only checkpoint copy. 3B still trains in-process.
Modal add_local_dir → examples now ship in the wheel (force-include) so packaged installs resolve them; the Modal image stopgap is documented to flip to pip_install post-publish.
Eval p-values silently wrong → paired-test scenario misalignment fixed (workers no longer append in completion order); reversed-latency regression test added.

High / UX / security

--version; ANTHROPIC_API_KEY preflight in generate/eval; generate refuses placeholder TODO: prompts; serve defaults to 127.0.0.1; --name slug validation (path-traversal guard).
LangGraph: compile-time warnings (no user nodes / TODO prompts), uuid module name, trust-boundary warning before executing a .py, new docs/adapters.md.
README/docs: lead Install with pip install agent2model; absolutize links; attribute 128–462× / 87–98% to the paper as not-yet-reproduced; per-step cost/GPU annotations.
Traversal biases by BFS distance-to-terminal; release.yml adds twine check + wheel smoke and fixes the stale URL; Claude.md→CLAUDE.md.

Not in this PR (follow-ups)

Real 8×A100 verification of the 8B path (needs GPUs / Modal billing).
Flipping the Modal image from local-source to pip_install("agent2model[...]==0.1.0") — only possible after the first PyPI publish.

Address the multi-agent PyPI-readiness review end-to-end so `pip install agent2model` works and the package is publish-ready. Blockers / correctness: - cli: make langgraph imports lazy and add an `agent2model[langgraph]` extra, so the whole CLI (compile YAML, --help) works on a clean core install instead of ModuleNotFoundError. - packaging: add the Apache-2.0 LICENSE file; single-source the version via hatch `[tool.hatch.version]` + `dynamic` (now 0.1.0); add `py.typed`; enrich classifiers + project URLs; ship examples in the wheel (force-include) and the deepspeed/runpod JSON; commitizen tracks __init__.py. - eval: fix the paired-test scenario misalignment — workers appended results in completion order, so Wilcoxon zipped mismatched scenarios across conditions and every p-value was wrong. Return from the worker so gather preserves scenario order; add a reversed-latency regression test. - training(8b): actually launch DeepSpeed ZeRO-3 — `launch_training` routes 8B through `accelerate launch -m agent2model.training._entry`; guard `train` to refuse single-process 8B; main-process-only checkpoint copy. (3B unchanged.) High / UX / security: - cli: add `--version`; preflight ANTHROPIC_API_KEY in generate/eval (fail fast, no raw traceback); refuse `generate` while TODO prompts remain; default `serve` host to 127.0.0.1; validate recipe `--name` as a slug (path-traversal guard). - langgraph: warn on zero user-role nodes + TODO prompts at compile time; uuid-suffixed module name; trust-boundary warning before executing a .py; document the supported subset in docs/adapters.md. - docs/README: lead Install with `pip install agent2model` + extras; absolutize repo-relative links (PyPI-renderable); attribute the 128–462x / 87–98% figures to the paper as not-yet-reproduced; annotate cost/GPU per pipeline step. - traversal: bias by BFS distance-to-terminal so multi-hop escapes converge instead of raising a spurious "trap cycles" error. - release.yml: twine check + wheel install smoke; fix stale PyPI URL. - rename Claude.md -> CLAUDE.md (case-correct for Linux/GitHub). ruff/black/mypy-strict clean; 318 unit tests pass; wheel builds + twine check OK.

Re-review of the hardening branch surfaced one real new blocker and several high/medium items; fixed here. - langgraph(blocker): keep the imported module in sys.modules across the whole discovery (factory call + var scan), popping only in finally. The previous pop-before-factory broke the dominant pattern — a build_graph() whose state is Annotated[list, add_messages] under `from __future__ import annotations` — with a spurious "NameError: Annotated is not defined". Regression test added. - eval(high): fix holm_bonferroni docstring example output ([True,False,False]; the code was already correct — the example was written as Benjamini-Hochberg). - serve(security): default the library serve() host to 127.0.0.1 (was 0.0.0.0, an unauthenticated all-interfaces bind for Python-API callers); the Modal container already passes 0.0.0.0 explicitly. - cloud(high): install the published wheel in Modal images by default (`pip install agent2model[extra]==<version>`); AGENT2MODEL_MODAL_LOCAL_SRC=1 keeps the local-source path for pre-publish/dev. Fixes the wheel-install demo. - cli(ux): add a typer.confirm cost gate + --yes to generate/eval; default --baselines to in_context (the langgraph baseline needs the [langgraph] extra). - packaging: add an aggregate [eval] extra (matplotlib+openai+langgraph). - langgraph: warn when escalation/abandonment-looking sinks collapse to terminal: success. README: correct test-count claim + serve status footnote; gitignore the local review artifact. ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds + twine check OK.

- build: pin `hatchling>=1.27` — required for PEP 639 license metadata (Metadata 2.4); older backends emit metadata PyPI rejects, breaking the build non-deterministically. - runpod: wire `--size` (AGENT2MODEL_SIZE, default 3b) through setup.sh and set `8b` in train_8b.json — the 8B pod was training on the single-GPU 3B preset and OOMing; drop the dead AGENT2MODEL_NUM_GPUS env. - docs honesty: README ships-table marks vLLM serving ⏳* (container-verified, HTTP pending) to match the status section; travel example README reframes the "within 5% of Table 1" parity claim as a goal not-yet-reproduced (empty benchmarks), per CLAUDE.md. - packaging: real maintainer name + email (trust contact); LangGraph-focused PyPI summary (CrewAI adapter is still a stub — kept as roadmap, not advertised as working); plain hyphen instead of em-dash in the summary. ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds + twine check OK.

… cases) Round 4 confirmed no finding survives at blocker severity. Fixing the verified high/correctness items so the first release is top-tier: - langgraph: reject a node that has BOTH a plain edge and conditional edges (would silently become a decision node with an unconditional branch). Extend the escalation/abandonment scan to predecessor node names (the shipped demo's `escalate -> END` collapses onto a success terminal) and warn loudly. - eval: warn when run without --served-url that the compiled model is NOT scored (baselines only) — the README/FAQ sell eval as the no-GPU win. - report: relabel the cost chart "API cost per conversation" + note it excludes self-hosted GPU inference cost, so it is not mistaken for the paper's cost-reduction figure. - packaging: exclude __pycache__/*.pyc from wheel+sdist and remove a stray .pyc that was shipping via the examples force-include. - runpod: self-bootstrap setup.sh via curl in every spec's dockerArgs (the recipe previously assumed setup.sh was already on the pod and hard-failed); document the no-egress fallback and mark the 8B spec an unvalidated template. - README: note the [langgraph] extra above the compile snippet + the trust-boundary caution for executing a .py graph. ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds (no .pyc) + twine OK.

Round 5 again confirmed no install/runtime blockers; fixing the verified high correctness + credibility items. - langgraph: reject graphs with multiple `add_edge(START, …)` (previously kept only the last, silently producing a wrong, nondeterministic IR for a legal graph). Makes "the conversion is deterministic and validated" true. - langgraph: warn on parallel fan-out (a node with >1 unconditional edge) — it silently becomes a single random branch in sequential traversal; every other lossy case already warned, this one didn't. - honesty(rubric): stop claiming the behavioral anchors are "verbatim from the paper §3" — they are agent2model's own 1/3/5 approximations of the paper's scale. Corrected in rubric.py, docs/evaluation.md, and the README. - honesty(benchmarks): relabel the "Paper" target numbers as reported-by-the- paper and NOT independently verified by this project (the source numbers can't be confirmed beyond the preprint); "Current" columns remain empty until our own reproductions land. ruff/black/mypy-strict clean; 319 unit tests pass.

…ipped) The CrewAI adapter raises NotImplementedError in v1, so the README/docs/CLI/ mkdocs headlines no longer advertise a CrewAI on-ramp. CrewAI stays as a roadmap mention and a discovery keyword, not a headline capability.

kamaalg added 6 commits May 30, 2026 13:18

kamaalg merged commit 9641160 into main May 30, 2026
2 checks passed

kamaalg deleted the release-hardening branch May 30, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: harden for first public PyPI release (0.1.0)#3

feat!: harden for first public PyPI release (0.1.0)#3
kamaalg merged 6 commits into
mainfrom
release-hardening

kamaalg commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kamaalg commented May 30, 2026

Blockers fixed

High / UX / security

Not in this PR (follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant