Skip to content

feat!: harden for first public PyPI release (0.1.0)#3

Merged
kamaalg merged 6 commits into
mainfrom
release-hardening
May 30, 2026
Merged

feat!: harden for first public PyPI release (0.1.0)#3
kamaalg merged 6 commits into
mainfrom
release-hardening

Conversation

@kamaalg
Copy link
Copy Markdown
Owner

@kamaalg kamaalg commented May 30, 2026

Addresses the multi-agent PyPI-readiness review end-to-end. After this, pip install agent2model works on a clean machine and the package is publish-ready. ruff/black/mypy-strict clean; 318 unit tests pass; wheel builds and twine check passes.

Blockers fixed

  • CLI dies on clean installlanggraph imports are now lazy + an agent2model[langgraph] extra; the CLI (compile YAML, --help) works with only the core install.
  • No LICENSE → added Apache-2.0 LICENSE + license-files.
  • Version 0.1.0.dev0 in 3 places → single-sourced via hatch dynamic/[tool.hatch.version], now 0.1.0; commitizen tracks __init__.py.
  • 8B ZeRO-3 never launchedlaunch_training runs 8B under accelerate launch -m agent2model.training._entry with DeepSpeed; train refuses single-process 8B; main-process-only checkpoint copy. 3B still trains in-process.
  • Modal add_local_dir → examples now ship in the wheel (force-include) so packaged installs resolve them; the Modal image stopgap is documented to flip to pip_install post-publish.
  • Eval p-values silently wrong → paired-test scenario misalignment fixed (workers no longer append in completion order); reversed-latency regression test added.

High / UX / security

  • --version; ANTHROPIC_API_KEY preflight in generate/eval; generate refuses placeholder TODO: prompts; serve defaults to 127.0.0.1; --name slug validation (path-traversal guard).
  • LangGraph: compile-time warnings (no user nodes / TODO prompts), uuid module name, trust-boundary warning before executing a .py, new docs/adapters.md.
  • README/docs: lead Install with pip install agent2model; absolutize links; attribute 128–462× / 87–98% to the paper as not-yet-reproduced; per-step cost/GPU annotations.
  • Traversal biases by BFS distance-to-terminal; release.yml adds twine check + wheel smoke and fixes the stale URL; Claude.mdCLAUDE.md.

Not in this PR (follow-ups)

  • Real 8×A100 verification of the 8B path (needs GPUs / Modal billing).
  • Flipping the Modal image from local-source to pip_install("agent2model[...]==0.1.0") — only possible after the first PyPI publish.

kamaalg added 6 commits May 30, 2026 13:18
Address the multi-agent PyPI-readiness review end-to-end so `pip install
agent2model` works and the package is publish-ready.

Blockers / correctness:
- cli: make langgraph imports lazy and add an `agent2model[langgraph]` extra,
  so the whole CLI (compile YAML, --help) works on a clean core install instead
  of ModuleNotFoundError.
- packaging: add the Apache-2.0 LICENSE file; single-source the version via
  hatch `[tool.hatch.version]` + `dynamic` (now 0.1.0); add `py.typed`; enrich
  classifiers + project URLs; ship examples in the wheel (force-include) and the
  deepspeed/runpod JSON; commitizen tracks __init__.py.
- eval: fix the paired-test scenario misalignment — workers appended results in
  completion order, so Wilcoxon zipped mismatched scenarios across conditions and
  every p-value was wrong. Return from the worker so gather preserves scenario
  order; add a reversed-latency regression test.
- training(8b): actually launch DeepSpeed ZeRO-3 — `launch_training` routes 8B
  through `accelerate launch -m agent2model.training._entry`; guard `train` to
  refuse single-process 8B; main-process-only checkpoint copy. (3B unchanged.)

High / UX / security:
- cli: add `--version`; preflight ANTHROPIC_API_KEY in generate/eval (fail fast,
  no raw traceback); refuse `generate` while TODO prompts remain; default `serve`
  host to 127.0.0.1; validate recipe `--name` as a slug (path-traversal guard).
- langgraph: warn on zero user-role nodes + TODO prompts at compile time;
  uuid-suffixed module name; trust-boundary warning before executing a .py;
  document the supported subset in docs/adapters.md.
- docs/README: lead Install with `pip install agent2model` + extras; absolutize
  repo-relative links (PyPI-renderable); attribute the 128–462x / 87–98% figures
  to the paper as not-yet-reproduced; annotate cost/GPU per pipeline step.
- traversal: bias by BFS distance-to-terminal so multi-hop escapes converge
  instead of raising a spurious "trap cycles" error.
- release.yml: twine check + wheel install smoke; fix stale PyPI URL.
- rename Claude.md -> CLAUDE.md (case-correct for Linux/GitHub).

ruff/black/mypy-strict clean; 318 unit tests pass; wheel builds + twine check OK.
Re-review of the hardening branch surfaced one real new blocker and several
high/medium items; fixed here.

- langgraph(blocker): keep the imported module in sys.modules across the whole
  discovery (factory call + var scan), popping only in finally. The previous
  pop-before-factory broke the dominant pattern — a build_graph() whose state is
  Annotated[list, add_messages] under `from __future__ import annotations` —
  with a spurious "NameError: Annotated is not defined". Regression test added.
- eval(high): fix holm_bonferroni docstring example output ([True,False,False];
  the code was already correct — the example was written as Benjamini-Hochberg).
- serve(security): default the library serve() host to 127.0.0.1 (was 0.0.0.0,
  an unauthenticated all-interfaces bind for Python-API callers); the Modal
  container already passes 0.0.0.0 explicitly.
- cloud(high): install the published wheel in Modal images by default
  (`pip install agent2model[extra]==<version>`); AGENT2MODEL_MODAL_LOCAL_SRC=1
  keeps the local-source path for pre-publish/dev. Fixes the wheel-install demo.
- cli(ux): add a typer.confirm cost gate + --yes to generate/eval; default
  --baselines to in_context (the langgraph baseline needs the [langgraph] extra).
- packaging: add an aggregate [eval] extra (matplotlib+openai+langgraph).
- langgraph: warn when escalation/abandonment-looking sinks collapse to
  terminal: success. README: correct test-count claim + serve status footnote;
  gitignore the local review artifact.

ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds + twine check OK.
- build: pin `hatchling>=1.27` — required for PEP 639 license metadata
  (Metadata 2.4); older backends emit metadata PyPI rejects, breaking the build
  non-deterministically.
- runpod: wire `--size` (AGENT2MODEL_SIZE, default 3b) through setup.sh and set
  `8b` in train_8b.json — the 8B pod was training on the single-GPU 3B preset and
  OOMing; drop the dead AGENT2MODEL_NUM_GPUS env.
- docs honesty: README ships-table marks vLLM serving ⏳* (container-verified,
  HTTP pending) to match the status section; travel example README reframes the
  "within 5% of Table 1" parity claim as a goal not-yet-reproduced (empty
  benchmarks), per CLAUDE.md.
- packaging: real maintainer name + email (trust contact); LangGraph-focused PyPI
  summary (CrewAI adapter is still a stub — kept as roadmap, not advertised as
  working); plain hyphen instead of em-dash in the summary.

ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds + twine check OK.
… cases)

Round 4 confirmed no finding survives at blocker severity. Fixing the verified
high/correctness items so the first release is top-tier:

- langgraph: reject a node that has BOTH a plain edge and conditional edges
  (would silently become a decision node with an unconditional branch). Extend
  the escalation/abandonment scan to predecessor node names (the shipped demo's
  `escalate -> END` collapses onto a success terminal) and warn loudly.
- eval: warn when run without --served-url that the compiled model is NOT scored
  (baselines only) — the README/FAQ sell eval as the no-GPU win.
- report: relabel the cost chart "API cost per conversation" + note it excludes
  self-hosted GPU inference cost, so it is not mistaken for the paper's
  cost-reduction figure.
- packaging: exclude __pycache__/*.pyc from wheel+sdist and remove a stray .pyc
  that was shipping via the examples force-include.
- runpod: self-bootstrap setup.sh via curl in every spec's dockerArgs (the
  recipe previously assumed setup.sh was already on the pod and hard-failed);
  document the no-egress fallback and mark the 8B spec an unvalidated template.
- README: note the [langgraph] extra above the compile snippet + the
  trust-boundary caution for executing a .py graph.

ruff/black/mypy-strict clean; 319 unit tests pass; wheel builds (no .pyc) + twine OK.
Round 5 again confirmed no install/runtime blockers; fixing the verified high
correctness + credibility items.

- langgraph: reject graphs with multiple `add_edge(START, …)` (previously kept
  only the last, silently producing a wrong, nondeterministic IR for a legal
  graph). Makes "the conversion is deterministic and validated" true.
- langgraph: warn on parallel fan-out (a node with >1 unconditional edge) — it
  silently becomes a single random branch in sequential traversal; every other
  lossy case already warned, this one didn't.
- honesty(rubric): stop claiming the behavioral anchors are "verbatim from the
  paper §3" — they are agent2model's own 1/3/5 approximations of the paper's
  scale. Corrected in rubric.py, docs/evaluation.md, and the README.
- honesty(benchmarks): relabel the "Paper" target numbers as reported-by-the-
  paper and NOT independently verified by this project (the source numbers can't
  be confirmed beyond the preprint); "Current" columns remain empty until our
  own reproductions land.

ruff/black/mypy-strict clean; 319 unit tests pass.
…ipped)

The CrewAI adapter raises NotImplementedError in v1, so the README/docs/CLI/
mkdocs headlines no longer advertise a CrewAI on-ramp. CrewAI stays as a
roadmap mention and a discovery keyword, not a headline capability.
@kamaalg kamaalg merged commit 9641160 into main May 30, 2026
2 checks passed
@kamaalg kamaalg deleted the release-hardening branch May 30, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant