feat: add ignore_unreachable to NodeClient.destroy_all_workers by timzsu · Pull Request #2 · mlsys-io/FlowMesh

timzsu · 2026-04-29T09:57:21Z

Purpose

Teardown flows currently wrap destroy_all_workers() in their own try/except FlowMeshConnectionError so they don't crash when a stack up failed before the server became reachable. Lift that pattern into the SDK as an opt-in keyword.

Changes

sdk/stack/src/flowmesh_stack/node_client.py — NodeClient.destroy_all_workers accepts ignore_unreachable: bool = False and returns bool. With True, FlowMeshConnectionError is swallowed and the call returns False. Other errors propagate either way.
tests/sdk/test_node_client.py — covers both branches: default re-raises FlowMeshConnectionError; ignore_unreachable=True returns False.

Design

Keyword-only flag follows pathlib.Path.unlink(missing_ok=True). Bool return is the canonical "did it reach the server?" signal; the SDK doesn't log itself so callers control messaging.

Test Plan

uv run pre-commit run --all-files
uv run pytest tests/ -q

Test Result

pre-commit — clean (isort / black / ruff / codespell / mypy / sync-requirements)
pytest — 531 passed in 73.4s
env-examples + sync-requirements + check-pr-title — all pass

Pre-submission Checklist

I have read CONTRIBUTING.md (or AGENTS.md if no CONTRIBUTING.md).
I have run uv run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that the test suite passes locally.
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.

Teardown flows that need to tolerate a never-reachable FlowMesh server (e.g. a previous ``stack up`` that failed before the server was healthy) currently have to wrap the destroy call in try/except. Lift that into a ``ignore_unreachable`` keyword on ``destroy_all_workers`` itself: when True, ``FlowMeshConnectionError`` is swallowed and the method returns cleanly. There are no workers to destroy when the server isn't there anyway. Other errors (auth, 5xx) still propagate so genuine misconfiguration stays loud. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

…ten _drain_workers ``NodeClient.destroy_worker`` and ``NodeClient.stop_worker`` gain the same ``ignore_unreachable`` flag as ``destroy_all_workers``: when ``True``, ``FlowMeshConnectionError`` is swallowed and the method returns cleanly; auth, 5xx, and other errors still propagate. Two new tests per method cover both branches. ``flowmesh_cli_stack.stack._drain_workers`` switches from a broad ``except Exception`` (which silently swallowed auth / 5xx / programming bugs) to ``destroy_all_workers(ignore_unreachable=True)``. Connection- down is still tolerated during ``stack down`` / ``clean`` / ``reset``, but real misconfiguration now surfaces instead of being logged and ignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

…unreachable Brings ``destroy_all_workers``'s docstring back to the same shape as ``destroy_worker`` / ``stop_worker`` — the rationale belongs in the PR description, not three places in the source. When ``ignore_unreachable=True`` swallows a ``FlowMeshConnectionError``, each method now logs a WARNING via ``flowmesh_stack.node_client``'s module logger so teardown flows that previously emitted a "skipping" warning at the call site keep that visibility. Tests assert the warning is emitted on each swallow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

….core.logging ``stop_worker`` / ``destroy_worker`` / ``destroy_all_workers`` now return ``True`` on success and ``False`` when ``ignore_unreachable=True`` and the FlowMesh server was unreachable. The internal stdlib WARNING stays for SDK consumers. ``flowmesh_cli_stack.stack._drain_workers`` checks the return value and, on a skipped destroy, emits a yellow ``logging.warning`` via the CLI's own ``flowmesh_cli.core.logging`` (typer-echo) so users running ``stack down`` / ``clean`` / ``reset`` against a half-broken stack see why teardown skipped workers — the SDK's stdlib warning doesn't otherwise surface in CLI output. Tests assert the return value alongside the warning on each swallow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

…ss`` The bool return from ``destroy_all_workers`` / ``destroy_worker`` / ``stop_worker`` is the canonical signal — log emission is the consumer's job. Removes the SDK's stdlib ``logger.warning`` (and the unused ``import logging`` / ``logger = ...`` setup) so the SDK stays purely a transport layer. ``_drain_workers`` binds the result to ``success`` before checking, both to make the intent explicit and to give the variable a place where a future caller could thread additional reactions. Tests no longer assert log emission; they verify the return value alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Per review discussion: ``destroy_all_workers`` is the only call that fits the "best-effort teardown against an absent server" use case (it runs from ``_drain_workers`` during ``stack down`` / ``clean`` / ``reset``). Per-worker stop / destroy is invoked from user-facing CLI commands where unreachable should be a hard error — extending the flag there bloats the API for no real consumer. ``stop_worker`` and ``destroy_worker`` go back to their original signatures (no flag, no bool return). The flag and the corresponding tests stay on ``destroy_all_workers``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

One comment.

kaiitunnz · 2026-04-29T12:32:49Z

+    client = stack_node_client(env_file, base_url=None, token=None)
+    success = client.destroy_all_workers(ignore_unreachable=True)
+    if not success:
+        logging.warning("Server unreachable; skipping worker destruction.")


I suggest keeping the old behavior: ignore exception. log warning, and fall through. Otherwise, you need to catch other exceptions, log properly, and exit.

Per review: keep this PR scoped strictly to the SDK addition. The ``ignore_unreachable=True`` flag is available for downstream callers (lumilake.optimizer's deploy library) to use; rolling out the new shape inside FlowMesh's own CLI can land separately if/when wanted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

LGTM.

timzsu and others added 2 commits April 29, 2026 09:56

timzsu changed the title ~~feat: add ignore_unreachable flag to NodeClient.destroy_all_workers~~ feat: add ignore_unreachable to NodeClient destroy/stop methods Apr 29, 2026

timzsu and others added 3 commits April 29, 2026 10:59

timzsu requested a review from kaiitunnz April 29, 2026 11:40

timzsu marked this pull request as ready for review April 29, 2026 11:41

timzsu changed the title ~~feat: add ignore_unreachable to NodeClient destroy/stop methods~~ feat: add ignore_unreachable to NodeClient.destroy_all_workers Apr 29, 2026

kaiitunnz requested changes Apr 29, 2026

View reviewed changes

timzsu requested a review from kaiitunnz April 29, 2026 12:39

kaiitunnz approved these changes Apr 29, 2026

View reviewed changes

timzsu merged commit 762f0e2 into main Apr 29, 2026
0 of 9 checks passed

timzsu deleted the zsu/destroy-workers-ignore-unreachable branch April 29, 2026 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ignore_unreachable to NodeClient.destroy_all_workers#2

feat: add ignore_unreachable to NodeClient.destroy_all_workers#2
timzsu merged 7 commits into
mainfrom
zsu/destroy-workers-ignore-unreachable

timzsu commented Apr 29, 2026 •

edited

Loading

Uh oh!

kaiitunnz left a comment

Uh oh!

kaiitunnz Apr 29, 2026

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timzsu commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timzsu commented Apr 29, 2026 •

edited

Loading