Skip to content

fix: send api key on worker node registration#43

Merged
kaiitunnz merged 9 commits into
mainfrom
kaiitunnz/fix/node-registration
May 13, 2026
Merged

fix: send api key on worker node registration#43
kaiitunnz merged 9 commits into
mainfrom
kaiitunnz/fix/node-registration

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 13, 2026

Purpose

A worker node's supervisor couldn't register with a root server when the root had any IdentityProvider plugin chain installed. Lifecycle._register_http posted to /api/v1/nodes/register without an Authorization header, so the server's auth chain rejected the unauthenticated POST. Every other supervisor / runtime path already attaches FLOWMESH_API_KEY as a bearer token — registration was the odd one out.

This PR ships the fix plus the refactor it landed on top of: the auth_headers / add_auth_headers helpers (previously buried in worker/executors/utils/artifacts.py) move to shared.utils.http so the supervisor can reuse them without depending on the executor stack. While there, drop dead duplicates from server/utils/helpers.py.

Changes

  • Fix: Lifecycle._register_http now sends Authorization: Bearer ${FLOWMESH_API_KEY} on POST /api/v1/nodes/register; tests/server/test_supervisor_lifecycle.py pins the header contract for both the set / unset key cases.
  • New shared.utils.http holds auth_headers and add_auth_headers. Worker call sites re-import from here; the supervisor reuses the same helpers. Module stays os-only (no aiohttp / requests) so the worker SSH executor import chain stays clean under runtime-worker-core.
  • server/utils/helpers.py cleanup: drop the JSON helpers already duplicated in shared.utils.json, fold iter_pubsub_messages into clients/redis.py (keeping the safer error-tolerant variant), and drop get_logger in favor of logging.getLogger("supervisor") at the two adapter call sites.

Design

The bug fix on its own is a two-line change. The refactor isn't strictly required, but auth_headers is the only obviously-shared piece between _register_http and the rest of the runtime — leaving it under worker.executors.utils.artifacts would have left the supervisor depending on the executor module just to format a bearer header. Moving it to shared.utils.http keeps the fix readable and makes the helper available to anything else that talks to the server. The helpers.py cleanups are opportunistic — same file, same review, no extra surface.

Test Plan

uv run pre-commit run --all-files
uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py

No live multi-node smoke run on this branch — the fix is mechanical (header attach) and covered by the new lifecycle tests.

Test Result

$ uv run pre-commit run --files <touched files across the branch>
# All passed

$ uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
# 864 passed, 18 warnings in 32.00s

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

@kaiitunnz kaiitunnz requested a review from timzsu as a code owner May 13, 2026 06:17
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One potential risk posed by the broad try-except.

As this is a critical bugfix, I suggest we publish a patch release 0.1.1 after this PR.

Comment thread src/server/clients/redis.py
@kaiitunnz kaiitunnz requested a review from timzsu May 13, 2026 07:55
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/fix/node-registration branch from a54e713 to 2981011 Compare May 13, 2026 08:00
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiitunnz kaiitunnz force-pushed the kaiitunnz/fix/node-registration branch 2 times, most recently from dede814 to e5b464e Compare May 13, 2026 08:22
kaiitunnz added 9 commits May 13, 2026 08:32
`auth_headers` / `add_auth_headers` were defined in
`worker/executors/utils/artifacts.py` but are useful anywhere that
talks to the FlowMesh server — supervisor self-registration, future
SDK callers, etc. Hoist them to `shared.utils.http` so non-worker
modules can import them without pulling in the executor stack.

Both functions gain an optional `token` arg; the default still reads
`FLOWMESH_API_KEY` from the env so the existing zero-arg call sites
keep working.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Sits next to the auth-headers helpers it already uses. No call sites
update — `HttpSession` has none today; the move just puts it where
future supervisor / shared callers can pick it up without depending
on `server.utils.helpers`.

`__init__` now goes through `auth_headers(token)` so the bearer
formatting stays in one place.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`dedup_json`, `restore_json`, `lookup_deduped_json`, `normalize_numbers`,
and the private `_restore_deduped_node` were verbatim copies of the
canonical versions in `shared.utils.json` (already re-exported via
`shared.utils.__init__`). No code imported the server copies.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Two copies existed: `clients/redis.py` (bare loop) and
`utils/helpers.py` (loop wrapped in `try/except (ConnectionError,
ValueError, OSError)`). Keep the safer wrapping version on
`clients/redis.py` — pubsub iteration belongs with the Redis client
— and drop the helpers copy.

Pre-existing server-side callers (`registries/node.py`,
`services/monitoring.py`) gain the same shutdown-tolerance the
supervisor callers already had; both wrappers either had their own
outer `try/except` or didn't need the connection-error surface.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The helpers-side `get_logger` was a singleton wrapping
`logging.getLogger("server")` with a `RotatingFileHandler` side
effect on first call. Only the supervisor Docker / vastai adapters
called it, both at module-import time — so the first import would
silently spin up a `server.log` handler before `supervisor.py`
configured the real `supervisor` logger.

Replace both call sites with `logging.getLogger("supervisor")`, which
returns the same instance the supervisor entrypoint later attaches
handlers to (no pre-configuration side effects). Drop the helpers
function and its `_logger` global; the canonical logger factory in
`utils/logging.py` is now the only `get_logger` in the package.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
`_register_http` posted to `/api/v1/nodes/register` without an
`Authorization` header, so a worker node's supervisor couldn't
register against a root server with an `IdentityProvider` plugin
chain installed — the server's auth chain rejected the unauthenticated
POST. Pass `auth_headers()` so the same `FLOWMESH_API_KEY` the rest
of the runtime already uses rides on the request.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
HttpSession had no live callers anywhere in the tree, and its
`aiohttp` / `requests` imports leaked into the worker `ssh_executor`
import chain — breaking `runtime-worker-core` installs that don't
ship aiohttp (the registry would silently drop SSHExecutor via
`_safe_import`). Drop it; `shared.utils.http` is back to the
auth-header helpers only.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Pins the contract that worker node registration sends
`Authorization: Bearer <FLOWMESH_API_KEY>` when the env var is
set, and omits the header otherwise. Guards the regression this
branch fixed.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/fix/node-registration branch from e5b464e to 1fa17a4 Compare May 13, 2026 08:32
@kaiitunnz kaiitunnz merged commit c436859 into main May 13, 2026
11 checks passed
@kaiitunnz kaiitunnz deleted the kaiitunnz/fix/node-registration branch May 13, 2026 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants