fix: spawn supervisor child to avoid OpenSSL fork deadlock by kaiitunnz · Pull Request #55 · mlsys-io/FlowMesh

kaiitunnz · 2026-05-22T11:39:40Z

Purpose

Fixes an intermittent server-startup hang where the FastAPI lifespan times out with Supervisor child did not register a node within 30s (child still alive). The hang reproduces roughly 1-in-5 to 1-in-10 stack boots and is more likely after a stack restart that overlaps with other docker activity.

Also addresses two newly disclosed CVEs that started failing pip-audit on this PR — starlette PYSEC-2026-161 is resolved on the server by bumping fastapi; diffusers GHSA-7wx4-6vff-v64p and the worker-GPU starlette exposure are silenced because both upgrades are blocked by existing pins.

Changes

src/server/supervisor/supervisor.py — spawn the supervisor child via mp.get_context("spawn") instead of the default fork. Widen the process annotation to BaseProcess | None so spawn's SpawnProcess type-checks.
src/server/utils/concurrent.py — create_task_channel now creates its IPC queues from the same spawn context, so the SemLocks inside are spawn-compatible (mixing fork and spawn primitives raises RuntimeError at spawn time).
pyproject.toml / uv.lock / src/server/requirements.txt — bump fastapi>=0.135.0 so the server pip-audit picks up a starlette past the PYSEC-2026-161 fix.
.github/workflows/security.yml + docs/CODE_STYLE.md — add --ignore-vuln entries for GHSA-7wx4-6vff-v64p (diffusers, all worker steps) and PYSEC-2026-161 (starlette, worker-GPU only), with matching rows in the advisory table. Both upgrades are blocked there by existing transitive pins.

Design

The parent server opens Redis-over-TLS connections during lifespan startup, which initialises OpenSSL state in the parent process. OpenSSL is not fork-safe — random-pool state and internal locks inherited via fork() can deadlock the child the first time it calls ssl.SSLContext.__new__. We caught the child stuck there with a faulthandler thread dump, inside redis-py's SSL connect path on SyncRedisClient.ping().

Switching to spawn execs a fresh Python interpreter for the child, so OpenSSL initialises cleanly. The spawn overhead is ~0.5–1 s and pays only once per stack boot.

Test Plan

uv run pre-commit run --all-files
uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
uvx pip-audit==2.9.0 --strict $(<ignore-list>) -r src/server/requirements.txt
uvx pip-audit==2.9.0 --strict $(<ignore-list>) -r src/worker/requirements/requirements.txt
uvx pip-audit==2.9.0 $(<ignore-list>) -r src/worker/requirements/requirements.gpu.txt

E2E (local stack, no plugin scenarios):

flowmesh stack up × 25 consecutive cycles with worker up cpu 1 mixed in — 25/25 healthy, zero handshake timeouts. Pre-fix, the same loop reproduced the timeout at ~1 in 5–10 cycles.

Test Result

$ uv run pre-commit run --all-files
# all hooks pass on the touched files; pre-existing mypy errors elsewhere unchanged

$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
977 passed in 44.18s

$ uvx pip-audit==2.9.0 --strict ... -r src/server/requirements.txt
$ uvx pip-audit==2.9.0 --strict ... -r src/worker/requirements/requirements.txt
$ uvx pip-audit==2.9.0          ... -r src/worker/requirements/requirements.gpu.txt
No known vulnerabilities found (0 ignored server, 13 ignored CPU, 21 ignored GPU)

25-cycle stress loop: PASS=25 FAIL=0.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

The parent server opens Redis-over-TLS connections during lifespan startup, which initialises OpenSSL state. OpenSSL is not fork-safe — inheriting random-pool state and internal locks via `fork()` can deadlock the child the first time it calls `ssl.SSLContext.__new__` (observed as an intermittent "Supervisor child did not register a node within 30s" handshake timeout, with the child stuck in `ssl.create_default_context` under `redis-py`'s SSL connect path). Use `mp.get_context("spawn")` for the supervisor `Process` and for the IPC queues in `create_task_channel`. Spawn execs a fresh interpreter so the child gets clean OpenSSL state. The IPC primitives must be created from the same context — mixing fork-context `SemLock`s with a spawn-context process raises a RuntimeError. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Newly disclosed CVE in diffusers 0.36.0, fixed in 0.38.0. Bumping is blocked by the same `safetensors>=0.8.0rc0` pre-release requirement that already gates GHSA-j7w6-vpvq-j3gm and GHSA-98h9-4798-4q5v; adding to the existing diffusers row block in the workflow and the advisory table. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

`mp.get_context("spawn")` returns a singleton instantiated at `multiprocessing` import time, so `@functools.cache` on a getter was redundant and the "lazily constructed" framing was misleading. A module-level constant is simpler, honest about what it is, and reads naturally at call sites. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Newly disclosed CVE in starlette 0.52.1, fixed in 1.0.1. Bumping is blocked by `gradio==5.50` (transitive via `vllm-omni==0.18`), which caps `starlette<1.0` — same chain that gates the existing gradio / vllm-omni CVE ignores. Add to the worker-GPU pip-audit invocation (where the failure surfaced) and document the row in the advisory table. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

The previous commit added the ignore only to the worker-GPU step because that's where the CVE first surfaced. The lock then resolved starlette to a different (still <1.0.1) version on the server side, exposing the same advisory in the server pip-audit step. Same blocker (gradio 5.50 caps starlette<1.0 via vllm-omni 0.18 — already in the docs table). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

One minor comment. PTAL.

timzsu · 2026-05-22T15:04:17Z

        run: |
          grep -v '@ git+' src/server/requirements.txt > /tmp/requirements-server-audit.txt
          uvx pip-audit==2.9.0 --strict \
+            --ignore-vuln PYSEC-2026-161 \


Do we need this line? I have locally experimented and found that bumping fastapi to >=0.135.0 can solve this ignore cleanly. (The doc addition mentions that the ignore is due to vllm-omni, which is not installed in the server environment).

Fixed by removing the line and bumping FastAPI bound.

The previous ignore for PYSEC-2026-161 on the server pip-audit step cited the gradio 5.50 / vllm-omni 0.18 cap on `starlette<1.0`, but neither lives in the server requirements — only the worker-GPU layer brings them in. Bumping fastapi's floor lets `pip-audit` install a starlette past the 1.0.1 fix on the server, so the server step audits clean without an ignore. The worker-GPU ignore stays — that chain still caps starlette there. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

LGTM.

kaiitunnz changed the title ~~fix(supervisor): spawn child instead of forking to avoid OpenSSL deadlock~~ fix: spawn supervisor child to avoid OpenSSL fork deadlock May 22, 2026

kaiitunnz force-pushed the kaiitunnz/fix/supervisor-spawn-context branch 2 times, most recently from 4fe56a0 to 4bda435 Compare May 22, 2026 12:11

kaiitunnz added 2 commits May 22, 2026 21:17

kaiitunnz force-pushed the kaiitunnz/fix/supervisor-spawn-context branch from 4bda435 to dd00769 Compare May 22, 2026 13:17

kaiitunnz added 3 commits May 22, 2026 21:28

kaiitunnz requested a review from timzsu May 22, 2026 14:21

timzsu requested changes May 22, 2026

View reviewed changes

kaiitunnz requested a review from timzsu May 22, 2026 16:46

timzsu approved these changes May 23, 2026

View reviewed changes

kaiitunnz merged commit 98a535a into main May 23, 2026
12 checks passed

kaiitunnz deleted the kaiitunnz/fix/supervisor-spawn-context branch May 23, 2026 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: spawn supervisor child to avoid OpenSSL fork deadlock#55

fix: spawn supervisor child to avoid OpenSSL fork deadlock#55
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/fix/supervisor-spawn-context

kaiitunnz commented May 22, 2026 •

edited

Loading

Uh oh!

timzsu left a comment

Uh oh!

timzsu May 22, 2026

Uh oh!

kaiitunnz May 22, 2026

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiitunnz commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

timzsu May 22, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz May 22, 2026

Choose a reason for hiding this comment

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiitunnz commented May 22, 2026 •

edited

Loading