Skip to content

fix: spawn supervisor child to avoid OpenSSL fork deadlock#55

Merged
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/fix/supervisor-spawn-context
May 23, 2026
Merged

fix: spawn supervisor child to avoid OpenSSL fork deadlock#55
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/fix/supervisor-spawn-context

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 22, 2026

Purpose

Fixes an intermittent server-startup hang where the FastAPI lifespan times out with Supervisor child did not register a node within 30s (child still alive). The hang reproduces roughly 1-in-5 to 1-in-10 stack boots and is more likely after a stack restart that overlaps with other docker activity.

Also addresses two newly disclosed CVEs that started failing pip-audit on this PR — starlette PYSEC-2026-161 is resolved on the server by bumping fastapi; diffusers GHSA-7wx4-6vff-v64p and the worker-GPU starlette exposure are silenced because both upgrades are blocked by existing pins.

Changes

  • src/server/supervisor/supervisor.py — spawn the supervisor child via mp.get_context("spawn") instead of the default fork. Widen the process annotation to BaseProcess | None so spawn's SpawnProcess type-checks.
  • src/server/utils/concurrent.pycreate_task_channel now creates its IPC queues from the same spawn context, so the SemLocks inside are spawn-compatible (mixing fork and spawn primitives raises RuntimeError at spawn time).
  • pyproject.toml / uv.lock / src/server/requirements.txt — bump fastapi>=0.135.0 so the server pip-audit picks up a starlette past the PYSEC-2026-161 fix.
  • .github/workflows/security.yml + docs/CODE_STYLE.md — add --ignore-vuln entries for GHSA-7wx4-6vff-v64p (diffusers, all worker steps) and PYSEC-2026-161 (starlette, worker-GPU only), with matching rows in the advisory table. Both upgrades are blocked there by existing transitive pins.

Design

The parent server opens Redis-over-TLS connections during lifespan startup, which initialises OpenSSL state in the parent process. OpenSSL is not fork-safe — random-pool state and internal locks inherited via fork() can deadlock the child the first time it calls ssl.SSLContext.__new__. We caught the child stuck there with a faulthandler thread dump, inside redis-py's SSL connect path on SyncRedisClient.ping().

Switching to spawn execs a fresh Python interpreter for the child, so OpenSSL initialises cleanly. The spawn overhead is ~0.5–1 s and pays only once per stack boot.

Test Plan

uv run pre-commit run --all-files
uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
uvx pip-audit==2.9.0 --strict $(<ignore-list>) -r src/server/requirements.txt
uvx pip-audit==2.9.0 --strict $(<ignore-list>) -r src/worker/requirements/requirements.txt
uvx pip-audit==2.9.0 $(<ignore-list>) -r src/worker/requirements/requirements.gpu.txt

E2E (local stack, no plugin scenarios):

  • flowmesh stack up × 25 consecutive cycles with worker up cpu 1 mixed in — 25/25 healthy, zero handshake timeouts. Pre-fix, the same loop reproduced the timeout at ~1 in 5–10 cycles.

Test Result

$ uv run pre-commit run --all-files
# all hooks pass on the touched files; pre-existing mypy errors elsewhere unchanged

$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
977 passed in 44.18s

$ uvx pip-audit==2.9.0 --strict ... -r src/server/requirements.txt
$ uvx pip-audit==2.9.0 --strict ... -r src/worker/requirements/requirements.txt
$ uvx pip-audit==2.9.0          ... -r src/worker/requirements/requirements.gpu.txt
No known vulnerabilities found (0 ignored server, 13 ignored CPU, 21 ignored GPU)

25-cycle stress loop: PASS=25 FAIL=0.


Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

@kaiitunnz kaiitunnz changed the title fix(supervisor): spawn child instead of forking to avoid OpenSSL deadlock fix: spawn supervisor child to avoid OpenSSL fork deadlock May 22, 2026
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/fix/supervisor-spawn-context branch 2 times, most recently from 4fe56a0 to 4bda435 Compare May 22, 2026 12:11
kaiitunnz added 2 commits May 22, 2026 21:17
The parent server opens Redis-over-TLS connections during lifespan
startup, which initialises OpenSSL state. OpenSSL is not fork-safe —
inheriting random-pool state and internal locks via `fork()` can
deadlock the child the first time it calls `ssl.SSLContext.__new__`
(observed as an intermittent "Supervisor child did not register a
node within 30s" handshake timeout, with the child stuck in
`ssl.create_default_context` under `redis-py`'s SSL connect path).

Use `mp.get_context("spawn")` for the supervisor `Process` and for
the IPC queues in `create_task_channel`. Spawn execs a fresh
interpreter so the child gets clean OpenSSL state. The IPC primitives
must be created from the same context — mixing fork-context
`SemLock`s with a spawn-context process raises a RuntimeError.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in diffusers 0.36.0, fixed in 0.38.0. Bumping is
blocked by the same `safetensors>=0.8.0rc0` pre-release requirement
that already gates GHSA-j7w6-vpvq-j3gm and GHSA-98h9-4798-4q5v;
adding to the existing diffusers row block in the workflow and the
advisory table.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/fix/supervisor-spawn-context branch from 4bda435 to dd00769 Compare May 22, 2026 13:17
kaiitunnz added 3 commits May 22, 2026 21:28
`mp.get_context("spawn")` returns a singleton instantiated at
`multiprocessing` import time, so `@functools.cache` on a getter was
redundant and the "lazily constructed" framing was misleading. A
module-level constant is simpler, honest about what it is, and reads
naturally at call sites.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Newly disclosed CVE in starlette 0.52.1, fixed in 1.0.1. Bumping is
blocked by `gradio==5.50` (transitive via `vllm-omni==0.18`), which
caps `starlette<1.0` — same chain that gates the existing gradio /
vllm-omni CVE ignores. Add to the worker-GPU pip-audit invocation
(where the failure surfaced) and document the row in the advisory
table.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The previous commit added the ignore only to the worker-GPU step
because that's where the CVE first surfaced. The lock then resolved
starlette to a different (still <1.0.1) version on the server side,
exposing the same advisory in the server pip-audit step. Same blocker
(gradio 5.50 caps starlette<1.0 via vllm-omni 0.18 — already in the
docs table).

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz requested a review from timzsu May 22, 2026 14:21
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. PTAL.

Comment thread .github/workflows/security.yml Outdated
run: |
grep -v '@ git+' src/server/requirements.txt > /tmp/requirements-server-audit.txt
uvx pip-audit==2.9.0 --strict \
--ignore-vuln PYSEC-2026-161 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line? I have locally experimented and found that bumping fastapi to >=0.135.0 can solve this ignore cleanly. (The doc addition mentions that the ignore is due to vllm-omni, which is not installed in the server environment).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by removing the line and bumping FastAPI bound.

The previous ignore for PYSEC-2026-161 on the server pip-audit step
cited the gradio 5.50 / vllm-omni 0.18 cap on `starlette<1.0`, but
neither lives in the server requirements — only the worker-GPU layer
brings them in. Bumping fastapi's floor lets `pip-audit` install a
starlette past the 1.0.1 fix on the server, so the server step
audits clean without an ignore. The worker-GPU ignore stays — that
chain still caps starlette there.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz requested a review from timzsu May 22, 2026 16:46
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiitunnz kaiitunnz merged commit 98a535a into main May 23, 2026
12 checks passed
@kaiitunnz kaiitunnz deleted the kaiitunnz/fix/supervisor-spawn-context branch May 23, 2026 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants