ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47) by congwang-mk · Pull Request #49 · multikernel/sandlock

congwang-mk · 2026-05-16T22:05:32Z

Summary

Fixes #47. Sandbox(policy).run([...]) was failing with sandlock_spawn failed when called from a multi-threaded Python host (uvicorn/asyncio) on a Kubernetes pod with the RuntimeDefault seccomp profile.

Root cause: every FFI entry point built a fresh tokio::runtime::Runtime::new(), which is new_multi_thread() and spawns worker threads eagerly via pthread_create. Kubernetes' RuntimeDefault blocks clone3 with ENOSYS, and the multi-thread builder's eager-spawn path returned Err before glibc's fallback could help, surfacing to the caller as a NULL handle.

Fix: switch FFI entry points to a per-thread cached current_thread Tokio runtime, which spawns no threads at construction. Three runtime shapes:

Call site	Runtime	Why
`sandlock_run` and the rest of the one-shot entry points	thread-local `current_thread`	One-shot block_on, no tasks need to outlive the call
`sandlock_create` (live handle)	per-handle multi_thread, 1 worker	Supervisor must keep ticking between `start` and `wait`; one persistent worker is unavoidable here
`sandlock_create_for_run` (new)	per-handle `current_thread`	Python's `Sandbox.run()` is `start` then `wait` back-to-back, so suspension across the gap is fine and avoids the one worker thread `sandlock_create` would have spawned

Sandbox.run() is wired to sandlock_create_for_run. The seccomp supervisor's blocking SECCOMP_IOCTL_NOTIF_RECV thread is left as-is: it spawns through pthread_create, which the reporter's diagnostic confirms works in their environment ("new_thread": "ok").

Test plan

cargo build --release clean on dev
pytest python/tests/ 249/249 pass, including the new TestSandlockRunCAbiMultiThreaded and TestSandboxRunMultiThreaded regression tests
Reporter retests against this branch on their k8s pod with RuntimeDefault seccomp (the only environment where the original failure was empirically observable)

Note: the new tests are not red-on-pristine on an unrestricted dev box because glibc's clone3 → clone(2) fallback masks the failure mode locally. The docstring on TestSandlockRunCAbiMultiThreaded documents this honestly.

Signed-off-by: Cong Wang <cwang@multikernel.io>

mrsimpson · 2026-05-18T05:24:32Z

Verified on the target environment.

Setup: k3s pod, RuntimeDefault seccomp, Landlock ABI v7, Python 3.12, uvicorn (2 threads: event loop + thread pool).

Tested by running Sandbox.run() directly from within a multi-threaded Python process inside the pod:

threads active: 2
landlock abi: 7
success=True exit=0 stdout=b'4\n' error=None

The fix works. Sandbox.run() succeeds from the uvicorn process without any subprocess workaround. The subprocess-per-call workaround has been removed from our MCP server.

Note: the Policy class is gone in the dev branch — Sandbox is now the single dataclass for both policy and runtime. net_allow_hosts is now net_allow. We updated our code accordingly without issue.

mrsimpson · 2026-05-18T06:50:26Z

EDIT: There are other side effects that prevent this from working e2e in my seecomp k8s. I'll continue investigation

mrsimpson · 2026-05-18T09:25:29Z

Follow-up: fix does not resolve the issue in our environment.

After further testing, the dev branch does not fix the problem on our k8s pod. Here is what we found:

What works:

Sandbox.run() called from a fresh kubectl exec process (any PID ≠ 1) — succeeds
Sandbox.run() called from a subprocess spawned by the uvicorn server — succeeds
Our previous subprocess-per-call workaround — still works with the new API

What still fails:

Sandbox.run() called directly from within the uvicorn server process (PID 1)
Both sandlock_create_for_run and sandlock_run (one-shot, thread-local runtime) return NULL
Rust stderr is empty — no log_build_error output, no child error message
This means do_create() returns Err(...) silently (swallowed by block_on_runtime's match)

Environment when failing:

pid=1, active_threads=2, Seccomp: 2, Seccomp_filters: 1, NoNewPrivs: 1

The failure is specific to the uvicorn server process itself — not its thread count, not its seccomp state (same as exec processes), not its fd count (14 open fds, limit 1M). A subprocess spawned by that same process works fine.

The actual error from do_create() is not surfaced anywhere — block_on_runtime catches the Err variant silently and returns None → NULL handle. It would help to log the error string from do_create (e.g. to stderr) so we can see exactly which step inside do_create fails.

We've restored the subprocess workaround for now. Happy to test further if you have ideas on what to instrument.

mrsimpson · 2026-05-18T09:53:21Z

Root cause found.

By building from a fork with eprintln! added to sandlock_create_with_runtime and sandlock_run, we can now see the actual error:

sandlock: sandlock_create_with_runtime: do_create failed:
  Runtime(Child("read notif fd from child: pipe closed before 4 bytes read"))

The child process crashes before writing the notif fd number back to the parent via pipes.notif_w. The parent's read_u32_fd(pipes.notif_r) then fails because the pipe closes.

The child's fail!() output goes to its stderr — which in our case is a pipe (uvicorn's stderr is pipe:[...], not a tty), so it's silently swallowed before we can read it.

This happens at startup (threads=1, pid=1) — before uvicorn has even started its thread pool. So thread count is not the trigger. The child is dying in confine_child before it completes the seccomp/Landlock setup and calls write_u32_fd.

The most likely culprit: the child inherits uvicorn's pipe:[...] stderr fd (fd 2), and when close_fds_above(2, &keep_fds) runs it may close fds the child needs, or one of the earlier steps (Landlock landlock_restrict_self, seccomp prctl, or pipe2) is failing under the specific process state of PID 1.

Next diagnostic step: capture the child's stderr before close_fds_above runs, or add eprintln! inside confine_child's fail!() macro so child errors reach the parent.

mrsimpson · 2026-05-18T10:58:30Z

Updated summary (replaces earlier debugging comments)

Verdict: PR#49 is correct and necessary, but insufficient on its own for our environment. We found a second, unrelated container-specific bug.

PR#49 works as intended

The current_thread Tokio runtime fix correctly avoids the clone3 seccomp issue. After fixing the second bug, Sandbox.run() works directly from uvicorn PID 1 with RuntimeDefault seccomp — no subprocess workarounds needed.

Second bug: `getppid() == 1` in containers

Location: context.rs line 788 (confine_child)
Problem: if unsafe { libc::getppid() } == 1 { fail!("parent died before confinement"); } — in containers the entrypoint commonly runs as PID 1, so forked children legitimately have getppid() == 1, triggering a false positive
Effect: child exits silently before writing to the notif pipe → parent sees "read notif fd from child: pipe closed before 4 bytes read"
Fix: capture actual parent PID before fork, pass through ChildSpawnArgs, compare against that → fix: compare getppid() against actual parent pid, not hardcoded 1 #53

Both fixes needed

In containerized environments (k8s, PID 1 entrypoint), both PRs are required: PR#49 for the Tokio runtime + PR#53 for the getppid() container check. Neither alone suffices.

congwang-mk added 3 commits May 16, 2026 14:47

ffi: build current_thread Tokio runtime per FFI thread (#47)

c4455dc

Signed-off-by: Cong Wang <cwang@multikernel.io>

python: run sandboxes on local FFI runtime (#47)

22a3d53

Signed-off-by: Cong Wang <cwang@multikernel.io>

test: regression for threaded Sandbox.run (#47)

20a1fe5

Signed-off-by: Cong Wang <cwang@multikernel.io>

This was referenced May 16, 2026

sandlock_spawn fails with ENOSYS (clone3) when called from a multi-threaded Python process (uvicorn/asyncio + Kubernetes RuntimeDefault seccomp) #47

Open

core: drain seccomp notifications via AsyncFd, remove per-sandbox blocking thread #50

Merged

mrsimpson mentioned this pull request May 18, 2026

chore(sandbox-mcp): revert sandlock to pip release after PR#49 merge mrsimpson/homelab-apps#2

Open

mrsimpson mentioned this pull request May 18, 2026

fix: compare getppid() against actual parent pid, not hardcoded 1 #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47)#49

ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47)#49
congwang-mk wants to merge 3 commits into
mainfrom
dev

congwang-mk commented May 16, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

congwang-mk commented May 16, 2026

Summary

Test plan

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

Uh oh!

mrsimpson commented May 18, 2026

PR#49 works as intended

Second bug: getppid() == 1 in containers

Both fixes needed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Second bug: `getppid() == 1` in containers