Skip to content

ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47)#49

Open
congwang-mk wants to merge 3 commits into
mainfrom
dev
Open

ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47)#49
congwang-mk wants to merge 3 commits into
mainfrom
dev

Conversation

@congwang-mk
Copy link
Copy Markdown
Contributor

Summary

Fixes #47. Sandbox(policy).run([...]) was failing with sandlock_spawn failed when called from a multi-threaded Python host (uvicorn/asyncio) on a Kubernetes pod with the RuntimeDefault seccomp profile.

Root cause: every FFI entry point built a fresh tokio::runtime::Runtime::new(), which is new_multi_thread() and spawns worker threads eagerly via pthread_create. Kubernetes' RuntimeDefault blocks clone3 with ENOSYS, and the multi-thread builder's eager-spawn path returned Err before glibc's fallback could help, surfacing to the caller as a NULL handle.

Fix: switch FFI entry points to a per-thread cached current_thread Tokio runtime, which spawns no threads at construction. Three runtime shapes:

Call site Runtime Why
sandlock_run and the rest of the one-shot entry points thread-local current_thread One-shot block_on, no tasks need to outlive the call
sandlock_create (live handle) per-handle multi_thread, 1 worker Supervisor must keep ticking between start and wait; one persistent worker is unavoidable here
sandlock_create_for_run (new) per-handle current_thread Python's Sandbox.run() is start then wait back-to-back, so suspension across the gap is fine and avoids the one worker thread sandlock_create would have spawned

Sandbox.run() is wired to sandlock_create_for_run. The seccomp supervisor's blocking SECCOMP_IOCTL_NOTIF_RECV thread is left as-is: it spawns through pthread_create, which the reporter's diagnostic confirms works in their environment ("new_thread": "ok").

Test plan

  • cargo build --release clean on dev
  • pytest python/tests/ 249/249 pass, including the new TestSandlockRunCAbiMultiThreaded and TestSandboxRunMultiThreaded regression tests
  • Reporter retests against this branch on their k8s pod with RuntimeDefault seccomp (the only environment where the original failure was empirically observable)

Note: the new tests are not red-on-pristine on an unrestricted dev box because glibc's clone3 → clone(2) fallback masks the failure mode locally. The docstring on TestSandlockRunCAbiMultiThreaded documents this honestly.

Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
@mrsimpson
Copy link
Copy Markdown

Verified on the target environment.

Setup: k3s pod, RuntimeDefault seccomp, Landlock ABI v7, Python 3.12, uvicorn (2 threads: event loop + thread pool).

Tested by running Sandbox.run() directly from within a multi-threaded Python process inside the pod:

threads active: 2
landlock abi: 7
success=True exit=0 stdout=b'4\n' error=None

The fix works. Sandbox.run() succeeds from the uvicorn process without any subprocess workaround. The subprocess-per-call workaround has been removed from our MCP server.

Note: the Policy class is gone in the dev branch — Sandbox is now the single dataclass for both policy and runtime. net_allow_hosts is now net_allow. We updated our code accordingly without issue.

@mrsimpson
Copy link
Copy Markdown

EDIT: There are other side effects that prevent this from working e2e in my seecomp k8s. I'll continue investigation

@mrsimpson
Copy link
Copy Markdown

Follow-up: fix does not resolve the issue in our environment.

After further testing, the dev branch does not fix the problem on our k8s pod. Here is what we found:

What works:

  • Sandbox.run() called from a fresh kubectl exec process (any PID ≠ 1) — succeeds
  • Sandbox.run() called from a subprocess spawned by the uvicorn server — succeeds
  • Our previous subprocess-per-call workaround — still works with the new API

What still fails:

  • Sandbox.run() called directly from within the uvicorn server process (PID 1)
  • Both sandlock_create_for_run and sandlock_run (one-shot, thread-local runtime) return NULL
  • Rust stderr is empty — no log_build_error output, no child error message
  • This means do_create() returns Err(...) silently (swallowed by block_on_runtime's match)

Environment when failing:

pid=1, active_threads=2, Seccomp: 2, Seccomp_filters: 1, NoNewPrivs: 1

The failure is specific to the uvicorn server process itself — not its thread count, not its seccomp state (same as exec processes), not its fd count (14 open fds, limit 1M). A subprocess spawned by that same process works fine.

The actual error from do_create() is not surfaced anywhere — block_on_runtime catches the Err variant silently and returns None → NULL handle. It would help to log the error string from do_create (e.g. to stderr) so we can see exactly which step inside do_create fails.

We've restored the subprocess workaround for now. Happy to test further if you have ideas on what to instrument.

@mrsimpson
Copy link
Copy Markdown

Root cause found.

By building from a fork with eprintln! added to sandlock_create_with_runtime and sandlock_run, we can now see the actual error:

sandlock: sandlock_create_with_runtime: do_create failed:
  Runtime(Child("read notif fd from child: pipe closed before 4 bytes read"))

The child process crashes before writing the notif fd number back to the parent via pipes.notif_w. The parent's read_u32_fd(pipes.notif_r) then fails because the pipe closes.

The child's fail!() output goes to its stderr — which in our case is a pipe (uvicorn's stderr is pipe:[...], not a tty), so it's silently swallowed before we can read it.

This happens at startup (threads=1, pid=1) — before uvicorn has even started its thread pool. So thread count is not the trigger. The child is dying in confine_child before it completes the seccomp/Landlock setup and calls write_u32_fd.

The most likely culprit: the child inherits uvicorn's pipe:[...] stderr fd (fd 2), and when close_fds_above(2, &keep_fds) runs it may close fds the child needs, or one of the earlier steps (Landlock landlock_restrict_self, seccomp prctl, or pipe2) is failing under the specific process state of PID 1.

Next diagnostic step: capture the child's stderr before close_fds_above runs, or add eprintln! inside confine_child's fail!() macro so child errors reach the parent.

@mrsimpson
Copy link
Copy Markdown

Updated summary (replaces earlier debugging comments)

Verdict: PR#49 is correct and necessary, but insufficient on its own for our environment. We found a second, unrelated container-specific bug.


PR#49 works as intended

The current_thread Tokio runtime fix correctly avoids the clone3 seccomp issue. After fixing the second bug, Sandbox.run() works directly from uvicorn PID 1 with RuntimeDefault seccomp — no subprocess workarounds needed.

Second bug: getppid() == 1 in containers

  • Location: context.rs line 788 (confine_child)
  • Problem: if unsafe { libc::getppid() } == 1 { fail!("parent died before confinement"); } — in containers the entrypoint commonly runs as PID 1, so forked children legitimately have getppid() == 1, triggering a false positive
  • Effect: child exits silently before writing to the notif pipe → parent sees "read notif fd from child: pipe closed before 4 bytes read"
  • Fix: capture actual parent PID before fork, pass through ChildSpawnArgs, compare against that → fix: compare getppid() against actual parent pid, not hardcoded 1 #53

Both fixes needed

In containerized environments (k8s, PID 1 entrypoint), both PRs are required: PR#49 for the Tokio runtime + PR#53 for the getppid() container check. Neither alone suffices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sandlock_spawn fails with ENOSYS (clone3) when called from a multi-threaded Python process (uvicorn/asyncio + Kubernetes RuntimeDefault seccomp)

2 participants