ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47)#49
ffi: fix sandlock_spawn failure under multi-threaded callers with restricted seccomp (#47)#49congwang-mk wants to merge 3 commits into
Conversation
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Cong Wang <cwang@multikernel.io>
|
Verified on the target environment. Setup: k3s pod, Tested by running The fix works. Note: the |
|
EDIT: There are other side effects that prevent this from working e2e in my seecomp k8s. I'll continue investigation |
|
Follow-up: fix does not resolve the issue in our environment. After further testing, the dev branch does not fix the problem on our k8s pod. Here is what we found: What works:
What still fails:
Environment when failing: The failure is specific to the uvicorn server process itself — not its thread count, not its seccomp state (same as exec processes), not its fd count (14 open fds, limit 1M). A subprocess spawned by that same process works fine. The actual error from We've restored the subprocess workaround for now. Happy to test further if you have ideas on what to instrument. |
|
Root cause found. By building from a fork with The child process crashes before writing the notif fd number back to the parent via The child's This happens at startup ( The most likely culprit: the child inherits uvicorn's Next diagnostic step: capture the child's stderr before |
Verdict: PR#49 is correct and necessary, but insufficient on its own for our environment. We found a second, unrelated container-specific bug. PR#49 works as intendedThe Second bug:
|
Summary
Fixes #47.
Sandbox(policy).run([...])was failing withsandlock_spawn failedwhen called from a multi-threaded Python host (uvicorn/asyncio) on a Kubernetes pod with theRuntimeDefaultseccomp profile.Root cause: every FFI entry point built a fresh
tokio::runtime::Runtime::new(), which isnew_multi_thread()and spawns worker threads eagerly viapthread_create. Kubernetes'RuntimeDefaultblocksclone3withENOSYS, and the multi-thread builder's eager-spawn path returnedErrbefore glibc's fallback could help, surfacing to the caller as a NULL handle.Fix: switch FFI entry points to a per-thread cached
current_threadTokio runtime, which spawns no threads at construction. Three runtime shapes:sandlock_runand the rest of the one-shot entry pointscurrent_threadsandlock_create(live handle)startandwait; one persistent worker is unavoidable heresandlock_create_for_run(new)current_threadSandbox.run()isstartthenwaitback-to-back, so suspension across the gap is fine and avoids the one worker threadsandlock_createwould have spawnedSandbox.run()is wired tosandlock_create_for_run. The seccomp supervisor's blockingSECCOMP_IOCTL_NOTIF_RECVthread is left as-is: it spawns throughpthread_create, which the reporter's diagnostic confirms works in their environment ("new_thread": "ok").Test plan
cargo build --releaseclean on devpytest python/tests/249/249 pass, including the newTestSandlockRunCAbiMultiThreadedandTestSandboxRunMultiThreadedregression testsRuntimeDefaultseccomp (the only environment where the original failure was empirically observable)Note: the new tests are not red-on-pristine on an unrestricted dev box because glibc's
clone3 → clone(2)fallback masks the failure mode locally. The docstring onTestSandlockRunCAbiMultiThreadeddocuments this honestly.