Problem
Current sandbox has no syscall filter (Seccomp: 0 in /proc/self/status inside the sandbox). Every syscall is reachable, including ones with known escape surface or limited legitimate use for a shell/dev workload.
Proposal
Opt-in [sandbox] seccomp = true (or a path to a custom BPF filter) that compiles a deny-by-default list and passes it via bwrap's --seccomp <fd>.
Starter blocklist:
ptrace — sandbox inspection / cross-process tampering
keyctl, add_key, request_key — kernel keyring access
bpf — arbitrary eBPF loads
perf_event_open — side-channel surface
userfaultfd — exploitation primitive
mount, umount2, pivot_root, unshare, setns, clone (NEWUSER flag) — re-nesting / namespace games
create_module, delete_module, init_module, finit_module — kernel module ops (moot for unpriv but belt-and-suspenders)
kexec_load / kexec_file_load
reboot
Implementation sketch
- Dependency:
libseccomp bindings. Either python3-seccomp (Debian/Ubuntu) or emit a raw BPF program (more portable, more code).
- Simplest path: shell out to
seccomp-tools or precompile a filter at install time; ship a .bpf file; pass via --seccomp with an fd.
- Validate the filter isn't overly aggressive — run the existing test suite inside the sandbox before/after.
Open questions
- Custom profiles? TOML list of syscalls to block? A named preset (
minimal, strict)?
- Architecture handling — filter has to account for x86_64 + arm64 syscall number differences.
- What breaks?
ptrace blocks strace, gdb; some debuggers may also use perf_event_open. Document the tradeoffs.
Why
Defence in depth. Even with user-namespace + mount-namespace isolation, a seccomp filter meaningfully reduces the kernel attack surface reachable from inside.
Problem
Current sandbox has no syscall filter (
Seccomp: 0in/proc/self/statusinside the sandbox). Every syscall is reachable, including ones with known escape surface or limited legitimate use for a shell/dev workload.Proposal
Opt-in
[sandbox] seccomp = true(or a path to a custom BPF filter) that compiles a deny-by-default list and passes it via bwrap's--seccomp <fd>.Starter blocklist:
ptrace— sandbox inspection / cross-process tamperingkeyctl,add_key,request_key— kernel keyring accessbpf— arbitrary eBPF loadsperf_event_open— side-channel surfaceuserfaultfd— exploitation primitivemount,umount2,pivot_root,unshare,setns,clone(NEWUSER flag) — re-nesting / namespace gamescreate_module,delete_module,init_module,finit_module— kernel module ops (moot for unpriv but belt-and-suspenders)kexec_load/kexec_file_loadrebootImplementation sketch
libseccompbindings. Eitherpython3-seccomp(Debian/Ubuntu) or emit a raw BPF program (more portable, more code).seccomp-toolsor precompile a filter at install time; ship a.bpffile; pass via--seccompwith an fd.Open questions
minimal,strict)?ptraceblocksstrace,gdb; some debuggers may also useperf_event_open. Document the tradeoffs.Why
Defence in depth. Even with user-namespace + mount-namespace isolation, a seccomp filter meaningfully reduces the kernel attack surface reachable from inside.