Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,22 +295,23 @@ positive int = deny with errno, `"audit"`/`-2` = allow + flag.
**Event fields:** `syscall`, `category` (file/network/process/memory),
`pid`, `parent_pid`, `host`, `port`, `argv`, `denied`.

> **TOCTOU NOTE ** Per `seccomp_unotify(2)`, the kernel
> **TOCTOU NOTE** Per `seccomp_unotify(2)`, the kernel
> re-reads user-memory pointers after `Continue`. Sandlock handles this
> in two places:
>
> - **Path strings are not exposed on events.** Path-based access control
> belongs in static Landlock rules (`fs_readable` / `fs_writable` /
> `fs_denied`) — kernel-enforced and TOCTOU-immune. Use
> `ctx.deny_path()` for runtime additions.
> - **`event.argv` is exposed and TOCTOU-safe.** Before returning
> `Continue` for an `execve`, the supervisor `PTRACE_SEIZE` +
> `PTRACE_INTERRUPT`s every sibling thread of the calling tid so the
> kernel's re-read happens with no other writer running. The pause
> has no observable cost: `execve`'s `de_thread` step kills sibling
> threads anyway. If the freeze cannot be established (e.g., YAMA
> blocks ptrace), the execve is denied with `EPERM` — the safety
> invariant is never silently relaxed.
> - **`event.argv` is exposed and TOCTOU-safe.** Before exposing
> `argv` to `policy_fn` or returning `Continue` for an
> `execve`, the supervisor freezes every task in `ProcessIndex`,
> including peer processes that may alias argv through shared memory.
> With `policy_fn` active, fork-like syscalls are traced for one
> ptrace creation event, so children are registered in `ProcessIndex`
> before they can run user code. If the freeze or creation tracking
> cannot be established (e.g., YAMA blocks ptrace), the syscall is
> denied with `EPERM`; the safety invariant is never silently relaxed.

**Context methods:**
- `ctx.restrict_network(ips)` / `ctx.grant_network(ips)` — network control
Expand Down Expand Up @@ -455,8 +456,8 @@ what a command would do before committing.
### COW Fork & Map-Reduce

Initialize expensive state once, then fork COW clones that share memory.
Each fork uses raw `fork(2)` (bypasses seccomp notification) for minimal
overhead. 1000 clones in ~530ms, ~1,900 forks/sec.
Each clone uses raw `fork(2)` with shared copy-on-write pages. 1000
clones in ~530ms, ~1,900 forks/sec.

Each clone's stdout is captured via its own pipe. `reduce()` reads all
pipes and feeds combined output to a reducer's stdin — fully pipe-based
Expand Down
18 changes: 18 additions & 0 deletions crates/sandlock-core/src/arch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,17 @@ mod imp {
pub const SYS_IOPERM: Option<i64> = Some(libc::SYS_ioperm);
pub const SYS_IOPL: Option<i64> = Some(libc::SYS_iopl);
pub const SYS_TIME: Option<i64> = Some(libc::SYS_time);

/// Every syscall the kernel will dispatch through `handle_fork`.
/// Single source of truth for callers that enumerate fork-class
/// syscalls (BPF notif registration in `seccomp::dispatch`,
/// classification in `resource::is_process_creation_notif`).
pub const FORK_LIKE_SYSCALLS: &[i64] = &[
libc::SYS_clone,
libc::SYS_clone3,
libc::SYS_vfork,
libc::SYS_fork,
];
}

#[cfg(target_arch = "aarch64")]
Expand Down Expand Up @@ -60,6 +71,13 @@ mod imp {
pub const SYS_IOPERM: Option<i64> = None;
pub const SYS_IOPL: Option<i64> = None;
pub const SYS_TIME: Option<i64> = None;

/// Every syscall the kernel will dispatch through `handle_fork`.
/// aarch64 has no `fork`/`vfork` (glibc emulates via `clone`).
pub const FORK_LIKE_SYSCALLS: &[i64] = &[
libc::SYS_clone,
libc::SYS_clone3,
];
}

pub use imp::*;
Expand Down
43 changes: 41 additions & 2 deletions crates/sandlock-core/src/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,15 @@ pub fn notif_syscalls(policy: &Policy) -> Vec<u32> {
libc::SYS_waitid as u32,
];
arch::push_optional_syscall(&mut nrs, arch::SYS_VFORK);
// Bare fork(2) carries none of the namespace/process-limit risk of
// clone/clone3 and was historically left out of the BPF filter so
// hot fork-loops (COW map-reduce) bypass the supervisor entirely.
// It only needs interception when policy_fn is active, so the
// supervisor can register the new child via ptrace fork events
// before it can run user code (argv-safety invariant).
if policy.policy_fn.is_some() {
arch::push_optional_syscall(&mut nrs, arch::SYS_FORK);
}

if policy.max_memory.is_some() {
nrs.push(libc::SYS_mmap as u32);
Expand Down Expand Up @@ -949,9 +958,21 @@ pub(crate) fn confine_child(args: ChildSpawnArgs<'_>) -> ! {
let mut notif = notif_syscalls(policy);
if !extra_syscalls.is_empty() {
notif.extend_from_slice(extra_syscalls);
notif.sort_unstable();
notif.dedup();
}
// Argv-safety gate (companion to the policy_fn case in
// notif_syscalls): an extra handler bound to execve/execveat
// can call `read_child_mem` to inspect argv, so the supervisor
// must register newly forked children before they can run user
// code — same invariant policy_fn relies on. Bare fork(2)
// therefore needs to be intercepted here too.
let exec_extra = extra_syscalls.iter().any(|&n| {
n == libc::SYS_execve as u32 || n == libc::SYS_execveat as u32
});
if exec_extra {
arch::push_optional_syscall(&mut notif, arch::SYS_FORK);
}
notif.sort_unstable();
notif.dedup();
let filter = match bpf::assemble_filter(&notif, &deny, &args) {
Ok(f) => f,
Err(e) => fail!(format!("seccomp assemble: {}", e)),
Expand Down Expand Up @@ -1093,6 +1114,24 @@ mod tests {
if let Some(vfork) = arch::SYS_VFORK {
assert!(nrs.contains(&(vfork as u32)));
}
// Bare fork(2) is intercepted only when policy_fn is active —
// see notif_syscalls. The default policy has no policy_fn, so
// fork stays out of the BPF filter and hot fork-loops keep
// bypassing the supervisor.
if let Some(fork) = arch::SYS_FORK {
assert!(!nrs.contains(&(fork as u32)));
}
}

#[test]
fn test_notif_syscalls_fork_gated_on_policy_fn() {
let Some(fork) = arch::SYS_FORK else { return };
let policy = Policy::builder()
.policy_fn(|_event, _ctx| crate::policy_fn::Verdict::Allow)
.build()
.unwrap();
let nrs = notif_syscalls(&policy);
assert!(nrs.contains(&(fork as u32)));
}

#[test]
Expand Down
8 changes: 4 additions & 4 deletions crates/sandlock-core/src/fork.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@
//! COW clones that share memory pages with the template. Each clone
//! receives `CLONE_ID=0..N-1` and execs `work_cmd`.
//!
//! Uses raw `fork()` syscall (NR 57 on x86_64) to bypass seccomp
//! notification — the BPF filter only intercepts `clone`/`clone3`.
//! Uses raw `fork()` syscall (NR 57 on x86_64). The supervisor
//! intercepts fork-like syscalls for process accounting and, when
//! `policy_fn` is active, child registration before user code runs.

use std::os::unix::io::RawFd;

// ============================================================
// Raw fork (bypasses seccomp clone interception)
// Raw fork
// ============================================================

/// Raw fork() syscall — NR 57 on x86_64.
/// Unlike clone/clone3, this is NOT intercepted by the seccomp notif filter.
fn raw_fork() -> std::io::Result<i32> {
#[cfg(target_arch = "x86_64")]
const NR_FORK: i64 = 57;
Expand Down
16 changes: 11 additions & 5 deletions crates/sandlock-core/src/policy_fn.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,12 @@ pub enum SyscallCategory {
/// (`fs_read` / `fs_write` / `fs_deny`); see issue #27.
///
/// `argv` *is* exposed for `execve`/`execveat` and is TOCTOU-safe by
/// construction: before the supervisor returns `Continue` for an
/// execve, it `PTRACE_SEIZE`+`PTRACE_INTERRUPT`s every task in the
/// sandbox — both sibling threads of the calling tid (same TGID, share
/// construction: with `policy_fn` active, fork-like syscalls are traced
/// for one ptrace creation event, so children are registered in
/// `ProcessIndex` before they can run user code. Before the supervisor
/// exposes `argv` to `policy_fn` or returns `Continue` for an execve, it
/// then `PTRACE_SEIZE`+`PTRACE_INTERRUPT`s every task that could write
/// the memory — both sibling threads of the calling tid (same TGID, share
/// `mm_struct`) and peer threads in other TGIDs that may alias argv
/// pages via `MAP_SHARED` mappings or share `mm_struct` via
/// `clone(CLONE_VM)`. The kernel's post-Continue re-read therefore
Expand Down Expand Up @@ -94,8 +97,11 @@ pub struct SyscallEvent {
/// Size argument (for mmap, brk).
pub size: Option<u64>,
/// Command arguments for execve/execveat. TOCTOU-safe: every task
/// in the sandbox (caller's siblings and peer processes) is frozen
/// before the kernel re-reads argv from child memory.
/// in `ProcessIndex` (caller's siblings and peer processes) is
/// frozen before argv is read for this event and before the kernel
/// re-reads argv from child memory; fork-like syscalls register
/// children before they can run user code while `policy_fn` is
/// active.
pub argv: Option<Vec<String>>,
/// Whether the supervisor denied this syscall.
pub denied: bool,
Expand Down
Loading
Loading