Skip to content

setns: support user namespaces#13323

Closed
shayonj wants to merge 1 commit into
google:masterfrom
shayonj:issue-13314-userns-setns
Closed

setns: support user namespaces#13323
shayonj wants to merge 1 commit into
google:masterfrom
shayonj:issue-13314-userns-setns

Conversation

@shayonj
Copy link
Copy Markdown
Contributor

@shayonj shayonj commented May 30, 2026

User namespace entries under /proc/[pid]/ns currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an nsfs file that setns(2) can use. Rootless
container tools such as buildah and podman rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with EINVAL.

Make UserNamespace implement vfs.Namespace and give each user namespace
an nsfs inode when it is created. /proc/[pid]/ns/user now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

Setns now accepts CLONE_NEWUSER from both nsfds and pidfds. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring CAP_SYS_ADMIN in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
fs state shared outside the thread group. The capability checks for any
other namespaces in the same setns call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
/proc/<pid>/ns/user, and verifies that setns(CLONE_NEWUSER) succeeds.

Fixes #13314

@anthops
Copy link
Copy Markdown

anthops commented Jun 2, 2026

You're a legend! Just wanted to say that I tested this out and it works perfectly :)

@milantracy
Copy link
Copy Markdown
Contributor

thanks for the patch, LGTM

@shayonj
Copy link
Copy Markdown
Contributor Author

shayonj commented Jun 2, 2026

Thanks for the quick review @milantracy , appreciate it

Comment thread test/syscalls/linux/setns.cc Outdated
}

TEST(SetnsTest, ChangeUserNamespace) {
SKIP_IF(!ASSERT_NO_ERRNO_AND_VALUE(HaveCapability(CAP_SYS_ADMIN)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need CAP_SYS_ADMIN for this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was thinking of the rootless userns model (something I am running internally), where the caller does not need host CAP_SYS_ADMIN and instead uses capabilities scoped to a namespace it owns, and I copied the skip shape from the existing non-userns tests.

For CLONE_NEWUSER, Linux requires CAP_SYS_ADMIN in the target user namespace. Since this test’s parent creates the child user namespace, I think yeah it makes sense for the parent to be capable in that target namespace by the userns owner rule. I’ll switch this to CanCreateUserNamespace so the test matches that model.

Comment thread test/syscalls/linux/setns.cc Outdated
_exit(0);
}
Cleanup cleanup([child] {
kill(child, SIGKILL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why send SIGCONT and SIGKILL both? Does the SIGCONT after a SIGKILL matter in this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don’t think the SIGCONT matters here. I think I was wondering something about the child process, being blocked in pause. I’ll remove the SIGCONT

Comment thread test/syscalls/linux/setns.cc
Comment thread pkg/sentry/kernel/task_clone.go
Comment thread pkg/sentry/kernel/task_clone.go Outdated
return linuxerr.EINVAL
}
t.tg.signalHandlers.mu.Unlock()
fsContext := t.FSContext()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we already know that this threadgroup is single threaded, so do we need to call checkAndPreventSharingOutsideTG()+allowSharing()? The lone task cannot cause the fsContext become shared from under us: it is waiting for the setns() it already issued.

It looks like we only need to check if its already shared, without toggling fs.preventSharing, maybe by way of writing a non-locking version of checkAndPreventSharingOutsideTG(): an isSharedOutsideTG().

Please let me know if there is a reason to lock down fs.preventSharing if I'm missing it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yeah. I used checkAndPreventSharingOutsideT because it already captured the rule I wanted, which is rejecting a user ns join when fs state is shared outside the thread group. But yeah after the single threded check succeeds, I agree the prevent-sharing part looks unnecessary. I’ll replace this with a simpler shared-fs check then.

I don't think you are missing anything, good callout

Comment thread pkg/sentry/kernel/task_clone.go
Comment thread pkg/sentry/kernel/task_clone.go Outdated

if flags&linux.CLONE_NEWUSER != 0 {
if target.ExitState() >= TaskExitInitiated {
return linuxerr.ESRCH
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this exit state check that other namespaces lack. What does it achieve?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this check was that the user ns comes from credentials, which survive longer than the task’s other namespace fields. Without an explicit exit check, pidfd setns(CLONE_NEWUSR) could still see credentials on an exiting task where other namespace joins would fail with ESRCH. That said, I agree this should be part of the same target-task snapshot instead of a separate early read.

Lmk if I mis understood the context

@shayonj shayonj force-pushed the issue-13314-userns-setns branch from 1c8a5b3 to dc80b71 Compare June 2, 2026 22:49
@shayonj
Copy link
Copy Markdown
Contributor Author

shayonj commented Jun 2, 2026

Thanks for the feedback, I addressed.

Comment thread test/syscalls/linux/setns.cc
Comment thread pkg/sentry/kernel/task_clone.go
@shayonj shayonj force-pushed the issue-13314-userns-setns branch 2 times, most recently from 33f8d13 to ade5562 Compare June 3, 2026 16:30
Comment thread pkg/sentry/kernel/task_clone.go
Comment thread pkg/sentry/kernel/auth/user_namespace.go
Comment thread pkg/sentry/kernel/auth/user_namespace.go Outdated
Comment thread pkg/sentry/kernel/auth/user_namespace.go
@shayonj shayonj force-pushed the issue-13314-userns-setns branch from ade5562 to 8060b5f Compare June 3, 2026 20:57
@shailend-g
Copy link
Copy Markdown
Contributor

LGTM, thanks for implementing this.

copybara-service Bot pushed a commit that referenced this pull request Jun 3, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 925507008
copybara-service Bot pushed a commit that referenced this pull request Jun 3, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 925507008
copybara-service Bot pushed a commit that referenced this pull request Jun 4, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 925507008
copybara-service Bot pushed a commit that referenced this pull request Jun 4, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 925507008
Comment thread test/syscalls/linux/setns.cc
copybara-service Bot pushed a commit that referenced this pull request Jun 4, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 926411968
copybara-service Bot pushed a commit that referenced this pull request Jun 4, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 926411968
copybara-service Bot pushed a commit that referenced this pull request Jun 4, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 926411968
User namespace entries under /proc/[pid]/ns currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an nsfs file that setns(2) can use. Rootless
container tools such as buildah and podman rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with EINVAL.

Make UserNamespace implement vfs.Namespace and give each user namespace an
nsfs inode when it is created. /proc/[pid]/ns/user now uses the regular
namespace symlink path, so opening it returns a joinable namespace file
instead of a fake link target.

Setns now accepts CLONE_NEWUSER from both nsfds and pidfds. It follows the
Linux restrictions for user namespace joins by rejecting the caller's
current user namespace, requiring CAP_SYS_ADMIN in the target user
namespace, rejecting multithreaded callers, and rejecting callers with fs
state shared outside the thread group. The capability checks for any other
namespaces in the same setns call use the credentials the caller would have
after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
/proc/<pid>/ns/user, and verifies that setns(CLONE_NEWUSER) succeeds.
@shayonj shayonj force-pushed the issue-13314-userns-setns branch from 8060b5f to a130ef2 Compare June 4, 2026 20:25
copybara-service Bot pushed a commit that referenced this pull request Jun 4, 2026
User namespace entries under `/proc/[pid]/ns` currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an `nsfs` file that `setns(2)` can use. Rootless
container tools such as `buildah` and `podman` rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with `EINVAL`.

Make `UserNamespace` implement `vfs.Namespace` and give each user namespace
an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

`Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
`fs` state shared outside the thread group. The capability checks for any
other namespaces in the same `setns` call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
`/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds.

Fixes #13314

COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f
PiperOrigin-RevId: 926855389
@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Jun 4, 2026

This was merged as 3f949a4

@ayushr2 ayushr2 closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/proc/[pid]/ns/user is not usable with setns

5 participants