setns: support user namespaces by shayonj · Pull Request #13323 · google/gvisor

shayonj · 2026-05-30T12:46:48Z

User namespace entries under /proc/[pid]/ns currently render as fake
namespace symlinks. They look like the other namespace files, but opening
them does not produce an nsfs file that setns(2) can use. Rootless
container tools such as buildah and podman rely on that file when they
re-enter the pause process user namespace, so the second lifecycle command
fails with EINVAL.

Make UserNamespace implement vfs.Namespace and give each user namespace
an nsfs inode when it is created. /proc/[pid]/ns/user now uses the
regular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.

Setns now accepts CLONE_NEWUSER from both nsfds and pidfds. It
follows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring CAP_SYS_ADMIN in the target
user namespace, rejecting multithreaded callers, and rejecting callers with
fs state shared outside the thread group. The capability checks for any
other namespaces in the same setns call use the credentials the caller
would have after joining the user namespace.

Add a syscall regression test that creates a child user namespace, opens
/proc/<pid>/ns/user, and verifies that setns(CLONE_NEWUSER) succeeds.

Fixes #13314

anthops · 2026-06-02T02:37:17Z

You're a legend! Just wanted to say that I tested this out and it works perfectly :)

milantracy · 2026-06-02T19:09:53Z

thanks for the patch, LGTM

shayonj · 2026-06-02T19:13:40Z

Thanks for the quick review @milantracy , appreciate it

shailend-g · 2026-06-02T18:49:40Z

 }

+TEST(SetnsTest, ChangeUserNamespace) {
+  SKIP_IF(!ASSERT_NO_ERRNO_AND_VALUE(HaveCapability(CAP_SYS_ADMIN)));


Why do we need CAP_SYS_ADMIN for this test?

Good point. I was thinking of the rootless userns model (something I am running internally), where the caller does not need host CAP_SYS_ADMIN and instead uses capabilities scoped to a namespace it owns, and I copied the skip shape from the existing non-userns tests.

For CLONE_NEWUSER, Linux requires CAP_SYS_ADMIN in the target user namespace. Since this test’s parent creates the child user namespace, I think yeah it makes sense for the parent to be capable in that target namespace by the userns owner rule. I’ll switch this to CanCreateUserNamespace so the test matches that model.

shailend-g · 2026-06-02T18:50:39Z

+    _exit(0);
+  }
+  Cleanup cleanup([child] {
+    kill(child, SIGKILL);


Why send SIGCONT and SIGKILL both? Does the SIGCONT after a SIGKILL matter in this case?

No, I don’t think the SIGCONT matters here. I think I was wondering something about the child process, being blocked in pause. I’ll remove the SIGCONT

shailend-g · 2026-06-02T19:30:16Z

+			return linuxerr.EINVAL
+		}
+		t.tg.signalHandlers.mu.Unlock()
+		fsContext := t.FSContext()


At this point we already know that this threadgroup is single threaded, so do we need to call checkAndPreventSharingOutsideTG()+allowSharing()? The lone task cannot cause the fsContext become shared from under us: it is waiting for the setns() it already issued.

It looks like we only need to check if its already shared, without toggling fs.preventSharing, maybe by way of writing a non-locking version of checkAndPreventSharingOutsideTG(): an isSharedOutsideTG().

Please let me know if there is a reason to lock down fs.preventSharing if I'm missing it.

hm yeah. I used checkAndPreventSharingOutsideT because it already captured the rule I wanted, which is rejecting a user ns join when fs state is shared outside the thread group. But yeah after the single threded check succeeds, I agree the prevent-sharing part looks unnecessary. I’ll replace this with a simpler shared-fs check then.

I don't think you are missing anything, good callout

shailend-g · 2026-06-02T19:46:39Z


+	if flags&linux.CLONE_NEWUSER != 0 {
+		if target.ExitState() >= TaskExitInitiated {
+			return linuxerr.ESRCH


Curious about this exit state check that other namespaces lack. What does it achieve?

The reason for this check was that the user ns comes from credentials, which survive longer than the task’s other namespace fields. Without an explicit exit check, pidfd setns(CLONE_NEWUSR) could still see credentials on an exiting task where other namespace joins would fail with ESRCH. That said, I agree this should be part of the same target-task snapshot instead of a separate early read.

Lmk if I mis understood the context

shayonj · 2026-06-02T22:57:00Z

Thanks for the feedback, I addressed.

shailend-g · 2026-06-03T22:31:52Z

LGTM, thanks for implementing this.

User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 925507008

User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 926411968

User namespace entries under /proc/[pid]/ns currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an nsfs file that setns(2) can use. Rootless container tools such as buildah and podman rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with EINVAL. Make UserNamespace implement vfs.Namespace and give each user namespace an nsfs inode when it is created. /proc/[pid]/ns/user now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. Setns now accepts CLONE_NEWUSER from both nsfds and pidfds. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring CAP_SYS_ADMIN in the target user namespace, rejecting multithreaded callers, and rejecting callers with fs state shared outside the thread group. The capability checks for any other namespaces in the same setns call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens /proc/<pid>/ns/user, and verifies that setns(CLONE_NEWUSER) succeeds.

User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 926855389

ayushr2 · 2026-06-04T20:52:53Z

This was merged as 3f949a4

shayonj mentioned this pull request May 30, 2026

/proc/[pid]/ns/user is not usable with setns #13314

Closed

ayushr2 requested a review from shailend-g May 31, 2026 00:11

milantracy approved these changes Jun 2, 2026

View reviewed changes

milantracy added the ready to pull label Jun 2, 2026

copybara-service Bot mentioned this pull request Jun 2, 2026

setns: support user namespaces #13346

Open

shailend-g reviewed Jun 2, 2026

View reviewed changes

shayonj force-pushed the issue-13314-userns-setns branch from 1c8a5b3 to dc80b71 Compare June 2, 2026 22:49

shailend-g reviewed Jun 3, 2026

View reviewed changes

Comment thread test/syscalls/linux/setns.cc

Comment thread pkg/sentry/kernel/task_clone.go

shayonj force-pushed the issue-13314-userns-setns branch 2 times, most recently from 33f8d13 to ade5562 Compare June 3, 2026 16:30

shailend-g reviewed Jun 3, 2026

View reviewed changes

Comment thread pkg/sentry/kernel/task_clone.go

Comment thread pkg/sentry/kernel/auth/user_namespace.go

Comment thread pkg/sentry/kernel/auth/user_namespace.go Outdated

Comment thread pkg/sentry/kernel/auth/user_namespace.go

shayonj force-pushed the issue-13314-userns-setns branch from ade5562 to 8060b5f Compare June 3, 2026 20:57

ayushr2 reviewed Jun 4, 2026

View reviewed changes

Comment thread test/syscalls/linux/setns.cc

copybara-service Bot mentioned this pull request Jun 4, 2026

setns: support user namespaces #13359

Merged

shayonj force-pushed the issue-13314-userns-setns branch from 8060b5f to a130ef2 Compare June 4, 2026 20:25

ayushr2 closed this Jun 4, 2026

Conversation

shayonj commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anthops commented Jun 2, 2026

Uh oh!

milantracy commented Jun 2, 2026

Uh oh!

shayonj commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shayonj commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shailend-g commented Jun 3, 2026

Uh oh!

Uh oh!

ayushr2 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shayonj commented May 30, 2026 •

edited

Loading