setns: support user namespaces#13323
Conversation
|
You're a legend! Just wanted to say that I tested this out and it works perfectly :) |
|
thanks for the patch, LGTM |
|
Thanks for the quick review @milantracy , appreciate it |
| } | ||
|
|
||
| TEST(SetnsTest, ChangeUserNamespace) { | ||
| SKIP_IF(!ASSERT_NO_ERRNO_AND_VALUE(HaveCapability(CAP_SYS_ADMIN))); |
There was a problem hiding this comment.
Why do we need CAP_SYS_ADMIN for this test?
There was a problem hiding this comment.
Good point. I was thinking of the rootless userns model (something I am running internally), where the caller does not need host CAP_SYS_ADMIN and instead uses capabilities scoped to a namespace it owns, and I copied the skip shape from the existing non-userns tests.
For CLONE_NEWUSER, Linux requires CAP_SYS_ADMIN in the target user namespace. Since this test’s parent creates the child user namespace, I think yeah it makes sense for the parent to be capable in that target namespace by the userns owner rule. I’ll switch this to CanCreateUserNamespace so the test matches that model.
| _exit(0); | ||
| } | ||
| Cleanup cleanup([child] { | ||
| kill(child, SIGKILL); |
There was a problem hiding this comment.
Why send SIGCONT and SIGKILL both? Does the SIGCONT after a SIGKILL matter in this case?
There was a problem hiding this comment.
No, I don’t think the SIGCONT matters here. I think I was wondering something about the child process, being blocked in pause. I’ll remove the SIGCONT
| return linuxerr.EINVAL | ||
| } | ||
| t.tg.signalHandlers.mu.Unlock() | ||
| fsContext := t.FSContext() |
There was a problem hiding this comment.
At this point we already know that this threadgroup is single threaded, so do we need to call checkAndPreventSharingOutsideTG()+allowSharing()? The lone task cannot cause the fsContext become shared from under us: it is waiting for the setns() it already issued.
It looks like we only need to check if its already shared, without toggling fs.preventSharing, maybe by way of writing a non-locking version of checkAndPreventSharingOutsideTG(): an isSharedOutsideTG().
Please let me know if there is a reason to lock down fs.preventSharing if I'm missing it.
There was a problem hiding this comment.
hm yeah. I used checkAndPreventSharingOutsideT because it already captured the rule I wanted, which is rejecting a user ns join when fs state is shared outside the thread group. But yeah after the single threded check succeeds, I agree the prevent-sharing part looks unnecessary. I’ll replace this with a simpler shared-fs check then.
I don't think you are missing anything, good callout
|
|
||
| if flags&linux.CLONE_NEWUSER != 0 { | ||
| if target.ExitState() >= TaskExitInitiated { | ||
| return linuxerr.ESRCH |
There was a problem hiding this comment.
Curious about this exit state check that other namespaces lack. What does it achieve?
There was a problem hiding this comment.
The reason for this check was that the user ns comes from credentials, which survive longer than the task’s other namespace fields. Without an explicit exit check, pidfd setns(CLONE_NEWUSR) could still see credentials on an exiting task where other namespace joins would fail with ESRCH. That said, I agree this should be part of the same target-task snapshot instead of a separate early read.
Lmk if I mis understood the context
1c8a5b3 to
dc80b71
Compare
|
Thanks for the feedback, I addressed. |
33f8d13 to
ade5562
Compare
ade5562 to
8060b5f
Compare
|
LGTM, thanks for implementing this. |
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 925507008
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 925507008
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 925507008
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 925507008
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 926411968
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 926411968
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 926411968
User namespace entries under /proc/[pid]/ns currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an nsfs file that setns(2) can use. Rootless container tools such as buildah and podman rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with EINVAL. Make UserNamespace implement vfs.Namespace and give each user namespace an nsfs inode when it is created. /proc/[pid]/ns/user now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. Setns now accepts CLONE_NEWUSER from both nsfds and pidfds. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring CAP_SYS_ADMIN in the target user namespace, rejecting multithreaded callers, and rejecting callers with fs state shared outside the thread group. The capability checks for any other namespaces in the same setns call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens /proc/<pid>/ns/user, and verifies that setns(CLONE_NEWUSER) succeeds.
8060b5f to
a130ef2
Compare
User namespace entries under `/proc/[pid]/ns` currently render as fake namespace symlinks. They look like the other namespace files, but opening them does not produce an `nsfs` file that `setns(2)` can use. Rootless container tools such as `buildah` and `podman` rely on that file when they re-enter the pause process user namespace, so the second lifecycle command fails with `EINVAL`. Make `UserNamespace` implement `vfs.Namespace` and give each user namespace an `nsfs` inode when it is created. `/proc/[pid]/ns/user` now uses the regular namespace symlink path, so opening it returns a joinable namespace file instead of a fake link target. `Setns` now accepts `CLONE_NEWUSER` from both `nsfd`s and `pidfd`s. It follows the Linux restrictions for user namespace joins by rejecting the caller's current user namespace, requiring `CAP_SYS_ADMIN` in the target user namespace, rejecting multithreaded callers, and rejecting callers with `fs` state shared outside the thread group. The capability checks for any other namespaces in the same `setns` call use the credentials the caller would have after joining the user namespace. Add a syscall regression test that creates a child user namespace, opens `/proc/<pid>/ns/user`, and verifies that `setns(CLONE_NEWUSER)` succeeds. Fixes #13314 COPYBARA_INTEGRATE_REVIEW=#13323 from shayonj:issue-13314-userns-setns 8060b5f PiperOrigin-RevId: 926855389
|
This was merged as 3f949a4 |
User namespace entries under
/proc/[pid]/nscurrently render as fakenamespace symlinks. They look like the other namespace files, but opening
them does not produce an
nsfsfile thatsetns(2)can use. Rootlesscontainer tools such as
buildahandpodmanrely on that file when theyre-enter the pause process user namespace, so the second lifecycle command
fails with
EINVAL.Make
UserNamespaceimplementvfs.Namespaceand give each user namespacean
nsfsinode when it is created./proc/[pid]/ns/usernow uses theregular namespace symlink path, so opening it returns a joinable namespace
file instead of a fake link target.
Setnsnow acceptsCLONE_NEWUSERfrom bothnsfds andpidfds. Itfollows the Linux restrictions for user namespace joins by rejecting the
caller's current user namespace, requiring
CAP_SYS_ADMINin the targetuser namespace, rejecting multithreaded callers, and rejecting callers with
fsstate shared outside the thread group. The capability checks for anyother namespaces in the same
setnscall use the credentials the callerwould have after joining the user namespace.
Add a syscall regression test that creates a child user namespace, opens
/proc/<pid>/ns/user, and verifies thatsetns(CLONE_NEWUSER)succeeds.Fixes #13314