user namespace bugfixes and features #6865

Blub · 2017-11-14T12:57:17Z

This series can be seen as 4 separate "chunks":

Chunk 1: setgid mode bugfix & regression test:

Patch 1 fixes the main issue.
Patch 2 adds a helper for running user namespace tests. Currently uses a fixed
user id range. (I saw no reason for anything more complex than that.)
Patch 3 adds a regression test for the issue fixed in patch 1.

Chunk 2: mounting from user namespaces (RFC):

Patch 4 is an RFC useful for when a user can have a mount namespace (usually
in combination with user namespaces. Eg. giving zfs allowing create+mount
permissions to a container.
Patch 5 is necessary when including the third chunk but is otherwise there
since it made writing the test case of patch 6 more convenient.
Patch 6 tests create+mount permissions with user namespaces.

Chunk 3: mapping user ids when using zfs allow from within user namespaces.

Patch 7 causes ZFS_IOC_GET_FSACL and ZFS_IOC_SET_FSACL to perform user id
mapping (as well as checking!) on the sent/received data. Otherwise root in a
user namespace would not be able to run zfs allow with the user IDs as seen
from within its namespace, but would have to perform the mapping to real IDs.
This is also what easily enables users to create allow entries for user IDs
which do not exist in the host namespace's /etc/passwd and therefore would
show up empty and indistinguishable to the host (making patch 5 a
requirement).

Chunk 4: change the 'unallow' check:

Patch 8 allows users who have CAP_SYS_ADMIN in the current namespace (iow.
root in containers) to remove permissions of others if they're also allowed
to add the permission.

Checklist:

My code follows the ZFS on Linux code style requirements. (at least according to make checkstyle)
I have updated the documentation accordingly. (not yet)
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

Change file related checks to use user namespaces and make sure involved uids/gids are mappable in the current namespace. Note that checks without file ownership information will still not take user namespaces into account, as some of these should be handled via 'zfs allow' (otherwise root in a user namespace could issue commands such as `zpool export`). Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Closes openzfs#6800

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

An unprivileged container usually has its own user and group list, and zfs allow should be able to both view and modify them from the outside without having to add temporary entries with the mapped uids/gids. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

When executing 'zfs allow' from within a user namespace, the uids and gids must be mapped in accordance with the namespace. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

Regular users can only remove permissions from themselves in addition to requiring the allow permission to do so. It makes more sense for privileged users in a user namespace to be able to manage permissions of all users of that namespace. Thus, when the user has CAP_SYS_ADMIN in their current namespace, use the same check as for 'zfs allow' instead. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

behlendorf

The thrust of this looks good! I'm glad to see some user namespace test cases. Hopefully we'll be able to add more over time.

It's unfortunate that while we can delegate mount privileges to a user in a user namespace, the same user when running in the global namespace can't exercise those same privileges.

I think it would be worth investigating if the existing is_global_zone logic in the ZTS could be modified to mean is_global_namespace instead. This could potentially allow you to run the ZTS in a user namespace. Right now it's hardwired to always assume it's running in the global zone.

behlendorf · 2017-11-15T21:17:03Z

tests/zfs-tests/cmd/user_ns_exec/user_ns_exec.c

+
+	if (socketpair(AF_UNIX, SOCK_STREAM, 0, syncfd) != 0) {
+		perror("socketpair");
+		return (errno);


Returning errno here is dodgy since after a successful call to perror(3) errno is technically undefined. You should save errno in a local variable if you want to use it latter.

I was wondering about that, since chg_usr_exec.c is doing the same.

Yup, that's not good either and should really be fixed at some point.

behlendorf · 2017-11-15T21:25:14Z

tests/zfs-tests/cmd/user_ns_exec/user_ns_exec.c

+	return (exit_code);
+error_errno:
+	exit_code = errno;
+error:


As mentioned above you can't safely assume errno is still valid here. Given that, I think it would be simpler to drop the error* labels and do the needed error handling in each conditional.

Given that it's just a helper for tests most of the cases don't really need errno anyway and returning 1 would work just as much. A "failed" or "not failed" exit status should suffice after perror() printed the message.

behlendorf · 2017-11-15T21:26:22Z

tests/zfs-tests/cmd/user_ns_exec/user_ns_exec.c

+	close(syncfd[0]);
+done:
+	while (waitpid(child, &wstatus, 0) != child) {
+		/* Keep it simple. */


nit: I don't think the comment here is needed.

behlendorf · 2017-11-15T21:28:47Z

tests/runfiles/linux.run

@@ -695,6 +695,11 @@ tags = ['functional', 'truncate']
 tests = [ 'upgrade_userobj_001_pos' ]
 tags = ['functional', 'upgrade']

+# user_namespace_001 - https://github.com/zfsonlinux/zfs/issues/6800


We shouldn't need to link to the issue once this functionality works and has test coverage. So this can be dropped.

behlendorf · 2017-11-15T21:30:24Z

tests/zfs-tests/tests/functional/user_namespace/Makefile.am

+dist_pkgdata_SCRIPTS = \
+	setup.ksh \
+	cleanup.ksh \
+	user_namespace_001.ksh


user_namespace_common.kshlib and user_namespace_common.cfg need to be added to the Makefile.am. This is what caused the bots to fails. Please double check the permissions on the scripts too.

behlendorf · 2017-11-15T21:41:30Z

module/zfs/policy.c

 {
 	ASSERT3S(all, ==, B_FALSE);

 	if (cr != CRED() && (cr != kcred))
 		return (err);

-	if (!capable(capability))
+	if (!(ns ? ns_capable(ns, capability) : capable(capability)))


There's going to need to be some compatibility code added for older kernels which don't have ns_capable or a cr->user_ns. In which case this functionally needs to be automatically disabled.

http://build.zfsonlinux.org/builders/CentOS%206%20x86_64%20%28BUILD%29/builds/1581/steps/shell_3/logs/make

behlendorf · 2017-11-15T21:44:53Z

module/zfs/zpl_super.c

@@ -339,4 +339,5 @@ struct file_system_type zpl_fs_type = {
 	.get_sb			= zpl_get_sb,
 #endif /* HAVE_MOUNT_NODEV */
 	.kill_sb		= zpl_kill_sb,
+	.fs_flags		= FS_USERNS_MOUNT,


This wasn't a valid flag until the 3.8 kernel, this functionally will need to be gracefully disabled in older kernels.

behlendorf · 2017-11-15T21:48:59Z

module/zfs/zfs_vfsops.c

+	if (error == EACCES)
+		error = dsl_deleg_access(osname, "mount", cred);
+	if (error)
+		return (error);


nit: can you assign cred = CRED() in the declaration.

behlendorf · 2017-11-15T22:10:08Z

module/zfs/zfs_ioctl.c

+	if ((error = dsl_deleg_get(zc->zc_name, &nvp)) != 0)
+		return (error);
+#ifdef CONFIG_USER_NS
+	if ((error = deleg_map_user_ns(&nvp)) == 0)
 		error = put_nvlist(zc, nvp);


We need this put_nvlist in the !CONFIG_USER_NS case as well.

behlendorf · 2017-11-15T22:16:37Z

module/zfs/zfs_ioctl.c

+		 * allow.
+		 */
+		if (zc->zc_perm_action == B_FALSE ||
+		    ns_capable(cr->user_ns, CAP_SYS_ADMIN)) {


Same story, compatibility code needed for ns_capable() on old kernels.

behlendorf · 2018-01-10T00:26:43Z

@Blub any updates on this?

Blub · 2018-01-10T12:01:58Z

It's on my todo list but unfortunately not at the top currently.
I'll try to address your requested change in the first patch soon (late this week or early next week), as it's the important one. The rest of the series may have to wait a little longer.

pstch · 2018-01-20T16:32:58Z

Hi, I've been testing this patch for some days now, and I didn't find a way to prohibit the host from mounting the datasets created by the containers. At each reboot, they get mounted on the host, and mounting them from the container results in an "already mounted" error.

Is this something that should be handled by the user (in some way that I haven't found yet) or is it a limitation of this patch ? Thanks a lot for your work.

DeHackEd · 2018-01-20T16:51:18Z

Just as a suggestion, how about only allowing datasets with zoned=on set to do mounting in user namespaces on top of the existing zfs allow system, and by default zfs mount -a would skip datasets with zoned=on set in the host.

This prevents container mounts from being available to the host by default to prevent these kinds of issues and allows the mount points to safely reflect where they should be in the container itself.

Downsides:

I'm not actually sure how to do that since it's intentionally not very obvious you're in a container.
It means container setup would require ZFS awareness to mount the specifically allowed datasets on startup

@pstch: the issue is that if a dataset is already mounted, you can't mount it a second time with the mount command even in another namespace. It would be necessary to use bind mounting from the host, which also shared/slave mount in the container to receive it after the fact.

Blub · 2018-01-24T15:21:07Z

Regarding is_global_zone meaning is_global_namespace: The only way I currently see is to check for the 0->0 mapping in /proc/self/uid_map. This seems to work:

--- a/tests/zfs-tests/include/libtest.shlib
+++ b/tests/zfs-tests/include/libtest.shlib
@@ -1470,7 +1470,11 @@ function setup_nfs_server
 function is_global_zone
 {
        if is_linux; then
-               return 0
+               typeset uid_map=$(cat /proc/self/uid_map 2>/dev/null)
+               if [[ $uid_map == *( )0+( )0+( )4294967295 ]]; then
+                       return 0
+               fi
+               return 1
        else
                typeset cur_zone=$(zonename 2>/dev/null)
                if [[ $cur_zone != "global" ]]; then

I can push it with my next updates.

Blub · 2018-01-24T15:21:20Z

@pstch, thanks for testing. During development I'm mostly testing with unshare -m. The way we start-up containers is by using the already mounted datasets (bindmounting them to lxc's rootfs setup path). Given that no other storage we have currently allows adding datasets to an already running container I haven't tested this yet. I'll have to take a closer look.

behlendorf · 2018-01-24T18:47:24Z

Just as a suggestion, how about only allowing datasets with zoned=on set to do mounting in user namespaces on top of the existing zfs allow system, and by default zfs mount -a would skip datasets with zoned=on set in the host.

Yes, this is exactly what the zoned property was intended for so let's make use of it and update the man page accordingly.

behlendorf · 2018-03-07T23:45:29Z

@Blub with chunk 1 merged in 0e85048 why don't we close this and open new independant PRs for the remaining chunks.

codecov · 2018-03-08T05:55:33Z

Codecov Report

Merging #6865 into master will increase coverage by 1.22%.
The diff coverage is 75.36%.

@@            Coverage Diff             @@
##           master    #6865      +/-   ##
==========================================
+ Coverage   75.23%   76.45%   +1.22%     
==========================================
  Files         298      328      +30     
  Lines       94503   104006    +9503     
==========================================
+ Hits        71100    79521    +8421     
- Misses      23403    24485    +1082

Flag	Coverage Δ
#kernel	`76.09% <75.57%> (+1.33%)`	⬆️
#user	`65.75% <46.31%> (-1.7%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71788d9...c82eda6. Read the comment docs.

Blub · 2018-03-08T06:59:14Z

Sure. Will open new ones once I rebased them and incorporated the remaining requested changes.

behlendorf · 2018-03-19T21:02:33Z

@Blub in the context of improving ZFS intregration with user namespaces you might also be interested in #7294. To summerize, with a little work it's probably possible to support zfs mount/unmount delegations in an user namespace. We just need someone to investigate exactly what's required.

AkihiroSuda · 2018-06-03T08:00:09Z

Any news?

AkihiroSuda · 2018-08-22T07:02:38Z

ping^^ @Blub

Blub · 2018-08-23T09:54:25Z

Haven't gotten around to continue with this yet...

pstch · 2018-11-18T05:07:04Z

When rebasing this PR on current master, I can successfully mount, and the tests pass, but I get the following uid/gid, after creating a file in a dataset mounted in the global namespace, and then creating another file in the same dataset mounted in an user namespace :

root@test:/mnt# ls -l
total 1
-rw-r--r-- 1 1000000 1000000 0 Nov 18 04:56 created_on_container
-rw-r--r-- 1 root    root    0 Nov 18 04:56 created_on_host
root@test:/mnt# echo >> created_on_container 
root@test:/mnt# echo >> created_on_host

I have some trouble understanding if this is the required behaviour (and getting worried that I failed something when rebasing), and if UID/GIDs should be mapped when mounted in user namespaces. I would have personally thought that yes, they would be mapped. This LWN article says that they should not:

When a process within a user namespace accesses a filesystem mounted outside that namespace, its user and group IDs will be mapped accordingly before any access decisions are made. If the filesystem has been mounted within the namespace, though, that mapping should not happen.

I don't really understand why should that mapping not happen, so I have some difficulties understanding the proper behaviour of user namespace mounts.

pstch · 2018-11-18T05:11:31Z

I would also like to say that I don't think #7294 is related to this issue. Delegating mount/unmount to user namespaces seems to work (zfs mount works in the namespace after allowing mount with zfs allow in the global namespace), although I don't know if ownship and permissions are handled properly.

Blub · 2018-11-23T08:54:25Z

Just an FYI for who's following this: pushed the rebase, and another patch on top to make the super block always owned by inituser_ns for now. After re-reading @DeHackEd's comment I'm thinking limiting userns mounts to the zoned property probably makes sense, too, and in this case it may even make sense to have the super block owned by the user namespace. It really change much from the perspective of a container, but it would allow switching a container to a different user namespace (or turn it into a privileged container and back) without utilizing tools such as fuidshift. Not sure if it's of much use other than that? @stgraber might have some thoughts?
Both could generally be allowed in user namespaces, where zoned=on would have sb->s_user_ns = userns and zoned=off would have sb->s_user_ns = &inituser_ns.

sommcz · 2018-11-23T19:46:43Z

@Blub Thanks for the push, great work. Just for sure, have you solved the issue with pool structure of other user shown in userns too?

Blub · 2018-11-24T08:17:07Z

What do you mean exactly?

pstch · 2018-12-02T09:30:28Z

I have tested the last rebase of this PR (with the patch that makes the superblock owned by init_user_ns) on Debian, and I can say that it seems to work well. zfs allow for mount works, and UIDs/GIDs are properly mapped when mounting from unprivileged LXD containers.

As mentioned above, it's possible to integrate userns mounts with the zoned property. There are a few different possibilities:

Ignoring zones altogether (s_user_ns = &init_user_ns for all superblocks)
Using a "global" (init_user_ns) and a "user" zone (for all non-init namespaces)
Making sense of zone IDs on Linux (possibility: namespace's inode number)

If some concept of zone becomes available, it becomes possible to integrate it in a few different ways with userns mounts:

Restricting userns mounts to zoned=on datasets, that cannot be mounted in init_user_ns
Making the superblock of zoned=on datasets be owned by the userns rather than init_user_ns, possibly adding some other option to restrict userns mounts

The current code does not seem to handle the "superblock owned by userns" case (which is basically the situation before the patch the made it owned by init_user_ns) well, and leads to this weird situation when interacting with a dataset mounted in an user namespace:

$ id
uid=1000(test) gid=1000(test)
$ echo test > test
$ echo test > test
bash: test: Permission denied
$ ls -ln
-rw-r--r--  1 1001000 1001000           2 déc.   2 10:19 test
# chown 1000:1000 test
# ls -ln
-rw-r--r--  1 1000 1000                 2 déc.   2 10:19 test

The files appear with the same UID/GIDs when mounted in init_user_ns, so there is no UID/GID translation done when reading files and when calling chown, but there seems to be a problem with the way the owner of created files and directories is determined.

I am currently testing the global/user zone approach (using userns-owned superblocks for zoned datasets), while trying to find a cause/solution for the above issue.

Blub · 2018-12-02T09:48:51Z

So I think I'd start with a PR for the ZFS_IOC_GET/SET_FSACL parts (and the numeric ids in zfs allow as without those managing user namespaces rather annoying). That way the "view" seen from user namespaces makes sense.
After that, a PR for the general mount support where super blocks are owned by the init user ns only. Iow. zoned=off-only variant. ID translations would work as we're used to already and containers would be able to zfs mount.
Maybe by then we know more about how we want to proceed with the zoned property?
One nice thing it would enable is to have a container template simply cloned and used in different user namespaces without having to shift all the user ids inside first.
I know too little about solaris zones to properly map their behavior to linux. The variants @pstch mentioned above with the userns-inode or only global vs non-global would both enable the above use case. I'm not sure which way makes more sense on linux. The inode numbers could probably be dangerous if zfs expects zone ids to be in some way persistent.

pstch · 2018-12-03T19:35:29Z

After reading the ZoL code more thorougly, I think that there i already some logic enabled for zones, even if it is never used at thi time (because we are always in the global zone).

For example, if we use a non-global zone for user namespaces, user namespaces will ONLY be able to mount zoned dataset. This means that without changing the current logic, it is only possible to restrict userns mounts to zoned datasets (which would be the result of implementing crgetzoneid/INGLOBALZONE/zone_dataset_visible).

I'll create a PR for the global/user zones approach (which does not use inode numbers, just 0/1 for init/user namespaces), trying to implement the required things in zone.h without changing the current logic, once I have working code and if the required PRs have been merged, so that it's possible to discuss which approach would make more sense.

stevegilbert23 · 2021-01-30T21:50:26Z

Does anyone know what happened to this PR? It was closed in March 2018 but conversation continued through December 2018 (previous comment) and it appears to have stopped there.

Was anything done toward getting user namespaces working? I am interested regarding access to /dev/zfs inside containers: https://github.com/lxc/lxd/issues/4184

mklemm2 · 2021-11-17T10:18:52Z

I also wondered what happened here, for the same reasons as @stevegilbert23

behlendorf · 2021-11-17T14:35:07Z

This work was taken up in #12263.

mklemm2 · 2021-12-03T21:06:01Z

thanks!

Blub added 8 commits November 14, 2017 13:52

Add user_ns_exec helper for the test suite

d945593

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

Add regression test for user namespaces.

93d641c

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

Support mounting from user namespaces

35accc8

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

Add tests for mounting in user namespaces

f8e6711

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

User namespace support for ZFS_IOC_GET/SET_FSACL

6644317

When executing 'zfs allow' from within a user namespace, the uids and gids must be mapped in accordance with the namespace. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>

behlendorf requested changes Nov 15, 2017

View reviewed changes

behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 27, 2017

pstch mentioned this pull request Jan 20, 2018

Allow using the ZFS backend in containers canonical/lxd#4184

Closed

Blub mentioned this pull request Mar 5, 2018

user namespace fix & test for #6800 #7270

Merged

Blub closed this Mar 8, 2018

pstch mentioned this pull request Nov 18, 2018

Adding --no-canonicalize prevents user mounts #7294

Closed

user namespace bugfixes and features #6865

user namespace bugfixes and features #6865

Conversation

Blub commented Nov 14, 2017 • edited Loading

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented Jan 10, 2018

Blub commented Jan 10, 2018

pstch commented Jan 20, 2018

DeHackEd commented Jan 20, 2018

Blub commented Jan 24, 2018

Blub commented Jan 24, 2018

behlendorf commented Jan 24, 2018

behlendorf commented Mar 7, 2018

codecov bot commented Mar 8, 2018 • edited Loading

Codecov Report

Blub commented Mar 8, 2018

behlendorf commented Mar 19, 2018

AkihiroSuda commented Jun 3, 2018

AkihiroSuda commented Aug 22, 2018

Blub commented Aug 23, 2018

pstch commented Nov 18, 2018 • edited Loading

pstch commented Nov 18, 2018

Blub commented Nov 23, 2018

sommcz commented Nov 23, 2018

Blub commented Nov 24, 2018

pstch commented Dec 2, 2018

Blub commented Dec 2, 2018

pstch commented Dec 3, 2018 • edited Loading

stevegilbert23 commented Jan 30, 2021

mklemm2 commented Nov 17, 2021

behlendorf commented Nov 17, 2021

mklemm2 commented Dec 3, 2021

Blub commented Nov 14, 2017 •

edited

Loading

codecov bot commented Mar 8, 2018 •

edited

Loading

pstch commented Nov 18, 2018 •

edited

Loading

pstch commented Dec 3, 2018 •

edited

Loading