Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user namespace bugfixes and features #6865

Closed
wants to merge 8 commits into from
Closed

Conversation

Blub
Copy link
Contributor

@Blub Blub commented Nov 14, 2017

This series can be seen as 4 separate "chunks":

Chunk 1: setgid mode bugfix & regression test:

  • Patch 1 fixes the main issue.
  • Patch 2 adds a helper for running user namespace tests. Currently uses a fixed
    user id range. (I saw no reason for anything more complex than that.)
  • Patch 3 adds a regression test for the issue fixed in patch 1.

Chunk 2: mounting from user namespaces (RFC):

  • Patch 4 is an RFC useful for when a user can have a mount namespace (usually
    in combination with user namespaces. Eg. giving zfs allowing create+mount
    permissions to a container.
  • Patch 5 is necessary when including the third chunk but is otherwise there
    since it made writing the test case of patch 6 more convenient.
  • Patch 6 tests create+mount permissions with user namespaces.

Chunk 3: mapping user ids when using zfs allow from within user namespaces.

  • Patch 7 causes ZFS_IOC_GET_FSACL and ZFS_IOC_SET_FSACL to perform user id
    mapping (as well as checking!) on the sent/received data. Otherwise root in a
    user namespace would not be able to run zfs allow with the user IDs as seen
    from within its namespace, but would have to perform the mapping to real IDs.
    This is also what easily enables users to create allow entries for user IDs
    which do not exist in the host namespace's /etc/passwd and therefore would
    show up empty and indistinguishable to the host (making patch 5 a
    requirement).

Chunk 4: change the 'unallow' check:

  • Patch 8 allows users who have CAP_SYS_ADMIN in the current namespace (iow.
    root in containers) to remove permissions of others if they're also allowed
    to add the permission.

Checklist:

  • My code follows the ZFS on Linux code style requirements. (at least according to make checkstyle)
  • I have updated the documentation accordingly. (not yet)
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

Change file related checks to use user namespaces and make
sure involved uids/gids are mappable in the current
namespace.

Note that checks without file ownership information will
still not take user namespaces into account, as some of
these should be handled via 'zfs allow' (otherwise root in a
user namespace could issue commands such as `zpool export`).

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Closes openzfs#6800
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
An unprivileged container usually has its own user and group
list, and zfs allow should be able to both view and modify
them from the outside without having to add temporary
entries with the mapped uids/gids.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
When executing 'zfs allow' from within a user namespace, the
uids and gids must be mapped in accordance with the
namespace.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Regular users can only remove permissions from themselves in
addition to requiring the allow permission to do so. It
makes more sense for privileged users in a user namespace
to be able to manage permissions of all users of that
namespace.
Thus, when the user has CAP_SYS_ADMIN in their current
namespace, use the same check as for 'zfs allow' instead.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thrust of this looks good! I'm glad to see some user namespace test cases. Hopefully we'll be able to add more over time.

It's unfortunate that while we can delegate mount privileges to a user in a user namespace, the same user when running in the global namespace can't exercise those same privileges.

I think it would be worth investigating if the existing is_global_zone logic in the ZTS could be modified to mean is_global_namespace instead. This could potentially allow you to run the ZTS in a user namespace. Right now it's hardwired to always assume it's running in the global zone.


if (socketpair(AF_UNIX, SOCK_STREAM, 0, syncfd) != 0) {
perror("socketpair");
return (errno);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning errno here is dodgy since after a successful call to perror(3) errno is technically undefined. You should save errno in a local variable if you want to use it latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that, since chg_usr_exec.c is doing the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's not good either and should really be fixed at some point.

return (exit_code);
error_errno:
exit_code = errno;
error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above you can't safely assume errno is still valid here. Given that, I think it would be simpler to drop the error* labels and do the needed error handling in each conditional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's just a helper for tests most of the cases don't really need errno anyway and returning 1 would work just as much. A "failed" or "not failed" exit status should suffice after perror() printed the message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

close(syncfd[0]);
done:
while (waitpid(child, &wstatus, 0) != child) {
/* Keep it simple. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think the comment here is needed.

@@ -695,6 +695,11 @@ tags = ['functional', 'truncate']
tests = [ 'upgrade_userobj_001_pos' ]
tags = ['functional', 'upgrade']

# user_namespace_001 - https://github.com/zfsonlinux/zfs/issues/6800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to link to the issue once this functionality works and has test coverage. So this can be dropped.

dist_pkgdata_SCRIPTS = \
setup.ksh \
cleanup.ksh \
user_namespace_001.ksh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_namespace_common.kshlib and user_namespace_common.cfg need to be added to the Makefile.am. This is what caused the bots to fails. Please double check the permissions on the scripts too.

{
ASSERT3S(all, ==, B_FALSE);

if (cr != CRED() && (cr != kcred))
return (err);

if (!capable(capability))
if (!(ns ? ns_capable(ns, capability) : capable(capability)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's going to need to be some compatibility code added for older kernels which don't have ns_capable or a cr->user_ns. In which case this functionally needs to be automatically disabled.

http://build.zfsonlinux.org/builders/CentOS%206%20x86_64%20%28BUILD%29/builds/1581/steps/shell_3/logs/make

@@ -339,4 +339,5 @@ struct file_system_type zpl_fs_type = {
.get_sb = zpl_get_sb,
#endif /* HAVE_MOUNT_NODEV */
.kill_sb = zpl_kill_sb,
.fs_flags = FS_USERNS_MOUNT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't a valid flag until the 3.8 kernel, this functionally will need to be gracefully disabled in older kernels.

if (error == EACCES)
error = dsl_deleg_access(osname, "mount", cred);
if (error)
return (error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you assign cred = CRED() in the declaration.

if ((error = dsl_deleg_get(zc->zc_name, &nvp)) != 0)
return (error);
#ifdef CONFIG_USER_NS
if ((error = deleg_map_user_ns(&nvp)) == 0)
error = put_nvlist(zc, nvp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this put_nvlist in the !CONFIG_USER_NS case as well.

* allow.
*/
if (zc->zc_perm_action == B_FALSE ||
ns_capable(cr->user_ns, CAP_SYS_ADMIN)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story, compatibility code needed for ns_capable() on old kernels.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 27, 2017
@behlendorf
Copy link
Contributor

@Blub any updates on this?

@Blub
Copy link
Contributor Author

Blub commented Jan 10, 2018

It's on my todo list but unfortunately not at the top currently.
I'll try to address your requested change in the first patch soon (late this week or early next week), as it's the important one. The rest of the series may have to wait a little longer.

@pstch
Copy link

pstch commented Jan 20, 2018

Hi, I've been testing this patch for some days now, and I didn't find a way to prohibit the host from mounting the datasets created by the containers. At each reboot, they get mounted on the host, and mounting them from the container results in an "already mounted" error.

Is this something that should be handled by the user (in some way that I haven't found yet) or is it a limitation of this patch ? Thanks a lot for your work.

@DeHackEd
Copy link
Contributor

Just as a suggestion, how about only allowing datasets with zoned=on set to do mounting in user namespaces on top of the existing zfs allow system, and by default zfs mount -a would skip datasets with zoned=on set in the host.

This prevents container mounts from being available to the host by default to prevent these kinds of issues and allows the mount points to safely reflect where they should be in the container itself.

Downsides:

  • I'm not actually sure how to do that since it's intentionally not very obvious you're in a container.
  • It means container setup would require ZFS awareness to mount the specifically allowed datasets on startup

@pstch: the issue is that if a dataset is already mounted, you can't mount it a second time with the mount command even in another namespace. It would be necessary to use bind mounting from the host, which also shared/slave mount in the container to receive it after the fact.

@Blub
Copy link
Contributor Author

Blub commented Jan 24, 2018

Regarding is_global_zone meaning is_global_namespace: The only way I currently see is to check for the 0->0 mapping in /proc/self/uid_map. This seems to work:

--- a/tests/zfs-tests/include/libtest.shlib
+++ b/tests/zfs-tests/include/libtest.shlib
@@ -1470,7 +1470,11 @@ function setup_nfs_server
 function is_global_zone
 {
        if is_linux; then
-               return 0
+               typeset uid_map=$(cat /proc/self/uid_map 2>/dev/null)
+               if [[ $uid_map == *( )0+( )0+( )4294967295 ]]; then
+                       return 0
+               fi
+               return 1
        else
                typeset cur_zone=$(zonename 2>/dev/null)
                if [[ $cur_zone != "global" ]]; then

I can push it with my next updates.

@Blub
Copy link
Contributor Author

Blub commented Jan 24, 2018

@pstch, thanks for testing. During development I'm mostly testing with unshare -m. The way we start-up containers is by using the already mounted datasets (bindmounting them to lxc's rootfs setup path). Given that no other storage we have currently allows adding datasets to an already running container I haven't tested this yet. I'll have to take a closer look.

@behlendorf
Copy link
Contributor

Just as a suggestion, how about only allowing datasets with zoned=on set to do mounting in user namespaces on top of the existing zfs allow system, and by default zfs mount -a would skip datasets with zoned=on set in the host.

Yes, this is exactly what the zoned property was intended for so let's make use of it and update the man page accordingly.

@behlendorf
Copy link
Contributor

@Blub with chunk 1 merged in 0e85048 why don't we close this and open new independant PRs for the remaining chunks.

@codecov
Copy link

codecov bot commented Mar 8, 2018

Codecov Report

Merging #6865 into master will increase coverage by 1.22%.
The diff coverage is 75.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6865      +/-   ##
==========================================
+ Coverage   75.23%   76.45%   +1.22%     
==========================================
  Files         298      328      +30     
  Lines       94503   104006    +9503     
==========================================
+ Hits        71100    79521    +8421     
- Misses      23403    24485    +1082
Flag Coverage Δ
#kernel 76.09% <75.57%> (+1.33%) ⬆️
#user 65.75% <46.31%> (-1.7%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71788d9...c82eda6. Read the comment docs.

@Blub
Copy link
Contributor Author

Blub commented Mar 8, 2018

Sure. Will open new ones once I rebased them and incorporated the remaining requested changes.

@Blub Blub closed this Mar 8, 2018
@behlendorf
Copy link
Contributor

@Blub in the context of improving ZFS intregration with user namespaces you might also be interested in #7294. To summerize, with a little work it's probably possible to support zfs mount/unmount delegations in an user namespace. We just need someone to investigate exactly what's required.

@AkihiroSuda
Copy link

Any news?

@AkihiroSuda
Copy link

ping^^ @Blub

@Blub
Copy link
Contributor Author

Blub commented Aug 23, 2018

Haven't gotten around to continue with this yet...

@pstch
Copy link

pstch commented Nov 18, 2018

When rebasing this PR on current master, I can successfully mount, and the tests pass, but I get the following uid/gid, after creating a file in a dataset mounted in the global namespace, and then creating another file in the same dataset mounted in an user namespace :

root@test:/mnt# ls -l
total 1
-rw-r--r-- 1 1000000 1000000 0 Nov 18 04:56 created_on_container
-rw-r--r-- 1 root    root    0 Nov 18 04:56 created_on_host
root@test:/mnt# echo >> created_on_container 
root@test:/mnt# echo >> created_on_host

I have some trouble understanding if this is the required behaviour (and getting worried that I failed something when rebasing), and if UID/GIDs should be mapped when mounted in user namespaces. I would have personally thought that yes, they would be mapped. This LWN article says that they should not:

When a process within a user namespace accesses a filesystem mounted outside that namespace, its user and group IDs will be mapped accordingly before any access decisions are made. If the filesystem has been mounted within the namespace, though, that mapping should not happen.

I don't really understand why should that mapping not happen, so I have some difficulties understanding the proper behaviour of user namespace mounts.

@pstch
Copy link

pstch commented Nov 18, 2018

I would also like to say that I don't think #7294 is related to this issue. Delegating mount/unmount to user namespaces seems to work (zfs mount works in the namespace after allowing mount with zfs allow in the global namespace), although I don't know if ownship and permissions are handled properly.

@Blub
Copy link
Contributor Author

Blub commented Nov 23, 2018

Just an FYI for who's following this: pushed the rebase, and another patch on top to make the super block always owned by inituser_ns for now. After re-reading @DeHackEd's comment I'm thinking limiting userns mounts to the zoned property probably makes sense, too, and in this case it may even make sense to have the super block owned by the user namespace. It really change much from the perspective of a container, but it would allow switching a container to a different user namespace (or turn it into a privileged container and back) without utilizing tools such as fuidshift. Not sure if it's of much use other than that? @stgraber might have some thoughts?
Both could generally be allowed in user namespaces, where zoned=on would have sb->s_user_ns = userns and zoned=off would have sb->s_user_ns = &inituser_ns.

@sommcz
Copy link

sommcz commented Nov 23, 2018

@Blub Thanks for the push, great work. Just for sure, have you solved the issue with pool structure of other user shown in userns too?

@Blub
Copy link
Contributor Author

Blub commented Nov 24, 2018

What do you mean exactly?

@pstch
Copy link

pstch commented Dec 2, 2018

I have tested the last rebase of this PR (with the patch that makes the superblock owned by init_user_ns) on Debian, and I can say that it seems to work well. zfs allow for mount works, and UIDs/GIDs are properly mapped when mounting from unprivileged LXD containers.

As mentioned above, it's possible to integrate userns mounts with the zoned property. There are a few different possibilities:

  • Ignoring zones altogether (s_user_ns = &init_user_ns for all superblocks)
  • Using a "global" (init_user_ns) and a "user" zone (for all non-init namespaces)
  • Making sense of zone IDs on Linux (possibility: namespace's inode number)

If some concept of zone becomes available, it becomes possible to integrate it in a few different ways with userns mounts:

  • Restricting userns mounts to zoned=on datasets, that cannot be mounted in init_user_ns
  • Making the superblock of zoned=on datasets be owned by the userns rather than init_user_ns, possibly adding some other option to restrict userns mounts

The current code does not seem to handle the "superblock owned by userns" case (which is basically the situation before the patch the made it owned by init_user_ns) well, and leads to this weird situation when interacting with a dataset mounted in an user namespace:

$ id
uid=1000(test) gid=1000(test)
$ echo test > test
$ echo test > test
bash: test: Permission denied
$ ls -ln
-rw-r--r--  1 1001000 1001000           2 déc.   2 10:19 test
# chown 1000:1000 test
# ls -ln
-rw-r--r--  1 1000 1000                 2 déc.   2 10:19 test

The files appear with the same UID/GIDs when mounted in init_user_ns, so there is no UID/GID translation done when reading files and when calling chown, but there seems to be a problem with the way the owner of created files and directories is determined.

I am currently testing the global/user zone approach (using userns-owned superblocks for zoned datasets), while trying to find a cause/solution for the above issue.

@Blub
Copy link
Contributor Author

Blub commented Dec 2, 2018

So I think I'd start with a PR for the ZFS_IOC_GET/SET_FSACL parts (and the numeric ids in zfs allow as without those managing user namespaces rather annoying). That way the "view" seen from user namespaces makes sense.
After that, a PR for the general mount support where super blocks are owned by the init user ns only. Iow. zoned=off-only variant. ID translations would work as we're used to already and containers would be able to zfs mount.
Maybe by then we know more about how we want to proceed with the zoned property?
One nice thing it would enable is to have a container template simply cloned and used in different user namespaces without having to shift all the user ids inside first.
I know too little about solaris zones to properly map their behavior to linux. The variants @pstch mentioned above with the userns-inode or only global vs non-global would both enable the above use case. I'm not sure which way makes more sense on linux. The inode numbers could probably be dangerous if zfs expects zone ids to be in some way persistent.

@pstch
Copy link

pstch commented Dec 3, 2018

After reading the ZoL code more thorougly, I think that there i already some logic enabled for zones, even if it is never used at thi time (because we are always in the global zone).

For example, if we use a non-global zone for user namespaces, user namespaces will ONLY be able to mount zoned dataset. This means that without changing the current logic, it is only possible to restrict userns mounts to zoned datasets (which would be the result of implementing crgetzoneid/INGLOBALZONE/zone_dataset_visible).

I'll create a PR for the global/user zones approach (which does not use inode numbers, just 0/1 for init/user namespaces), trying to implement the required things in zone.h without changing the current logic, once I have working code and if the required PRs have been merged, so that it's possible to discuss which approach would make more sense.

@stevegilbert23
Copy link

Does anyone know what happened to this PR? It was closed in March 2018 but conversation continued through December 2018 (previous comment) and it appears to have stopped there.

Was anything done toward getting user namespaces working? I am interested regarding access to /dev/zfs inside containers: https://github.com/lxc/lxd/issues/4184

@mklemm2
Copy link

mklemm2 commented Nov 17, 2021

I also wondered what happened here, for the same reasons as @stevegilbert23

@behlendorf
Copy link
Contributor

This work was taken up in #12263.

@mklemm2
Copy link

mklemm2 commented Dec 3, 2021

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants