Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to run container with non root user #1513

Closed
erezo opened this issue Jul 10, 2017 · 24 comments
Closed

Not able to run container with non root user #1513

erezo opened this issue Jul 10, 2017 · 24 comments

Comments

@erezo
Copy link

erezo commented Jul 10, 2017

Hi,

running with:

CentOS Linux release 7.2.1511 (Core)

runc version 1.0.0-rc3
commit: 5c73abb
spec: 1.0.0-rc5-dev

Docker version 17.06.0-ce, build 02c1d87

I followed the building section and installed runc in custom dir, then Created OCI Bundle and rootless spec and finally tried to run the rootless container but keep getting the following error:

../usr/local/sbin/runc --root /tmp/runc run mycontainerid

container_linux.go:265: starting container process caused "process_linux.go:250: running exec setns process for init caused "exit status 34""

I enabled user_namespace with grubby, rebooted and setting seem to be set:

BOOT_IMAGE=/vmlinuz-3.10.0-327.el7.x86_64 root=UUID=4cb23129-4832-4b25-abfb-be8d63160eac ro crashkernel=auto rd.lvm.lv=vg00/lswap nomodeset rhgb quiet LANG=en_US.UTF-8 user_namespace.enable=1

Still I can not run a container from a non root user. When i do it with root user it works but not with non root.

Any idea?

@erezo erezo changed the title Getting error when trying to run mycontainer example Not able to run container with non root user Jul 10, 2017
@tamagokun
Copy link

tamagokun commented Jul 21, 2017

Same here.

exit status 34 should be: https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/nsexec.c#L723 based on other issues where that exit status is discussed.

Using the latest master (1.0.0-rc6-dev)

@ChethanSuresh
Copy link

ChethanSuresh commented Jul 28, 2017

I received a similar error with exit status 4, first step that helped me was
$ strace -f runc --root /path/to/myroot run id
see where it fails.
In my case, it was permission denied to write to /proc/self/oom_score_adj file.

@cyphar
Copy link
Member

cyphar commented Aug 5, 2017

In my case, it was permission denied to write to /proc/self/oom_score_adj file.

I fixed this a while ago in 6bd4bd9.

@cyphar
Copy link
Member

cyphar commented Aug 5, 2017

@erezo I believe we @mrunalp may know more about this. In particular, I would not be surprised if it's caused by SELinux or some other interaction where the way we're creating the namespaces isn't the way SELinux wants it to be done.

I also believe he was complaining a few months ago that some of the nsexec changes broke on RHEL in certain cases, but I don't think we ever figured out how we can cleanly solve that issue.

@rhatdan
Copy link
Contributor

rhatdan commented Aug 5, 2017

On RHEL and CentOS platforms I believe, you have to set a sysctl to allow you to run non privilege user namespaces.

@cyphar
Copy link
Member

cyphar commented Aug 5, 2017

@rhatdan Oh, is user_namespace.enable=1 on the kernel cmdline not enough anymore? I thought you only need the sysctl if you didn't set the cmdline.

@TomSweeneyRedHat
Copy link

'''
On RHEL 7.3 and later, starting in June 2017, it's now:
grubby --default-kernel

/boot/vmlinuz-3.10.0-514.16.1.el7.x86_64
** update namespace.unpriv_enable in kernel from above **
grubby --args=namespace.unpriv_enable=1 --update-kernel /boot/vmlinuz-3.10.0-682.el7.x86_64

The systctl command that's been added is:
sysctl user.max_user_namespaces=15076

I don't know when (if?) this has been implemented in CentOS.
'''
(Trying with auto formating off)

@cyphar
Copy link
Member

cyphar commented Aug 5, 2017

Oh I just noticed that @erezo was trying to run inside a Docker container. Docker uses a seccomp policy to disable CLONE_NEWUSER. You'll want to write your own seccomp policy or run the container with --security-opt="seccomp=unconfined"

@erezo
Copy link
Author

erezo commented Aug 9, 2017

@cyphar I wasn't trying to run inside Docker. I just mentioned the Docker version also installed on that machine.

@ChethanSuresh
Copy link

I fixed this a while ago in 6bd4bd9.

Sorry for the interruption @cyphar,(off the issue topic)
but in android-3.18,

  • the proc_score_adj file seems to be created with only user read permissions(400)
  • even a simple echo 1 > /proc/self/oom_score_adj fails with permissions denied for non-root user.

Therefore, update_oom_score_adj failed.

I presumed that the change must be from kernel side, is my understanding right?

@cyphar
Copy link
Member

cyphar commented Aug 10, 2017

@chethanmaurian Yes that's a ABI breakage from the Android kernel fork. I guess they think it makes it more secure for some reason? Here's the commit that made the change.

@cyphar
Copy link
Member

cyphar commented Aug 10, 2017

@erezo Ah okay, I was a bit confused because the Docker version isn't really relevant unless you're running inside Docker (runc doesn't use Docker, it's the other way around). I'll boot up a CentOS 7.2 VM over the weekend to figure out why it's broken.

@cyphar
Copy link
Member

cyphar commented Aug 10, 2017

This looks suspiciously like one of those really fun "only CentOS is broken" bugs. In particular, it looks like on CentOS 7 you can't do an unshare(CLONE_NEWUSER|CLONE_NEWNS):

% unshare -U true
% unshare -Uuinprf true
% unshare -Um true
unshare: unshare failed: Operation not permitted

@mrunalp @rhatdan Would this be an SELinux problem? I tried setenforce 0 but it looks like that doesn't change anything. I tried pulling the CentOS kernel sources, but soon realised that there's no patch information so searching would be a huge pain. I found the patch, see below.

@cyphar
Copy link
Member

cyphar commented Aug 10, 2017

Nope, it's actually just a very simple patch to fs/namespace.c:

struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
		struct user_namespace *user_ns, struct fs_struct *new_fs)
{
	struct mnt_namespace *new_ns;

	BUG_ON(!ns);
	get_mnt_ns(ns);

	if (!(flags & CLONE_NEWNS))
		return ns;

	/* Unprivileged creation currently disabled in RHEL7 */
	if (!capable(CAP_SYS_ADMIN)) {
		put_mnt_ns(ns);
		return ERR_PTR(-EPERM);
	}

	new_ns = dup_mnt_ns(ns, user_ns, new_fs);

	put_mnt_ns(ns);
	return new_ns;
}

Where (obviously) this is the patched part:

	/* Unprivileged creation currently disabled in RHEL7 */
	if (!capable(CAP_SYS_ADMIN)) {
		put_mnt_ns(ns);
		return ERR_PTR(-EPERM);
	}

On paper it should be possible to use runc without mount namespaces. But this is quite disappointing IMO.

@rhatdan
Copy link
Contributor

rhatdan commented Aug 14, 2017

If you setenforce 0 and something is still denied, 98.9% chance not SELinux.

User Namespace and Mounting to not match. Only think you are allowed to mount in a user namespace is tmpfs, and bind mounts, I believe. All mounting has to be completed by container runtime before the user namespace is created. I know that the user namespace guys are working hard to fix this, but it is a Very difficult problem.

@cyphar
Copy link
Member

cyphar commented Aug 14, 2017

@rhatdan If you see this comment, I identified that it's a patch that RHEL applies to completely disable the ability to create a new mount namespace inside a user namespace. You're right it has nothing to do with SELinux, this is the patched part:

	/* Unprivileged creation currently disabled in RHEL7 */
	if (!capable(CAP_SYS_ADMIN)) {
		put_mnt_ns(ns);
		return ERR_PTR(-EPERM);
	}

I'm not exactly sure how this works with Docker, since I was under the impression that --userns-remap wouldn't work if that feature was disabled (maybe you know more about that than me). Note that in other kernels (openSUSE, Fedora, Ubuntu, Debian, etc) this works (the upstream kernel does not impose this restriction). Is there a reason for this decision? Is it possible to make this a sysctl or cmdline knob?

@Omnifarious
Copy link

@cyphar - Has RH given an explanation of why they apply this patch, or a way to disable it? I know some people who really want to ship a product that relies on creating a mount namespace inside a user namespace.

@cyphar
Copy link
Member

cyphar commented Feb 28, 2018

I spoke to Eric Biederman about this patch when I last saw him. In short the reason why this patch is applied is because the RHEL kernel doesn't have a lot of the patches required to make mountns-inside-userns secure, so they just disable it. Newer versions of RHEL have this patch removed I believe -- though this won't help you if you're stuck on 7.4.

@rhatdan
Copy link
Contributor

rhatdan commented Feb 28, 2018

I believe this will be allowed in RHEL7.5 which should be coming in the next couple of months

@Omnifarious
Copy link

Omnifarious commented Feb 28, 2018

@rhatdan - The company I care about this on the behalf of is pretty large, and the feature that depends on this is a pretty major and important feature. We should probably establish contact with RedHat about this and make sure.

RIght now we're considered a lot of fairly hacky workarounds. And all of them involve elevated privilege levels for things we were hoping wouldn't need them. This company telling its customers that they have to upgrade for this feature will not be that unreasonable. They won't like it, but it won't be a disaster.

@FelikZ
Copy link

FelikZ commented Mar 13, 2018

See moby/moby#35806

@cyphar
Copy link
Member

cyphar commented Mar 13, 2018

Or more importantly, refer to https://discuss.linuxcontainers.org/t/centos-7-kernel-514-693-cannot-start-any-nodes-after-update/641/17. It looks like they added a boot parameter after I looked at the source above -- since the checks allowing weren't present in the sources I downloaded.

The TL;DR is that on newer RHEL 7.4 releases you need to use both user_namespace.enable=1 namespace.unpriv_enable=1.

@MichaelOVertolli
Copy link

I had a similar issue on CentOS 7.5 which I localized to sysctl failing to assign user.max_user_namespaces=15076 on boot. It turned out to be an SELinux issue. Calling ausearch -m avc showed a rather weird error. The key part is:

scontext=system_u:system_r:systemd_sysctl_t:s0
tcontext=system_u:system_r:systemd_sysctl_t:s0

Anyway, apparently this is a known bug in kernel 3.10.0. The easiest fix is probably to use audit2allow. My type enforcement file looked like this:

module systemd_sysctl_t 1.0:

require {
  type systemd_sysctl_t;
  class capability sys_resource;
}

allow systemd_sysctl_t self:capability sys_resource;

Few other details:
I modified /etc/docker/daemon.json rather than /etc/sysconfig/docker. The documentation recommends the latter, but I believe the sysconfig approach is deprecated. Consequently, systemctl status docker | grep userns didn't show anything (i.e., that command is not a valid check anymore).

I also used both user_namespace.enable=1 and namespace.unpriv_enable=1. I haven't checked if they are required, though.

@cyphar
Copy link
Member

cyphar commented Nov 14, 2018

Given all of the above discussion, I'm pretty sure that this can be closed. It's possible that in future RHEL or CentOS releases there will be more out-of-tree knobs. Please open a new issue if that is the case.

@cyphar cyphar closed this as completed Nov 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants