Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: better support for user namespaces #50098

Open
outofforest opened this issue Dec 10, 2021 · 11 comments
Open

os/exec: better support for user namespaces #50098

outofforest opened this issue Dec 10, 2021 · 11 comments
Labels
Proposal Proposal-Hold WaitingForInfo

Comments

@outofforest
Copy link

@outofforest outofforest commented Dec 10, 2021

I have this scenario:

  1. I want to start new process inside new user namespace (aka rootless container).
  2. unshare syscall allows me to map root user only.
  3. podman is able to map subids too by calling newuidmap and newgidmap (it is implemented here: https://github.com/containers/storage/tree/main/pkg/unshare/unshare_linux.go)
  4. To make it work they implemented complex wrapper around exec.Cmd plus some C code to call unshare in the middle of the process, after setting full mapping set
  5. For some reason calling syscall.Unshare(syscall.CLONE_NEWUSER) returns invalid argument error even if called from goroutine pinned to thread.

It would be nice to have more "goish" way to do it, like this:

	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUSER,
		UidMappings: []syscall.SysProcIDMap{
			{
				HostID:      1000,
				ContainerID: 0,
				Size:        1,
			},
			{
				HostID:      1,
				ContainerID: 100000,
				Size:        65536,
			},
		},
		GidMappings: []syscall.SysProcIDMap{
			{
				HostID:      1000,
				ContainerID: 0,
				Size:        1,
			},
			{
				HostID:      1,
				ContainerID: 100000,
				Size:        65536,
			},
		},
	}
@seankhliao seankhliao changed the title Better support for user namespaces, affected package: exec os/exec: better support for user namespaces Dec 10, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 11, 2021

This is supposed to be done by using the Unshareflags field of syscall.SysProcAttr. Why does that not suffice? Thanks.

@ianlancetaylor ianlancetaylor added the WaitingForInfo label Dec 11, 2021
@outofforest
Copy link
Author

@outofforest outofforest commented Dec 11, 2021

This is supposed to be done by using the Unshareflags field of syscall.SysProcAttr. Why does that not suffice? Thanks.

Behavior is the same no matter if I use Cloneflags or Unshareflags. They both work if only root uid is mapped but fails if I try to map subids too. This is by design because that's the limitation of linux unshare for unprivileged user. If I try to map both scopes I get error: fork/exec /tmp/executor-468678108: operation not permitted.

Software like podman use alternative implementation of exec.Cmd which calls newuidmap, newgidmap, reexec and C wrapper to bypass this limitation. It would be nice to have it implemented in go directly.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 11, 2021

What would we have to change to make this work? Thanks.

@outofforest
Copy link
Author

@outofforest outofforest commented Dec 11, 2021

What would we have to change to make this work? Thanks.

I believe ready to use implementation is here: https://github.com/containers/storage/tree/main/pkg/unshare/

@outofforest
Copy link
Author

@outofforest outofforest commented Dec 11, 2021

BTW: the best approach IMO would be to implement fork. Yes, I know it's dangerous and golang team refused to do it many time but actually this is the best way to switch to namespace:

  1. Fork the process
  2. Inside parent set id mappings
  3. Inside child call unshare

Whenever someone shares this idea golang team answers that the correct way of doing this is to use exec.Cmd. But:

  1. as I shown you above, it doesn't work anyway
  2. it makes us impossible to provide "container as a library".

Imagine this situation:
I want to create a library which allows to do some operations inside namespace.
To do it now I have to run separate binary using exec.Cmd. All existing software do that by reexecuting /proc/self/exe. But this is not a good way if you want to create a library, reexecuting current binary from a library is extremely weird idea. Now I do it by embedding binary code inside library, saving it later in tmp location and executing it in a namespace. This is sooooo weird solution but the only one available because fork is not present.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 12, 2021

It's infeasible to implement fork in the Go standard library. After a program forks, there is only a single thread running in the child process. That means that the basic assumptions of the Go runtime fall apart. Almost any ordinary Go operation can potentially fail in that case. The syscall code that starts a child is very carefully written to avoid problems. We can't permit ordinary Go code to run between the fork and the execve.

Thanks for pointing to the package. There is a lot there, and I don't know what matters. It would help this proposal a great deal if you or somebody could write down exactly what would need to be added to syscall.ProcSysAttr to make things work. In your initial comment you mentioned Cloneflags, UidMappings, and GidMappings, but all of those already exist. What do we need that is new?

@outofforest
Copy link
Author

@outofforest outofforest commented Dec 13, 2021

@ianlancetaylor

I'm not a C expert but linked library works this way:

  1. New process is started using exec.Cmd. It reexecutes /proc/self/exe so the same binary is executed again in this case, but in general one it should allow to start any binary.
  2. Executed go binary is prepared in some tricky way. As far as I understand, it runs C function before go runtime takes control:
    https://github.com/containers/storage/blob/main/pkg/unshare/unshare_gccgo.go#L5
// #cgo CFLAGS: -Wall -Wextra
// extern void _containers_unshare(void);
// void __attribute__((constructor)) init(void) {
//   _containers_unshare();
// }
import "C"
  1. _containers_unshare C function does the real stuff related to creating a namespace: https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L299

And this is the flow:

  1. Binary is reexecuted so we have a parent and child.
  2. Child starts the C code, sends its PID to parent (https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L319) and waits (https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L328).
  3. Parent takes child process ID and calls newuidmap and newgidmap linux binaries to set full mappings (both root user and subids if defined in /etc/subuid and /etc/subgid): https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L266
    https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L217
    It may be done by unprivileged user because newuidmap and newgidmap have appropriate capability or sticky bit set, depending on linux distro.
  4. Now child gets a signal (using pipe) to continue.
  5. Child reexecutes itself again (this time in C code): https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L372
  6. Finally after 2 reexecs we have go binary running inside user namespce with both root and subids being mapped.

In this setup flags for unshare are passed using _Containers-unshare env variable:
https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L86
https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L304

Entire process is sooooo complicated...
It would be much easier if go executable could call unshare before real go runtime starts.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 15, 2021

We are not going to permit calling ordinary Go code between fork and execve.

I don't know what the newuidmap and newgidmap binaries are.

For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.

@outofforest
Copy link
Author

@outofforest outofforest commented Dec 15, 2021

I don't know what the newuidmap and newgidmap binaries are.

https://man7.org/linux/man-pages/man1/newuidmap.1.html
https://man7.org/linux/man-pages/man1/newgidmap.1.html

For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.

I think I provided all the details on how the flow should look like. But once again:

  1. fork
  2. call newuidmap and newgidmap in parent process
  3. call unshare in child process before go runtime takes control over execution

go is frequently used to develop containerization engines so it would be nice to implement this once for all.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 15, 2021

Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?

@outofforest
Copy link
Author

@outofforest outofforest commented Dec 16, 2021

Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?

Unprivileged user may create only a single mapping: id in container -> current uid/gid on host. Providing more mappings causes "permission denied" error, even if mapped ids are specified in /etc/subuid or /etc/subgid. This is by design in linux.
Calling newuidmap / newgidmap bypasses this limitation because those binaries have sticky bit or capability set (depending on distro) so they always run with privileges required to set extended mappings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposal Proposal-Hold WaitingForInfo
Projects
None yet
Development

No branches or pull requests

2 participants