Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set libnetwork sandbox key w/o OCI hooks #44385

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

corhere
Copy link
Contributor

@corhere corhere commented Nov 1, 2022

- What I did
Made another reexec go away while at the same time potentially improving compatibility with some OCI runtimes.

- How I did it
I made it possible for consumers of libcontainerd to create tasks without immediately starting the user process so that they can run arbitrary code before the process starts. Anything that could be done with a prestart or createProcess OCI hook can now also be implemented without one. I then replaced the libnetwork-setkey OCI hook with an in-daemon equivalent.

- How to verify it
CI

- Description for the changelog

  • OCI lifecycle hooks are no longer used to configure container networking. This may improve the compatibility with some alternative runtimes.

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere marked this pull request as ready for review November 1, 2022 22:52
@corhere corhere marked this pull request as draft November 1, 2022 22:56
@corhere
Copy link
Contributor Author

corhere commented Nov 2, 2022

I realized that builder-next can utilize user namespaces via the IdentityMapping option passed into runcexecutor, therefore my change to builder-next breaks buildkit builds when the daemon is started with the --userns-remap option. (There must be a gap in integration-test coverage as CI didn't catch that breakage.)

#5 0.422 runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/8), flags: 0xf: operation not permitted

That leaves us with a few options:

  1. Buildkit's runcexecutor grows a callback function which is called between the create and start operations: an in-process analogue to the createProcess OCI hook
  2. We keep the libnetwork-setkey reexec hook around in some form or another, just for builder-next
  3. Park this PR until more of the containerd snapshotter integration has been merged, as the builder-next executor is being switched over to use the buildkit's containerdexecutor as part of that effort. That executor would still need to grow an analogous callback to the createProcess OCI hook, but it would be a less-invasive change than on runcexecutor due to the containerd client forcing the separation between NewTask (i.e. create) and Start operations.

@tonistiigi
Copy link
Member

@corhere Can you explain more about why the getNetworkSandbox needs to be between create and start and how does it cause the userns error otherwise? Vanilla buildkit doesn't use any hooks and reuses ns from a pool for performance (it also does not run userns containers).

@thaJeztah How does one trigger userns CI for this PR?

@corhere
Copy link
Contributor Author

corhere commented Nov 2, 2022

Can you explain more about why the getNetworkSandbox needs to be between create and start and how does it cause the userns error otherwise?

For reasons I do not fully understand, runC fails to create a container with a spec that both sets uidMappings/gidMappings and a network namespace with a path. "When a nonuser namespace is created, it is owned by the user namespace in which the creating process was a member at the time of the creation of the namespace." It likely has something to do with the libnetwork-created network namespace being owned by the parent user namespace of the container's user namespace, although the failure mode is not what I would have expected. The original solution (#15187) was to make the runtime create the network namespace so that it is owned by the container's user namespace, and use the libnetwork-setkey hook to configure the network namespace before the user binary is started. Really, all libnetwork-setkey does is write the container PID's /proc/[pid]/ns/net path to a UNIX domain socket the daemon's libnetwork controller is listening on, and wait until libnetwork has finished configuring the namespace. (@cpuguy83 recalls that this was implemented before create and start were distinct runtime operations.) The code between create and start in (*Daemon).containerStart is a 1:1 replacement for the hook, without using the hook. The critical line is sb.SetKey; that's the blocking call where libnetwork configures the network namespace. The preceding getNetworkSandbox call is just plumbing.

@tonistiigi
Copy link
Member

I wonder if we should have a separate codepath for userns and non-userns in this case. Eventually I'd like to have the same netns pooling in buildkit in dockerd that is in upstream as it is the most performant but based on your description that does not seem possible. I assume that dockerd making the user namespace fd itself as well and passing it to runc(then I would assume it can create netns associated with same userns) isn't an option either?

@corhere
Copy link
Contributor Author

corhere commented Nov 2, 2022

Multithreaded processes cannot change their user namespace. The kernel will refuse to setns or unshare the user namespace of any task which shares its virtual memory space with any other task. So while dockerd technically could make the user namespace fd itself, it would have to fork a whole new process to do so. I don't see any advantages over deferring user-namespace creation to the runtime. We could pool and reuse runtime-created user and network namespaces if we wanted to; the daemon merely has to persist them through the usual mechanisms (holding open an fd, bind-mounting) before the container is stopped.

@corhere
Copy link
Contributor Author

corhere commented Mar 1, 2023

I made progress towards #44690, only to discover that containerdexecutor does not have user-namespace support plumbed in. More work is needed on the buildkit side.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The options required by the executor depend on the platform, and soon
will also depend on the values of other options. Give the executor
constructor the flexibility to pull whatever options it needs out of the
Opts struct.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Have libnetwork create the network namespace and pass the path to the
namespace to runC. Switch to buildkit's containerd executor when
userns-remapping is enabled so the in-process OnCreateRuntime callback
can be used in place of the OCI hook.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere
Copy link
Contributor Author

corhere commented Mar 26, 2024

Unfortunately, runc invokes the prestart OCI hooks before it applies the sysctls in the container spec, contrary to what the OCI runtime spec says MUST be done. Moving setting the libnetwork sandbox key to after the create operation completes is therefore a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants