Set libnetwork sandbox key w/o OCI hooks #44385

corhere · 2022-11-01T20:42:43Z

- What I did
Made another reexec go away while at the same time potentially improving compatibility with some OCI runtimes.

- How I did it
I made it possible for consumers of libcontainerd to create tasks without immediately starting the user process so that they can run arbitrary code before the process starts. Anything that could be done with a prestart or createProcess OCI hook can now also be implemented without one. I then replaced the libnetwork-setkey OCI hook with an in-daemon equivalent.

- How to verify it
CI

- Description for the changelog

OCI lifecycle hooks are no longer used to configure container networking. This may improve the compatibility with some alternative runtimes.

- A picture of a cute animal (not mandatory but encouraged)

corhere · 2022-11-02T00:15:42Z

I realized that builder-next can utilize user namespaces via the IdentityMapping option passed into runcexecutor, therefore my change to builder-next breaks buildkit builds when the daemon is started with the --userns-remap option. (There must be a gap in integration-test coverage as CI didn't catch that breakage.)

#5 0.422 runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/8), flags: 0xf: operation not permitted

That leaves us with a few options:

Buildkit's runcexecutor grows a callback function which is called between the create and start operations: an in-process analogue to the createProcess OCI hook
We keep the libnetwork-setkey reexec hook around in some form or another, just for builder-next
Park this PR until more of the containerd snapshotter integration has been merged, as the builder-next executor is being switched over to use the buildkit's containerdexecutor as part of that effort. That executor would still need to grow an analogous callback to the createProcess OCI hook, but it would be a less-invasive change than on runcexecutor due to the containerd client forcing the separation between NewTask (i.e. create) and Start operations.

tonistiigi · 2022-11-02T01:37:52Z

@corhere Can you explain more about why the getNetworkSandbox needs to be between create and start and how does it cause the userns error otherwise? Vanilla buildkit doesn't use any hooks and reuses ns from a pool for performance (it also does not run userns containers).

@thaJeztah How does one trigger userns CI for this PR?

corhere · 2022-11-02T15:04:32Z

Can you explain more about why the getNetworkSandbox needs to be between create and start and how does it cause the userns error otherwise?

For reasons I do not fully understand, runC fails to create a container with a spec that both sets uidMappings/gidMappings and a network namespace with a path. "When a nonuser namespace is created, it is owned by the user namespace in which the creating process was a member at the time of the creation of the namespace." It likely has something to do with the libnetwork-created network namespace being owned by the parent user namespace of the container's user namespace, although the failure mode is not what I would have expected. The original solution (#15187) was to make the runtime create the network namespace so that it is owned by the container's user namespace, and use the libnetwork-setkey hook to configure the network namespace before the user binary is started. Really, all libnetwork-setkey does is write the container PID's /proc/[pid]/ns/net path to a UNIX domain socket the daemon's libnetwork controller is listening on, and wait until libnetwork has finished configuring the namespace. (@cpuguy83 recalls that this was implemented before create and start were distinct runtime operations.) The code between create and start in (*Daemon).containerStart is a 1:1 replacement for the hook, without using the hook. The critical line is sb.SetKey; that's the blocking call where libnetwork configures the network namespace. The preceding getNetworkSandbox call is just plumbing.

tonistiigi · 2022-11-02T17:55:13Z

I wonder if we should have a separate codepath for userns and non-userns in this case. Eventually I'd like to have the same netns pooling in buildkit in dockerd that is in upstream as it is the most performant but based on your description that does not seem possible. I assume that dockerd making the user namespace fd itself as well and passing it to runc(then I would assume it can create netns associated with same userns) isn't an option either?

corhere · 2022-11-02T18:15:33Z

Multithreaded processes cannot change their user namespace. The kernel will refuse to setns or unshare the user namespace of any task which shares its virtual memory space with any other task. So while dockerd technically could make the user namespace fd itself, it would have to fork a whole new process to do so. I don't see any advantages over deferring user-namespace creation to the runtime. We could pool and reuse runtime-created user and network namespaces if we wanted to; the daemon merely has to persist them through the usual mechanisms (holding open an fd, bind-mounting) before the container is stopped.

corhere · 2023-03-01T21:47:34Z

I made progress towards #44690, only to discover that containerdexecutor does not have user-namespace support plumbed in. More work is needed on the buildkit side.

Signed-off-by: Cory Snider <csnider@mirantis.com>

The options required by the executor depend on the platform, and soon will also depend on the values of other options. Give the executor constructor the flexibility to pull whatever options it needs out of the Opts struct. Signed-off-by: Cory Snider <csnider@mirantis.com>

Have libnetwork create the network namespace and pass the path to the namespace to runC. Switch to buildkit's containerd executor when userns-remapping is enabled so the in-process OnCreateRuntime callback can be used in place of the OCI hook. Signed-off-by: Cory Snider <csnider@mirantis.com>

Signed-off-by: Cory Snider <csnider@mirantis.com>

corhere · 2024-03-26T23:08:27Z

Unfortunately, runc invokes the prestart OCI hooks before it applies the sysctls in the container spec, contrary to what the OCI runtime spec says MUST be done. Moving setting the libnetwork sandbox key to after the create operation completes is therefore a breaking change.

corhere force-pushed the drop-oci-lifecycle-hooks branch from 84bf8df to 18222cb Compare November 1, 2022 21:44

corhere marked this pull request as ready for review November 1, 2022 22:52

corhere requested review from cpuguy83 and tonistiigi as code owners November 1, 2022 22:52

corhere marked this pull request as draft November 1, 2022 22:56

corhere mentioned this pull request Nov 2, 2022

[RFC] containerdexecutor: add network namespace callback moby/buildkit#3254

Merged

corhere mentioned this pull request Nov 17, 2022

libnetwork: eliminate almost all reexecs #44491

Merged

corhere mentioned this pull request Dec 22, 2022

containerdexecutor: add network namespace callback follow-ups #44690

Open

corhere force-pushed the drop-oci-lifecycle-hooks branch 2 times, most recently from 4c4b641 to 96e4118 Compare March 1, 2023 21:44

corhere mentioned this pull request Apr 27, 2023

libnetwork: processSetKeyReexec() remove defer(), and some refactoring #43506

Merged

1 task

This was referenced Jan 9, 2024

libcontainerd: create unstarted tasks #47052

Merged

Detect IPv6 support in containers, generate '/etc/hosts' accordingly. #47062

Merged

corhere force-pushed the drop-oci-lifecycle-hooks branch from 96e4118 to ead1361 Compare January 19, 2024 18:03

corhere added 4 commits January 19, 2024 15:00

daemon: set libnetwork sandbox key w/o OCI hook

4b0cf3d

Signed-off-by: Cory Snider <csnider@mirantis.com>

libnetwork: remove unused libnetwork-setkey reexec

6700cc1

Signed-off-by: Cory Snider <csnider@mirantis.com>

corhere force-pushed the drop-oci-lifecycle-hooks branch from ead1361 to 6700cc1 Compare January 19, 2024 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set libnetwork sandbox key w/o OCI hooks #44385

Set libnetwork sandbox key w/o OCI hooks #44385

corhere commented Nov 1, 2022

corhere commented Nov 2, 2022 •

edited

Loading

tonistiigi commented Nov 2, 2022

corhere commented Nov 2, 2022

tonistiigi commented Nov 2, 2022

corhere commented Nov 2, 2022

corhere commented Mar 1, 2023

corhere commented Mar 26, 2024

Set libnetwork sandbox key w/o OCI hooks #44385

Are you sure you want to change the base?

Set libnetwork sandbox key w/o OCI hooks #44385

Conversation

corhere commented Nov 1, 2022

corhere commented Nov 2, 2022 • edited Loading

tonistiigi commented Nov 2, 2022

corhere commented Nov 2, 2022

tonistiigi commented Nov 2, 2022

corhere commented Nov 2, 2022

corhere commented Mar 1, 2023

corhere commented Mar 26, 2024

corhere commented Nov 2, 2022 •

edited

Loading