[RFC] containerdexecutor: add network namespace callback #3254

corhere · 2022-11-02T23:33:51Z

In order to support identity mapping and user namespaces, dockerd needs to defer the creation of a container's network namespace to the runtime and hook into the container lifecycle to configure the network namespace before the user binary is started. The standard way to do so is by configuring a createRuntime OCI lifecycle hook, in which the OCI runtime executes a specified process in the runtime environment after the container has been created and before it is started. In the case of dockerd the network namespace needs to be configured from the daemon process, which necessitates that the hook process communicate with the daemon process. This is complicated and slow. All the hook process does is inform the daemon of the container's PID and wait until the daemon has finished applying the network namespace configuration, but this requires IPC and synchronization.

There is an alternative to the createRuntime OCI hook which containerd clients can take advantage of. The container.NewTask method is directly analogous to the OCI create operation, and the task.Start method is directly analogous to the OCI start operation. Any operations performed between the NewTask and Start calls are therefore directly analogous to createRuntime OCI hooks, without needing to execute any external processes! Provide a mechanism for network.Namespace instances to register a callback function which can be used to configure a container's network namespace instead of, or in addition to, createRuntime OCI hooks.

See Set libnetwork sandbox key w/o OCI hooks moby#44385 for more details and motivation

(RFC because containerdexecutor does not have ID mapping wired up, though AIUI that will need to change as dockerd needs to be migrated over to containerdexecutor as part of the containerd snapshotter integration project and ID mapping is supported with the runcexecutor currently integrated into dockerd.)

corhere · 2022-11-02T23:37:42Z

PTAL @crazy-max @tonistiigi @jedevc @thaJeztah

corhere · 2022-11-02T23:44:30Z

To preempt the obvious question: runcexecutor could also implement the same behaviour, albeit with a larger diff and increased overhead, by switching from runc.Run() to runc.Create() + runc.Start() and either a PID file or runc.State().

thaJeztah · 2022-11-10T20:24:42Z

executor/containerdexecutor/executor.go

+// [network.Namespace] which also implements this interface, the containerd
+// executor will run the callback at the appropriate point in the container
+// lifecycle.
+type NetworkNamespaceHooker interface {


If we're not sure yet if we need this; perhaps un-export it for now?

I would prefer it be exported so that implementations can assert that they implement the interface such that the interface being changed/removed will result in a compile error.

var _ containerdexecutor.OnCreateRuntimer = (*MyNamespaceImpl)(nil)

tonistiigi · 2022-12-03T04:32:07Z

As discussed in the maintainers meeting:

Update the name
Make sure this works with both executors
Optional: I'm ok with just updating the existing interface. Adding this functionality in hidden functions just increases the risk that future maintainers/implementations will not know about these and things will break unexpectedly. We shouldn't be afraid of updating existing interfaces if it makes sense.

Note that we are about to close v0.11 milestone, so please update and remove draft asap if you need this for it.

tonistiigi · 2022-12-03T04:44:58Z

Update: I missed the comment about runc.Run and the overhead(and other possible issues) implementing this for runc executor causes. This is quite problematic as changing Moby to use containerd executor would also have overhead(and maybe some cleanup issues by default). If you want it you can just update the name and we can get it in(maybe unexported like Seb suggested). If all executors do not support it then we shouldn't change the existing interface.

I think best performing solution for the 95% case for this is to use runc.Run() and set the pid path in spec directly in Moby. This works in all non-userns cases afaik and is the only (practical) way for Moby code to do similar netns pooling as the CNI implementation does. Netns pooling can save ~150-200ms per container and is very significant.

corhere · 2022-12-09T00:43:20Z

It's going to be more invasive than I'd thought to make this work with the runc executor. It currently uses foreground mode, but would have to use detached mode in order to separate create from run, which means dealing with subreapers and IO no longer being proxied through a foreground runc process. In other words, the runc executor would have to become more like containerd.

In order to support identity mapping and user namespaces, the Moby project needs to defer the creation of a container's network namespace to the runtime and hook into the container lifecycle to configure the network namespace before the user binary is started. The standard way to do so is by configuring a `createRuntime` OCI lifecycle hook, in which the OCI runtime executes a specified process in the runtime environment after the container has been created and before it is started. In the case of Moby the network namespace needs to be configured from the daemon process, which necessitates that the hook process communicate with the daemon process. This is complicated and slow. All the hook process does is inform the daemon of the container's PID and wait until the daemon has finished applying the network namespace configuration. There is an alternative to the `createRuntime` OCI hook which containerd clients can take advantage of. The `container.NewTask` method is directly analogous to the OCI create operation, and the `task.Start` method is directly analogous to the OCI start operation. Any operations performed between the `NewTask` and `Start` calls are therefore directly analogous to `createRuntime` OCI hooks, without needing to execute any external processes! Provide a mechanism for network.Namespace instances to register a callback function which can be used to configure a container's network namespace instead of, or in addition to, `createRuntime` OCI hooks. Signed-off-by: Cory Snider <csnider@mirantis.com>

tonistiigi

As discussed in maintainers meeting, on moby side this should be followed up with:

if NOT userns: keep runcexector, remove libnetwork hook and write netns directly to OCI spec. In the future netns can be pooled as well like the CNI bridge does today.
if userns: switch to containerdexecutor. Remove libnetwork hook and replace with the detection on this new interface that gets the netns path between create and start.

thaJeztah reviewed Nov 10, 2022

View reviewed changes

corhere force-pushed the c8dexecutor-inprocess-lifecycle-hook branch from bf1c3d1 to b5fdf90 Compare December 9, 2022 00:46

corhere marked this pull request as ready for review December 9, 2022 17:21

tonistiigi approved these changes Dec 22, 2022

View reviewed changes

tonistiigi merged commit e0220af into moby:master Dec 22, 2022

thaJeztah mentioned this pull request Dec 22, 2022

containerdexecutor: add network namespace callback follow-ups moby/moby#44690

Open

tonistiigi mentioned this pull request Jan 6, 2023

[v0.11] cherry-picks #3463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] containerdexecutor: add network namespace callback #3254

[RFC] containerdexecutor: add network namespace callback #3254

corhere commented Nov 2, 2022

corhere commented Nov 2, 2022

corhere commented Nov 2, 2022

thaJeztah Nov 10, 2022

corhere Dec 9, 2022

tonistiigi commented Dec 3, 2022

tonistiigi commented Dec 3, 2022

corhere commented Dec 9, 2022 •

edited

Loading

tonistiigi left a comment

[RFC] containerdexecutor: add network namespace callback #3254

[RFC] containerdexecutor: add network namespace callback #3254

Conversation

corhere commented Nov 2, 2022

corhere commented Nov 2, 2022

corhere commented Nov 2, 2022

thaJeztah Nov 10, 2022

Choose a reason for hiding this comment

corhere Dec 9, 2022

Choose a reason for hiding this comment

tonistiigi commented Dec 3, 2022

tonistiigi commented Dec 3, 2022

corhere commented Dec 9, 2022 • edited Loading

tonistiigi left a comment

Choose a reason for hiding this comment

corhere commented Dec 9, 2022 •

edited

Loading