Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] containerdexecutor: add network namespace callback #3254

Merged

Conversation

corhere
Copy link
Contributor

@corhere corhere commented Nov 2, 2022

In order to support identity mapping and user namespaces, dockerd needs to defer the creation of a container's network namespace to the runtime and hook into the container lifecycle to configure the network namespace before the user binary is started. The standard way to do so is by configuring a createRuntime OCI lifecycle hook, in which the OCI runtime executes a specified process in the runtime environment after the container has been created and before it is started. In the case of dockerd the network namespace needs to be configured from the daemon process, which necessitates that the hook process communicate with the daemon process. This is complicated and slow. All the hook process does is inform the daemon of the container's PID and wait until the daemon has finished applying the network namespace configuration, but this requires IPC and synchronization.

There is an alternative to the createRuntime OCI hook which containerd clients can take advantage of. The container.NewTask method is directly analogous to the OCI create operation, and the task.Start method is directly analogous to the OCI start operation. Any operations performed between the NewTask and Start calls are therefore directly analogous to createRuntime OCI hooks, without needing to execute any external processes! Provide a mechanism for network.Namespace instances to register a callback function which can be used to configure a container's network namespace instead of, or in addition to, createRuntime OCI hooks.

(RFC because containerdexecutor does not have ID mapping wired up, though AIUI that will need to change as dockerd needs to be migrated over to containerdexecutor as part of the containerd snapshotter integration project and ID mapping is supported with the runcexecutor currently integrated into dockerd.)

@corhere
Copy link
Contributor Author

corhere commented Nov 2, 2022

@corhere
Copy link
Contributor Author

corhere commented Nov 2, 2022

To preempt the obvious question: runcexecutor could also implement the same behaviour, albeit with a larger diff and increased overhead, by switching from runc.Run() to runc.Create() + runc.Start() and either a PID file or runc.State().

// [network.Namespace] which also implements this interface, the containerd
// executor will run the callback at the appropriate point in the container
// lifecycle.
type NetworkNamespaceHooker interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not sure yet if we need this; perhaps un-export it for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer it be exported so that implementations can assert that they implement the interface such that the interface being changed/removed will result in a compile error.

var _ containerdexecutor.OnCreateRuntimer = (*MyNamespaceImpl)(nil)

@tonistiigi
Copy link
Member

As discussed in the maintainers meeting:

  • Update the name
  • Make sure this works with both executors
  • Optional: I'm ok with just updating the existing interface. Adding this functionality in hidden functions just increases the risk that future maintainers/implementations will not know about these and things will break unexpectedly. We shouldn't be afraid of updating existing interfaces if it makes sense.

Note that we are about to close v0.11 milestone, so please update and remove draft asap if you need this for it.

@tonistiigi
Copy link
Member

Update: I missed the comment about runc.Run and the overhead(and other possible issues) implementing this for runc executor causes. This is quite problematic as changing Moby to use containerd executor would also have overhead(and maybe some cleanup issues by default). If you want it you can just update the name and we can get it in(maybe unexported like Seb suggested). If all executors do not support it then we shouldn't change the existing interface.

I think best performing solution for the 95% case for this is to use runc.Run() and set the pid path in spec directly in Moby. This works in all non-userns cases afaik and is the only (practical) way for Moby code to do similar netns pooling as the CNI implementation does. Netns pooling can save ~150-200ms per container and is very significant.

@corhere
Copy link
Contributor Author

corhere commented Dec 9, 2022

It's going to be more invasive than I'd thought to make this work with the runc executor. It currently uses foreground mode, but would have to use detached mode in order to separate create from run, which means dealing with subreapers and IO no longer being proxied through a foreground runc process. In other words, the runc executor would have to become more like containerd.

In order to support identity mapping and user namespaces, the Moby
project needs to defer the creation of a container's network namespace
to the runtime and hook into the container lifecycle to configure the
network namespace before the user binary is started. The standard way to
do so is by configuring a `createRuntime` OCI lifecycle hook, in which
the OCI runtime executes a specified process in the runtime environment
after the container has been created and before it is started. In the
case of Moby the network namespace needs to be configured from the
daemon process, which necessitates that the hook process communicate
with the daemon process. This is complicated and slow. All the hook
process does is inform the daemon of the container's PID and wait until
the daemon has finished applying the network namespace configuration.

There is an alternative to the `createRuntime` OCI hook which containerd
clients can take advantage of. The `container.NewTask` method is
directly analogous to the OCI create operation, and the `task.Start`
method is directly analogous to the OCI start operation. Any operations
performed between the `NewTask` and `Start` calls are therefore directly
analogous to `createRuntime` OCI hooks, without needing to execute any
external processes! Provide a mechanism for network.Namespace instances
to register a callback function which can be used to configure a
container's network namespace instead of, or in addition to,
`createRuntime` OCI hooks.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the c8dexecutor-inprocess-lifecycle-hook branch from bf1c3d1 to b5fdf90 Compare December 9, 2022 00:46
@corhere corhere marked this pull request as ready for review December 9, 2022 17:21
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in maintainers meeting, on moby side this should be followed up with:

  • if NOT userns: keep runcexector, remove libnetwork hook and write netns directly to OCI spec. In the future netns can be pooled as well like the CNI bridge does today.
  • if userns: switch to containerdexecutor. Remove libnetwork hook and replace with the detection on this new interface that gets the netns path between create and start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants