-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-interface sysctls #47639
Comments
This is the part where things may become fuzzy. ISTR we either have a hard-coded list of sysctls that were known to be namespaced ( |
Ah - yes ... https://docs.docker.com/reference/cli/docker/container/run/#sysctl - so, I guess we're safe in terms of namespacing if we limit the new options to Beyond that, and checking the setting is interface-specific - at some later point we may decide to error on settings that definitely break things for a particular driver ... but that's about making the footgun hard to aim, we probably can't remove it completely because we won't know the user's intent. If we find we need to allow access to more than just So, I'd probably leave it at that - rather than try to build in setting-specific warnings which, as you say, will be a moving target? |
I think the best UX for such hypothetical driver would be a mechanism that allows to name each iface independently. Then, sysctls would have to be specified with the name component set. If we want to make sure we can support this in the future, then I think it's best to: 1. keep the name component in the sysctl (no discoverability issue) ; 2. require users to name their interfaces when they want to use One downside with this, we're moving the "multi-iface" problem from the |
Yes, it's the "moving target" that may be problematic. If these are things we can say for sure (guaranteed) to be non-functional, we could error, but if not, maybe a warning (at best?). And even there, there's a bit of a risk of it becoming a sliding-slope if it means we need to continue codifying too much knowledge about Linux and sysctls (based on assumptions we can currently make). That said; providing a useful warning to the user also has a place so "win some", "loose some" I guess. |
Being able to name interfaces in the container might be a good feature - but it'd be nice to keep it separate from this change. Tying the two together for the sake of hypothetical driver might be wasted effort. But, it's worth knowing we're not backing ourselves into a corner ... so, we'd need a way to refer to the hypothetical driver's interface roles. If the driver has interfaces with roles "ns" and "ew" (which could be instantiated as "eth0"/"eth1", "eth1"/"eth0", "ns0"/"ew0", or anything else if we add an 'ifname' option) - something like this would do the trick, and we'd be able extend the syntax to accommodate it later, if needed? ... |
Indeed - as a starting point, I suggest we just steer clear. If we find people are getting themselves into trouble with any particular options, or there's a driver that will always break if a setting is modified but people modify it anyway ... we can add an error/warning at that point. The sysctls will be applied by libnetwork when it moves the interface into the network namespace but, if we find we need a way for drivers to object to settings - we can pass the list of sysctls to the driver (as a generic param / label or whatever it's called) ... for the driver to inspect if it's interested, and raise an error if there's a problem. |
Maybe But - something that describes its function a bit more accurately, makes it less surprising that the sysctl name is mangled (and to give us a bit of room for manoeuvre if we ever think of a reason to allow more than 'net....*', unlikely as that seems). |
Description
Background
In 26.0, we removed use of the OCI prestart hook that was used by the daemon to get a callback from the runtime, during container task creation, to
SetKey
.With the hook in place, the initial set of network interfaces were created in the container's namespace before task creation. Once it was removed, interfaces were only moved into the network namespace after task creation (before the task was started).
The prestart hook runs before
--sysctl
options have been applied. We reverted that change, because it meant--sysctl
for an interface-specific setting (like--sysctl net.ipv6.conf.eth0.accept_ra=2
), prevented the container from starting because the interface did not exist when the runtime tried to apply the sysctl setting.The immediate driver for the change was to be able to inspect the container to determine whether it had IPv6 enabled, after sysctls had been applied. Then, use that to set up
/etc/hosts
with or without IPv6 hosts - and, in follow-up PRs, to decide whether to allocate IPv6 addresses etc. But, eliminating the SetKey "reexec" was desirable in any case.(Discussed in the networking maintainers sync call on 26th March - @corhere, @akerouanton, @cpuguy83.)
Objectives
eth0
is the name of the first interface that gets set up, but the order isn't predictable on restart for a container with multiple interfaces.network connect
.)Proposal
sysctl
option to the "advanced syntax" of--network
. For example:--network name=mynet,sysctl=ipv6.conf.forwarding=1,sysctl=ipv6.conf.accept_ra=2
[]string
inEndpointSettings
.--sysctl
options are passed as a map, but ordering might be important and should be preserved.net.
and the interface's generated name...ipv6.conf.forwarding
->net.ipv6.conf.eth0.forwarding
.net.ipv6.conf.all.disable_ipv6
would be difficult to deal with).ifname
option, but we'd also need to deal with conflicting names, and unspecified names.net.something
, and if the interface name isall
ordefault
. But, the real interface name is more tricky - could either ignore the given name as a placeholder and use the generated name (but that might be surprising), or require the interface name to beIFNAME
or similar (but then, back to the problem that the option's usage isn't clear)?net.*.*.eth0
settings from--sysctl
to the new field inEndpointSettings
, for the first--network
.--sysctl
, add a warning to the create response, disable the migration in some future API version.--sysctl
option todocker network connect
.ipv6.conf.accept_ra=2
.Matching changes will be needed in 'compose'. (I don't think 'build' has provision for sysctl settings.)
Other notes ...
sysctl
options.The text was updated successfully, but these errors were encountered: