-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write CNI conf to network-plugin-dir #24672
Comments
Why does it need to get written out when it'll just get passed by Kubernetes to CNI via stdin? I guess I'm a bit fuzzy on how rkt works here; in the rkt case is kubenet not spawning the CNI plugins itself but somehow delegating? If so, why is that? |
(note that I specifically made kubenet not write out a file because that entirely avoids file management issues, and the configuration may change later too) |
With the docker runtime we start pause, call cni, join everything else to pause via |
So how does rkt+kubenet actually work then? Does that mean that rkt wouldn't use large parts of the existing kubenet code for setup/teardown? |
@dcbw It works if rkt can find the CNI config files. But not exactly the way as the kubenet code. E.g. plugin.Shaper is not called. |
I guess what I'm wondering is if rkt doesn't actually use much kubenet code at all, should it be using kubenet? Or does it just care about the CNI config that kubenet writes, and maybe that should be generalized? |
@dcbw To be more clear, rkt embeds the CNI, so when rkt starts a pod, it will try to add the pod to the network with the given name (passed by Yes, it only cares about the CNI config files that kubenet writes for now. |
@dcbw One reason I can't use |
The goal would be consistent CNI config so other runtimes don't have to come up with their own, none of the other bells and whistles. |
Even though I probably don't understand the details of your concerns I think that rkt should not use any kubernetes code, ever. The other way round, kubernetes should also not depend on rkt's CNI code. I would expect kubernetes network setup to work exactly the same for every runtime that has the option to run in the invoker's netns. In case of rkt, it an be asked to not touch the netns using |
@bprashanth @dcbw @steveej FYI I just created a PR #24688 which symlinks rkt's net.d directory to |
@steveej yes, I see what you're saying here. Currently, the rkt runtime will pass --net=io.kubernetes.rkt as the network and an existing CNI config for that is required, and there is no facility currently for the rkt runtime to do pause/network container setup at all. So either we do a larger restructuring of the rkt kubelet runtime to more closely match the flow of the docker runtime (eg, create pause container, then other containers and move them to pause container's namespace before starting them) we're going to need special-casing of each runtime in the network plugins. I don't think anyone really likes needing the pause container just to keep the network namespace alive. But how does rkt handle pods? I can't see anything in rkt.go that would allow containers to share the same network namespace like the docker runtime does... |
@bprashanth looking at the rkt implementation, it seems like the only pieces it would really use are the POD_CIDR_CHANGED event and the CNI config generation. That's not a lot of kubenet that's actually used; even the shaper isn't used because that's handled through SetUpPod()/TearDownPod(). I'm not sure it's worth rkt really using kubenet given we'd have to special case stuff in kubenet to make that happen, and that is icky. But docker and rkt have fundamentally different ideas of how networking should work (with docker, we pass the pre-created network namespace that pod containers should use; but rkt cannot handle that so every container, even in the same pod, gets a separate namespace and IP address?). The existing network plugin API expects the docker flow and not the rkt flow. Making them work with the rkt flow would require every plugin writer to handle both runtimes in their plugin, which isn't a great approach. I wonder if we could get rkt to enhance its API to allow passing in an existing network namespace that the container should use, which it then just passes off to CNI? That would be a huge step towards consolidating this from the Kubernetes side. |
@bprashanth might be easier than I thought to get rkt to put a pod into a different namespace; see rkt/rkt#2525 |
@dcbw I didn't get it, the containers within a rkt pod shares the same network namespace today. |
@dcbw Briefly, rkt calls And each app chroot to their own rootfs. So all the apps (kubernetes' "container" notion) shares the same namespace. |
@yifan-gu are you sure? I could be wrong, but I don't see that being the case when --net is passed... kubernetes' pkg/kubelet/rkt/rkt.go will pass --net=rkt.kubernetes.io to rkt whne launching a networked container. That ends up in rkt's rkt/run.go which passes the list to the stage0 RunConfig. That ends up in stage0/run.go which runs stage1 and passes --net=rkt.kubernetes.io. Stage1 (in stage1/init.go) then calls networking.Setup(), which creates a new network namespace every time via basicNetNS(). If --net isn't passed (which only happens when the kubernetes container is supposed to use host networking) then yes, it uses the host's namespace. But many (most?) Kubernetes pods will run containers not in the host net namespace... Again, I could be wrong, I was trying to trace --net through the rkt code and came up with the above. |
@dcbw The stage1 is invoked once per pod, each pod has one stage1. What did you mean by every time? Ref https://github.com/coreos/rkt/blob/master/Documentation/devel/architecture.md#stage-1 |
Please see https://github.com/coreos/rkt/blob/master/Documentation/subcommands/run.md#host-networking, summarized Looking at rkt/rkt#2525 I understand that you didn't pursue my suggestion of directly invoking CNI from kubernetes. I strongly advise against using rkt's CNI functionality like that as it will make it impossible for k8s to intercept any results that CNI passes back. Creating the Pod using the rkt invocation is an atomic operation. |
@steveej If I understand you correctly, I think I came to the same conclusion earlier today, but not calling rkt's CNI directly would still require that Kubernetes have special-cased code for rkt to create the pod namespace, though that's not a ton of code and would probably be fairly easy to implement in kubelet. |
@steveej @yifan-gu Ok, how about this for a plan. The kubelet rkt code locks the OS thread, creates a new permanent namespace for the pod (by mounting it to /var/run/netns) using the CNI code in containernetworking/cni#167, switches back to the host namespace, and then calls the network plugin's SetUpPod method. After the network plugin has run, it locks the OS thread again, switches back to the pod namespace, and does "rkt run-prepared --net=host", then after that's done switches back to the host namespace and unlocks the OS thread. |
@dcbw We should be very careful when using |
@yifan-gu Yeah, well aware of the dangers of threading and namespaces... been dealing with that in CNI recently. Other than spawning a helper process that does all of this, there's not a great way to ensure that the rkt process will be spawned in the right namespace. That's why rkt/rkt#2525 would be useful. |
@yifan-gu actually it may not be quite as simple as I said above, since Kubernetes doesn't run rkt directly, but instead writes a systemd unit file for the pod and starts that unit. So we'd need to instead write a systemd unit file with: ExecStart=ip netns exec rkt run-prepared --net=host ... |
@dcbw Where is the NETNS parameter? Does kubelet creates network namespace today? |
@yifan-gu kubelet does not create the net namespaces today, but that was my suggestion. pkg/kubelet/rkt/rkt.go would create the namespace with 'ip netns add ' and then exec the rkt run-prepared with that namespace name. Then when the pod was torn down it would 'ip netns del '. |
From #24688 (comment)
|
@yifan-gu I'll have a PR up tonight for Kubernetes that implements the idea... |
@dcbw I've got a kube-managed netns sorta working, though it needs a little more work before it's PR-worthy. What I've got is the rkt code creating a network namespace. calling the existing What's the scope of what you're implementing? If there's overlap here, I'm more than willing to let you have at it; I just want to avoid duplicating effort. |
@dcbw Are you in the kubernetes slack channel btw? Can't really figure out your handle :( It would be nice if we can slack you 😉 |
@yifan-gu not on slack yet, but I probably should be... will figure that out. |
Closing in favor of #25062 |
Writing this https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/network/kubenet/kubenet_linux.go#L158 to filesystem allows us to share a single kubelet generated CNI config for all runtimes (eg rkt). Potential downside is that we'd end up with 2 type=bridge cni confs that conflict.
@freehan @dcbw @yifan-gu
The text was updated successfully, but these errors were encountered: