Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write CNI conf to network-plugin-dir #24672

Closed
bprashanth opened this issue Apr 22, 2016 · 34 comments
Closed

Write CNI conf to network-plugin-dir #24672

bprashanth opened this issue Apr 22, 2016 · 34 comments
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bprashanth
Copy link
Contributor

Writing this https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/network/kubenet/kubenet_linux.go#L158 to filesystem allows us to share a single kubelet generated CNI config for all runtimes (eg rkt). Potential downside is that we'd end up with 2 type=bridge cni confs that conflict.

@freehan @dcbw @yifan-gu

@bprashanth bprashanth added sig/network Categorizes an issue or PR as relevant to SIG Network. team/cluster labels Apr 22, 2016
@dcbw
Copy link
Member

dcbw commented Apr 22, 2016

Why does it need to get written out when it'll just get passed by Kubernetes to CNI via stdin? I guess I'm a bit fuzzy on how rkt works here; in the rkt case is kubenet not spawning the CNI plugins itself but somehow delegating? If so, why is that?

@dcbw
Copy link
Member

dcbw commented Apr 22, 2016

(note that I specifically made kubenet not write out a file because that entirely avoids file management issues, and the configuration may change later too)

@bprashanth
Copy link
Contributor Author

I guess I'm a bit fuzzy on how rkt works here; in the rkt case is kubenet not spawning the CNI plugins itself but somehow delegating? If so, why is that?

With the docker runtime we start pause, call cni, join everything else to pause via --net=.
With rkt you tell it to create a pod, it uses CNI.

@dcbw
Copy link
Member

dcbw commented Apr 22, 2016

So how does rkt+kubenet actually work then? Does that mean that rkt wouldn't use large parts of the existing kubenet code for setup/teardown?

@yifan-gu
Copy link
Contributor

@dcbw It works if rkt can find the CNI config files. But not exactly the way as the kubenet code. E.g. plugin.Shaper is not called.

@dcbw
Copy link
Member

dcbw commented Apr 22, 2016

I guess what I'm wondering is if rkt doesn't actually use much kubenet code at all, should it be using kubenet? Or does it just care about the CNI config that kubenet writes, and maybe that should be generalized?

@yifan-gu
Copy link
Contributor

@dcbw To be more clear, rkt embeds the CNI, so when rkt starts a pod, it will try to add the pod to the network with the given name (passed by --net= flag to rkt CLI).

Yes, it only cares about the CNI config files that kubenet writes for now.

@yifan-gu
Copy link
Contributor

@dcbw One reason I can't use SetUpPod / TearDownPod is that there is no pod infra container in rkt, so there is no stage that "when the network namespace of the pod is available, but the containers are not running".
In rkt, when the pod's network namespace is available, the pod is already running.

cc @steveej @jonboulle @sjpotter @euank

@bprashanth
Copy link
Contributor Author

The goal would be consistent CNI config so other runtimes don't have to come up with their own, none of the other bells and whistles.
@kubernetes/sig-network

@steveej
Copy link

steveej commented Apr 22, 2016

@dcbw

I guess what I'm wondering is if rkt doesn't actually use much kubenet code at all, should it be using kubenet? Or does it just care about the CNI config that kubenet writes, and maybe that should be generalized?

Even though I probably don't understand the details of your concerns I think that rkt should not use any kubernetes code, ever. The other way round, kubernetes should also not depend on rkt's CNI code.

I would expect kubernetes network setup to work exactly the same for every runtime that has the option to run in the invoker's netns. In case of rkt, it an be asked to not touch the netns using --net=host. Of course the invocation can only happen after the network has already been setup by kubernetes and CNI.

@yifan-gu
Copy link
Contributor

@bprashanth @dcbw @steveej FYI I just created a PR #24688 which symlinks rkt's net.d directory to --network-plugin-dir, so that it can reads the CNI configs if plugin== cni or kubenet.

@dcbw
Copy link
Member

dcbw commented Apr 27, 2016

@steveej yes, I see what you're saying here. Currently, the rkt runtime will pass --net=io.kubernetes.rkt as the network and an existing CNI config for that is required, and there is no facility currently for the rkt runtime to do pause/network container setup at all. So either we do a larger restructuring of the rkt kubelet runtime to more closely match the flow of the docker runtime (eg, create pause container, then other containers and move them to pause container's namespace before starting them) we're going to need special-casing of each runtime in the network plugins.

I don't think anyone really likes needing the pause container just to keep the network namespace alive. But how does rkt handle pods? I can't see anything in rkt.go that would allow containers to share the same network namespace like the docker runtime does...

@dcbw
Copy link
Member

dcbw commented Apr 27, 2016

@bprashanth looking at the rkt implementation, it seems like the only pieces it would really use are the POD_CIDR_CHANGED event and the CNI config generation. That's not a lot of kubenet that's actually used; even the shaper isn't used because that's handled through SetUpPod()/TearDownPod(). I'm not sure it's worth rkt really using kubenet given we'd have to special case stuff in kubenet to make that happen, and that is icky.

But docker and rkt have fundamentally different ideas of how networking should work (with docker, we pass the pre-created network namespace that pod containers should use; but rkt cannot handle that so every container, even in the same pod, gets a separate namespace and IP address?). The existing network plugin API expects the docker flow and not the rkt flow. Making them work with the rkt flow would require every plugin writer to handle both runtimes in their plugin, which isn't a great approach.

I wonder if we could get rkt to enhance its API to allow passing in an existing network namespace that the container should use, which it then just passes off to CNI? That would be a huge step towards consolidating this from the Kubernetes side.

@dcbw
Copy link
Member

dcbw commented Apr 27, 2016

@bprashanth might be easier than I thought to get rkt to put a pod into a different namespace; see rkt/rkt#2525

@yifan-gu
Copy link
Contributor

I can't see anything in rkt.go that would allow containers to share the same network namespace like the docker runtime does...

@dcbw I didn't get it, the containers within a rkt pod shares the same network namespace today.

@yifan-gu
Copy link
Contributor

@dcbw Briefly, rkt calls nspawn --boot to create a so called "rkt pod" which has different namespaces from the host.
Then each "rkt app" (which corresponds to the kubernetes' "container" notion) is started by the PID1 in that pod created by nspawn. The PID1 is a systemd process in this case.

And each app chroot to their own rootfs.

So all the apps (kubernetes' "container" notion) shares the same namespace.

@dcbw
Copy link
Member

dcbw commented Apr 27, 2016

@yifan-gu are you sure? I could be wrong, but I don't see that being the case when --net is passed...

kubernetes' pkg/kubelet/rkt/rkt.go will pass --net=rkt.kubernetes.io to rkt whne launching a networked container. That ends up in rkt's rkt/run.go which passes the list to the stage0 RunConfig. That ends up in stage0/run.go which runs stage1 and passes --net=rkt.kubernetes.io. Stage1 (in stage1/init.go) then calls networking.Setup(), which creates a new network namespace every time via basicNetNS().

If --net isn't passed (which only happens when the kubernetes container is supposed to use host networking) then yes, it uses the host's namespace. But many (most?) Kubernetes pods will run containers not in the host net namespace...

Again, I could be wrong, I was trying to trace --net through the rkt code and came up with the above.

@yifan-gu
Copy link
Contributor

yifan-gu commented Apr 27, 2016

which creates a new network namespace every time via basicNetNS().

@dcbw The stage1 is invoked once per pod, each pod has one stage1. What did you mean by every time?

Ref https://github.com/coreos/rkt/blob/master/Documentation/devel/architecture.md#stage-1

@steveej
Copy link

steveej commented Apr 27, 2016

@dcbw

Please see https://github.com/coreos/rkt/blob/master/Documentation/subcommands/run.md#host-networking, summarized --net=host will tell rkt to leave the networking alone which is what I suggested in my above comment.

Looking at rkt/rkt#2525 I understand that you didn't pursue my suggestion of directly invoking CNI from kubernetes. I strongly advise against using rkt's CNI functionality like that as it will make it impossible for k8s to intercept any results that CNI passes back. Creating the Pod using the rkt invocation is an atomic operation.
I'm not saying that the code that invokes the CNI plugins must live within the k8s repository, but I suggest again that we extend libcni for this purpose.

@dcbw
Copy link
Member

dcbw commented Apr 27, 2016

@steveej @yifan-gu yes, thanks for the pointers, I see what's going on now and I missed it first time around.

@dcbw
Copy link
Member

dcbw commented Apr 27, 2016

@steveej If I understand you correctly, I think I came to the same conclusion earlier today, but not calling rkt's CNI directly would still require that Kubernetes have special-cased code for rkt to create the pod namespace, though that's not a ton of code and would probably be fairly easy to implement in kubelet.

@dcbw
Copy link
Member

dcbw commented Apr 28, 2016

@steveej @yifan-gu Ok, how about this for a plan. The kubelet rkt code locks the OS thread, creates a new permanent namespace for the pod (by mounting it to /var/run/netns) using the CNI code in containernetworking/cni#167, switches back to the host namespace, and then calls the network plugin's SetUpPod method. After the network plugin has run, it locks the OS thread again, switches back to the pod namespace, and does "rkt run-prepared --net=host", then after that's done switches back to the host namespace and unlocks the OS thread.

@yifan-gu
Copy link
Contributor

yifan-gu commented Apr 28, 2016

@dcbw We should be very careful when using LockOSThread in such case, because it does not prevent new threads from being cloned from that locked thread, when there is not enough os threads available for goroutines. Then those threads will be in the undesired network namespace, and there is no way to remove them.

@dcbw
Copy link
Member

dcbw commented Apr 28, 2016

@yifan-gu Yeah, well aware of the dangers of threading and namespaces... been dealing with that in CNI recently. Other than spawning a helper process that does all of this, there's not a great way to ensure that the rkt process will be spawned in the right namespace. That's why rkt/rkt#2525 would be useful.

@dcbw
Copy link
Member

dcbw commented Apr 28, 2016

@yifan-gu actually it may not be quite as simple as I said above, since Kubernetes doesn't run rkt directly, but instead writes a systemd unit file for the pod and starts that unit. So we'd need to instead write a systemd unit file with:

ExecStart=ip netns exec rkt run-prepared --net=host ...

@yifan-gu
Copy link
Contributor

ExecStart=ip netns exec rkt run-prepared --net=host ...

@dcbw Where is the NETNS parameter? Does kubelet creates network namespace today?

@dcbw
Copy link
Member

dcbw commented Apr 29, 2016

@yifan-gu kubelet does not create the net namespaces today, but that was my suggestion. pkg/kubelet/rkt/rkt.go would create the namespace with 'ip netns add ' and then exec the rkt run-prepared with that namespace name. Then when the pod was torn down it would 'ip netns del '.

@yifan-gu
Copy link
Contributor

yifan-gu commented May 2, 2016

From #24688 (comment)

@dcbw @bprashanth
Updates:
@euank did some POC last week with the idea of run rkt with --net=host in a given network namespace created by rktnetes. He is now working on making that change to rktnetes code to see if it works.

@dcbw
Copy link
Member

dcbw commented May 2, 2016

@yifan-gu I'll have a PR up tonight for Kubernetes that implements the idea...

@euank
Copy link
Contributor

euank commented May 2, 2016

@dcbw I've got a kube-managed netns sorta working, though it needs a little more work before it's PR-worthy. What I've got is the rkt code creating a network namespace. calling the existing kubenet.SetUpPod code, and then execing in that namespace. What I've got is here (it didn't really take that much for it to work at all happily).

What's the scope of what you're implementing? If there's overlap here, I'm more than willing to let you have at it; I just want to avoid duplicating effort.

@yifan-gu
Copy link
Contributor

yifan-gu commented May 3, 2016

@dcbw Are you in the kubernetes slack channel btw? Can't really figure out your handle :( It would be nice if we can slack you 😉

@dcbw
Copy link
Member

dcbw commented May 3, 2016

@euank I think we've got mostly the same approach. Mine is #25062. Note that if we're using "/sbin/ip netns", there's no point in using nsenter since 'ip netns exec' does the same thing.

@dcbw
Copy link
Member

dcbw commented May 3, 2016

@yifan-gu not on slack yet, but I probably should be... will figure that out.

@yifan-gu
Copy link
Contributor

yifan-gu commented May 3, 2016

Closing in favor of #25062

@yifan-gu yifan-gu closed this as completed May 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

5 participants