Write CNI conf to network-plugin-dir #24672

bprashanth · 2016-04-22T16:50:36Z

Writing this https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/network/kubenet/kubenet_linux.go#L158 to filesystem allows us to share a single kubelet generated CNI config for all runtimes (eg rkt). Potential downside is that we'd end up with 2 type=bridge cni confs that conflict.

@freehan @dcbw @yifan-gu

dcbw · 2016-04-22T17:05:08Z

Why does it need to get written out when it'll just get passed by Kubernetes to CNI via stdin? I guess I'm a bit fuzzy on how rkt works here; in the rkt case is kubenet not spawning the CNI plugins itself but somehow delegating? If so, why is that?

dcbw · 2016-04-22T17:05:55Z

(note that I specifically made kubenet not write out a file because that entirely avoids file management issues, and the configuration may change later too)

bprashanth · 2016-04-22T17:13:01Z

I guess I'm a bit fuzzy on how rkt works here; in the rkt case is kubenet not spawning the CNI plugins itself but somehow delegating? If so, why is that?

With the docker runtime we start pause, call cni, join everything else to pause via --net=.
With rkt you tell it to create a pod, it uses CNI.

dcbw · 2016-04-22T17:19:25Z

So how does rkt+kubenet actually work then? Does that mean that rkt wouldn't use large parts of the existing kubenet code for setup/teardown?

yifan-gu · 2016-04-22T17:21:35Z

@dcbw It works if rkt can find the CNI config files. But not exactly the way as the kubenet code. E.g. plugin.Shaper is not called.

dcbw · 2016-04-22T17:24:46Z

I guess what I'm wondering is if rkt doesn't actually use much kubenet code at all, should it be using kubenet? Or does it just care about the CNI config that kubenet writes, and maybe that should be generalized?

yifan-gu · 2016-04-22T17:26:23Z

@dcbw To be more clear, rkt embeds the CNI, so when rkt starts a pod, it will try to add the pod to the network with the given name (passed by --net= flag to rkt CLI).

Yes, it only cares about the CNI config files that kubenet writes for now.

yifan-gu · 2016-04-22T17:32:27Z

@dcbw One reason I can't use SetUpPod / TearDownPod is that there is no pod infra container in rkt, so there is no stage that "when the network namespace of the pod is available, but the containers are not running".
In rkt, when the pod's network namespace is available, the pod is already running.

cc @steveej @jonboulle @sjpotter @euank

bprashanth · 2016-04-22T17:35:20Z

The goal would be consistent CNI config so other runtimes don't have to come up with their own, none of the other bells and whistles.
@kubernetes/sig-network

steveej · 2016-04-22T18:04:37Z

@dcbw

I guess what I'm wondering is if rkt doesn't actually use much kubenet code at all, should it be using kubenet? Or does it just care about the CNI config that kubenet writes, and maybe that should be generalized?

Even though I probably don't understand the details of your concerns I think that rkt should not use any kubernetes code, ever. The other way round, kubernetes should also not depend on rkt's CNI code.

I would expect kubernetes network setup to work exactly the same for every runtime that has the option to run in the invoker's netns. In case of rkt, it an be asked to not touch the netns using --net=host. Of course the invocation can only happen after the network has already been setup by kubernetes and CNI.

yifan-gu · 2016-04-22T22:29:49Z

@bprashanth @dcbw @steveej FYI I just created a PR #24688 which symlinks rkt's net.d directory to --network-plugin-dir, so that it can reads the CNI configs if plugin== cni or kubenet.

dcbw · 2016-04-27T16:48:41Z

@steveej yes, I see what you're saying here. Currently, the rkt runtime will pass --net=io.kubernetes.rkt as the network and an existing CNI config for that is required, and there is no facility currently for the rkt runtime to do pause/network container setup at all. So either we do a larger restructuring of the rkt kubelet runtime to more closely match the flow of the docker runtime (eg, create pause container, then other containers and move them to pause container's namespace before starting them) we're going to need special-casing of each runtime in the network plugins.

I don't think anyone really likes needing the pause container just to keep the network namespace alive. But how does rkt handle pods? I can't see anything in rkt.go that would allow containers to share the same network namespace like the docker runtime does...

dcbw · 2016-04-27T17:11:43Z

@bprashanth looking at the rkt implementation, it seems like the only pieces it would really use are the POD_CIDR_CHANGED event and the CNI config generation. That's not a lot of kubenet that's actually used; even the shaper isn't used because that's handled through SetUpPod()/TearDownPod(). I'm not sure it's worth rkt really using kubenet given we'd have to special case stuff in kubenet to make that happen, and that is icky.

But docker and rkt have fundamentally different ideas of how networking should work (with docker, we pass the pre-created network namespace that pod containers should use; but rkt cannot handle that so every container, even in the same pod, gets a separate namespace and IP address?). The existing network plugin API expects the docker flow and not the rkt flow. Making them work with the rkt flow would require every plugin writer to handle both runtimes in their plugin, which isn't a great approach.

I wonder if we could get rkt to enhance its API to allow passing in an existing network namespace that the container should use, which it then just passes off to CNI? That would be a huge step towards consolidating this from the Kubernetes side.

dcbw · 2016-04-27T17:58:18Z

@bprashanth might be easier than I thought to get rkt to put a pod into a different namespace; see rkt/rkt#2525

yifan-gu · 2016-04-27T19:14:48Z

I can't see anything in rkt.go that would allow containers to share the same network namespace like the docker runtime does...

@dcbw I didn't get it, the containers within a rkt pod shares the same network namespace today.

yifan-gu · 2016-04-27T19:22:50Z

@dcbw Briefly, rkt calls nspawn --boot to create a so called "rkt pod" which has different namespaces from the host.
Then each "rkt app" (which corresponds to the kubernetes' "container" notion) is started by the PID1 in that pod created by nspawn. The PID1 is a systemd process in this case.

And each app chroot to their own rootfs.

So all the apps (kubernetes' "container" notion) shares the same namespace.

dcbw · 2016-04-27T19:28:52Z

@yifan-gu are you sure? I could be wrong, but I don't see that being the case when --net is passed...

kubernetes' pkg/kubelet/rkt/rkt.go will pass --net=rkt.kubernetes.io to rkt whne launching a networked container. That ends up in rkt's rkt/run.go which passes the list to the stage0 RunConfig. That ends up in stage0/run.go which runs stage1 and passes --net=rkt.kubernetes.io. Stage1 (in stage1/init.go) then calls networking.Setup(), which creates a new network namespace every time via basicNetNS().

If --net isn't passed (which only happens when the kubernetes container is supposed to use host networking) then yes, it uses the host's namespace. But many (most?) Kubernetes pods will run containers not in the host net namespace...

Again, I could be wrong, I was trying to trace --net through the rkt code and came up with the above.

yifan-gu · 2016-04-27T19:47:47Z

which creates a new network namespace every time via basicNetNS().

@dcbw The stage1 is invoked once per pod, each pod has one stage1. What did you mean by every time?

Ref https://github.com/coreos/rkt/blob/master/Documentation/devel/architecture.md#stage-1

steveej · 2016-04-27T20:03:50Z

@dcbw

Please see https://github.com/coreos/rkt/blob/master/Documentation/subcommands/run.md#host-networking, summarized --net=host will tell rkt to leave the networking alone which is what I suggested in my above comment.

Looking at rkt/rkt#2525 I understand that you didn't pursue my suggestion of directly invoking CNI from kubernetes. I strongly advise against using rkt's CNI functionality like that as it will make it impossible for k8s to intercept any results that CNI passes back. Creating the Pod using the rkt invocation is an atomic operation.
I'm not saying that the code that invokes the CNI plugins must live within the k8s repository, but I suggest again that we extend libcni for this purpose.

dcbw · 2016-04-27T21:58:18Z

@steveej @yifan-gu yes, thanks for the pointers, I see what's going on now and I missed it first time around.

dcbw · 2016-04-27T22:05:28Z

@steveej If I understand you correctly, I think I came to the same conclusion earlier today, but not calling rkt's CNI directly would still require that Kubernetes have special-cased code for rkt to create the pod namespace, though that's not a ton of code and would probably be fairly easy to implement in kubelet.

dcbw · 2016-04-28T02:38:40Z

@steveej @yifan-gu Ok, how about this for a plan. The kubelet rkt code locks the OS thread, creates a new permanent namespace for the pod (by mounting it to /var/run/netns) using the CNI code in containernetworking/cni#167, switches back to the host namespace, and then calls the network plugin's SetUpPod method. After the network plugin has run, it locks the OS thread again, switches back to the pod namespace, and does "rkt run-prepared --net=host", then after that's done switches back to the host namespace and unlocks the OS thread.

yifan-gu · 2016-04-28T03:12:40Z

@dcbw We should be very careful when using LockOSThread in such case, because it does not prevent new threads from being cloned from that locked thread, when there is not enough os threads available for goroutines. Then those threads will be in the undesired network namespace, and there is no way to remove them.

dcbw · 2016-04-28T03:16:26Z

@yifan-gu Yeah, well aware of the dangers of threading and namespaces... been dealing with that in CNI recently. Other than spawning a helper process that does all of this, there's not a great way to ensure that the rkt process will be spawned in the right namespace. That's why rkt/rkt#2525 would be useful.

dcbw · 2016-04-28T15:02:27Z

@yifan-gu actually it may not be quite as simple as I said above, since Kubernetes doesn't run rkt directly, but instead writes a systemd unit file for the pod and starts that unit. So we'd need to instead write a systemd unit file with:

ExecStart=ip netns exec rkt run-prepared --net=host ...

yifan-gu · 2016-04-28T22:14:22Z

ExecStart=ip netns exec rkt run-prepared --net=host ...

@dcbw Where is the NETNS parameter? Does kubelet creates network namespace today?

dcbw · 2016-04-29T02:00:01Z

@yifan-gu kubelet does not create the net namespaces today, but that was my suggestion. pkg/kubelet/rkt/rkt.go would create the namespace with 'ip netns add ' and then exec the rkt run-prepared with that namespace name. Then when the pod was torn down it would 'ip netns del '.

yifan-gu · 2016-05-02T23:16:07Z

From #24688 (comment)

@dcbw @bprashanth
Updates:
@euank did some POC last week with the idea of run rkt with --net=host in a given network namespace created by rktnetes. He is now working on making that change to rktnetes code to see if it works.

dcbw · 2016-05-02T23:22:29Z

@yifan-gu I'll have a PR up tonight for Kubernetes that implements the idea...

euank · 2016-05-02T23:41:02Z

@dcbw I've got a kube-managed netns sorta working, though it needs a little more work before it's PR-worthy. What I've got is the rkt code creating a network namespace. calling the existing kubenet.SetUpPod code, and then execing in that namespace. What I've got is here (it didn't really take that much for it to work at all happily).

What's the scope of what you're implementing? If there's overlap here, I'm more than willing to let you have at it; I just want to avoid duplicating effort.

yifan-gu · 2016-05-03T00:17:55Z

@dcbw Are you in the kubernetes slack channel btw? Can't really figure out your handle :( It would be nice if we can slack you 😉

dcbw · 2016-05-03T00:57:07Z

@euank I think we've got mostly the same approach. Mine is #25062. Note that if we're using "/sbin/ip netns", there's no point in using nsenter since 'ip netns exec' does the same thing.

dcbw · 2016-05-03T01:18:57Z

@yifan-gu not on slack yet, but I probably should be... will figure that out.

yifan-gu · 2016-05-03T02:29:21Z

Closing in favor of #25062

bprashanth added sig/network Categorizes an issue or PR as relevant to SIG Network. team/cluster labels Apr 22, 2016

yifan-gu mentioned this issue Apr 22, 2016

rkt: Symlink net.d to '--network-plugin-dir'. #24688

Closed

dcbw mentioned this issue Apr 27, 2016

Allow passing existing network namespace to --net as a net name rkt/rkt#2525

Closed

yifan-gu closed this as completed May 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write CNI conf to network-plugin-dir #24672

Write CNI conf to network-plugin-dir #24672

bprashanth commented Apr 22, 2016

dcbw commented Apr 22, 2016

dcbw commented Apr 22, 2016

bprashanth commented Apr 22, 2016

dcbw commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

dcbw commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

bprashanth commented Apr 22, 2016

steveej commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 27, 2016

yifan-gu commented Apr 27, 2016

yifan-gu commented Apr 27, 2016

dcbw commented Apr 27, 2016

yifan-gu commented Apr 27, 2016 •

edited

steveej commented Apr 27, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 28, 2016

yifan-gu commented Apr 28, 2016 •

edited

dcbw commented Apr 28, 2016

dcbw commented Apr 28, 2016 •

edited

yifan-gu commented Apr 28, 2016

dcbw commented Apr 29, 2016

yifan-gu commented May 2, 2016 •

edited

dcbw commented May 2, 2016

euank commented May 2, 2016

yifan-gu commented May 3, 2016

dcbw commented May 3, 2016

dcbw commented May 3, 2016

yifan-gu commented May 3, 2016 •

edited

Write CNI conf to network-plugin-dir #24672

Write CNI conf to network-plugin-dir #24672

Comments

bprashanth commented Apr 22, 2016

dcbw commented Apr 22, 2016

dcbw commented Apr 22, 2016

bprashanth commented Apr 22, 2016

dcbw commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

dcbw commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

bprashanth commented Apr 22, 2016

steveej commented Apr 22, 2016

yifan-gu commented Apr 22, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 27, 2016

yifan-gu commented Apr 27, 2016

yifan-gu commented Apr 27, 2016

dcbw commented Apr 27, 2016

yifan-gu commented Apr 27, 2016 • edited

steveej commented Apr 27, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 27, 2016

dcbw commented Apr 28, 2016

yifan-gu commented Apr 28, 2016 • edited

dcbw commented Apr 28, 2016

dcbw commented Apr 28, 2016 • edited

yifan-gu commented Apr 28, 2016

dcbw commented Apr 29, 2016

yifan-gu commented May 2, 2016 • edited

dcbw commented May 2, 2016

euank commented May 2, 2016

yifan-gu commented May 3, 2016

dcbw commented May 3, 2016

dcbw commented May 3, 2016

yifan-gu commented May 3, 2016 • edited

yifan-gu commented Apr 27, 2016 •

edited

yifan-gu commented Apr 28, 2016 •

edited

dcbw commented Apr 28, 2016 •

edited

yifan-gu commented May 2, 2016 •

edited

yifan-gu commented May 3, 2016 •

edited