Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet network plugin for client/server container runtimes #28667

Closed
feiskyer opened this issue Jul 8, 2016 · 58 comments
Closed

Kubelet network plugin for client/server container runtimes #28667

feiskyer opened this issue Jul 8, 2016 · 58 comments
Assignees
Labels
area/extensibility area/kubelet-api sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@feiskyer
Copy link
Member

feiskyer commented Jul 8, 2016

Proposal #17048 introduces client/server container runtime into kubelet, it supposes container's networking is processed by kubelet network plugins.

As suggested by @thockin, we should have more discussions about the role of network plugins in all of this. @thockin provides two options here:

  • Option 1) we expect all RemoteRuntimes to NOT set up networking at all and let us do it. This means even more esoteric runtimes have to defer to us.
  • Option 2) we expect all RemoteRuntimes to handle networking, in which case the very notion of a network plugin should be removed from the RuntimeManager, and the API may have to grow some network functions.

Concerns on Option 1):

  • Is kubelet network plugin general enough for all runtimes?
  • All current network plugins suppose containers are netns-based, while it is not true for hypervisors, e.g. HyperContainer. How can do deal with those hypervisor-based runtimes?

Concerns on Option 2):

  • What changes should be made toward the runtime API?
  • How RemoteRuntimes cooperate with kubelet network plugins?
  • Will this impact the functionality of kube-proxy?

CC @thockin @freehan @dchen1107 @kubernetes/sig-node @kubernetes/sig-network

@thockin
Copy link
Member

thockin commented Jul 8, 2016

Is kubelet network plugin general enough for all runtimes?

I don't think so.

All current network plugins suppose containers are netns-based, while it is not true for hypervisors

Precisely

How RemoteRuntimes cooperate with kubelet network plugins?

I think the only model is that we define what we expect from a runtime wrt network, which WILL eventually include a notion of multiple-networks, and then we expect the runtime to do the work. This means that our "docker" runtime is really "docker + CNI".

Will this impact the functionality of kube-proxy?

I think we also need to spec this. What is required in order for the out-of-the-box kube-proxy to function.

@matchstick

@bprashanth
Copy link
Contributor

Have we discussed having both? the runtime declares upfront if it wants kubernetes to handle networking or do it privately (in which case the kubelet just uses the noOp plugin)

@thockin
Copy link
Member

thockin commented Jul 8, 2016

"let's do both" is the trap we always fall into. It's part of what makes
this stuff complicated. Let's not do both, if we can possibly avoid it.

On Fri, Jul 8, 2016 at 10:03 AM, Prashanth B notifications@github.com
wrote:

Have we discussed having both? the runtime declares upfront if it wants
kubernetes to handle networking or do it privately (in which case the
kubelet just uses the noOp plugin)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVHWl5ZOHTmZx0QoPJ14Ah6AXiSmCks5qToL4gaJpZM4JHvvg
.

@bprashanth
Copy link
Contributor

I mean kubelet --runtime=esoteric-runtime --network-plugin=noop is always an option, as long as esoteric-runtime understands how to find the network config

@yujuhong yujuhong added area/kubelet-api sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster sig/network Categorizes an issue or PR as relevant to SIG Network. labels Jul 11, 2016
@yujuhong
Copy link
Contributor

If most runtimes work with net namespaces, and only a few hypervisor-based runtimes are exceptions, it might be worth handling the plugins in kubelet for the majority of the runtimes. This saves those runtimes a lot of pain to integrate.

@vishh
Copy link
Contributor

vishh commented Jul 11, 2016

Given that we intend to let this API handle other OSes too in the future and as per @thockin, there might be an user who wants to just use docker networking, I'd prefer integrating network config into the new runtime API.
Runtime authors can always re-use the plugins that are developed as part of kubernetes core for docker integration.
To be specific, I recommend Option 2, even if it will require additional work in the short-term.

@pskrzyns
Copy link

cc @kubernetes/sig-rktnetes

@lukasredynk
Copy link

lukasredynk commented Jul 13, 2016

If most runtimes work with net namespaces

NS-based:

  • Docker
  • rkt coreos flavor

Incompatible with NS:

  • rkt kvm flavor
  • HyperContainers (WIP)

It's not looking like "most" working with NS, it's more like 50/50

@yujuhong
Copy link
Contributor

There are proposals/discussions to integrate runc (oci) runtime, kurma, etc. It's the question of whether we want to make integration of non-hypervisor based runtime easier. If they can easily reuse/import what we maintain for the docker integration, that might be enough. On the other hand, it's probably not that easy to reuse our network-related code as it is today.

@feiskyer
Copy link
Member Author

feiskyer commented Aug 1, 2016

Proposal of letting kubelet handle networking by CNI plugin

Benifits of this proposal

  • Centralized management networking inside kubernetes.
  • Easily reuse existing CNI network plugins (required to upgrade CNI to v0.5.0).
  • Easily cooperate with kube-proxy and kube-proxy may keep as is.
  • Make no assumptions for container runtimes, e.g. runtime is not required to support CNI network plugin because CNI plugin is invoked inside kubelet.

The propoposed method involves changes in CNI, Kubelet and Container Runtime API.

CNI part

CNI needs to support hypervisor-based container runtime first, which is on progress:

type RuntimeConf struct {
    // If set, a tap device instead of veth pair should be created
    UsesTapDevice bool
    ....
}

Kubelet part

Kubelet needs to update CNI version and support hypervisor-based container runtimes:

  • Update CNI network plugin to v0.5.0
  • Kubelet get container runtime type (hypervisor or netns) by calling Version() API.
    • If the runtime is based-on netns, then get netns from PodSandboxStatus() and calls cni network plugin
    • Else, the runtime is based-on hypervisor
      • Call cni network plugin with UsesTapDevice set, and get network information (interface name, MAC address, ip address and so on) back.
      • Pass network information to runtime by SetupInterface()
  • Kubelet network plugin interface should be updated to new runtime API: old podInfraContainerID should be replaced with sandbox ID.

NetworkPlugin.SetUpPod() workflow for netns-based container runtimes (e.g. Docker, runc):

   Kubelet              CNI network plugin              Container Runtime
      +                          +                               +
      |                          |                               |
      +-------------PodSandboxStatus() for getting netns--------->
      |                          |                               |
      <----------------------Success-----------------------------+
      |                          |                               |
      |                          |                               |
      +----------Setup()--------->                               |
      |                          |                               |
      |                          |                               |
      <---------Success----------+                               |
      |                          |                               |
      +                          +                               +

NetworkPlugin.SetUpPod() workflow for hypervisor-based container runtimes (e.g. HyperContainer, rkt with lkvm):

  Kubelet               CNI Network Plugin                Container Runtime
     +                             +                                +
     |                             |                                |
     +-Setup() with UsesTapDevice-->                                |
     |                             |                                |
     <----network info for nic-----+                                |
     |                             |                                |
     +---------------------SetupInterface()------------------------->
     |                             |                                |
     <--------------------------Success-----------------------------+
     |                             |                                |
     +                             +                                +

Container Runtime API part

Since container runtime type (hypervisor or netns) is required for CNI network plugin, the type should be got from runtime API. A new field is_hypervisor could be added to VersionResponse:

message VersionResponse {
    ....
    // If set, the container runtime is based-on hypervisor
    optional bool is_hypervisor = 5;
}

Since CNI network plugin do not setup interfaces inside hypervisor, kubelet should pass the network information (interface name, MAC address, ip address and so on) got from CNI network plugin back to runtime. A new interface SetupInterface() could be added to runtime API:

service RuntimeService {
    ....
    // SetupInterface sets up a inteface to sandbox
    rpc SetupInterface(SetupInterfaceRequest) returns (SetupInterfaceResponse) {}

    // TeardownInterface tears downs a interface for sandbox
    rpc TeardownInterface(TeardownInterfaceRequest) returns (TeardownInterfaceResponse) {}
}

// NetworkInfo containe
message NetworkInfo {
  optional string ip_address = 1;
  optional string gateway = 2;
  repeated string routes = 3;
}

// NetworkInterface contains all information for setting a network interface
message NetworkInterface {
  optional string host_interface_name = 1;
  optional string container_interface_name = 2;
  optional NetworkInfo network_info = 3;
}

message SetupInterfaceRequest {
  optional string sandbox_id = 1;
  optional NetworkInterface interface =2;
}

message SetupInterfaceResponse {}

message TeardownInterfaceRequest {
  optional string sandbox_id = 1;
  optional NetworkInterface interface =2;
}

message TeardownInterfaceResponse {}

Disadvantage compared to let runtimes handle networking

  • Depends on external CNI version upgrades and CNI plugins upgrades.
  • Not flexible since networking is all handled by kubelet, runtimes couldn't choose other network plugins outside kubelet.

CC @thockin @freehan @dchen1107 @yujuhong

@thockin
Copy link
Member

thockin commented Aug 1, 2016

I commented on the CNI proposal, but this just feels backwards to me. It
feels like we are taking information that should be encapsulated and
turning it inside out.

Do we really expect that every CNI plugin will work for every runtime? I
admit that I am not current on VM network configuration options, so I am
more than willing to be schooled.

On Mon, Aug 1, 2016 at 6:26 AM, Pengfei Ni notifications@github.com wrote:

Proposal of letting kubelet handle networking by CNI plugin Benifits of
this proposal

  • Centralized management networking inside kubernetes.
  • Easily reuse existing CNI network plugins (required to upgrade CNI
    to v0.5.0).
  • Easily cooperate with kube-proxy and kube-proxy may keep as is.
  • Make no assumptions for container runtimes, e.g. runtime is not
    required to support CNI network plugin because CNI plugin is invoked inside
    kubelet.

The propoposed method involves changes in CNI, Kubelet and Container
Runtime API.
CNI part

CNI needs to support hypervisor-based container runtime first, which is on
progress:

type RuntimeConf struct {
// If set, a tap device instead of veth pair should be created
UsesTapDevice bool
....
}

Kubelet part

Kubelet needs to update CNI version and support hypervisor-based container
runtimes:

  • Update CNI network plugin to v0.5.0
  • Kubelet get container runtime type (hypervisor or netns) by calling
    Version() API.
    • If the runtime is based-on netns, then get netns from
      PodSandboxStatus() and calls cni network plugin
    • Else, the runtime is based-on hypervisor
      • Call cni network plugin with UsesTapDevice set, and get
        network information (interface name, MAC address, ip address and so on)
        back.
      • Pass network information to runtime by SetupInterface()
    • Kubelet network plugin interface
      https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/network/plugins.go#L52
      should be updated to new runtime API: old podInfraContainerID should
      be replaced with sandbox ID.

NetworkPlugin.SetUpPod() workflow for netns-based container runtimes
(e.g. Docker, runc):

Kubelet CNI network plugin Container Runtime
+ + +
| | |
+-------------PodSandboxStatus() for getting netns--------->
| | |
<----------------------Success-----------------------------+
| | |
| | |
+----------Setup()---------> |
| | |
| | |
<---------Success----------+ |
| | |
+ + +

NetworkPlugin.SetUpPod() workflow for hypervisor-based container runtimes
(e.g. HyperContainer, rkt with lkvm):

Kubelet CNI Network Plugin Container Runtime
+ + +
| | |
+-Setup() with UsesTapDevice--> |
| | |
<----network info for nic-----+ |
| | |
+---------------------SetupInterface()------------------------->
| | |
<--------------------------Success-----------------------------+
| | |
+ + +

Container Runtime API part

Since container runtime type (hypervisor or netns) is required for CNI
network plugin, the type should be got from runtime API. A new field
is_hypervisor could be added to VersionResponse
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/api/v1alpha1/runtime/api.proto#L63
:

message VersionResponse {
....
// If set, the container runtime is based-on hypervisor
optional bool is_hypervisor = 5;
}

Since CNI network plugin do not setup interfaces inside hypervisor,
kubelet should pass the network information (interface name, MAC address,
ip address and so on) got from CNI network plugin back to runtime. A new
interface SetupInterface() could be added to runtime API
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/api/v1alpha1/runtime/api.proto#L7
:

service RuntimeService {
....
// SetupInterface sets up a inteface to sandbox
rpc SetupInterface(SetupInterfaceRequest) returns (SetupInterfaceResponse) {}

// TeardownInterface tears downs a interface for sandbox
rpc TeardownInterface(TeardownInterfaceRequest) returns (TeardownInterfaceResponse) {}

}

// NetworkInfo containe
message NetworkInfo {
optional string ip_address = 1;
optional string gateway = 2;
repeated string routes = 3;
}

// NetworkInterface contains all information for setting a network interface
message NetworkInterface {
optional string host_interface_name = 1;
optional string container_interface_name = 2;
optional NetworkInfo network_info = 3;
}

message SetupInterfaceRequest {
optional string sandbox_id = 1;
optional NetworkInterface interface =2;
}

message SetupInterfaceResponse {}

message TeardownInterfaceRequest {
optional string sandbox_id = 1;
optional NetworkInterface interface =2;
}

message TeardownInterfaceResponse {}

Disadvantage compared to let runtimes handle networking

  • Depends on external CNI version upgrades and CNI plugins upgrades.
  • Not flexible since networking is all handled by kubelet, runtimes
    couldn't choose other network plugins outside kubelet.

CC @thockin https://github.com/thockin @freehan
https://github.com/freehan @dchen1107 https://github.com/dchen1107
@yujuhong https://github.com/yujuhong


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVOTmJdTSbA4p18KFHHxXqJFMjJ74ks5qbfPpgaJpZM4JHvvg
.

@yujuhong
Copy link
Contributor

yujuhong commented Aug 1, 2016

Do we really expect that every CNI plugin will work for every runtime? I
admit that I am not current on VM network configuration options, so I am
more than willing to be schooled.

@feiskyer, why can't we just have one CNI plugin that understands VM networking and passes the plugin name to kubelet if user wants hypervisor-based runtimes? That way, we don't have to force every cni plugin to understand the new semantics.

@feiskyer
Copy link
Member Author

feiskyer commented Aug 2, 2016

Do we really expect that every CNI plugin will work for every runtime? I
admit that I am not current on VM network configuration options, so I am
more than willing to be schooled.

@thockin Yes, I'm expecting all core plugins inside CNI for this. I think @lukasredynk and @jellonek also expect this since they are working on the implementation of this inside CNI. But other external plugins may be opt-in for this.

@feiskyer, why can't we just have one CNI plugin that understands VM networking and passes the plugin name to kubelet if user wants hypervisor-based runtimes? That way, we don't have to force every cni plugin to understand the new semantics.

@yujuhong Same with above, a general CNI for all runtimes is expected, so there is no need to handle networking differently for different runtimes. And this could give users more options of network plugins.

@thockin
Copy link
Member

thockin commented Aug 2, 2016

I think you misunderstood my point. I don't think that is a reasonable
assumption.

On Mon, Aug 1, 2016 at 6:40 PM, Pengfei Ni notifications@github.com wrote:

Do we really expect that every CNI plugin will work for every runtime? I
admit that I am not current on VM network configuration options, so I am
more than willing to be schooled.

@thockin https://github.com/thockin Yes, I'm expecting this. I think
@lukasredynk https://github.com/lukasredynk and @jellonek
https://github.com/jellonek also expect this since they are working on
the implementation of this inside CNI.

@feiskyer https://github.com/feiskyer, why can't we just have one CNI
plugin that understands VM networking and passes the plugin name to kubelet
if user wants hypervisor-based runtimes? That way, we don't have to force
every cni plugin to understand the new semantics.

Same with above, a general CNI for all runtimes is expected, so there is
no need to handle networking differently for different runtimes. And this
could give users more options of network plugins.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVF7-pWPqLug_42Pq6L11RlrXprmMks5qbqAegaJpZM4JHvvg
.

@resouer
Copy link
Contributor

resouer commented Aug 2, 2016

@thockin Yes, we can definitely implement a specific plugin to handle HyperContainer's network, but so do the rkt and other hypervisor container users (like Mirantis) have to. Since hypervisor containers' networks share lots in common, it will be great if we can (actually we are eager to) collaborate on one project instead of re-inventing wheel on each side.

Obviously, this proposal is not asking for CNI to support all kind of vendors. In fact, it does not have (and shouldn't have) any hyper/rkt specific refactoring, it is just asking for support another kind of container runtime (since we already have two we already have one), and the cost of doing this, as you can see, is negligible.

Otherwise, hyper & rkt & others may have to build another interface under CNI, well, a standard under a standard ...

@jellonek
Copy link
Contributor

jellonek commented Aug 2, 2016

It feels like we are taking information that should be encapsulated and turning it inside out.

Using host networking we are already gathering this information but because of not passing this info to execution driver - we are cutting this information from exec drivers, removing possibility to tune config for specific driver, or showing this info on itself. Also host networking is incompatible with VM based runtimes.

Do we really expect that every CNI plugin will work for every runtime?

IMO no. I think that if there is no interest in some possibilities (ex. ipvlan) - there is no need to make a pressure to fully support particular network type in both scopes, NS based and VM based.

@lukasredynk to list of NS incompatible runtimes you can add https://github.com/fabiand/vocker because @fabiand showed interest in this area also, so vocker is possible next runtime which could be introduced as execution driver for k8s.

@thockin
Copy link
Member

thockin commented Aug 3, 2016

instead of re-inventing wheel on each side

Nothing says that the individual runtimes can't all use CNI or even use a
common CNI library. I just don't think we should be telling people "the
runtime doesn't get to choose".

I really think that having a docker-native driver mode will appeal to some
people.

just asking for support another kind of container runtime (since we
already have two)

If you mean docker and rocket as two, those should only count as one. They
operate largely the same way. Adding all this machinery around tap devices
smells of a slippery slope. On what grounds do we say no to the next
request for a third special case?

On Tue, Aug 2, 2016 at 2:01 AM, harry notifications@github.com wrote:

@thockin https://github.com/thockin Yes, we can definitely implement a
specific plugin to handle HyperContainer's network, but so do the rkt and
other hypervisor container users (like Mirantis) have to. Since hypervisor
containers' networks share lots in common, it will be great if we can
(actually we are eager to) collaborate on one project instead of
re-inventing wheel on each side.

Obviously, this proposal is not asking for CNI to support all kind of
vendors. In fact, it does not have (and shouldn't have) any hyper/rkt
specific refactoring, it is just asking for support another kind of
container runtime (since we already have two), and the cost of doing this,
as you can see is negligible.

Otherwise, hyper & rkt & others may have to build another interface under
CNI, well, a standard under a standard ...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVOp7l9GzYNic0qHZbyl9OKQIcgB_ks5qbwdrgaJpZM4JHvvg
.

@resouer
Copy link
Contributor

resouer commented Aug 3, 2016

already have two)

Sorry, I mean we had one (ns container), and will make it two, I have updated the comment.

I can see one of your concerns: if you are arguing Setup() with UsesTapDevice, we can definitely change it to something like SetupNonNSNetwork() so it won't confuse CNI users like Calico, Weave etc. Kubelet is responsible to choose the right workflow according to its runtime.

I can't see that this proposal is implying ns containers should know or aware about tap devices. Is there anything misleading? I can only see the core idea is: kubelet choose different workflow (method chain to call) according to its runtime option.

@thockin
Copy link
Member

thockin commented Aug 4, 2016

There are two issues here.

  1. If/how CNI should handle tap devices

  2. which side of the CRI boundary should networking go into

I think #1 can still be debated

I think I have a strong preference that #2 be "on the plugin side", unless someone can show me why this is really REALLY bad.

@feiskyer
Copy link
Member Author

So we're in agreement then?

Yes. In the long run, it is essential to cooperate network plugins with networking inside kubernetes, e.g. define a network object inside k8s and let plugins to do real connecting things. But currently, we could keep network plugins into runtimes, and pass in network related information in the future.

@thockin
Copy link
Member

thockin commented Aug 10, 2016

Right now there is a single implied Network, but eventually I believe there
will be N explicit networks

On Tue, Aug 9, 2016 at 10:43 PM, Pengfei Ni notifications@github.com
wrote:

So we're in agreement then?

Yes. In the long run, it is essential to cooperate network plugins with
networking inside kubernetes, e.g. define a network object inside k8s and
let plugins to do real connecting things. But currently, we could keep
network plugins into runtimes, and pass in network related information in
the future.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVBz1z2gW5NZVEpLgV5JuyGIBuGaAks5qeWULgaJpZM4JHvvg
.

@yujuhong
Copy link
Contributor

The only thing kubelet should care about is retrieving information (e.g. pod IP and network stats) via CRI.

We still need to pass some information to the runtime via CRI, e.g., pod CIDR, so that the runtime can setup everything properly.
We also need to document any implicit assumption (if there is any) about networking in the runtime that cannot be conveyed in CRI directly.
Finally, we need to copy/adapt all the kubenet related code to the dockershim (a misnomer, since it's a kubelet shim for docker) package. The cbr0 code lives in kubelet directly, so it'll need to be isolated properly so that we can switch it off for testing.

@freehan I assume you will be working on augmenting CRI for this. Let me know if that's not the case, thanks!

@freehan
Copy link
Contributor

freehan commented Aug 10, 2016

@yujuhong I will pick that up. Thanks for summarizing.

@thockin
Copy link
Member

thockin commented Aug 11, 2016

Note that PodCIDR is sort of a legacy artifact. We don't have a way to get
rid of it yet, but some network drivers don't use it.

On Wed, Aug 10, 2016 at 11:39 AM, Yu-Ju Hong notifications@github.com
wrote:

The only thing kubelet should care about is retrieving information (e.g.
pod IP and network stats) via CRI.

We still need to pass some information to the runtime via CRI, e.g., pod
CIDR, so that the runtime can setup everything properly.
We also need to document any implicit assumption (if there is any) about
networking in the runtime that cannot be conveyed in CRI directly.
Finally, we need to copy/adapt all the kubenet related code to the
dockershim (a misnomer, since it's a kubelet shim for docker) package. The
cbr0 code lives in kubelet directly, so it'll need to be isolated properly
so that we can switch it off for testing.

@freehan https://github.com/freehan I assume you will be working on
augmenting CRI for this. Let me know if that's not the case, thanks!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVKetPSXlI6GBJ-xfm4PK-K0SRy44ks5qehrzgaJpZM4JHvvg
.

@squeed
Copy link
Contributor

squeed commented Oct 7, 2016

Hi there, rkt developer here,
Has any progress been made? The current CRI spec doesn't seem to have much in the PodSandboxConfig - i.e. the "create" datatype - w.r.t. networking.

Will the CRI support both kubelet and runtime invocation of CNI?

On the flip side, the PodSandboxStatus - the "read" datatype - has a rich set of information about networking.

@yifan-gu
Copy link
Contributor

yifan-gu commented Oct 11, 2016

@squeed, @euank and I discussed last week on whether kubelet or rkt should setup the CNI networks,
so far we think it's better to have kubelet to do it:

  • Kubelet also manages the /etc/hosts file
  • Kubelet also injects the pod ip into envs for the pod (downward API)
  • Kubelet has a dedicated logic to manage and gc networks

The information returned by CRI's PodSandboxStatus is not necessary the information of a CNI network, since docker runtime also supports no-op CNI plugin, which uses the docker's own network module.

@squeed
Copy link
Contributor

squeed commented Oct 11, 2016

@yifan-gu I agree with you - lots of focus seems to be on the kubelet CRI driver. It's not clear to me if direction should be shifted from that.

@euank
Copy link
Contributor

euank commented Oct 11, 2016

@yifan-gu I disagree with the first two bullet points mattering here. With the no-op driver, PodSandboxStatus will have the correct ip for downwards API etc in happy cases.

I think the ip returned in PodSandboxStatus is, by api contract, the ip you'd want present in the downward api and /etc/hosts so it doesn't matter whether it's related to CNI or not.

The only thing kubelet should care about is retrieving information (e.g. pod IP and network stats) via CRI.

I think an important point the kubelet could care about is resource isolation. The way to manage that reliably is for kubelet to manage a network namespace (similar to the pod cgroup parent for cpu/memory). This namespace can be used as a point for bandwidth QoS and network stats.

Of course, the runtime could still in theory do different network setup within that namespace (make tap devices or more bridges or whatever), but it could also just reuse it if the kubelet set it up appropriately.

I'm not totally convinced on either whether the kubelet or the runtime should actually be calling network plugins.

If kubenet is moved into dockershim, I'm pretty confident rktshim will have an almost totally unmodified copy of that code for setup / gc / etc, so in practice some will be shared.

@squeed
Copy link
Contributor

squeed commented Oct 11, 2016

I'm not totally convinced on either whether the kubelet or the runtime should actually be calling network plugins.

My personal preference (especially given #31307) is to have Kubelet call CNI, but I don't really care either way. I just want to see whatever the decision is captured in the CRI spec.

@dcbw
Copy link
Member

dcbw commented Oct 11, 2016

@euank @yifan-gu last net-sig meeting @thockin was suggesting that the network setup would all live within the runtime code, so would be under control of docker/rkt/etc instead of in kubelet. If that's still the planned direction, then I would say that the CNI driver code should be in some kind of utility packet in kubernetes git or under the kubernetes project that the runtimes could then import and use, instead of duplicating all that themselves.

WRT the stuff that kubelet currently passes into the plugins, all that can be handled instead in the runtime or the plugin itself by watching the apiserver. You don't need kubelet to pass that stuff down. For example. podCIDR can be grabbed from the node object, bandwidth annotations can be grabbed from the Pod's annotations, etc. Not sure if that's desirable or if the CRI runtimes are currently expected to watch the apiserver or just use info passed from kubelet.

@thockin what's the plan of record here?

@euank
Copy link
Contributor

euank commented Oct 11, 2016

@dcbw based on @freehan's PR over here #34276 it seems like that indeed is the plan of action.

I'm not sure that we wanted CRI runtimes to talk to the api-server (out of band) vs it being explicitly expressed in kubelet (in-band). I'm in favor of the latter even though it might mean fairly bulky CRI interfaces.

@jellonek
Copy link
Contributor

IMO kubelet should be a proxy for passing this data and CRI should not access api server by self.
This simplifies CRI runtimes which would not require any knowledge about api-server.

@feiskyer
Copy link
Member Author

@euank @yifan-gu last net-sig meeting @thockin was suggesting that the network setup would all live within the runtime code, so would be under control of docker/rkt/etc instead of in kubelet. If that's still the planned direction, then I would say that the CNI driver code should be in some kind of utility packet in kubernetes git or under the kubernetes project that the runtimes could then import and use, instead of duplicating all that themselves.

+1 for moving network plugins to a separate project for reusing.

@thockin
Copy link
Member

thockin commented Oct 12, 2016

I agree that moving the CRI->CNI logic into a lib of its own probably makes
sense, but let's not prematurely factor it, please.

On Tue, Oct 11, 2016 at 5:18 PM, Pengfei Ni notifications@github.com
wrote:

@euank https://github.com/euank @yifan-gu https://github.com/yifan-gu
last net-sig meeting @thockin https://github.com/thockin was suggesting
that the network setup would all live within the runtime code, so would be
under control of docker/rkt/etc instead of in kubelet. If that's still the
planned direction, then I would say that the CNI driver code should be in
some kind of utility packet in kubernetes git or under the kubernetes
project that the runtimes could then import and use, instead of duplicating
all that themselves.

+1 for moving network plugins to a separate project for reusing.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28667 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVIvUuUWxxdajstmt-pfUVjGVMh_tks5qzCc9gaJpZM4JHvvg
.

@squeed
Copy link
Contributor

squeed commented Oct 12, 2016

Right now, the node team is exploring to how to include network plugin invocation in-runtime (or shim) via the CRI API. The conclusion is that it's hard to predict if any nasty complexities (or needless leaky abstractions) will crop up without actually trying it.

In designing this API, it's also worth looking over any node-affecting proposals to the networking spec. The first ones that come to mind are NetworkPolicy (#22469) and NetworkEgressPolicy (#26720). Will implementing these require a change to CRI and the corresponding implementations? If so, that results in a nasty support matrix scenario (where different runtimes support different sets of features) - something to be avoided!

@euank
Copy link
Contributor

euank commented Oct 12, 2016

I think sharing kubenet and cni-related code might not be premature, since for the sake of backwards compatibility I assume at least docker and rkt will want to have implementations that are functionally identical to the existing code.

@squeed Wouldn't that matrix exist in either scenario? The runtimes have a natural desire to be compatible so switching is painless, but they might be forced to be incompatible because of e.g. the limitations of VMs. Regardless where the logic lives, you'll have the same pair of (networkPluginConfig, runtime) which fails, and handling that failure just ends up in different places.

@freehan
Copy link
Contributor

freehan commented Oct 14, 2016

To summarize the discussion:

With container runtime interface, runtime/shim are responsible for pod networking. This includes set_up, tear_down, get_network_status and other background mechanism for maintaining network state (e.g. keep track of allocated IPs and hostports). Kubelet is responsible for supplying information to shim/runtime. From kubelet point of view, it pushes necessary config information (e.g. podCIDR), triggers create/delete podSandbox and expect shim/runtime to implement/maintain pod network state.

Why let runtime/shim handle networking?

  • Let kubelet handle networking implies that network workflows need to be compatible with multiple runtimes. Given runtimes are very different and network providers can also be very different, it seems unlikely to have a simple uniform network workflow.
  • Kubelet does not have enough information to maintain network states. For example, runtime restart/crash and needs proper clean up.
  • Runtime/shim have more flexibility to do things differently.
  • In current implementation, kubelet usually acts as the middleman between network plugins and runtimes. It simply passes information back and forth. This does not add much value. On the other hand, if some runtimes want to share the common networking code, we can consider extracting those code to a library.

Some Additional points:

  • Kubelet expect pods to be setup with an IP that is routable within the cluster, hence pod-to-pod, pod-to-node and pod-to-internet communication are possible. (We should put this in spec).
  • Kubelet made no assumption on the implementation detail. Kubelet has no expectation on whether the shim/runtime works with CNI or kubenet or docker networks or anything else.
  • Existing network configs (e.g. kubenet, cni network plugin) should still work without CRI, until it gets deprecated.
  • Pod traffic shaping currently goes through pod annotation. If it become a field, then it may move into PodSandboxConfig. Same applies to other future features. CRI need to adjust over time.

Plan to move forward:

  • Migrate network related parameters from kubelet to runtime/shim, such as hairpin-mode, network-plugin.
  • Information that is not available on initialization goes thru UpdateRuntimeConfig interface. Currently there is only PodCIDR. We may encounter more cases as we move on, such as the coming k8s Network object.
  • Keep the current network plugin codebase and refactor them, so that they can be reused in different shims.
  • Try to reuse the current network plugin interface. At least for dockershim, it still needs a plugin interface to extend network capabilities.
  • Trigger network plugin in shims.
  • Clean up network plugin hooks in kubeGenericRuntimeManager.

@yifan-gu
Copy link
Contributor

Let kubelet handle networking implies that network workflows need to be compatible with multiple runtimes. Given runtimes are very different and network providers can also be very different, it seems unlikely to have a simple uniform network workflow.

@freehan Could you elaborate what's included in thenetwork workflow? Do you have any examples here?
My thought is we should keep the common logic in the kubelet's network package, and the special part as CNI plugins. Isn't that why we need plugin for?

From what I understand, today's runtimes can be categorized into two types:

  • namespace based (docker, rkt with nspawn stage1, cri-o)
  • vm based (hyper, rkt with kvm stage1)

We already have a solution for the namespace based runtimes, which rkt and docker are sharing the same kubelet network logic.
For the vm runtimes, efforts are out there that adding support for them in the CNI (containernetworking/cni#251), but seems not active recently, not sure about the status.

What I want to point out is, I am not expecting a one single uniform solution, that's why plugin comes in.
Besids, the runtimes are not so different IMO as they really are, especially they are now exposing very low lever interface through CRI.

Kubelet does not have enough information to maintain network states. For example, runtime restart/crash and needs proper clean up.

I think this is solvable by making sure all resources are revoked before finally removing the container. (e.g. if teardown() fails, then don't remove the container this time, try next time instead).

IMHO, we cannot avoid such logic, it either lives in kubelet and shared by all runtimes, or implemented in all runtimes (in which case, the runtime has to clean up resources first in RemovePodSandbox(), and abort if it fails to do that).
Hence I am not seeing additional values pushing it down to runtime.

In current implementation, kubelet usually acts as the middleman between network plugins and runtimes. It simply passes information back and forth. This does not add much value. On the other hand, if some runtimes want to share the common networking code, we can consider extracting those code to a library.

I think today kubelet only pass information only once when it creates the sandbox, after that, it gets the network information from the network code, and does the clean up itself, so it't not really touching the runtime for network a lot.

Pod traffic shaping currently goes through pod annotation. If it become a field, then it may move into PodSandboxConfig. Same applies to other future features. CRI need to adjust over time.

This is another example that it's duplicating the shareable logic. Besides isn't it contradicting what you said above:

Kubelet made no assumption on the implementation detail. Kubelet has no expectation on whether the shim/runtime works with CNI or kubenet or docker networks or anything else.

Since kubelet is making assumption about the shaper.

To sum up, my two cents are:

  • I am not convinced that pushing network down to runtime is the only solution to for proper resource clean up.
  • Give runtime more control over the network, IMO is on the opposite direction of CRI, which aims to make UX of different runtime more close by giving runtime less freedom. And as @squeed said above, I am concerned this will result in a nasty support matrix scenario.

@philips
Copy link
Contributor

philips commented Oct 19, 2016

I want to echo what @thockin has said:

I agree that moving the CRI->CNI logic into a lib of its own probably makes
sense, but let's not prematurely factor it, please.

Reading through all of this I see the argument for doing this to fix a bug and give more power to the runtimes. But, I don't understand why we need to do it in this release with everything else we are trying to accomplish in proving out the CRI model. We know the CRI will need to evolve overtime as @freehan said above:

Pod traffic shaping currently goes through pod annotation. If it become a field, then it may move > into PodSandboxConfig. Same applies to other future features. CRI need to adjust over time.

Also, are there other things the kubelet allocates and might leak? For example how do we ensure that kubelet allocated cgroups don't leak?

It would be great to support everything with CRI in v1.5 but I question if we can get everything 100% with net-ns based runtimes in this release. There are already plenty of challenges on the plate (tty, port-forward, etc).

Is networking the only potential place that the CRI and kubelet might get out of sync on a kubelet restart? It does feel like we need to reflect on @thockin's comment about checkpointing:

Is this just another case of "if we had a checkpoint we would never lose
state" ?

What is the urgency for #34780?

@yujuhong
Copy link
Contributor

@philips, could you clarify a little bit whether you are disagreeing with the general direction of delegating the networking to the runtime, or just questioning the timeline?

Is networking the only potential place that the CRI and kubelet might get out of sync on a kubelet restart? It does feel like we need to reflect on @thockin's comment about checkpointing:

kubelet checks the current state of all containers and sandboxes, and create/remove them if necessary. If runtime can faithfully report that there are still remaining resources allocated to the container/sandbox, kubelet will remove them. For example, if an infra container has been removed but the IP addressed allocated to it remains, runtime can report the podsandbox is not ready, so that kubelet can call RemovePodSandbox to reclaim the resources.

For resources managed by kubelet directly, kubelet is responsible for ensuring that there is no leakage.

@yujuhong
Copy link
Contributor

Reading through all of this I see the argument for doing this to fix a bug and give more power to the runtimes. But, I don't understand why we need to do it in this release with everything else we are trying to accomplish in proving out the CRI model. We know the CRI will need to evolve overtime as @freehan said above:

I think there is some misunderstanding here. We are NOT doing #34780 to fix a bug. This issue was created back in July so that we could reach a consensus on how to handle networking in CRI. PR #34780 is simply implementing the that.

I want to echo what @thockin has said:

I agree that moving the CRI->CNI logic into a lib of its own probably makes
sense, but let's not prematurely factor it, please.

I think @thockin was saying that each runtime can implement its own networking logic, and if there is ever a need to factor out a common library, we can do it (but not now). I don't see how this contradicts with the decision and implementation.

@yujuhong
Copy link
Contributor

@yifan-gu I think kubelet is not going to dictate what network plugins you use, whether it be CNI or not. For example, @thockin also mentioned in #28667 (comment) that docker native driver may appeal to some users.

@yujuhong
Copy link
Contributor

To summarize, pod/container networking is delegated to the container runtime completely in CRI as of v1alpha1. Closing this issue.

@jonboulle
Copy link
Contributor

To summarize, pod/container networking is delegated to the container runtime completely in CRI as of v1alpha1.

@yujuhong could you expand on this a bit? I don't really understand, since the CRI still only contains minimal networking related information (only pod CIDR at this stage).

@bprashanth
Copy link
Contributor

Right now, CRI doesn't even care about CNI or !CNI. Our implementation of dockershim probes for plugins and translates pod sandbox setup/teardown calls to CNI ADD/DEL calls. The kubelet just passes a struct of plugin settings down to the shim (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L509) and then, sometime later, passes down the podCIDR through an rpc (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_network.go#L221).

The 6 kubelet command line args that are passed through are:

--cni-bin-dir string                                      <Warning: Alpha feature> The full path of the directory in which to search for CNI plugin binaries. Default: /opt/cni/bin
--cni-conf-dir string                                     <Warning: Alpha feature> The full path of the directory in which to search for CNI config files. Default: /etc/cni/net.d
--hairpin-mode string                                     How should the kubelet setup hairpin NAT. This allows endpoints of a Service to loadbalance back to themselves if they should try to access their own Service. Valid values are "promiscuous-bridge", "hairpin-veth" and "none". (default "promiscuous-bridge")
--network-plugin string                                   <Warning: Alpha feature> The name of the network plugin to be invoked for various events in kubelet/pod lifecycle
--network-plugin-mtu int32                                <Warning: Alpha feature> The MTU to be passed to the network plugin, to override the default. Set to 0 to use the default 1460 MTU.
--non-masquerade-cidr string                              Traffic to IPs outside this range will use IP masquerade. (default "10.0.0.0/8")

We'll probably rip these out of the kubelet alltogether when we default to a standalone shim.

@yujuhong
Copy link
Contributor

@yujuhong could you expand on this a bit? I don't really understand, since the CRI still only contains minimal networking related information (only pod CIDR at this stage).

In short, kubelet doesn't care how pod networking is set up. It assumes the runtime (shim) handles all the necessary steps to set up networking for a pod. The only thing it passes to the runtime via CRI as of now is the pod CIDR, and that's only because pod CIDR is written to the node spec by a controller. Not all network plugins use the pod CIDR, so this extra bit of information may be ignored by the runtime/plugin completely.

AFAIK, @freehan is working on a documentation for CRI networking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/extensibility area/kubelet-api sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests