KEP-2593: update risks and mitigation, rename object and target releases #4174

aojea · 2023-09-04T13:24:38Z

During the development of the KEP 1880: Multiple Service CIDRs the intersection with KEP 2593: Enhanced NodeIPAM to support Discontiguous Cluster CIDR made the SIG Network to revisit the state of the Kubernetes networking configuration, specially the configuration of the Pod and Services networks.
This triggered several discussion and different alternatives were evaluated and discussed deeply , concluding that all possible problems of the network intersection can not be mitigated, it will be very difficult (if not impossible) at this stage of the project to do any significant changes with backwards compatibility and without disrupting the whole ecosystem that depends on current state.

During these conversations some concerns were raised by the community about this KEP-2593:

It should be generic enough to not be tied to a single company/provider/vendor. IGoogle interest, promoter of the KEP, is to use it with Cilium , a CNCF and OSS project; SUSE is already interested on this functionality and implemented it in flannel flannel-io/flannel@243cf7e, giving enough proof that this can not generate vendor lockin.
It should be clear that PodCIDRs is a network plugin responsability and the KEP is not authoritative about the pod network. The KEP is clear about this and both the Motivation, Goals and Non-Goals sections are explicit about it https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2593-multiple-cluster-cidrs#motivation

One part that may cause fiction is the name of the object used, ClusterCIDR , but there are several points that makes keeping the same object name preferrable as previously mentioned:

there are several implementations that still use the existing API object , like flannel https://github.com/flannel-io/flannel/tree/master/Documentation/MultiClusterCIDR.
the API object was already renamed in 1.26 and another rename will need tp add another release delay, in a KEP that it's functionality was already implemented in 1.25 and that there is not additional work needed.
The existing kube-controller-manager flag name has the same name --cluster-cidr

--cluster-cidr string
--
  | CIDR Range for Pods in cluster. Requires --allocate-node-cidrs to be true

The proposal is to move forward to beta this KEP in 1.29 with current state

/sig network
/assign @thockin @danwinship

danwinship · 2023-09-04T13:59:04Z

/hold

The existing kube-controller-manager flag name has the same name --cluster-cidr

Yes, and the ClusterCIDR object is intended to override that flag only (eg, it is explicitly not intended to also override kube-proxy's --cluster-cidr flag). If it was called KubeControllerManagerClusterCIDR, that would be great, but calling it just ClusterCIDR makes it sound like it not just about kcm.

It should be generic enough to not be tied to a single company/provider/vendor. Google interest, promoter of the KEP, is to use it with Cilium , a CNCF and OSS project; SUSE is already interested on this functionality and implemented it in flannel flannel-io/flannel@243cf7e, giving enough proof that this can not generate vendor lockin.

I'm not worried about "vendor lockin". I'm worred about... exactly what they did in that commit. The KEP clearly states that "This KEP assumes that the only consumer of the --cluster-cidr value is the NodeIPAM controller." And yet, flannel was modified so that it now also consumes the ClusterCIDR objects, for its own cluster-CIDR-determining purposes, meaning it is treating ClusterCIDR as the canonical definition of the pod network, which is exactly what we agreed it is not supposed to be.

danwinship · 2023-09-04T15:12:28Z

I'm worred about... exactly what they did in that commit.

So, clarifying that some more: what everyone wants is "pod network autoscaling", because if you don't have that, then node autoscaling stops being useful at some point.

In GKE, it happens to be the case that the only additional feature you need to have in order to be able to do pod network autoscaling is the ability to reconfigure the kcm Node IPAM controller; nothing else in GKE cares about the size/shape of the cluster network.

In most other cluster configurations, this is not the case; for example, when using flannel as your network plugin, flannel itself needs to know the extent of the cluster network, so just having the ability to reconfigure the Node IPAM controller does not give you pod network autoscaling if you are using flannel.

The PR linked above "fixes" this by having flannel sniff the config objects that KEP-2593 added in order to configure NodeIPAM, which KEP-2593 explicitly says is not what those objects are intended to be used for. There is not actually any way for them to implement the feature that they want with the current KEP. And, IMO, that suggests that the KEP is not actually solving the right problem; having the ability to reconfigure NodeIPAM is one of the things you need in order to have pod network autoscaling, but it's not the only thing (or even, for most clusters, the primary thing).

danwinship · 2023-09-04T15:15:51Z

Or, OTOH, if we are only going to solve the problem for GKE, then there is no need for a Kubernetes-level feature; GKE could just implement its own alternate NodeIPAM controller and use that rather than the kcm one (as several other network plugins already do). Then external components like flannel would not be tricked into misusing the API.

aojea · 2023-09-05T06:04:10Z

Then external components like flannel would not be tricked into misusing the API.

SUSE rancher wanted this feature as described in the KEP, is not misusing it, a consequence of using this feature is that the objects used to configure the Ipam matches the pod CIDRs on the nodes, not the other way around.

cilium also has a mode of operation that depends on the nodeIpam and will benefit of this feature, there are multiple projects that depends on the nodeIpam and benefit from this feature , is not a GKE one problem

aojea · 2023-09-05T06:25:57Z

@danwinship I think there is a misunderstanding on what this feature targets, ALL clusters that depends on nodeIPAM, some modes of cilium, flannel, kubeadm, ... CAN NOT scale if the number of nodes is equal to the 2^"ClusterCIDR mask - node-pod-mask-size".
The only alternative today is to restart the controller manager with new flags, this feature removes the need to use flags and restart controller manager nodes.

GKE has bigger demand and customer reach this limit more often than in other places, I think that we can remove the GKE argument as we can see how there are st least 3 OSS projects that use nodeIpam and the problem is demonstrated.

I don't find legit pushing back on the argument that people may use the apis for something that is not supposed to represent , people can sneak into the object and assume that is the ClusterCIDR for all the cluster, it perfectly can list all nodes and list the node.Spec.PodCIDRs too

I don't disagree on the name can be misleading but I completely disagree on the feature not be useful outside of GKE

aojea · 2023-09-05T06:53:01Z

I'm going to rename of the object to avoid people to assume this objects configure the PodCIDR of the cluster, I think we never disagree on that, but we both reviewed the original KEP back in 2021 #2594 , I don't think is legit at this point to revisit the intent of the KEP ...

The name ClusterCIDR for the object may cause confusion on users, since will only be represent the ClusterCIDR IF the Cluster is using the specific MultiNodeIPAM controller that uses these objects.

k8s-ci-robot · 2023-09-05T08:32:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
Once this PR has been reviewed and has the lgtm label, please ask for approval from thockin. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-network/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2023-09-05T08:47:11Z

@caseydavenport I can see calico has a CRD IPAMConfiguration, I'm suggesting to use NodeIPAMConfig here, can you ack this is ok from Calico side https://docs.tigera.io/calico/latest/reference/resources/ipamconfig

danwinship · 2023-09-05T13:05:57Z

we both reviewed the original KEP back in 2021 #2594, I don't think is legit at this point to revisit the intent of the KEP ...

(FWIW, I didn't like this approach then either: #2594 (comment). I just got busy with other stuff and stopped arguing against it.)

aojea · 2023-09-05T13:07:36Z

(FWIW, I didn't like this approach then either: #2594 (comment). I just got busy with other stuff and stopped arguing against it.)

then what we do, do we revert the KEP?

danwinship · 2023-09-05T13:52:46Z

SUSE rancher wanted this feature as described in the KEP

No, SUSE wanted "pod network scaling". That is not the feature described in the KEP. The KEP describes "extending NodeIPAM to support pod network scaling", which is part of the "pod network scaling" feature, but not all of it. SUSE wanted all of it, and so they also made changes to flannel, and those changes reuse parts of the NodeIPAM scaling feature in a way which the KEP does not anticipate and probably intends to forbid. (The KEP says that "This KEP assumes that the only consumer of the `--cluster-cidr` value is the NodeIPAM controller", and even rejects the possibility that kube-proxy could look at the ClusterCIDR objects when it is running in --detect-local-mode ClusterCIDR. The PRR answer about skew says that "there is no coordination between multiple components required for this feature", which is certainly not true in a cluster where kcm and flannel are both looking at ClusterCIDR objects. E.g. if flannel is configured with --use-multi-cluster-cidr and kcm is not configured with --cidr-allocator-type=MultiCIDRRangeAllocator, or vice versa, then the network will malfunction. This is of course a solvable problem, but the KEP explicitly does not try to solve it.)

ALL clusters that depends on nodeIPAM, some modes of cilium, flannel, kubeadm, ... CAN NOT scale if the number of nodes is equal to the 2^"ClusterCIDR mask - node-pod-mask-size".

Yes, I understand that. My point is that most clusters, whether they use NodeIPAM or not, also require other changes in order to support pod network scaling. And this KEP makes no effort to solve those other problems.

I would love for Kubernetes to support pod network scaling, but if we want it to support that, then we need to actually write a pod network scaling KEP, which will include dynamic NodeIPAM reconfiguration as one of its pieces, but will also have to include other things.

aojea · 2023-09-05T14:01:11Z

I would love for Kubernetes to support pod network scaling, but if we want it to support that, then we need to actually write a pod network scaling KEP, which will include dynamic NodeIPAM reconfiguration as one of its pieces, but will also have to include other things.

I'm confused now, why is the other KEP blocking this KEP?
node.Spec.PodCIDR is inmutable, that can not be changed, NodeIPAM will not be a tool to solve the pod network scaling problem

aojea · 2023-09-05T14:55:46Z

This is of course a solvable problem, but the KEP explicitly does not try to solve i

Addressed in the last commit in the "Risk and Mitigations" section
58477c5

danwinship · 2023-09-18T16:06:35Z

This is of course a solvable problem, but the KEP explicitly does not try to solve i

Addressed in the last commit in the "Risk and Mitigations" section 58477c5

I don't think that addresses the problem. The "flannel config vs kcm config" point was just one example.

My point is that the KEP process exists to make sure that we answer various hard questions when adding a new feature, and in the case of KEP-2593, we explicitly decided to not answer any of the hard questions about pod network scaling in the general case (despite the fact that the KEP's first Goal is "Support multiple discontiguous IP CIDR blocks for Cluster CIDR" and its first User Story is "Add more pod IPs to the cluster"). But now people are trying to use KEP-2593 to implement pod network scaling in other contexts anyway, and potentially shooting themselves (and other people) in the foot because we didn't think about any of the possible difficulties with doing this... The KEP has become an attractive nuisance.

(e.g., Another problem with the flannel patch: if the admin deletes a ClusterCIDR object, flannel will consider that CIDR to be no longer part of the pod network right away, even though it may actually still be in use by some nodes. The defined semantics of deleting a ClusterCIDR object according to KEP-2593 is that the NodeIPAM controller should stop using that CIDR for new nodes, but nothing indicates when it is no longer used for old nodes, which is what flannel actually needs to know.)

danwinship · 2023-09-18T16:07:17Z

because we didn't think about any of the possible difficulties with doing this

an attempt: https://github.com/danwinship/enhancements/tree/pod-network-scaling/keps/sig-network/2593bis-pod-network-scaling#summary

aojea · 2023-09-22T16:06:18Z

The KEP has become an attractive nuisance.

agree, is a very bespoke solution, it does not solve the real problem and foremost, is going to be confusing for users ... I think we should also remove it from the codebase @thockin

an attempt: https://github.com/danwinship/enhancements/tree/pod-network-scaling/keps/sig-network/2593bis-pod-network-scaling#summary

@danwinship that has a lot of overlap with the multinetwork KEP https://docs.google.com/document/d/17LhyXsEgjNQ0NWtvqvtgJwVqdJWreizsgAZHWflgP-A/edit

update target releases

c7b41a5

k8s-ci-robot assigned danwinship Sep 4, 2023

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 4, 2023

k8s-ci-robot assigned thockin Sep 4, 2023

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 4, 2023

k8s-ci-robot requested review from dcbw and shaneutt September 4, 2023 13:24

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 4, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 4, 2023

rename object from ClusterCIDR to NodeIPAMConfig

ee08437

The name ClusterCIDR for the object may cause confusion on users, since will only be represent the ClusterCIDR IF the Cluster is using the specific MultiNodeIPAM controller that uses these objects.

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 5, 2023

clarify API usage

58477c5

aojea changed the title ~~KEP-2593: update target releases~~ KEP-2593: update risks and mitigation, rename object and target releases Sep 5, 2023

aojea mentioned this pull request Sep 14, 2023

NodeSelectorAsSelector needs to be fixed kubernetes/kubernetes#116419

Open

aojea closed this Sep 22, 2023

aojea mentioned this pull request Sep 27, 2023

Enhancing NodeIPAM to support multiple ClusterCIDRs #2593

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2593: update risks and mitigation, rename object and target releases #4174

KEP-2593: update risks and mitigation, rename object and target releases #4174

aojea commented Sep 4, 2023 •

edited

danwinship commented Sep 4, 2023

danwinship commented Sep 4, 2023 •

edited

danwinship commented Sep 4, 2023

aojea commented Sep 5, 2023

aojea commented Sep 5, 2023 •

edited

aojea commented Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

aojea commented Sep 5, 2023 •

edited

danwinship commented Sep 5, 2023

aojea commented Sep 5, 2023

danwinship commented Sep 5, 2023 •

edited

aojea commented Sep 5, 2023

aojea commented Sep 5, 2023

danwinship commented Sep 18, 2023

danwinship commented Sep 18, 2023

aojea commented Sep 22, 2023

KEP-2593: update risks and mitigation, rename object and target releases #4174

KEP-2593: update risks and mitigation, rename object and target releases #4174

Conversation

aojea commented Sep 4, 2023 • edited

danwinship commented Sep 4, 2023

danwinship commented Sep 4, 2023 • edited

danwinship commented Sep 4, 2023

aojea commented Sep 5, 2023

aojea commented Sep 5, 2023 • edited

aojea commented Sep 5, 2023

k8s-ci-robot commented Sep 5, 2023

aojea commented Sep 5, 2023 • edited

danwinship commented Sep 5, 2023

aojea commented Sep 5, 2023

danwinship commented Sep 5, 2023 • edited

aojea commented Sep 5, 2023

aojea commented Sep 5, 2023

danwinship commented Sep 18, 2023

danwinship commented Sep 18, 2023

aojea commented Sep 22, 2023

aojea commented Sep 4, 2023 •

edited

danwinship commented Sep 4, 2023 •

edited

aojea commented Sep 5, 2023 •

edited

aojea commented Sep 5, 2023 •

edited

danwinship commented Sep 5, 2023 •

edited