Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2593: update risks and mitigation, rename object and target releases #4174

Closed
wants to merge 3 commits into from

Conversation

aojea
Copy link
Member

@aojea aojea commented Sep 4, 2023

During the development of the KEP 1880: Multiple Service CIDRs the intersection with KEP 2593: Enhanced NodeIPAM to support Discontiguous Cluster CIDR made the SIG Network to revisit the state of the Kubernetes networking configuration, specially the configuration of the Pod and Services networks.
This triggered several discussion and different alternatives were evaluated and discussed deeply , concluding that all possible problems of the network intersection can not be mitigated, it will be very difficult (if not impossible) at this stage of the project to do any significant changes with backwards compatibility and without disrupting the whole ecosystem that depends on current state.

During these conversations some concerns were raised by the community about this KEP-2593:

One part that may cause fiction is the name of the object used, ClusterCIDR , but there are several points that makes keeping the same object name preferrable as previously mentioned:

--cluster-cidr string
--
  | CIDR Range for Pods in cluster. Requires --allocate-node-cidrs to be true

The proposal is to move forward to beta this KEP in 1.29 with current state

/sig network
/assign @thockin @danwinship

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 4, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 4, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 4, 2023
@danwinship
Copy link
Contributor

/hold

The existing kube-controller-manager flag name has the same name --cluster-cidr

Yes, and the ClusterCIDR object is intended to override that flag only (eg, it is explicitly not intended to also override kube-proxy's --cluster-cidr flag). If it was called KubeControllerManagerClusterCIDR, that would be great, but calling it just ClusterCIDR makes it sound like it not just about kcm.

It should be generic enough to not be tied to a single company/provider/vendor. Google interest, promoter of the KEP, is to use it with Cilium , a CNCF and OSS project; SUSE is already interested on this functionality and implemented it in flannel flannel-io/flannel@243cf7e, giving enough proof that this can not generate vendor lockin.

I'm not worried about "vendor lockin". I'm worred about... exactly what they did in that commit. The KEP clearly states that "This KEP assumes that the only consumer of the --cluster-cidr value is the NodeIPAM controller." And yet, flannel was modified so that it now also consumes the ClusterCIDR objects, for its own cluster-CIDR-determining purposes, meaning it is treating ClusterCIDR as the canonical definition of the pod network, which is exactly what we agreed it is not supposed to be.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 4, 2023
@danwinship
Copy link
Contributor

danwinship commented Sep 4, 2023

I'm worred about... exactly what they did in that commit.

So, clarifying that some more: what everyone wants is "pod network autoscaling", because if you don't have that, then node autoscaling stops being useful at some point.

In GKE, it happens to be the case that the only additional feature you need to have in order to be able to do pod network autoscaling is the ability to reconfigure the kcm Node IPAM controller; nothing else in GKE cares about the size/shape of the cluster network.

In most other cluster configurations, this is not the case; for example, when using flannel as your network plugin, flannel itself needs to know the extent of the cluster network, so just having the ability to reconfigure the Node IPAM controller does not give you pod network autoscaling if you are using flannel.

The PR linked above "fixes" this by having flannel sniff the config objects that KEP-2593 added in order to configure NodeIPAM, which KEP-2593 explicitly says is not what those objects are intended to be used for. There is not actually any way for them to implement the feature that they want with the current KEP. And, IMO, that suggests that the KEP is not actually solving the right problem; having the ability to reconfigure NodeIPAM is one of the things you need in order to have pod network autoscaling, but it's not the only thing (or even, for most clusters, the primary thing).

@danwinship
Copy link
Contributor

Or, OTOH, if we are only going to solve the problem for GKE, then there is no need for a Kubernetes-level feature; GKE could just implement its own alternate NodeIPAM controller and use that rather than the kcm one (as several other network plugins already do). Then external components like flannel would not be tricked into misusing the API.

@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

Then external components like flannel would not be tricked into misusing the API.

SUSE rancher wanted this feature as described in the KEP, is not misusing it, a consequence of using this feature is that the objects used to configure the Ipam matches the pod CIDRs on the nodes, not the other way around.

cilium also has a mode of operation that depends on the nodeIpam and will benefit of this feature, there are multiple projects that depends on the nodeIpam and benefit from this feature , is not a GKE one problem

@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

@danwinship I think there is a misunderstanding on what this feature targets, ALL clusters that depends on nodeIPAM, some modes of cilium, flannel, kubeadm, ... CAN NOT scale if the number of nodes is equal to the 2^"ClusterCIDR mask - node-pod-mask-size".
The only alternative today is to restart the controller manager with new flags, this feature removes the need to use flags and restart controller manager nodes.

GKE has bigger demand and customer reach this limit more often than in other places, I think that we can remove the GKE argument as we can see how there are st least 3 OSS projects that use nodeIpam and the problem is demonstrated.

I don't find legit pushing back on the argument that people may use the apis for something that is not supposed to represent , people can sneak into the object and assume that is the ClusterCIDR for all the cluster, it perfectly can list all nodes and list the node.Spec.PodCIDRs too

I don't disagree on the name can be misleading but I completely disagree on the feature not be useful outside of GKE

@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

I'm going to rename of the object to avoid people to assume this objects configure the PodCIDR of the cluster, I think we never disagree on that, but we both reviewed the original KEP back in 2021 #2594 , I don't think is legit at this point to revisit the intent of the KEP ...

The name ClusterCIDR for the object may cause confusion on users,
since will only be represent the ClusterCIDR IF the Cluster is using
the specific MultiNodeIPAM controller that uses these objects.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
Once this PR has been reviewed and has the lgtm label, please ask for approval from thockin. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 5, 2023
@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

@caseydavenport I can see calico has a CRD IPAMConfiguration, I'm suggesting to use NodeIPAMConfig here, can you ack this is ok from Calico side https://docs.tigera.io/calico/latest/reference/resources/ipamconfig

@danwinship
Copy link
Contributor

we both reviewed the original KEP back in 2021 #2594, I don't think is legit at this point to revisit the intent of the KEP ...

(FWIW, I didn't like this approach then either: #2594 (comment). I just got busy with other stuff and stopped arguing against it.)

@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

(FWIW, I didn't like this approach then either: #2594 (comment). I just got busy with other stuff and stopped arguing against it.)

then what we do, do we revert the KEP?

@danwinship
Copy link
Contributor

danwinship commented Sep 5, 2023

SUSE rancher wanted this feature as described in the KEP

No, SUSE wanted "pod network scaling". That is not the feature described in the KEP. The KEP describes "extending NodeIPAM to support pod network scaling", which is part of the "pod network scaling" feature, but not all of it. SUSE wanted all of it, and so they also made changes to flannel, and those changes reuse parts of the NodeIPAM scaling feature in a way which the KEP does not anticipate and probably intends to forbid. (The KEP says that "This KEP assumes that the only consumer of the `--cluster-cidr` value is the NodeIPAM controller", and even rejects the possibility that kube-proxy could look at the ClusterCIDR objects when it is running in --detect-local-mode ClusterCIDR. The PRR answer about skew says that "there is no coordination between multiple components required for this feature", which is certainly not true in a cluster where kcm and flannel are both looking at ClusterCIDR objects. E.g. if flannel is configured with --use-multi-cluster-cidr and kcm is not configured with --cidr-allocator-type=MultiCIDRRangeAllocator, or vice versa, then the network will malfunction. This is of course a solvable problem, but the KEP explicitly does not try to solve it.)

ALL clusters that depends on nodeIPAM, some modes of cilium, flannel, kubeadm, ... CAN NOT scale if the number of nodes is equal to the 2^"ClusterCIDR mask - node-pod-mask-size".

Yes, I understand that. My point is that most clusters, whether they use NodeIPAM or not, also require other changes in order to support pod network scaling. And this KEP makes no effort to solve those other problems.

I would love for Kubernetes to support pod network scaling, but if we want it to support that, then we need to actually write a pod network scaling KEP, which will include dynamic NodeIPAM reconfiguration as one of its pieces, but will also have to include other things.

@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

I would love for Kubernetes to support pod network scaling, but if we want it to support that, then we need to actually write a pod network scaling KEP, which will include dynamic NodeIPAM reconfiguration as one of its pieces, but will also have to include other things.

I'm confused now, why is the other KEP blocking this KEP?
node.Spec.PodCIDR is inmutable, that can not be changed, NodeIPAM will not be a tool to solve the pod network scaling problem

@aojea aojea changed the title KEP-2593: update target releases KEP-2593: update risks and mitigation, rename object and target releases Sep 5, 2023
@aojea
Copy link
Member Author

aojea commented Sep 5, 2023

This is of course a solvable problem, but the KEP explicitly does not try to solve i

Addressed in the last commit in the "Risk and Mitigations" section
58477c5

@danwinship
Copy link
Contributor

This is of course a solvable problem, but the KEP explicitly does not try to solve i

Addressed in the last commit in the "Risk and Mitigations" section 58477c5

I don't think that addresses the problem. The "flannel config vs kcm config" point was just one example.

My point is that the KEP process exists to make sure that we answer various hard questions when adding a new feature, and in the case of KEP-2593, we explicitly decided to not answer any of the hard questions about pod network scaling in the general case (despite the fact that the KEP's first Goal is "Support multiple discontiguous IP CIDR blocks for Cluster CIDR" and its first User Story is "Add more pod IPs to the cluster"). But now people are trying to use KEP-2593 to implement pod network scaling in other contexts anyway, and potentially shooting themselves (and other people) in the foot because we didn't think about any of the possible difficulties with doing this... The KEP has become an attractive nuisance.

(e.g., Another problem with the flannel patch: if the admin deletes a ClusterCIDR object, flannel will consider that CIDR to be no longer part of the pod network right away, even though it may actually still be in use by some nodes. The defined semantics of deleting a ClusterCIDR object according to KEP-2593 is that the NodeIPAM controller should stop using that CIDR for new nodes, but nothing indicates when it is no longer used for old nodes, which is what flannel actually needs to know.)

@danwinship
Copy link
Contributor

because we didn't think about any of the possible difficulties with doing this

an attempt: https://github.com/danwinship/enhancements/tree/pod-network-scaling/keps/sig-network/2593bis-pod-network-scaling#summary

@aojea
Copy link
Member Author

aojea commented Sep 22, 2023

The KEP has become an attractive nuisance.

agree, is a very bespoke solution, it does not solve the real problem and foremost, is going to be confusing for users ... I think we should also remove it from the codebase @thockin

an attempt: https://github.com/danwinship/enhancements/tree/pod-network-scaling/keps/sig-network/2593bis-pod-network-scaling#summary

@danwinship that has a lot of overlap with the multinetwork KEP https://docs.google.com/document/d/17LhyXsEgjNQ0NWtvqvtgJwVqdJWreizsgAZHWflgP-A/edit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants