New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scale] node state changes causes excessive LB re-configurations #111539
Comments
/sig network |
I am doubtful I should open a new issue for this (seeing as how it's very much correlated to this issue), but there's another problem exherbitated by re-configuring the LB when not really needed; Updating a service from Specifically, when such an update is issued on the service object: kube-proxy will update all rules on nodes not hosting any endpoint and stop traffic from being forwarded, this is "the fast path". Inefficiently however, the CCM will also trigger a re-configuration of the LBs HC, which is the "slow path". Any client attempting to connect to the application backed by this service will not be able to until the LB is fully re-configured, since the LB won't have an HC or an incomplete one not being able to fully indicate where traffic should be sent. It feels ridiculous that such a small change can lead to such high windows of observed downtime - this is particularly impactful on low latency applications with strict SLAs. This can't be fixed today without that enhancement proposal, but I think that in an ideal world (hopefully once the KEP is in) the HC for what concerns the LB should remain the same. The only real change should, on such an update, be the service proxy's response to that HC. That would severely reduce the observed "downtime". That obviously means impacting the |
/reopen This has not been fully completed by #109706, only partially. |
@alexanderConstantinescu: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@alexanderConstantinescu do you think we need a KEP for the rest of this change? (It sounds like there is already a KEP in progress, do you want to link it here?) |
Sure, here's the KEP I posted and which hopefully will get reviewed during the 1.26 KEP cycle: kubernetes/enhancements#3460 |
marking this as accepted since there is a KEP in flight /triage accepted |
/triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What happened?
There is scale bug related to the cloud-controller-manager, namely: when a cluster reaches a state where the relation between the amount of LBs and cluster nodes reaches an almost equal amount (i.e: a many-to-many relation between LBs and nodes) the amount of cloud API calls performed by the CCM to sync LBs can become overwhelming for the cloud provider. This can (among other things) lead to:
An example of problem 3. is: imagine we have 200 LBs and 500 nodes, on clusters at such scale, the amount of events processed by the CCM for transitioning node state can become important. The current implementation of the CCM will resync all 200 LB services for any node which has a transitioning state - as to keep the LBs' configured backends in sync with the state of the cluster. It has been observed that on clusters at that scale, the cloud provider can take up to hours to process a full resync for all services. A scenario such as the below is hence not improbable (and actually quite likely):
[All LB services are synced - time taken for this might be long]
[Eventually - once the CCM enters its second reconciliation loop caused by 3.]
The sync function which is triggered when a node transitions state from Ready <-> NotReady, is currently correct but can be avoided for services which have ExternalTrafficPolicyLocal if the service does not have an endpoint on any of the nodes which are transitioning state between Ready <-> NotReady. This is because such services are not designed to forward traffic to any other node, and the update is thus useless to them. The node which is experiencing a transitioning state is already configured for that LB, and the fact that it transitions state is a moot point for what concerns these services.
The longer vision I have for this: remove the sync function for transitioning node readiness state from the CCM altogether and instead have all services of type LoadBalancer evaluate the node's readiness state by the utilization of a LB configured health check probe against the kubelet's read-only port. This would allow transitioning node readiness state to be handled much more dynamically (similar to what has already been implemented for HealthCheckNodePort by kube-proxy) for all services of type LoadBalancer irrespectively of if they define ExternalTrafficPolicyLocal or not....but that will require a enhancement proposal since cloud providers have divergent implementations of how they configure health check probes for LB services.
I am putting this here for completeness sake: this is what the major cloud providers currently do to health check services which are not ExternalTrafficPolicyLocal
GCE: probes 10256
AWS: if ELB: probes the first NodePort defined on the service spec.
Azure: probes all NodePort defined on the service spec.
All of these can be improved, but an enhancement proposal would be needed to align them all. For now: this PR will help fix some cases.
What did you expect to happen?
Node readiness changes should not re-configure load balancers associated with services of type ETP=local.
healthCheckNodePort
and can rely on that much more dynamically without needing to re-configure the LB.In the long-term: all load balancers should not be re-configured because of node state changes. That will need an enhancement proposal however (which I am currently drafting) to change service proxies and align cloud provider HCs.
How can we reproduce it (as minimally and precisely as possible)?
"as minimally and precisely as possible" not sure if this hits the definition, but: we need a big cluster with a lot of entropy (for example by performing an upgrade and restarting all nodes one-by-one) - this should end up re-configuring every single LB N*M (where N is the amount of nodes and M is the amount of state transitions per node).
Anything else we need to know?
Kubernetes version
All versions.
Cloud provider
All cloud providers.
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: