[scale] node state changes causes excessive LB re-configurations #111539

alexanderConstantinescu · 2022-07-29T13:00:40Z

What happened?

There is scale bug related to the cloud-controller-manager, namely: when a cluster reaches a state where the relation between the amount of LBs and cluster nodes reaches an almost equal amount (i.e: a many-to-many relation between LBs and nodes) the amount of cloud API calls performed by the CCM to sync LBs can become overwhelming for the cloud provider. This can (among other things) lead to:

an excessive amount of cloud API calls for the LB re-configurations
rate limiting by the cloud provider as an effect of 1.
or worse: a non-trivial amount of time to sync all services in the cluster, leading to desynchronized LBs.

An example of problem 3. is: imagine we have 200 LBs and 500 nodes, on clusters at such scale, the amount of events processed by the CCM for transitioning node state can become important. The current implementation of the CCM will resync all 200 LB services for any node which has a transitioning state - as to keep the LBs' configured backends in sync with the state of the cluster. It has been observed that on clusters at that scale, the cloud provider can take up to hours to process a full resync for all services. A scenario such as the below is hence not improbable (and actually quite likely):

Node25 transitions from Ready -> NotReady
[All LB services are synced - time taken for this might be long]
Service1's LB which only has an endpoint on Node25 gets Node25 removed from its list of backend
Node25 transitions from NotReady -> Ready again, cancelling the scheduled eviction for the endpoint hosted on Node25 for Service1
CCM is still processing the updated caused by 1. and clients connecting to the LB of Service1 continues experiencing connectivity issues, even though Node25 now is in a Ready state.
[Eventually - once the CCM enters its second reconciliation loop caused by 3.]
The LB for Service1 gets Node25 added to its list of backend again and connectivity is again achieved by clients.

The sync function which is triggered when a node transitions state from Ready <-> NotReady, is currently correct but can be avoided for services which have ExternalTrafficPolicyLocal if the service does not have an endpoint on any of the nodes which are transitioning state between Ready <-> NotReady. This is because such services are not designed to forward traffic to any other node, and the update is thus useless to them. The node which is experiencing a transitioning state is already configured for that LB, and the fact that it transitions state is a moot point for what concerns these services.

The longer vision I have for this: remove the sync function for transitioning node readiness state from the CCM altogether and instead have all services of type LoadBalancer evaluate the node's readiness state by the utilization of a LB configured health check probe against the kubelet's read-only port. This would allow transitioning node readiness state to be handled much more dynamically (similar to what has already been implemented for HealthCheckNodePort by kube-proxy) for all services of type LoadBalancer irrespectively of if they define ExternalTrafficPolicyLocal or not....but that will require a enhancement proposal since cloud providers have divergent implementations of how they configure health check probes for LB services.

I am putting this here for completeness sake: this is what the major cloud providers currently do to health check services which are not ExternalTrafficPolicyLocal

GCE: probes 10256
AWS: if ELB: probes the first NodePort defined on the service spec.
Azure: probes all NodePort defined on the service spec.

All of these can be improved, but an enhancement proposal would be needed to align them all. For now: this PR will help fix some cases.

What did you expect to happen?

Node readiness changes should not re-configure load balancers associated with services of type ETP=local.

Nodes not hosting any endpoints associated with these services do not forward traffic to nodes that do, so re-configuring the LB because said node goes NotReady is useless.
Nodes hosting endpoints associated with these services are already health checked using the HC probe configured for the healthCheckNodePort and can rely on that much more dynamically without needing to re-configure the LB.

In the long-term: all load balancers should not be re-configured because of node state changes. That will need an enhancement proposal however (which I am currently drafting) to change service proxies and align cloud provider HCs.

How can we reproduce it (as minimally and precisely as possible)?

"as minimally and precisely as possible" not sure if this hits the definition, but: we need a big cluster with a lot of entropy (for example by performing an upgrade and restarting all nodes one-by-one) - this should end up re-configuring every single LB N*M (where N is the amount of nodes and M is the amount of state transitions per node).

Anything else we need to know?

Kubernetes version

All versions.

Cloud provider

All cloud providers.

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

alexanderConstantinescu · 2022-07-29T13:01:24Z

/sig network
/sig cloud-provider

alexanderConstantinescu · 2022-08-03T09:33:55Z

I am doubtful I should open a new issue for this (seeing as how it's very much correlated to this issue), but there's another problem exherbitated by re-configuring the LB when not really needed;

Updating a service from ExternalTrafficPolicy = Cluster to ExternalTrafficPolicy = Local can/will cause minute-wide blackouts for the service SLAs.

Specifically, when such an update is issued on the service object: kube-proxy will update all rules on nodes not hosting any endpoint and stop traffic from being forwarded, this is "the fast path". Inefficiently however, the CCM will also trigger a re-configuration of the LBs HC, which is the "slow path". Any client attempting to connect to the application backed by this service will not be able to until the LB is fully re-configured, since the LB won't have an HC or an incomplete one not being able to fully indicate where traffic should be sent. It feels ridiculous that such a small change can lead to such high windows of observed downtime - this is particularly impactful on low latency applications with strict SLAs.

This can't be fixed today without that enhancement proposal, but I think that in an ideal world (hopefully once the KEP is in) the HC for what concerns the LB should remain the same. The only real change should, on such an update, be the service proxy's response to that HC. That would severely reduce the observed "downtime". That obviously means impacting the healthCheckNodePort, since that port will need to be aligned with the default HC port for ETP!=Local services. At the same time users can specify what that port should be, so maybe the default port should change if the service has a healthCheckNodePort already defined? Te be discussed on the KEP...

nckturner · 2022-08-03T16:14:36Z

/cc @cheftako @nckturner @jprzychodzen

alexanderConstantinescu · 2022-08-04T06:23:53Z

/reopen

This has not been fully completed by #109706, only partially.

k8s-ci-robot · 2022-08-04T06:24:10Z

@alexanderConstantinescu: Reopened this issue.

In response to this:

/reopen

This has not been fully completed by #109706, only partially.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nckturner · 2022-08-17T16:12:34Z

@alexanderConstantinescu do you think we need a KEP for the rest of this change? (It sounds like there is already a KEP in progress, do you want to link it here?)

alexanderConstantinescu · 2022-08-17T18:26:22Z

@alexanderConstantinescu do you think we need a KEP for the rest of this change? (It sounds like there is already a KEP in progress, do you want to link it here?)

Sure, here's the KEP I posted and which hopefully will get reviewed during the 1.26 KEP cycle: kubernetes/enhancements#3460

elmiko · 2022-08-31T16:42:01Z

marking this as accepted since there is a KEP in flight

/triage accepted

bridgetkromhout · 2022-08-31T16:46:33Z

/triage accepted

k8s-triage-robot · 2022-11-29T17:01:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-11-30T07:57:52Z

/remove-lifecycle stale

k8s-triage-robot · 2024-01-19T11:58:53Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2024-04-18T12:18:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alexanderConstantinescu added the kind/bug Categorizes issue or PR as related to a bug. label Jul 29, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2022

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 29, 2022

This was referenced Jul 29, 2022

Avoid re-syncing LBs for ETP=local services #109706

Merged

[CCM - service controller] addressing left over comments from #109706 #111658

Merged

[Proxy]: add healthz verification when determining HC response for eTP:Local #111661

Merged

k8s-ci-robot closed this as completed in #109706 Aug 3, 2022

k8s-ci-robot reopened this Aug 4, 2022

alexanderConstantinescu mentioned this issue Aug 4, 2022

Cherry pick of #109706: Avoid re-syncing LBs for ETP=local services #111686

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 31, 2022

alexanderConstantinescu mentioned this issue Sep 28, 2022

KEP-3458: Remove transient node predicates from KCCM's service controller kubernetes/enhancements#3460

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 19, 2024

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 19, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scale] node state changes causes excessive LB re-configurations #111539

[scale] node state changes causes excessive LB re-configurations #111539

alexanderConstantinescu commented Jul 29, 2022 •

edited

alexanderConstantinescu commented Jul 29, 2022

alexanderConstantinescu commented Aug 3, 2022 •

edited

nckturner commented Aug 3, 2022

alexanderConstantinescu commented Aug 4, 2022

k8s-ci-robot commented Aug 4, 2022

nckturner commented Aug 17, 2022 •

edited

alexanderConstantinescu commented Aug 17, 2022

elmiko commented Aug 31, 2022

bridgetkromhout commented Aug 31, 2022

k8s-triage-robot commented Nov 29, 2022

vaibhav2107 commented Nov 30, 2022

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

[scale] node state changes causes excessive LB re-configurations #111539

[scale] node state changes causes excessive LB re-configurations #111539

Comments

alexanderConstantinescu commented Jul 29, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

alexanderConstantinescu commented Jul 29, 2022

alexanderConstantinescu commented Aug 3, 2022 • edited

nckturner commented Aug 3, 2022

alexanderConstantinescu commented Aug 4, 2022

k8s-ci-robot commented Aug 4, 2022

nckturner commented Aug 17, 2022 • edited

alexanderConstantinescu commented Aug 17, 2022

elmiko commented Aug 31, 2022

bridgetkromhout commented Aug 31, 2022

k8s-triage-robot commented Nov 29, 2022

vaibhav2107 commented Nov 30, 2022

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

alexanderConstantinescu commented Jul 29, 2022 •

edited

alexanderConstantinescu commented Aug 3, 2022 •

edited

nckturner commented Aug 17, 2022 •

edited