Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scale] node state changes causes excessive LB re-configurations #111539

Open
alexanderConstantinescu opened this issue Jul 29, 2022 · 13 comments · Fixed by #109706
Open

[scale] node state changes causes excessive LB re-configurations #111539

alexanderConstantinescu opened this issue Jul 29, 2022 · 13 comments · Fixed by #109706
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@alexanderConstantinescu
Copy link
Member

alexanderConstantinescu commented Jul 29, 2022

What happened?

There is scale bug related to the cloud-controller-manager, namely: when a cluster reaches a state where the relation between the amount of LBs and cluster nodes reaches an almost equal amount (i.e: a many-to-many relation between LBs and nodes) the amount of cloud API calls performed by the CCM to sync LBs can become overwhelming for the cloud provider. This can (among other things) lead to:

  1. an excessive amount of cloud API calls for the LB re-configurations
  2. rate limiting by the cloud provider as an effect of 1.
  3. or worse: a non-trivial amount of time to sync all services in the cluster, leading to desynchronized LBs.

An example of problem 3. is: imagine we have 200 LBs and 500 nodes, on clusters at such scale, the amount of events processed by the CCM for transitioning node state can become important. The current implementation of the CCM will resync all 200 LB services for any node which has a transitioning state - as to keep the LBs' configured backends in sync with the state of the cluster. It has been observed that on clusters at that scale, the cloud provider can take up to hours to process a full resync for all services. A scenario such as the below is hence not improbable (and actually quite likely):

  1. Node25 transitions from Ready -> NotReady
    [All LB services are synced - time taken for this might be long]
  2. Service1's LB which only has an endpoint on Node25 gets Node25 removed from its list of backend
  3. Node25 transitions from NotReady -> Ready again, cancelling the scheduled eviction for the endpoint hosted on Node25 for Service1
  4. CCM is still processing the updated caused by 1. and clients connecting to the LB of Service1 continues experiencing connectivity issues, even though Node25 now is in a Ready state.
    [Eventually - once the CCM enters its second reconciliation loop caused by 3.]
  5. The LB for Service1 gets Node25 added to its list of backend again and connectivity is again achieved by clients.

The sync function which is triggered when a node transitions state from Ready <-> NotReady, is currently correct but can be avoided for services which have ExternalTrafficPolicyLocal if the service does not have an endpoint on any of the nodes which are transitioning state between Ready <-> NotReady. This is because such services are not designed to forward traffic to any other node, and the update is thus useless to them. The node which is experiencing a transitioning state is already configured for that LB, and the fact that it transitions state is a moot point for what concerns these services.

The longer vision I have for this: remove the sync function for transitioning node readiness state from the CCM altogether and instead have all services of type LoadBalancer evaluate the node's readiness state by the utilization of a LB configured health check probe against the kubelet's read-only port. This would allow transitioning node readiness state to be handled much more dynamically (similar to what has already been implemented for HealthCheckNodePort by kube-proxy) for all services of type LoadBalancer irrespectively of if they define ExternalTrafficPolicyLocal or not....but that will require a enhancement proposal since cloud providers have divergent implementations of how they configure health check probes for LB services.

I am putting this here for completeness sake: this is what the major cloud providers currently do to health check services which are not ExternalTrafficPolicyLocal

GCE: probes 10256
AWS: if ELB: probes the first NodePort defined on the service spec.
Azure: probes all NodePort defined on the service spec.

All of these can be improved, but an enhancement proposal would be needed to align them all. For now: this PR will help fix some cases.

What did you expect to happen?

Node readiness changes should not re-configure load balancers associated with services of type ETP=local.

  • Nodes not hosting any endpoints associated with these services do not forward traffic to nodes that do, so re-configuring the LB because said node goes NotReady is useless.
  • Nodes hosting endpoints associated with these services are already health checked using the HC probe configured for the healthCheckNodePort and can rely on that much more dynamically without needing to re-configure the LB.

In the long-term: all load balancers should not be re-configured because of node state changes. That will need an enhancement proposal however (which I am currently drafting) to change service proxies and align cloud provider HCs.

How can we reproduce it (as minimally and precisely as possible)?

"as minimally and precisely as possible" not sure if this hits the definition, but: we need a big cluster with a lot of entropy (for example by performing an upgrade and restarting all nodes one-by-one) - this should end up re-configuring every single LB N*M (where N is the amount of nodes and M is the amount of state transitions per node).

Anything else we need to know?

Kubernetes version

All versions.

Cloud provider

All cloud providers.

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@alexanderConstantinescu alexanderConstantinescu added the kind/bug Categorizes issue or PR as related to a bug. label Jul 29, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2022
@alexanderConstantinescu
Copy link
Member Author

/sig network
/sig cloud-provider

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 29, 2022
@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Aug 3, 2022

I am doubtful I should open a new issue for this (seeing as how it's very much correlated to this issue), but there's another problem exherbitated by re-configuring the LB when not really needed;

Updating a service from ExternalTrafficPolicy = Cluster to ExternalTrafficPolicy = Local can/will cause minute-wide blackouts for the service SLAs.

Specifically, when such an update is issued on the service object: kube-proxy will update all rules on nodes not hosting any endpoint and stop traffic from being forwarded, this is "the fast path". Inefficiently however, the CCM will also trigger a re-configuration of the LBs HC, which is the "slow path". Any client attempting to connect to the application backed by this service will not be able to until the LB is fully re-configured, since the LB won't have an HC or an incomplete one not being able to fully indicate where traffic should be sent. It feels ridiculous that such a small change can lead to such high windows of observed downtime - this is particularly impactful on low latency applications with strict SLAs.

This can't be fixed today without that enhancement proposal, but I think that in an ideal world (hopefully once the KEP is in) the HC for what concerns the LB should remain the same. The only real change should, on such an update, be the service proxy's response to that HC. That would severely reduce the observed "downtime". That obviously means impacting the healthCheckNodePort, since that port will need to be aligned with the default HC port for ETP!=Local services. At the same time users can specify what that port should be, so maybe the default port should change if the service has a healthCheckNodePort already defined? Te be discussed on the KEP...

@nckturner
Copy link
Contributor

/cc @cheftako @nckturner @jprzychodzen

@alexanderConstantinescu
Copy link
Member Author

/reopen

This has not been fully completed by #109706, only partially.

@k8s-ci-robot
Copy link
Contributor

@alexanderConstantinescu: Reopened this issue.

In response to this:

/reopen

This has not been fully completed by #109706, only partially.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nckturner
Copy link
Contributor

nckturner commented Aug 17, 2022

@alexanderConstantinescu do you think we need a KEP for the rest of this change? (It sounds like there is already a KEP in progress, do you want to link it here?)

@alexanderConstantinescu
Copy link
Member Author

@alexanderConstantinescu do you think we need a KEP for the rest of this change? (It sounds like there is already a KEP in progress, do you want to link it here?)

Sure, here's the KEP I posted and which hopefully will get reviewed during the 1.26 KEP cycle: kubernetes/enhancements#3460

@elmiko
Copy link
Contributor

elmiko commented Aug 31, 2022

marking this as accepted since there is a KEP in flight

/triage accepted

@bridgetkromhout
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2022
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 19, 2024
@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants