Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-syncing LBs for ETP=local services #109706

Merged

Conversation

alexanderConstantinescu
Copy link
Member

@alexanderConstantinescu alexanderConstantinescu commented Apr 28, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR reduces the amount of LB re-configurations for ETP=local services by removing the sync completely for any node readiness changes, from the CCM's sync loop. The referenced issue explains the problem in greater detail, so readers are henced directed to that for what concerns the genesis of the problem. A lot has been discussed in this PR and a plan forward has been drafted. The following summarizes the discussion:

  • Remove syncs w.r.t node state changes in the CCM in 1.25 (follow up PR, once this merges, is to be expected for remaining node conditions)
  • Maybe / possibly / hopefully backport that to the oldest supported Kube version also supporting ETP=local services
  • Have the reported status of the LB HC configured for ETP=local services reflect the node state in 1.25 (also in the form of a follow up PR)
  • Write KEP for dealing with transitioning node state for all service LBs.
  • Kube >= after KEP : remove the sync logic for node state changes for all services and move them to the HC

Which issue(s) this PR fixes:

Partially fixes #111539 for ETP=local services, specifically: reduces the amount of cloud API calls and service downtime caused by excessive re-configurations of all cluster LBs as an effect of transitioning node readiness state.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Reduce the number of cloud API calls and service downtime caused by excessive re-configurations of cluster LBs with externalTrafficPolicy=Local when node readiness changes (https://github.com/kubernetes/kubernetes/issues/111539). The service controller (in cloud-controller-manager) will avoid resyncing nodes which are transitioning between `Ready` / `NotReady` (only for for ETP=Local Services). The LBs used for these services will solely rely on the health check probe defined by the `healthCheckNodePort` to determine if a particular node is to be used for traffic load balancing. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig network
/sig cloud-provider

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 28, 2022

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2022
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 28, 2022

Hi @alexanderConstantinescu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/cloudprovider labels Apr 28, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2022
@leilajal
Copy link
Contributor

leilajal commented Apr 28, 2022

/remove-sig api-machinery
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2022
@bowei
Copy link
Member

bowei commented Apr 28, 2022

cc: @bowei

@bowei
Copy link
Member

bowei commented Apr 28, 2022

/assign @bowei -- I'm very interested in poking more at the lifecycle of the LB node

@thockin thockin self-assigned this Apr 28, 2022
@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Apr 28, 2022

Some additional notes here below from the recent discussion in sig-net that just ended. Also, you'll be able to find the slides presented by me, here: https://docs.google.com/presentation/d/1EsK5FLS4s6E3Ok42XhRz643YtB9uZHn3wohWvcMyHag/edit#slide=id.p

My third slide present some ideas around the possibility of removing everything the CCM does w.r.t synchronizing nodes transitioning between NotReady <-> Ready.

  1. Is this PR - which eliminates the mentioned sync for ETP=local services, from the CCM's node sync loop
  2. Is removing that sync completely for all services irrespective of ETP=local or not.
    a. Is removing that sync completely only for a selected set of services, indicated by the user through the utilization of an annotation on the service object
  3. Is removing that sync completely for ETP=local services from the CCM, but replacing it by having kube-proxy dial the kubelet's read-only port when answering to the probe executed by the LB against the HealthCheckNodePort. I.e: kube-proxy would keep the current behavior (validating that the node in question is actually hosting endpoints for the service) and AND-ing that with the kubelet's reported state. That means that the LB would continue executing one probe against the HealthCheckNodePort, but kube-proxy would open a local connection against kubelet's read-only port on the node for every LB probe. There was some confusion surrounding this during the meeting, hence why I've chosen to detail this here. This would result in the same behavior as today, but be much more dynamic and efficient at handling transitioning node state - where instead of re-configuring the LB with a new set of nodes, the LB simply probes the node on the HealthCheckNodePort as it currently does, with the result also reflecting the kubelet/node readiness state.
  4. Is removing that sync completely for all services from the CCM. Instead we define a standardised behavior for health checking nodes which are listed as backends for LoadBalancer type services, that all cloud provider implementations must adhere to. All cloud providers today implement this interface: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/cloud.go#L133-L162 , however: when the service is ETP != local they are free to specify custom probes on their LBs. What they do today was described at the bottom of the description to this PR. If we align the behavior for all cloud implementations: we could have them include a probe against the kubelet's read-only port - or do as in the ETP=local case and have the LB query kube-proxy which does multiple checks. In the end, LB's only really care to know if the node can service traffic and if the service NodePort on the node is reachable. All three clouds have chosen one of these two checks, not both. GCE's LB check is actually superfluous, since the CCM also checks the same condition (i.e: the kubelet state/node readiness).

3 and 4. will require enhancement proposal and are beyond the scope of this PR, but we can discuss them here (or I can open a KEP for them and we can discuss them there...?)

This PR only concerns itself with finding a solution to either 1. and/or 2., simply because this currently is a bug for clusters at scale and, I believe, we should backport the fix to older releases.

I am actually in favour of doing 2.a., meaning: modifying this PR slightly and removing any relation to ETP=local services and checking where endpoints are hosted. I would instead prefer we define an annotation (ex: service.beta.kubernetes.io/lb-ignore-node-not-ready) which a user can set on the service object to tell the CCM to ignore this service for node readiness changes. The implementation effort would be minimal and would allow the user to solve both problems presented in my first slides. I, as a user of Kubernetes, should be allowed to disable the CCM's sync for node readiness state if I don't want my LBs to be re-configured as a cause by that. I own my nodes and LBs and should thus be allowed to decide what's best.

If we can't agree on 2.a, then 1. (the current state of this PR) will at least help reduce the time taken to sync all LBs by the CCM on large clusters with a lot of services ETP=local, which is already an improvement from today.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2022
@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented May 4, 2022

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2022
@danwinship
Copy link
Contributor

danwinship commented May 31, 2022

/ok-to-test

retSuccess := false
retNeedRetry := true
Copy link
Member

@BenTheElder BenTheElder Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like these are intended to be named constants?

Suggested change
retSuccess := false
retNeedRetry := true
const retSuccess = false
const retNeedRetry = true

Copy link
Member

@BenTheElder BenTheElder Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I would say the more usual approach would be to let the method documentation make it clear what true and false mean.

Copy link
Member Author

@alexanderConstantinescu alexanderConstantinescu Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you say the current state of that doc bloc does a good job at making it clear what true and false mean? In that case, I could just remove this in a follow up PR.

Copy link
Member

@thockin thockin Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird signature (false usually means fail) so I like having the const values.

@@ -559,6 +575,193 @@ func TestUpdateNodesInExternalLoadBalancer(t *testing.T) {
}
}

func TestNodeChangesForExternalTrafficPolicyLocalServices(t *testing.T) {
node1 := &v1.Node{ObjectMeta: metav1.ObjectMeta{Name: "node0"}, Status: v1.NodeStatus{Conditions: []v1.NodeCondition{{Type: v1.NodeReady, Status: v1.ConditionTrue}}}}
Copy link
Member

@thockin thockin Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor but...the var says node1, the name is node0, which is super confusing with the next one - the name is node1 but the var is node2. Can we just use the same number in both places? I don't care if it is not 0-based :)

Copy link
Member Author

@alexanderConstantinescu alexanderConstantinescu Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will make sure to fix this in the PR coming for the 3rd commit. This is recurrent across this test file, so maybe I'll add a dedicated PR for cleaning that up.

Copy link
Member

@thockin thockin Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@thockin
Copy link
Member

thockin commented Aug 2, 2022

Followup needed:

  • PR for 3rd commit
  • KEP to change HC behavior and reduce all syncs - once in the set, only the exclusion label matters

Anything else?

I'm really counting on your follow-through on these - I'd hate it to get stuck half way.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2022
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 2, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 2, 2022
@k8s-triage-robot
Copy link

k8s-triage-robot commented Aug 3, 2022

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Aug 3, 2022

Followup needed:

  • PR for 3rd commit
  • KEP to change HC behavior and reduce all syncs - once in the set, only the exclusion label matters

Anything else?

Yes, maybe I drew wrong conclusions from our discussions on this PR, but I thought we also wanted to start moving the readiness logic to kube-proxy's HC for ETP=local, as to get a head start on the version skew?

Specifically you mentioned here: #109706 (comment):

At the same time, we could change the endpoints to be more robust, and get that rolling out. Since that is just moving equivalent logic from the service controller to the HC, I don't think we need a KEP. The service logic remains in place, except ETP=Local.

1.25 could include the improved HCs and ETP=Local change. During 1.26 we could KEP and discuss the equivalent for everyone else.

For what concerns:

I'm really counting on your follow-through on these - I'd hate it to get stuck half way.

Will do, I am ready to submit PRs for the 3rd commit and kube-proxy's ETP=local HC once this gets in. I will have the KEP up in a week.

@thockin
Copy link
Member

thockin commented Aug 3, 2022

With the code freeze in effect, I am not sure we can get the HC changes done. I wanted to break it out so that it didn't drag this main PR down with it :)

This PR was approved before freeze and is not part of a KEP (limited impact) - do we need an exception for it @cici37 ?

@cici37
Copy link
Contributor

cici37 commented Aug 3, 2022

Thanks for checking. If the PR is a bug fix which does not contain major feature change and has been approved before code freeze, an exception is not needed and sig leads could merge it by tagging milestone label. :)

With the code freeze in effect, I am not sure we can get the HC changes done. I wanted to break it out so that it didn't drag this main PR down with it :)

This PR was approved before freeze and is not part of a KEP (limited impact) - do we need an exception for it Cici Huang ?

@thockin thockin added this to the v1.25 milestone Aug 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit aee13fc into kubernetes:master Aug 3, 2022
14 checks passed
@thockin
Copy link
Member

thockin commented Aug 3, 2022

@kubernetes/sig-scalability
@wojtek-t
@mborsz

FYI re scale - this should be an improvement for real clusters (minor because etp=Local, but eventually for all services)

@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Aug 4, 2022

@thockin : I am concerned the first commit might be wrong, specifically around removing nodes from the LB set based on the schedulability predicate.

It seems it was removed with #90823 which listed a production outage caused by the fact that we were syncing LBs as a cause by the schedulability constraint changing on a set of nodes.

This caused a production outage when nodes in an autoscaling group were marked unschedulable so the autoscaler would gracefully terminate them - instead the ingress controllers on those nodes were immediately removed from the LB pool and no new ingress requests could be accepted.

That makes me wonder why we even decided to sync updates to nodes which are experiencing a change to this attribute at all. That was added with PR: #90769 - which doesn't list any good reasons for adding it...

So:

Now we did the inverse it seems. Sorry for realizing this now....

alexanderConstantinescu added a commit to alexanderConstantinescu/kubernetes that referenced this pull request Aug 5, 2022
alexanderConstantinescu added a commit to alexanderConstantinescu/kubernetes that referenced this pull request Aug 11, 2022
Specifically:

- kubernetes#109706 (comment)
- kubernetes#109706 (comment)

Also, now the taint predicate is a bit better understood, improve the doc bloc.
alexanderConstantinescu added a commit to alexanderConstantinescu/kubernetes that referenced this pull request Sep 2, 2022
alexanderConstantinescu added a commit to alexanderConstantinescu/kubernetes that referenced this pull request Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/ipvs area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[scale] node state changes causes excessive LB re-configurations
9 participants