Skip non-update endpoint updates #50934

joelsmith · 2017-08-18T22:50:30Z

What this PR does / why we need it:

On large clusters, a large percentage of endpoint updates are actually non-updates that occur as a result of a change in an associated pod. This results in endpoint updates where the only field that has changed is the TargetRef.ResourceVersion in the endpoint address associated with the changed pod. Given enough of these non-updates, the endpoint controller's queue rate limit can be overwhelmed and legitimate updates can be delayed, resulting in (temporarily) broken services. We have clusters where we've seen endpoint updates take 9 minutes.

Which issue this PR fixes : fixes #50936

Special notes for your reviewer:
N/A

Release note:

Prevent unneeded endpoint updates

k8s-ci-robot · 2017-08-18T22:50:38Z

Hi @joelsmith. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

joelsmith · 2017-08-18T23:11:24Z

@sjenning @DirectXMan12 PTAL. Also, could somebody please add the ok to test label?

sjenning · 2017-08-19T02:54:26Z

/ok-to-test

sjenning · 2017-08-19T03:53:47Z

Darn it Joel! You made me pull out pen and paper for a Venn diagram 😛

This looks functionally correct to me, but it is hard to follow in places and does some things unnecessarily. For example, if we know that pod and labels didn't change, we can immediately return as we know there is no update required, avoiding one or both getPodServiceMemberships() calls which are pretty expensive.

The changes were harder to describe in words than in code, so just adpated your code (untested):
sjenning@7bcbe38

I think that does the minimal amount of work required and might be easier to understand. What do you think?

also cc @derekwaynecarr @eparis @smarterclayton

Looks like all the tests are passing except for one flake so that's nice.

Great work running this one down! 👍

joelsmith · 2017-08-19T04:09:44Z

@sjenning I like your updated version. I'll test it out then pull it in, squash and push to this PR's branch. Hopefully the flake will work the second time. Thanks for your help!

thockin

This looks great overall, but it really needs a test. At the very least, we can break out the comparison logic to a function and test against various inputs, and we can break out the service-change logic into a function, and test against various scenarios to prove that your set logic stays correct.

thockin · 2017-08-19T23:50:04Z

pkg/controller/endpoint/endpoints_controller.go

+	newEndpointAddress.NodeName = nil
+	oldEndpointAddress.NodeName = nil
+	if !reflect.DeepEqual(newEndpointAddress, oldEndpointAddress) {
+		// The pod has not changed in any way that impacts the endpoints


s/has not/has ??

Good catch, we let an outdated comment slip through

joelsmith · 2017-08-20T01:16:37Z

@thockin Thanks for your review. I have added the unit tests you recommended. I was planning on adding some tests (the reason for the "WIP" in the title) but I was struggling to come up with anything good. I appreciate your suggestions and I hope the newly-added tests are adequate.

thockin

thanks. Almost there. Any time there is something non-obvious, err on the side of more comments :)

thockin · 2017-08-20T04:28:42Z

pkg/controller/endpoint/endpoints_controller.go

+	oldEndpointAddress := podToEndpointAddress(oldPod)
+	newEndpointAddress.TargetRef.ResourceVersion = ""
+	oldEndpointAddress.TargetRef.ResourceVersion = ""
+	newEndpointAddress.NodeName = nil


Comment why we don't care about NodeName but we do care about TargetRef?

thockin · 2017-08-20T04:29:22Z

pkg/controller/endpoint/endpoints_controller_test.go

+		t.Errorf("Expected address to be unchanged for copied pod")
+	}
+
+	newPod.ObjectMeta.ResourceVersion = "changed"


repeat this for NodeName, if we really don't care about that (not convinced we don't).

thockin · 2017-08-20T04:32:13Z

pkg/controller/endpoint/endpoints_controller_test.go

+	bcd := sets.NewString("b", "c", "d")
+	abcd := sets.NewString("a", "b", "c", "d")
+	ad := sets.NewString("a", "d")
+	retval := determineNeededServiceUpdates(abc, bcd, false)


Can you do this as a table-driven test and exercise more cases. One or the other being empty. Both empty. Totally disjoint sets, identical sets, etc.

Let me know if you need examples of such tests.

joelsmith · 2017-08-20T06:09:58Z

@sjenning Were you ignoring the NodeName because of what I said about its value being different? If so, I think I led us down a bad path. My earlier iteration of an equality function to use in place of DeepEqual didn't handle NodeName, but DeepEqual does, and in my testing, it doesn't cause unnecessary endpoint updates. It looks like that was all due to the TargetRef.ResourceVersion. Sorry for the confusion. I've made the latest version stop ignoring NodeName and everything I've tested is working. Please let me know if there is something we're missing w.r.t. NodeName, but I suspect that you just added it to the ignore list due to my comments about my earlier mistaken implementation.

@thockin I've switched to the table of test cases and added a few more test cases. Please let me know if you think of any others that I missed.

thockin · 2017-08-20T06:18:25Z

This LGTM. Can you please squash commits?

joelsmith · 2017-08-20T06:21:57Z

@thockin squashed, and thanks again for the review!

thockin · 2017-08-20T06:30:10Z

/lgtm
/approve

A pod status change of unready -> ready results in a move from the endpoint's unready endpoint addresses to its ready addresses so if a pod update contains an unready -> ready status change, the endpoint needs to be updated.

Fix unready endpoints bug introduced in #50934

@joelsmith

Automatic merge from submit-queue (batch tested with PRs 15870, 15888, 15788, 15907, 15936) UPSTREAM: 50934: Skip non-update endpoint updates Node performance impact fix for endpoints controller. Skips no-op service updates on pod that have not changed in a way that impacts endpoints. xref kubernetes/kubernetes#50934 and kubernetes/kubernetes#51144 @joelsmith @derekwaynecarr @eparis @smarterclayton

…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go

@joelsmith

Automatic merge from submit-queue UPSTREAM: 50934: Skip non-update endpoint updates xref kubernetes/kubernetes#50934 and kubernetes/kubernetes#51144 master PR: #15888 xref https://bugzilla.redhat.com/show_bug.cgi?id=1481603 @joelsmith @derekwaynecarr @eparis @smarterclayton

…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 18, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 18, 2017

k8s-github-robot assigned thockin and bowei Aug 18, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Aug 18, 2017

joelsmith changed the title ~~WIP skip endpoint updates on non-updates~~ WIP skip non-update endpoint updates Aug 18, 2017

joelsmith mentioned this pull request Aug 18, 2017

No-op endpoint updates clog up endpoint controller's queue #50936

Closed

k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Aug 18, 2017

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 19, 2017

joelsmith force-pushed the skip-endpoints-update branch from d84f9f1 to a8dc0d4 Compare August 19, 2017 04:33

thockin reviewed Aug 19, 2017

View reviewed changes

joelsmith changed the title ~~WIP skip non-update endpoint updates~~ Skip non-update endpoint updates Aug 20, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 20, 2017

thockin reviewed Aug 20, 2017

View reviewed changes

joelsmith force-pushed the skip-endpoints-update branch from ce438a7 to 80458a1 Compare August 20, 2017 06:15

joelsmith force-pushed the skip-endpoints-update branch from 80458a1 to 87d9551 Compare August 20, 2017 06:20

joelsmith mentioned this pull request Aug 22, 2017

[e2e test failure][sig-network] Proxy version v1 should proxy through a service and a pod [Conformance] #51128

Closed

joelsmith mentioned this pull request Aug 22, 2017

Fix unready endpoints bug introduced in #50934 #51144

Merged

eparis added a commit that referenced this pull request Aug 22, 2017

Merge pull request #51144 from joelsmith/skip-endpoints-update

2b08d1e

Fix unready endpoints bug introduced in #50934

freehan mentioned this pull request Oct 31, 2017

Pod in graceful termination should not be on the ready address list of related Endpoints objects #54723

Closed

aojea mentioned this pull request Feb 11, 2022

Skip updating Endpoints if no relevant fields change #108078

Merged

tnqn mentioned this pull request Mar 2, 2022

Stop publishing Pod ResourceVersion in Endpoints and EndpointSlice API #108450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip non-update endpoint updates #50934

Skip non-update endpoint updates #50934

joelsmith commented Aug 18, 2017 •

edited

Loading

k8s-ci-robot commented Aug 18, 2017

joelsmith commented Aug 18, 2017

sjenning commented Aug 19, 2017

sjenning commented Aug 19, 2017

joelsmith commented Aug 19, 2017

thockin left a comment

thockin Aug 19, 2017

joelsmith Aug 19, 2017

joelsmith commented Aug 20, 2017

thockin left a comment

thockin Aug 20, 2017

thockin Aug 20, 2017

thockin Aug 20, 2017

joelsmith commented Aug 20, 2017

thockin commented Aug 20, 2017

joelsmith commented Aug 20, 2017

thockin commented Aug 20, 2017

Skip non-update endpoint updates #50934

Skip non-update endpoint updates #50934

Conversation

joelsmith commented Aug 18, 2017 • edited Loading

k8s-ci-robot commented Aug 18, 2017

joelsmith commented Aug 18, 2017

sjenning commented Aug 19, 2017

sjenning commented Aug 19, 2017

joelsmith commented Aug 19, 2017

thockin left a comment

Choose a reason for hiding this comment

thockin Aug 19, 2017

Choose a reason for hiding this comment

joelsmith Aug 19, 2017

Choose a reason for hiding this comment

joelsmith commented Aug 20, 2017

thockin left a comment

Choose a reason for hiding this comment

thockin Aug 20, 2017

Choose a reason for hiding this comment

thockin Aug 20, 2017

Choose a reason for hiding this comment

thockin Aug 20, 2017

Choose a reason for hiding this comment

joelsmith commented Aug 20, 2017

thockin commented Aug 20, 2017

joelsmith commented Aug 20, 2017

thockin commented Aug 20, 2017

joelsmith commented Aug 18, 2017 •

edited

Loading