-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip non-update endpoint updates #50934
Skip non-update endpoint updates #50934
Conversation
Hi @joelsmith. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@sjenning @DirectXMan12 PTAL. Also, could somebody please add the ok to test label? |
/ok-to-test |
Darn it Joel! You made me pull out pen and paper for a Venn diagram 😛 This looks functionally correct to me, but it is hard to follow in places and does some things unnecessarily. For example, if we know that pod and labels didn't change, we can immediately return as we know there is no update required, avoiding one or both The changes were harder to describe in words than in code, so just adpated your code (untested): I think that does the minimal amount of work required and might be easier to understand. What do you think? also cc @derekwaynecarr @eparis @smarterclayton Looks like all the tests are passing except for one flake so that's nice. Great work running this one down! 👍 |
@sjenning I like your updated version. I'll test it out then pull it in, squash and push to this PR's branch. Hopefully the flake will work the second time. Thanks for your help! |
d84f9f1
to
a8dc0d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great overall, but it really needs a test. At the very least, we can break out the comparison logic to a function and test against various inputs, and we can break out the service-change logic into a function, and test against various scenarios to prove that your set logic stays correct.
newEndpointAddress.NodeName = nil | ||
oldEndpointAddress.NodeName = nil | ||
if !reflect.DeepEqual(newEndpointAddress, oldEndpointAddress) { | ||
// The pod has not changed in any way that impacts the endpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/has not/has ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we let an outdated comment slip through
@thockin Thanks for your review. I have added the unit tests you recommended. I was planning on adding some tests (the reason for the "WIP" in the title) but I was struggling to come up with anything good. I appreciate your suggestions and I hope the newly-added tests are adequate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Almost there. Any time there is something non-obvious, err on the side of more comments :)
oldEndpointAddress := podToEndpointAddress(oldPod) | ||
newEndpointAddress.TargetRef.ResourceVersion = "" | ||
oldEndpointAddress.TargetRef.ResourceVersion = "" | ||
newEndpointAddress.NodeName = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment why we don't care about NodeName but we do care about TargetRef?
t.Errorf("Expected address to be unchanged for copied pod") | ||
} | ||
|
||
newPod.ObjectMeta.ResourceVersion = "changed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeat this for NodeName, if we really don't care about that (not convinced we don't).
bcd := sets.NewString("b", "c", "d") | ||
abcd := sets.NewString("a", "b", "c", "d") | ||
ad := sets.NewString("a", "d") | ||
retval := determineNeededServiceUpdates(abc, bcd, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this as a table-driven test and exercise more cases. One or the other being empty. Both empty. Totally disjoint sets, identical sets, etc.
Let me know if you need examples of such tests.
@sjenning Were you ignoring the @thockin I've switched to the table of test cases and added a few more test cases. Please let me know if you think of any others that I missed. |
ce438a7
to
80458a1
Compare
This LGTM. Can you please squash commits? |
80458a1
to
87d9551
Compare
@thockin squashed, and thanks again for the review! |
/lgtm |
A pod status change of unready -> ready results in a move from the endpoint's unready endpoint addresses to its ready addresses so if a pod update contains an unready -> ready status change, the endpoint needs to be updated.
Fix unready endpoints bug introduced in #50934
Automatic merge from submit-queue (batch tested with PRs 15870, 15888, 15788, 15907, 15936) UPSTREAM: 50934: Skip non-update endpoint updates Node performance impact fix for endpoints controller. Skips no-op service updates on pod that have not changed in a way that impacts endpoints. xref kubernetes/kubernetes#50934 and kubernetes/kubernetes#51144 @joelsmith @derekwaynecarr @eparis @smarterclayton
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
Automatic merge from submit-queue UPSTREAM: 50934: Skip non-update endpoint updates xref kubernetes/kubernetes#50934 and kubernetes/kubernetes#51144 master PR: #15888 xref https://bugzilla.redhat.com/show_bug.cgi?id=1481603 @joelsmith @derekwaynecarr @eparis @smarterclayton
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
…0934 :100644 100644 0f17c4a510... 0efd748a6e... M pkg/controller/endpoint/endpoints_controller.go :100644 100644 b4c51a2f2a... 7af7c41c28... M pkg/controller/endpoint/endpoints_controller_test.go
What this PR does / why we need it:
On large clusters, a large percentage of endpoint updates are actually non-updates that occur as a result of a change in an associated pod. This results in endpoint updates where the only field that has changed is the
TargetRef.ResourceVersion
in the endpoint address associated with the changed pod. Given enough of these non-updates, the endpoint controller's queue rate limit can be overwhelmed and legitimate updates can be delayed, resulting in (temporarily) broken services. We have clusters where we've seen endpoint updates take 9 minutes.Which issue this PR fixes : fixes #50936
Special notes for your reviewer:
N/A
Release note: