Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip non-update endpoint updates #50934

Merged
merged 1 commit into from
Aug 22, 2017

Conversation

joelsmith
Copy link
Contributor

@joelsmith joelsmith commented Aug 18, 2017

What this PR does / why we need it:

On large clusters, a large percentage of endpoint updates are actually non-updates that occur as a result of a change in an associated pod. This results in endpoint updates where the only field that has changed is the TargetRef.ResourceVersion in the endpoint address associated with the changed pod. Given enough of these non-updates, the endpoint controller's queue rate limit can be overwhelmed and legitimate updates can be delayed, resulting in (temporarily) broken services. We have clusters where we've seen endpoint updates take 9 minutes.

Which issue this PR fixes : fixes #50936

Special notes for your reviewer:
N/A

Release note:

Prevent unneeded endpoint updates

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 18, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @joelsmith. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 18, 2017
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Aug 18, 2017
@joelsmith joelsmith changed the title <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://git.k8s.io/community/contributors/devel/pull-requests.md#the-pr-submit-process and developer guide https://git.k8s.io/community/contributors/devel/development.md#development-guide 2. If you want *faster* PR reviews, read how: https://git.k8s.io/community/contributors/devel/pull-requests.md#best-practices-for-faster-reviews 3. Follow the instructions for writing a release note: https://git.k8s.io/community/contributors/devel/pull-requests.md#write-release-notes-if-needed --> WIP skip endpoint updates on non-updates Aug 18, 2017
@joelsmith joelsmith changed the title WIP skip endpoint updates on non-updates WIP skip non-update endpoint updates Aug 18, 2017
@k8s-github-robot k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Aug 18, 2017
@joelsmith
Copy link
Contributor Author

@sjenning @DirectXMan12 PTAL. Also, could somebody please add the ok to test label?

@sjenning
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 19, 2017
@sjenning
Copy link
Contributor

Darn it Joel! You made me pull out pen and paper for a Venn diagram 😛

This looks functionally correct to me, but it is hard to follow in places and does some things unnecessarily. For example, if we know that pod and labels didn't change, we can immediately return as we know there is no update required, avoiding one or both getPodServiceMemberships() calls which are pretty expensive.

The changes were harder to describe in words than in code, so just adpated your code (untested):
sjenning@7bcbe38

I think that does the minimal amount of work required and might be easier to understand. What do you think?

also cc @derekwaynecarr @eparis @smarterclayton

Looks like all the tests are passing except for one flake so that's nice.

Great work running this one down! 👍

@joelsmith
Copy link
Contributor Author

@sjenning I like your updated version. I'll test it out then pull it in, squash and push to this PR's branch. Hopefully the flake will work the second time. Thanks for your help!

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall, but it really needs a test. At the very least, we can break out the comparison logic to a function and test against various inputs, and we can break out the service-change logic into a function, and test against various scenarios to prove that your set logic stays correct.

newEndpointAddress.NodeName = nil
oldEndpointAddress.NodeName = nil
if !reflect.DeepEqual(newEndpointAddress, oldEndpointAddress) {
// The pod has not changed in any way that impacts the endpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/has not/has ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we let an outdated comment slip through

@joelsmith joelsmith changed the title WIP skip non-update endpoint updates Skip non-update endpoint updates Aug 20, 2017
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 20, 2017
@joelsmith
Copy link
Contributor Author

@thockin Thanks for your review. I have added the unit tests you recommended. I was planning on adding some tests (the reason for the "WIP" in the title) but I was struggling to come up with anything good. I appreciate your suggestions and I hope the newly-added tests are adequate.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Almost there. Any time there is something non-obvious, err on the side of more comments :)

oldEndpointAddress := podToEndpointAddress(oldPod)
newEndpointAddress.TargetRef.ResourceVersion = ""
oldEndpointAddress.TargetRef.ResourceVersion = ""
newEndpointAddress.NodeName = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment why we don't care about NodeName but we do care about TargetRef?

t.Errorf("Expected address to be unchanged for copied pod")
}

newPod.ObjectMeta.ResourceVersion = "changed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat this for NodeName, if we really don't care about that (not convinced we don't).

bcd := sets.NewString("b", "c", "d")
abcd := sets.NewString("a", "b", "c", "d")
ad := sets.NewString("a", "d")
retval := determineNeededServiceUpdates(abc, bcd, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this as a table-driven test and exercise more cases. One or the other being empty. Both empty. Totally disjoint sets, identical sets, etc.

Let me know if you need examples of such tests.

@joelsmith
Copy link
Contributor Author

@sjenning Were you ignoring the NodeName because of what I said about its value being different? If so, I think I led us down a bad path. My earlier iteration of an equality function to use in place of DeepEqual didn't handle NodeName, but DeepEqual does, and in my testing, it doesn't cause unnecessary endpoint updates. It looks like that was all due to the TargetRef.ResourceVersion. Sorry for the confusion. I've made the latest version stop ignoring NodeName and everything I've tested is working. Please let me know if there is something we're missing w.r.t. NodeName, but I suspect that you just added it to the ignore list due to my comments about my earlier mistaken implementation.

@thockin I've switched to the table of test cases and added a few more test cases. Please let me know if you think of any others that I missed.

@thockin
Copy link
Member

thockin commented Aug 20, 2017

This LGTM. Can you please squash commits?

@joelsmith
Copy link
Contributor Author

@thockin squashed, and thanks again for the review!

@thockin
Copy link
Member

thockin commented Aug 20, 2017

/lgtm
/approve

joelsmith added a commit to joelsmith/kubernetes that referenced this pull request Aug 22, 2017
A pod status change of unready -> ready results in a move from
the endpoint's unready endpoint addresses to its ready addresses
so if a pod update contains an unready -> ready status change,
the endpoint needs to be updated.
eparis added a commit that referenced this pull request Aug 22, 2017
openshift-merge-robot added a commit to openshift/origin that referenced this pull request Aug 24, 2017
Automatic merge from submit-queue (batch tested with PRs 15870, 15888, 15788, 15907, 15936)

UPSTREAM: 50934: Skip non-update endpoint updates

Node performance impact fix for endpoints controller. Skips no-op service updates on pod that have not changed in a way that impacts endpoints.

xref kubernetes/kubernetes#50934 and kubernetes/kubernetes#51144

@joelsmith @derekwaynecarr @eparis @smarterclayton
deads2k pushed a commit to deads2k/kubernetes that referenced this pull request Sep 1, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
deads2k pushed a commit to openshift/kubernetes that referenced this pull request Sep 1, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
openshift-merge-robot added a commit to openshift/origin that referenced this pull request Sep 15, 2017
deads2k pushed a commit to openshift/kubernetes that referenced this pull request Sep 15, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
ironcladlou pushed a commit to ironcladlou/kubernetes that referenced this pull request Sep 22, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
ironcladlou pushed a commit to ironcladlou/kubernetes that referenced this pull request Sep 25, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
ironcladlou pushed a commit to ironcladlou/kubernetes that referenced this pull request Sep 26, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
ironcladlou pushed a commit to ironcladlou/kubernetes that referenced this pull request Oct 3, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
soltysh pushed a commit to openshift/kubernetes that referenced this pull request Oct 4, 2017
…0934

:100644 100644 0f17c4a510... 0efd748a6e... M	pkg/controller/endpoint/endpoints_controller.go
:100644 100644 b4c51a2f2a... 7af7c41c28... M	pkg/controller/endpoint/endpoints_controller_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No-op endpoint updates clog up endpoint controller's queue
9 participants