Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple pod updates batching to endpoint controller. #80509

Merged
merged 2 commits into from Jul 27, 2019

Conversation

@mborsz
Copy link
Member

commented Jul 24, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:

This PR adds a mechanism to reduce number of generated endpoints updates in cost of increased network programming latency.
The default behavior matches the state before this PR, but this can be controlled using --endpoint-updates-batch-period flag.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

New flag --endpoint-updates-batch-period in kube-controller-manager can be used to reduce number of endpoints updates generated by pod changes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/assign @wojtek-t

@fejta-bot

This comment has been minimized.

Copy link

commented Jul 24, 2019

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

@mborsz - can you please move autogenerated stuff to a separate commit? It would be much easier to review it ;)

@mborsz mborsz force-pushed the mborsz:endpoints branch from 3ba7b75 to ad5a4ee Jul 24, 2019

@mborsz

This comment has been minimized.

Copy link
Member Author

commented Jul 24, 2019

Done

mborsz added some commits Jul 24, 2019

@mborsz mborsz force-pushed the mborsz:endpoints branch from ad5a4ee to d96f242 Jul 24, 2019

@@ -210,7 +214,7 @@ func (e *EndpointController) addPod(obj interface{}) {
return
}
for key := range services {
e.queue.Add(key)
e.queue.AddAfter(key, e.endpointUpdatesBatchPeriod)

This comment has been minimized.

Copy link
@freehan

freehan Jul 24, 2019

Member

Is this the standard way to implement batching? if there is only one endpoint change (probably the most common case), this will delay the endpoint programming by batch period.

Is it possible to implement batching on the dequeue end? For example:

key, ok := e.queue.Get()
timestamp := BatchPeriodAfterLastSync(key) 
if now < timestamp{
   e.queue.AddAfter(key, timestamp-now)
   return
}
sync(key)

probably need to maintain a map between key and lastSyncTimestamp.

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Jul 24, 2019

Member

Is this the standard way to implement batching?

It's probably not. But it's certainly the easiest way - it's two lines of code and works.

Regarding your idea - we've been thinking about that and my personal suggestion was that I'm not convinced whether it makes sense to complicate this code (though I don't have that strong opinion).

However, let me throw a bit more data points here:

  • I strongly believe we don't need any batching for clusters with less than ~1000 nodes - it should just work once we do things that all under work (EndpointSlice & serialize object once)
  • For larger clusters, I also think that the batching period I have on my mind is probably 500ms
  • In clusters with 1000+ nodes, I would generally expect to have non-negligible amount of services, so programming iptables or programming NEGs or sth like that should eat that small
    batching period overhead
  • in-cluster network programming latency SLO it's pretty clear that the threshold for SLO will be at least 15s (because that's what we observe even in 100-node clusters in scalability tests)

So this PR is exposing an ability for the operator to tune, with close to zero engineering work and no risk of bugs. The approach you're suggesting doesn't have this properties.

This comment has been minimized.

Copy link
@mborsz

mborsz Jul 25, 2019

Author Member

This change is the simplest possible way to implement the batching. I agree that there is a number of better ways to find a correct delay (e.g. as you mentioned we can pass the first pod update without delay and delay only next ones, we can use different batch period depending on the factors like observed pod update rate, current size of endpoint etc), but I think this is a good starting point I think.

Please note that we are not changing any default behavior, the change is purely opt-in -- user needs to explicitly set the flag to enable this and can tune the values that makes sense for the given use case.

I think that for large clusters, the reasonable values are around 0.5 seconds as wojtek-t mentioned. While I was testing this with 1s value, I was observing significant (5-10x) reduce in number of endpoints updates generated in scalability tests. Given that endpoints are usually large and observed by all nodes, they generate most of the traffic (in terms of bandwidth and cpu) sent from master.

I think it can be fair tradeoff to increase network programming latency by e.g. 0.5s (out of mentioned 15s) to significantly reduce overall traffic generated on master.

@roycaihw

This comment has been minimized.

Copy link
Member

commented Jul 25, 2019

/remove-sig api-machinery

@freehan

This comment has been minimized.

Copy link
Member

commented Jul 26, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 26, 2019

@thockin
Copy link
Member

left a comment

This is a simple change. It doesn't seem to reduce the amount of work that the controller does (e.g. by noticing that an update was already queued) but it will result in more no-op results, I guess.

If the intent is really O(100ms) windows it is probably fine. My mind immediately went to something more complicated, e.g. wait 1/2 window, if any updates arrived, wait 1/2 of that, etc. That would give a more dynamic window such that busy systems tend to batch more. But it's a lot more complicated.

/lgtm
/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: freehan, mborsz, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 44de947 into kubernetes:master Jul 27, 2019

23 checks passed

cla/linuxfoundation mborsz authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@wojtek-t

This comment has been minimized.

Copy link
Member

commented Jul 29, 2019

This is a simple change. It doesn't seem to reduce the amount of work that the controller does (e.g. by noticing that an update was already queued) but it will result in more no-op results, I guess.

Exactly - the goal was to not reduce amount of work done by endpoints controller, but amount of Endpoints objects changes that has to be then sent to all kube-proxies.

If the intent is really O(100ms) windows it is probably fine. My mind immediately went to something more complicated, e.g. wait 1/2 window, if any updates arrived, wait 1/2 of that, etc. That would give a more dynamic window such that busy systems tend to batch more. But it's a lot more complicated.

My motivation for proceeding with this approach is that we can always change the implementation if it appears that we really need something more sophisticated. But having something is better than nothing and this one is simple enough that in my opinion it's worth proceeding with it.

Thanks @thockin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.