Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make periodic NodeStatus updates cheaper #14733

Closed
gmarek opened this issue Sep 29, 2015 · 64 comments

Comments

@gmarek
Copy link
Member

commented Sep 29, 2015

We're currently sending whole NodeStatuses only to update a Node heartbeat. This a big source of traffic in the cluster, and one of the possible causes of 1000-node cluster failures. We need to extract heartbeat to a new API 'Heartbeat' object with timestamp and object reference only, and make Kubelet/NodeController to use this object instead of NodeStatus to determine health of the Node.

cc: @wojtek-t @lavalamp @bgrant0607 @brendandburns @smarterclayton @timothysc @davidopp @fgrzadkowski

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2015

Why not just add a heartbeat subresource to the node that performs the
operation?

EDIT: to do the update, but continue to just retrieve nodes as is. I don't see the need to retrieve a Heartbeat, just to set it. We don't have a separate pod binding object, we just have the "bind" verb on the pod that mutates nodeName

@gmarek

This comment has been minimized.

Copy link
Member Author

commented Sep 29, 2015

That's the plan for the implementation, but we still need to store it somewhere, hence the need for an API object.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2015

Why wouldn't the store continue to be on the node? Virtual resources like scale / bind / etc already handle this. Maybe that's what you're suggesting, just want to be sure.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2015

I.e. PUT /nodes/foo/heartbeat simply sets lastProbeTime and accepts the "Heartbeat" virtual sub resource you described.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

But this would require changing the implementation of how subresources are implemented. The problem is that currently we just send the whole object in that case (e.g. whole pod in case of binding). We would need efficient PATCH operation that would send only data that is changing (not the whole object).

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2015

Are you looking to reduce UPDATE cost, or WATCH of updated node cost?

Binding is a minimal resource - it's just the object reference for the target node and the name of the pod. For the former, you would POST/PUT heartbeat "lastProbeTime" to the apiserver, which would run a guaranteed update on node that only alters the lastProbeTime. The kubelets would only send a minimal object, but anyone watching would still get the update.

For the latter, if the resources are truly split, wouldn't that require the node controller to fetch and watch two objects in order to stitch that data together? I didn't think watch on node was the problem, which is why I assume you meant the former.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

Yes - I mostly meant the former.

For the watch - with the new watch in apiserver - we will read the data from etcd only once and then send it only to interested watchers (and there should be constant number of those).

So yes - I'm mostly worried about write path (not read path).

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2015

Binding is modeled today as:

POST /v1/namespaces/<namespace>/pods/<pod>/binding {"kind":"Binding", "metadata": {"name": "<pod>"}, "to": {"name": "node1"}}

The result of that is persisted into the pod in etcd via the GuaranteedUpdate loop.

PUT /v1/namespaces/<namespace>/nodes/<node>/heartbeat {"kind":"Heartbeat", "metadata": {"name": "<node>"}, "timestamp": "2015-10-09 12:03:00Z"}

Could be very similar.

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

Have you tried PATCH?

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

I assume that this is not for 1.1.

@gmarek

This comment has been minimized.

Copy link
Member Author

commented Sep 29, 2015

Probably not, as we're already close to meeting our performance goals of v.1.1. It'll be (probably) necessary when we start to aim at more than 250 Nodes.

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

If we were going to add a new subresource, it should match the conventions of other subresources (and API conventions generally) and we shouldn't call it Heartbeat -- it should be a condition-reporting API.

However, we shouldn't need to add a subresource just for efficiency reasons. We should make it efficient to patch and get subsets of resources.

Is PATCH not implemented for node status? Strategic merge patch looks like it should work fine, though we might want to fix resourceVersion preconditions.

The ability to GET a subset of fields is covered by #1459.

@bgrant0607 bgrant0607 added this to the v1.2-candidate milestone Sep 29, 2015

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

Also, do you have performance data showing how significant a problem this is?

@dchen1107

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

I disagreed on introducing a Heartbeat object for this purpose.

  • From current issues reported, I don't think NodeStatus is the issue. Instead events and PodStatus should have more impact on this.
  • This is actually a generic issue we should solve. For example, we need to update lastProbeTime and lastTransitionTime for PodStatus. Extracting Heartbeat object doesn't solve this particular issue with the same root cause.

Can we use PATCH to do delta updates? Or introduce bulk updates?

@lavalamp

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

-100 on heartbeat object.

Switch to PATCH. Even if we have to implement it.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2015

Oh yeah, patch is also way better.

On Sep 29, 2015, at 8:27 PM, Daniel Smith notifications@github.com wrote:

-100 on heartbeat object.

Switch to PATCH. Even if we have to implement it.


Reply to this email directly or view it on GitHub
#14733 (comment)
.

@lavalamp

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

I also want to add that while switching to PATCH would be nice, I don't believe at all that this is actually a problem. It's extra bandwidth to send the whole object, but we are not bandwidth limited, we are write QPS limited (right?) and that PATCH, the current PUT, and a hypothetical heartbeat object ALL cause the same number of etcd writes (and actually, the same exact data, since PATCH and heartbeat would still have to do a read/write cycle of the whole object).

So I expect changing this to make basically no difference to a cluster.

@dchen1107

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

Bulk update can somewhat solve today's QPS limit issue (of course, if the limit is valid is a separate issue), but no matter using PATCH, PUT, heatbeat object and bulk update, we have the same amount etcd writes. :-)

@lavalamp

This comment has been minimized.

Copy link
Member

commented Sep 29, 2015

Yes, exactly.

On Tue, Sep 29, 2015 at 3:33 PM, Dawn Chen notifications@github.com wrote:

Bulk update can somewhat solve today's QPS limit issue (of course, if the
limit is valid is a separate issue), but no matter using PATCH, PUT,
heatbeat object and bulk update, we have the same amount etcd writes. :-)


Reply to this email directly or view it on GitHub
#14733 (comment)
.

@shubheksha

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

ping @wojtek-t / @wangzhen127 - is this on track for 1.13?

@wangzhen127

This comment has been minimized.

Copy link
Member

commented Oct 30, 2018

Yes

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2018

@wangzhen127 can you provide a list of pending PRs for this issue and an ETA for when we can expect all the related PRs merged? Can you please give an apporpriate kind/ label for this issue (eg: feature, bug, cleanup etc)

We are nearing code slush for 1.13 next Friday 11/9 and Code freeze in 2 weeks. If you think this issue isn't critical for 1.13 and work might extend beyond code slush please work with @nikopen to move this out. thanks

@wangzhen127

This comment has been minimized.

Copy link
Member

commented Nov 1, 2018

/kind feature
This feature is critical for 1.13. I think it is in good progress.

Code

  • #69753 (expect to merge this week)
  • #70034 (expect to merge next week)

Doc

Test Infra

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2018

Thanks so much @wangzhen127 for the detailed PR list. helps us to track much better

@tfogo for Doc related PR

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Nov 22, 2018

Let's keep that open until this is enabled by default (currently, this is an Alpha feature, thus disabled by default).

We will do a bit more testing and we planning to enable it by default for 1.14.

@wojtek-t wojtek-t reopened this Nov 22, 2018

@wojtek-t wojtek-t self-assigned this Nov 22, 2018

@liggitt

This comment has been minimized.

Copy link
Member

commented Nov 22, 2018

/milestone v1.14

@liggitt

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

can this be closed, now that #72096 is merged?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

It's a matter of promoting the API and the feature to GA.
But I guess we don't need this bug to be opened and we should open another one for it...
So I think it's fine

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

On a related note - I just started working on promoting API to v1 - will send out a PR later today or tomorrow.

@nikopen

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

@wojtek-t as #72239 has been merged, can this be closed?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

The feature is not in GA (it will be in Beta in 1.14), but I think we can close it - there is a corresponding issue in enhancements repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.