Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Server Cost too much in runtime.findrunnable, We Need to Reduce API Server Goruntine #84001

Open
answer1991 opened this issue Oct 16, 2019 · 10 comments

Comments

@answer1991
Copy link
Contributor

@answer1991 answer1991 commented Oct 16, 2019

What would you like to be added:

We need to reduce API Server goruntine, I think there are many ways to reduce API Server goruntine, such as:

  1. Make API Server processing request full-stack context-aware, then try to remove timeout filter.
  2. kubelet can list/watch ConfigMap/Secret(s) that the all Node's Pods needed from a single watch request, but not each ConfigMap/Secret need one watch request.

Why is this needed:

In our environment, if API Server has more then 300k goruntine, then API Server will cost too much CPU in runtime.findrunnable. Please see the follow graph we got from our environment:

image

@dims

This comment has been minimized.

Copy link
Member

@dims dims commented Oct 16, 2019

/sig scalability
/sig api-machinery

@fedebongio

This comment has been minimized.

Copy link
Contributor

@fedebongio fedebongio commented Oct 17, 2019

/cc @lavalamp @wojtek-t
Thought you might be interested since you've been digging into related issue

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Oct 17, 2019

at 300k goroutines you probably actually have a leak.

See #83333 for a recently fixed leak.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Oct 17, 2019

For reference, I think a reasonable number of goroutines for a big, heavily loaded cluster is ~50k. Above that something is wrong.

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Oct 17, 2019

And #80465 is one possible thing that could trigger leaking timeouts.

@answer1991

This comment has been minimized.

Copy link
Contributor Author

@answer1991 answer1991 commented Oct 18, 2019

@lavalamp Thanks for your reply.

I had already picked #83333 and #80465 to our environment.

I do not think API Server still leaking goruntines, because our API Server should process about 200k+ request in every minutes and has more than 300k+ watches (sum by API Server instance, but not every instance) when cluster is NOT in heavy load state. And we had already set most pods autoAmountServiceAccount to be false, or more watches will be connected between API Server and kubelet. Please see more detail graph below:

  • response codes in every minute

image

  • watches by resource name

image

  • API Server goruntines (we had scale API Server replica to be 5 to avoid goruntine schedule performance issue)

image

@wojtek-t

This comment has been minimized.

Copy link
Member

@wojtek-t wojtek-t commented Oct 18, 2019

I agree that in large enough clusters hundreds of thousands of goroutines is WAI.
And this is mostly coming from the fact that there are IIRC 3 goroutines per watch request:

kubelet can list/watch ConfigMap/Secret(s) that the all Node's Pods needed from a single watch request, but not each ConfigMap/Secret need one watch request.

i used to have a design proposal for bulk watch:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/bulk_watch.md
but I no longer think that is what we want. It would introduce a lot of complications.

I was thinking about that in the past, and actually I believe the solution should be different. I wanted to write a KEP about it for quite some time, but never got to it (I may try to do that in upcoming weeks). At the high level, i think that what we should do is:

  • IIUC, we're basically recommending to not update secrets/configmaps - instead what we're recommending is to create a new one and do a rolling update to the new one
  • with that approach, I think that we should allow for switching off auto-updates of secrets/configmaps for running pods
  • with that feature, users will be able to switch it off (which will result in non-watching secrets/configmaps from apiserver - just the initial get will be send)

I will try to write down the KEP for it next week (unless someone will object in the meantime).

@lavalamp

This comment has been minimized.

Copy link
Member

@lavalamp lavalamp commented Nov 8, 2019

If you have any websocket watchers, I think it's worth the experiment to pick this (or search in your logs for the message it prints when it leaks): #84693

@wojtek-t

This comment has been minimized.

Copy link
Member

@wojtek-t wojtek-t commented Dec 10, 2019

And regarding watches from Secrets/ConfigMap - this KEP is supposed to help with it: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20191117-immutable-ephemeral-volumes.md

@answer1991

This comment has been minimized.

Copy link
Contributor Author

@answer1991 answer1991 commented Dec 10, 2019

And regarding watches from Secrets/ConfigMap - this KEP is supposed to help with it: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20191117-immutable-ephemeral-volumes.md

@wojtek-t Many thanks for your excellent work. Can not wait to cherry-pick this new feature to our environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.