Errors for large numbers of objects #1431

ianlewis · 2016-11-17T00:26:43Z

I've seen reports of the dashboard throwing errors and otherwise not working for large numbers of objects.

Anecdotally, I have seen reports for large numbers of pods and large numbers of events. It may behove us to test with large numbers of objects and see which are problematic and start creating issues and fixing them.

bryk · 2016-11-17T13:26:52Z

Any specific stacktrace or something to debug? The UI should work fine.

rf232 · 2016-11-17T17:17:29Z

somewhat similar, on my test cluster with a deployed 1.4.0 dashboard and 10k pods the dashboard container crashes every time I request a page. Haven't looked too much more into it though

ianlewis · 2016-11-17T21:20:38Z

@bryk I haven't tested it myself but I heard it a few times ( once on slack and a few times at KubeCon) so I thought it worth investigating.

What @rf232 said is exactly what I'm talking about but we need more info.

rf232 · 2016-11-18T10:32:35Z

I can reproduce on my env, will take a look what really happens

bryk · 2016-11-18T11:43:17Z

Interesting. Please reply here what was/is wrong

rf232 · 2016-11-18T21:35:44Z

I did a short investigation, and the container crashes with a OOM (Out Of Memory) crash

2016-11-18T21:29:35.151956377Z container oom bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.251943478Z container die bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (exitCode=137, image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.329794337Z container destroy f013e7a99b7b9513c97b7bb7e5cd9f15d297155e242754432c7c52687f8b7375 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=19, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_d7d1a95d)
2016-11-18T21:29:50.718773431Z container create f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)
2016-11-18T21:29:50.779910144Z container start f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)

rf232 · 2016-11-18T22:28:07Z

okey, apparently I had some limits set on my pod, after setting the limits a bit higher I found out that for 10k pods the dashboard needs ~200MB of memory

ianlewis · 2016-11-18T22:55:05Z

When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.

floreks · 2016-11-19T09:25:23Z

We're not enforcing any CPU/Memory limits on dashboard by default. It has to be applied externally either by adjusting yaml or creating LimitRange. We should add some comment in documentation saying that for high number of resources in the cluster memory limits should be extended (if there are any applied). 200Mi - 2Gi should be enough.

bryk · 2016-11-21T07:33:04Z

When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.

Can you get us a stacktrace or screenshot of some form? Or guide the folks to report bugs here? I'd help a lot.

ianlewis · 2016-11-21T07:54:48Z

I instructed them to create a bug but they were just evaluating Kubernetes. They may or may not post an issue.

I'll see if I can repro the issue at some point.

This issue about paging in the API may be worth following:
kubernetes/kubernetes#2349

rf232 · 2016-11-24T10:27:43Z

Even if we don't apply CPU/Memory limits ourselves in a way the hardware will do this in the end. Perhaps we should find a way to be a bit more graceful there.

But even if we don't crash with a high number of objects we are getting really slow. Some findings up till now about that:

JSON serializiation takes a significant amount of the time
Since k8s supports protobuf serialization format we could switch to that, but the straightforward way of just always requesting protobufs breaks the view and edit yaml workflow
We request way too large lists from api server.
If I have a RC with 10k pods and go to the list of jobs or a job detail page we still request the list of all pods (given the same namespace) and this makes it so that all pages become slow

Given that this is a bit larger than a small fix I'll remove this issue from the 1.5 project but keep it open

dgreene1 · 2016-11-29T20:35:04Z

Just wanted to confirm that my team and I are seeing this issue when we were running as low as 315 pods.

bryk · 2016-11-30T10:29:02Z

That's sad @dgreene1

Do you have any logs to confirm that this is OOMs?

A short term fix for this problem can be increasing memory reservation for the UI. Can you try this?

ianlewis · 2016-11-30T12:51:20Z

@rf232 Yah. gRPC might be a long term stretch goal but we'll still require paging to get around the memory issues and there doesn't seem to be a way to request pages of data from the API server at the moment. The issue kubernetes/kubernetes#2349 from the kubernetes repo addresses this but doesn't look like it's been seriously considered for implementation yet.

ianlewis · 2016-11-30T12:54:06Z

@dgreene1 That seems in line with what I heard from folks I talked to.

As @bryk said, please do provide as much info as you can so we can try to address the issue. What actually happens? does the dashboard app crash with OOM? Is there some other kind of error happening?

ianlewis · 2016-11-30T13:08:01Z

BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.

For instance fluentd gets 200Mi on GKE per node.

bryk · 2016-11-30T15:47:16Z

BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.
For instance fluentd gets 200Mi on GKE per node.

Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.

ianlewis · 2016-11-30T15:54:22Z

Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.

It's a balance between giving the dashboard enough memory (when the API calls get to big it will timeout anyway), and requesting too many resources from the cluster.

The best thing may be to give it a lowish ~200Mi request and a high 1Gb limit (or no limit) but we risk being unfriendly to other pods on the same node.

bryk · 2017-04-20T07:20:05Z

@maciaszczykm Can we fix this by increasing memory limits to O(hundreds) of megs? If you open Dashboard on any large cluster it crashes.

maciaszczykm · 2017-04-20T08:06:05Z

@bryk Sure, we can do it. Do you think about any specific limit?

bryk · 2017-04-20T08:49:36Z

100 megs requests 300 limits to start with?

And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml

maciaszczykm · 2017-04-20T09:44:59Z

Pull request is on core. Still, we have to fix issues mentioned by @rf232 in #1431 (comment).

rf232 · 2017-04-24T08:20:15Z

Pull request is on core. Still, we have to fix issues mentioned by @rf232 in #1431 (comment).

Switch to using protobuf is done (for the relevant pages, only the yaml editor uses json, but that is for single resources, so not worth the effort) Smaller lists would require us to refactor how we build up pages since now we do all requests to api in parallel, and we should get the resource first and then find the label selector and do a get to the backend with the label selector. This would require quite some work I think.

maciaszczykm · 2017-04-25T08:18:11Z

Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)

@rf232 Oh, I see now. Did not check it before.

Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.

Yes, I am aware of it. It is good enhacement for the future, but right now we should focus on higher priority issues as this is non-blocker IMO.

maciaszczykm · 2017-04-25T08:19:51Z

Let's track kubernetes/kubernetes#44712 from here.

Automatic merge from submit-queue (batch tested with PRs 43884, 44712, 45124, 43883) Increase Dashboard memory limits **What this PR does / why we need it**: Increases memory requests and limits for Dashboard. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes/dashboard#1431 **Special notes for your reviewer**: Dashboard crashes on large clusters, this change should fix that problem. **Release note**: ```release-note Increase Dashboard's memory requests and limits ```

rf232 self-assigned this Nov 18, 2016

rf232 added area/performance labels Nov 18, 2016

bryk mentioned this issue Nov 21, 2016

Add "send feedback/report bug" button to the UI #1456

Closed

bryk added kind/bug Categorizes issue or PR as related to a bug. priority/P0 and removed priority/P1 labels Nov 30, 2016

rf232 mentioned this issue Dec 15, 2016

Use protobuf instead of json for most API requests #1525

Merged

ianlewis mentioned this issue Feb 20, 2017

Dashboard crashes with too many pods #1658

Closed

floreks mentioned this issue Mar 17, 2017

Dashboard crashes with large number of pods in the cluster #1711

Closed

maciaszczykm removed the area/performance label Apr 18, 2017

maciaszczykm assigned maciaszczykm and rf232 and unassigned rf232 Apr 20, 2017

maciaszczykm mentioned this issue Apr 20, 2017

Increase Dashboard memory limits kubernetes/kubernetes#44712

Merged

maciaszczykm added priority/P1 and removed priority/P0 labels Apr 25, 2017

maciaszczykm removed their assignment Apr 25, 2017

k8s-github-robot closed this as completed in kubernetes/kubernetes#44712 May 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors for large numbers of objects #1431

Errors for large numbers of objects #1431

ianlewis commented Nov 17, 2016

bryk commented Nov 17, 2016

rf232 commented Nov 17, 2016

ianlewis commented Nov 17, 2016

rf232 commented Nov 18, 2016

bryk commented Nov 18, 2016

rf232 commented Nov 18, 2016

rf232 commented Nov 18, 2016

ianlewis commented Nov 18, 2016

floreks commented Nov 19, 2016

bryk commented Nov 21, 2016

ianlewis commented Nov 21, 2016

rf232 commented Nov 24, 2016

dgreene1 commented Nov 29, 2016

bryk commented Nov 30, 2016

ianlewis commented Nov 30, 2016

ianlewis commented Nov 30, 2016

ianlewis commented Nov 30, 2016

bryk commented Nov 30, 2016

ianlewis commented Nov 30, 2016

bryk commented Apr 20, 2017

maciaszczykm commented Apr 20, 2017

bryk commented Apr 20, 2017

maciaszczykm commented Apr 20, 2017

rf232 commented Apr 24, 2017 via email

maciaszczykm commented Apr 25, 2017

maciaszczykm commented Apr 25, 2017

Errors for large numbers of objects #1431

Errors for large numbers of objects #1431

Comments

ianlewis commented Nov 17, 2016

bryk commented Nov 17, 2016

rf232 commented Nov 17, 2016

ianlewis commented Nov 17, 2016

rf232 commented Nov 18, 2016

bryk commented Nov 18, 2016

rf232 commented Nov 18, 2016

rf232 commented Nov 18, 2016

ianlewis commented Nov 18, 2016

floreks commented Nov 19, 2016

bryk commented Nov 21, 2016

ianlewis commented Nov 21, 2016

rf232 commented Nov 24, 2016

dgreene1 commented Nov 29, 2016

bryk commented Nov 30, 2016

ianlewis commented Nov 30, 2016

ianlewis commented Nov 30, 2016

ianlewis commented Nov 30, 2016

bryk commented Nov 30, 2016

ianlewis commented Nov 30, 2016

bryk commented Apr 20, 2017

maciaszczykm commented Apr 20, 2017

bryk commented Apr 20, 2017

maciaszczykm commented Apr 20, 2017

rf232 commented Apr 24, 2017 via email

maciaszczykm commented Apr 25, 2017

maciaszczykm commented Apr 25, 2017