Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors for large numbers of objects #1431

Closed
ianlewis opened this issue Nov 17, 2016 · 26 comments · Fixed by kubernetes/kubernetes#44712
Closed

Errors for large numbers of objects #1431

ianlewis opened this issue Nov 17, 2016 · 26 comments · Fixed by kubernetes/kubernetes#44712
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ianlewis
Copy link
Contributor

I've seen reports of the dashboard throwing errors and otherwise not working for large numbers of objects.

Anecdotally, I have seen reports for large numbers of pods and large numbers of events. It may behove us to test with large numbers of objects and see which are problematic and start creating issues and fixing them.

@bryk
Copy link
Contributor

bryk commented Nov 17, 2016

Any specific stacktrace or something to debug? The UI should work fine.

@rf232
Copy link
Contributor

rf232 commented Nov 17, 2016

somewhat similar, on my test cluster with a deployed 1.4.0 dashboard and 10k pods the dashboard container crashes every time I request a page. Haven't looked too much more into it though

@ianlewis
Copy link
Contributor Author

@bryk I haven't tested it myself but I heard it a few times ( once on slack and a few times at KubeCon) so I thought it worth investigating.

What @rf232 said is exactly what I'm talking about but we need more info.

@rf232 rf232 self-assigned this Nov 18, 2016
@rf232
Copy link
Contributor

rf232 commented Nov 18, 2016

I can reproduce on my env, will take a look what really happens

@bryk
Copy link
Contributor

bryk commented Nov 18, 2016

Interesting. Please reply here what was/is wrong

@rf232
Copy link
Contributor

rf232 commented Nov 18, 2016

I did a short investigation, and the container crashes with a OOM (Out Of Memory) crash

2016-11-18T21:29:35.151956377Z container oom bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.251943478Z container die bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (exitCode=137, image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e)
2016-11-18T21:29:37.329794337Z container destroy f013e7a99b7b9513c97b7bb7e5cd9f15d297155e242754432c7c52687f8b7375 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=19, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_d7d1a95d)
2016-11-18T21:29:50.718773431Z container create f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)
2016-11-18T21:29:50.779910144Z container start f0149ffeab479c2620fe941e7cfa625bd8e6fe9bb12dd92cef5bdcc849310a26 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=21, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_4156d221)

@rf232
Copy link
Contributor

rf232 commented Nov 18, 2016

okey, apparently I had some limits set on my pod, after setting the limits a bit higher I found out that for 10k pods the dashboard needs ~200MB of memory

@ianlewis
Copy link
Contributor Author

When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.

@floreks
Copy link
Member

floreks commented Nov 19, 2016

We're not enforcing any CPU/Memory limits on dashboard by default. It has to be applied externally either by adjusting yaml or creating LimitRange. We should add some comment in documentation saying that for high number of resources in the cluster memory limits should be extended (if there are any applied). 200Mi - 2Gi should be enough.

@bryk
Copy link
Contributor

bryk commented Nov 21, 2016

When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them.

Can you get us a stacktrace or screenshot of some form? Or guide the folks to report bugs here? I'd help a lot.

@ianlewis
Copy link
Contributor Author

I instructed them to create a bug but they were just evaluating Kubernetes. They may or may not post an issue.

I'll see if I can repro the issue at some point.

This issue about paging in the API may be worth following:
kubernetes/kubernetes#2349

@rf232
Copy link
Contributor

rf232 commented Nov 24, 2016

Even if we don't apply CPU/Memory limits ourselves in a way the hardware will do this in the end. Perhaps we should find a way to be a bit more graceful there.

But even if we don't crash with a high number of objects we are getting really slow. Some findings up till now about that:

  • JSON serializiation takes a significant amount of the time
    Since k8s supports protobuf serialization format we could switch to that, but the straightforward way of just always requesting protobufs breaks the view and edit yaml workflow
  • We request way too large lists from api server.
    If I have a RC with 10k pods and go to the list of jobs or a job detail page we still request the list of all pods (given the same namespace) and this makes it so that all pages become slow

Given that this is a bit larger than a small fix I'll remove this issue from the 1.5 project but keep it open

@dgreene1
Copy link

Just wanted to confirm that my team and I are seeing this issue when we were running as low as 315 pods.

@bryk
Copy link
Contributor

bryk commented Nov 30, 2016

That's sad @dgreene1

Do you have any logs to confirm that this is OOMs?

A short term fix for this problem can be increasing memory reservation for the UI. Can you try this?

@bryk bryk added kind/bug Categorizes issue or PR as related to a bug. priority/P0 and removed priority/P1 labels Nov 30, 2016
@ianlewis
Copy link
Contributor Author

@rf232 Yah. gRPC might be a long term stretch goal but we'll still require paging to get around the memory issues and there doesn't seem to be a way to request pages of data from the API server at the moment. The issue kubernetes/kubernetes#2349 from the kubernetes repo addresses this but doesn't look like it's been seriously considered for implementation yet.

@ianlewis
Copy link
Contributor Author

@dgreene1 That seems in line with what I heard from folks I talked to.

As @bryk said, please do provide as much info as you can so we can try to address the issue. What actually happens? does the dashboard app crash with OOM? Is there some other kind of error happening?

@ianlewis
Copy link
Contributor Author

BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.

For instance fluentd gets 200Mi on GKE per node.

@bryk
Copy link
Contributor

bryk commented Nov 30, 2016

BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low.
For instance fluentd gets 200Mi on GKE per node.

Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.

@ianlewis
Copy link
Contributor Author

Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs.

It's a balance between giving the dashboard enough memory (when the API calls get to big it will timeout anyway), and requesting too many resources from the cluster.

The best thing may be to give it a lowish ~200Mi request and a high 1Gb limit (or no limit) but we risk being unfriendly to other pods on the same node.

@bryk
Copy link
Contributor

bryk commented Apr 20, 2017

@maciaszczykm Can we fix this by increasing memory limits to O(hundreds) of megs? If you open Dashboard on any large cluster it crashes.

@maciaszczykm
Copy link
Member

@bryk Sure, we can do it. Do you think about any specific limit?

@maciaszczykm maciaszczykm assigned maciaszczykm and rf232 and unassigned rf232 Apr 20, 2017
@bryk
Copy link
Contributor

bryk commented Apr 20, 2017

100 megs requests 300 limits to start with?

And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml

@maciaszczykm
Copy link
Member

Pull request is on core. Still, we have to fix issues mentioned by @rf232 in #1431 (comment).

@rf232
Copy link
Contributor

rf232 commented Apr 24, 2017 via email

@maciaszczykm
Copy link
Member

Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)

@rf232 Oh, I see now. Did not check it before.

Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.

Yes, I am aware of it. It is good enhacement for the future, but right now we should focus on higher priority issues as this is non-blocker IMO.

@maciaszczykm
Copy link
Member

Let's track kubernetes/kubernetes#44712 from here.

@maciaszczykm maciaszczykm removed their assignment Apr 25, 2017
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue May 1, 2017
Automatic merge from submit-queue (batch tested with PRs 43884, 44712, 45124, 43883)

Increase Dashboard memory limits

**What this PR does / why we need it**: Increases memory requests and limits for Dashboard.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes/dashboard#1431

**Special notes for your reviewer**: Dashboard crashes on large clusters, this change should fix that problem.

**Release note**:

```release-note
Increase Dashboard's memory requests and limits
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants