-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors for large numbers of objects #1431
Comments
Any specific stacktrace or something to debug? The UI should work fine. |
somewhat similar, on my test cluster with a deployed 1.4.0 dashboard and 10k pods the dashboard container crashes every time I request a page. Haven't looked too much more into it though |
I can reproduce on my env, will take a look what really happens |
Interesting. Please reply here what was/is wrong |
I did a short investigation, and the container crashes with a OOM (Out Of Memory) crash 2016-11-18T21:29:35.151956377Z container oom bfb8c28f5b70273f475a397bd8be3408f9dc256f33fbe3165940ba187b7a1253 (image=eu.gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.0, io.kubernetes.container.hash=bcabcc47, io.kubernetes.container.name=kubernetes-dashboard, io.kubernetes.container.ports=[{"containerPort":9090,"protocol":"TCP"}], io.kubernetes.container.restartCount=20, io.kubernetes.container.terminationMessagePath=/dev/termination-log, io.kubernetes.pod.name=kubernetes-dashboard-v1.4.0-z9pnm, io.kubernetes.pod.namespace=kube-system, io.kubernetes.pod.terminationGracePeriod=30, io.kubernetes.pod.uid=06b2dd99-a591-11e6-8b7d-42010a840052, name=k8s_kubernetes-dashboard.bcabcc47_kubernetes-dashboard-v1.4.0-z9pnm_kube-system_06b2dd99-a591-11e6-8b7d-42010a840052_b829007e) |
okey, apparently I had some limits set on my pod, after setting the limits a bit higher I found out that for 10k pods the dashboard needs ~200MB of memory |
When meeting with some folks yesterday they said they had problems even with 1000 pods. There are probably a number of issues related to memory, API timeouts, etc. for various API objects. I would leave this issue as pretty generic and solve each specific issue separately as we find them. |
We're not enforcing any CPU/Memory limits on dashboard by default. It has to be applied externally either by adjusting yaml or creating |
Can you get us a stacktrace or screenshot of some form? Or guide the folks to report bugs here? I'd help a lot. |
I instructed them to create a bug but they were just evaluating Kubernetes. They may or may not post an issue. I'll see if I can repro the issue at some point. This issue about paging in the API may be worth following: |
Even if we don't apply CPU/Memory limits ourselves in a way the hardware will do this in the end. Perhaps we should find a way to be a bit more graceful there. But even if we don't crash with a high number of objects we are getting really slow. Some findings up till now about that:
Given that this is a bit larger than a small fix I'll remove this issue from the 1.5 project but keep it open |
Just wanted to confirm that my team and I are seeing this issue when we were running as low as 315 pods. |
That's sad @dgreene1 Do you have any logs to confirm that this is OOMs? A short term fix for this problem can be increasing memory reservation for the UI. Can you try this? |
@rf232 Yah. gRPC might be a long term stretch goal but we'll still require paging to get around the memory issues and there doesn't seem to be a way to request pages of data from the API server at the moment. The issue kubernetes/kubernetes#2349 from the kubernetes repo addresses this but doesn't look like it's been seriously considered for implementation yet. |
BTW. I think setting ~200Mi for the limit for the dashboard is reasonable. If more is required than users can upgrade it but the current 50Mi may be too low. For instance fluentd gets 200Mi on GKE per node. |
Do we need memory limits anyway? Can we do only memory reservation and no limit? Or make limit, like, 500 megs. |
It's a balance between giving the dashboard enough memory (when the API calls get to big it will timeout anyway), and requesting too many resources from the cluster. The best thing may be to give it a lowish ~200Mi request and a high 1Gb limit (or no limit) but we risk being unfriendly to other pods on the same node. |
@maciaszczykm Can we fix this by increasing memory limits to O(hundreds) of megs? If you open Dashboard on any large cluster it crashes. |
@bryk Sure, we can do it. Do you think about any specific limit? |
100 megs requests 300 limits to start with? And update this on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dashboard/dashboard-controller.yaml |
Pull request is on core. Still, we have to fix issues mentioned by @rf232 in #1431 (comment). |
Pull request is on core. Still, we have to fix issues mentioned by
@rf232 in
#1431 (comment).
Switch to using protobuf is done (for the relevant pages, only the yaml
editor uses json, but that is for single resources, so not worth the
effort)
Smaller lists would require us to refactor how we build up pages since
now we do all requests to api in parallel, and we should get the
resource first and then find the label selector and do a get to the
backend with the label selector. This would require quite some work I
think.
|
@rf232 Oh, I see now. Did not check it before.
Yes, I am aware of it. It is good enhacement for the future, but right now we should focus on higher priority issues as this is non-blocker IMO. |
Let's track kubernetes/kubernetes#44712 from here. |
Automatic merge from submit-queue (batch tested with PRs 43884, 44712, 45124, 43883) Increase Dashboard memory limits **What this PR does / why we need it**: Increases memory requests and limits for Dashboard. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes/dashboard#1431 **Special notes for your reviewer**: Dashboard crashes on large clusters, this change should fix that problem. **Release note**: ```release-note Increase Dashboard's memory requests and limits ```
I've seen reports of the dashboard throwing errors and otherwise not working for large numbers of objects.
Anecdotally, I have seen reports for large numbers of pods and large numbers of events. It may behove us to test with large numbers of objects and see which are problematic and start creating issues and fixing them.
The text was updated successfully, but these errors were encountered: