Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up/service-discovery returns excessive amount of HTML for large K8s clusters #4134
Comments
beorn7
added
kind/bug
priority/P1
component/ui
labels
May 2, 2018
This comment has been minimized.
This comment has been minimized.
|
Limiting it to say 1k dropped targets per scrape config sounds reasonable to me. This is similar to #2119 We already have the HTTP API for getting this information, so for such large users we can probably point them there rather than adding a UI component. |
This comment has been minimized.
This comment has been minimized.
|
I was thinking some pagination and a search box |
This comment has been minimized.
This comment has been minimized.
|
I'd prefer not add the complication of pagination to the APIs, nor search functions. |
This comment has been minimized.
This comment has been minimized.
|
On Kubernetes you often drop most targets and the use case for visting the page is usually that you drop something you didn't want to drop, so cutting off after 1k dropped targets doesn't help in this usecase at all. I think it needs pagination or we should drop this feature altogether and rather provide a text endpoint to get this. |
This comment has been minimized.
This comment has been minimized.
|
There's always the HTTP API to get the full answer. I don't think we should remove a feature just because it doesn't work 100% for large users. |
This comment has been minimized.
This comment has been minimized.
|
I'd also like to avoid pagination, as that's a can of worms both implementation and performance wise. I think the goal here should be to avoid killing browsers, and we can link then to the HTTP API which I believe is along the lines of what @beorn7 was suggesting. |
This comment has been minimized.
This comment has been minimized.
|
1k + search box - sounds the most usage friendly solution to me, but can try with just showing the first 1k and if there are user request can revisit. |
This comment has been minimized.
This comment has been minimized.
|
I also got the impression it's quite a drain for the Prometheus server to shovel out Gigabytes of data to an increasingly unresponsive browser. Limiting it would be a quick fix we should do anyway, ideally in 2.2.2 (still waiting for a sanely usable Prom2, I'm afraid if we release 2.3.0 instead of 2.2.2, we might hit yet another bunch of interesting bugs with interesting now features). Once the bleeding has stopped, we can calmly contemplate if we want something fancier like a search filter. |
simonpasquier
referenced this issue
Jun 4, 2018
Merged
web: limit the number of dropped targets #4212
This comment has been minimized.
This comment has been minimized.
|
This issue should be closed now? |
krasi-georgiev
closed this
Jun 12, 2018
This comment has been minimized.
This comment has been minimized.
|
yep |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
beorn7 commentedMay 2, 2018
•
edited
The handy
/service-discoveryendpoint also show all dropped labels. However, SD often retrieve essentially all the meta data and then drop most of it for each job. In the case of K8s, imagine a decently sized cluster of 10k pods, each with 10 labels or annotations, each needs ~1kiB to display in HTML with its operation. A Prometheus server with 10 configured job would display all 100k labels/annotations in the K8s cluster for each of the 10 job, i.e. 1M labels/annotations, resulting in about 1GiB of data. That's just a broad estimate. I have just done a practical measurement and retrieved 1.4GiB of HTML code from the/service-discoveryendpoint. This is a drain on the Prometheus server and the network, and usually kills the browser tab. Most K8s clusters might be smaller, but the ~10k pod clusters at SoundCloud are not meant to be beyond the scalability limit of Prometheus. Selected few organizations even run much larger clusters, which Prometheus should still be able to monitor.In short, the
/service-discoveryendpoint needs to be capped for such large clusters. Details need to be figured out, but it could be something like a message "displaying only 100 dropped labels. There are 999,900 more labels." with a pagination mechanism or a download link for same raw data of the labels (which is then explicitly requested by the user rather than accidentally triggered by visiting the Service Discovery status page).IIRC @Conorbro implemented the endpoint and might have the best ideas how to proceed here.
@krasi-georgiev : as discussed.