Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/service-discovery returns excessive amount of HTML for large K8s clusters #4134

Closed
beorn7 opened this Issue May 2, 2018 · 11 comments

Comments

Projects
None yet
5 participants
@beorn7
Copy link
Member

beorn7 commented May 2, 2018

The handy /service-discovery endpoint also show all dropped labels. However, SD often retrieve essentially all the meta data and then drop most of it for each job. In the case of K8s, imagine a decently sized cluster of 10k pods, each with 10 labels or annotations, each needs ~1kiB to display in HTML with its operation. A Prometheus server with 10 configured job would display all 100k labels/annotations in the K8s cluster for each of the 10 job, i.e. 1M labels/annotations, resulting in about 1GiB of data. That's just a broad estimate. I have just done a practical measurement and retrieved 1.4GiB of HTML code from the /service-discovery endpoint. This is a drain on the Prometheus server and the network, and usually kills the browser tab. Most K8s clusters might be smaller, but the ~10k pod clusters at SoundCloud are not meant to be beyond the scalability limit of Prometheus. Selected few organizations even run much larger clusters, which Prometheus should still be able to monitor.

In short, the /service-discovery endpoint needs to be capped for such large clusters. Details need to be figured out, but it could be something like a message "displaying only 100 dropped labels. There are 999,900 more labels." with a pagination mechanism or a download link for same raw data of the labels (which is then explicitly requested by the user rather than accidentally triggered by visiting the Service Discovery status page).

IIRC @Conorbro implemented the endpoint and might have the best ideas how to proceed here.

@krasi-georgiev : as discussed.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 3, 2018

Limiting it to say 1k dropped targets per scrape config sounds reasonable to me. This is similar to #2119

We already have the HTTP API for getting this information, so for such large users we can probably point them there rather than adding a UI component.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 3, 2018

I was thinking some pagination and a search box

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 3, 2018

I'd prefer not add the complication of pagination to the APIs, nor search functions.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented May 31, 2018

On Kubernetes you often drop most targets and the use case for visting the page is usually that you drop something you didn't want to drop, so cutting off after 1k dropped targets doesn't help in this usecase at all.

I think it needs pagination or we should drop this feature altogether and rather provide a text endpoint to get this.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 31, 2018

There's always the HTTP API to get the full answer. I don't think we should remove a feature just because it doesn't work 100% for large users.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 31, 2018

I'd also like to avoid pagination, as that's a can of worms both implementation and performance wise. I think the goal here should be to avoid killing browsers, and we can link then to the HTTP API which I believe is along the lines of what @beorn7 was suggesting.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 31, 2018

1k + search box - sounds the most usage friendly solution to me, but can try with just showing the first 1k and if there are user request can revisit.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 31, 2018

I also got the impression it's quite a drain for the Prometheus server to shovel out Gigabytes of data to an increasingly unresponsive browser.

Limiting it would be a quick fix we should do anyway, ideally in 2.2.2 (still waiting for a sanely usable Prom2, I'm afraid if we release 2.3.0 instead of 2.2.2, we might hit yet another bunch of interesting bugs with interesting now features). Once the bleeding has stopped, we can calmly contemplate if we want something fancier like a search filter.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 12, 2018

This issue should be closed now?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 12, 2018

yep

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.