Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our prometheus data stream has become really slow #761

Open
choldgraf opened this issue Oct 5, 2018 · 7 comments
Open

Our prometheus data stream has become really slow #761

choldgraf opened this issue Oct 5, 2018 · 7 comments

Comments

@choldgraf
Copy link
Member

I've noticed lately that our grafana boards have gotten slower and slower. I just tried to load the launch percentages over the last 12 hours as a single page (https://grafana.mybinder.org/d/3SpLQinmk/1-overview?refresh=1m&panelId=28&fullscreen&orgId=1) and it's been at least 2 minutes and the graph still hasn't shown up.

I think this is actually not grafana's fault, but prometheus'. I tried to run a single query for one percentile of launch times:

histogram_quantile(0.1, sum(rate(binderhub_launch_time_seconds_bucket[5m])) without (instance, repo, provider)) > 0

and it took 14 seconds to return. AFAICT Grafana is basically just running this query a bunch of times to build a dashboard, so that'd explain the slowness.

Anybody have an idea why this would be?

@betatim
Copy link
Member

betatim commented Oct 6, 2018

Same observation here, the node related dashboard seems to load reasonably quickly, but the pod dashboard takes very long.

https://grafana.mybinder.org/d/GYEYQm7ik/components-resource-metrics?refresh=1m&orgId=1 is super quick.

I think some of our queries are "slow" (maybe scale as N^2 or some such) and we only started noticing now because our dataset is big enough. The best idea I have for finding the offending query/queries is to remove half the graphs from a dashboard and see if that changes something and keep iterating till we find them :-/

@choldgraf
Copy link
Member Author

oooh scaling inelegantly is a good point and one that I hadn't considered before...I bet that could be it. So what's a way to resolve that issue? I guess we could look back in time more shallowly, or have coarser resolution in time.

@choldgraf
Copy link
Member Author

I just tried to get the "launch time percentiles" plot to load by itself, scaled to only the last 3 hours instead of 12, and it hadn't loaded after like 10 minutes. I'm wondering if this is something more than just scaling?

@betatim
Copy link
Member

betatim commented Oct 8, 2018

I was thinking of splitting the panel more and more to identify which graphs are slow and then think very hard about why they are slow.

My hunch is that the slowdown-from-scaling would be scaling with the number of pods "in the system" not the look-back window. But this is just a guess.

@yuvipanda
Copy link
Contributor

This is most likely from explosion of total metrics we are tracking. A metric is a combination of metric name and unique combination of labels. Given the large amount of pods we have, we have a ton of labels :)

I think step 1 is to figure out the metrics we are collecting that we just don't use, and turning them off. This should buy us some time.

Prometheus recommends scaling by running multiple instances. I'd say for us we can run two, one for short run metrics that we keep for a week or so, and another for longer term metrics. We can also switch our current retention period to one month only - explicitly archiving metrics we wanna keep long term.

@minrk
Copy link
Member

minrk commented Nov 2, 2018

jupyterhub/binderhub#708 ought to slim down our metrics so that the ones we use most of the time won't have per-repo entries (since we don't use the repo there), and a separate metric tracks repos. We would need to update "most popular" charts to use the new counters, but the others should hopefully just get way faster.

@minrk
Copy link
Member

minrk commented Nov 6, 2018

I just deployed jupyterhub/binderhub#708, which might help with this. It's unclear to me whether this will improve as soon as the inefficient metrics fall off the selected window, or if we need to purge the prometheus data in order for it to happen.

But i it works, the build/launch status charts should be much quicker in a couple of days. If it's still super slow, then we might want to consider exporting and resetting our prometheus data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants