New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Our prometheus data stream has become really slow #761
Comments
Same observation here, the node related dashboard seems to load reasonably quickly, but the pod dashboard takes very long. https://grafana.mybinder.org/d/GYEYQm7ik/components-resource-metrics?refresh=1m&orgId=1 is super quick. I think some of our queries are "slow" (maybe scale as N^2 or some such) and we only started noticing now because our dataset is big enough. The best idea I have for finding the offending query/queries is to remove half the graphs from a dashboard and see if that changes something and keep iterating till we find them :-/ |
oooh scaling inelegantly is a good point and one that I hadn't considered before...I bet that could be it. So what's a way to resolve that issue? I guess we could look back in time more shallowly, or have coarser resolution in time. |
I just tried to get the "launch time percentiles" plot to load by itself, scaled to only the last 3 hours instead of 12, and it hadn't loaded after like 10 minutes. I'm wondering if this is something more than just scaling? |
I was thinking of splitting the panel more and more to identify which graphs are slow and then think very hard about why they are slow. My hunch is that the slowdown-from-scaling would be scaling with the number of pods "in the system" not the look-back window. But this is just a guess. |
This is most likely from explosion of total metrics we are tracking. A metric is a combination of metric name and unique combination of labels. Given the large amount of pods we have, we have a ton of labels :) I think step 1 is to figure out the metrics we are collecting that we just don't use, and turning them off. This should buy us some time. Prometheus recommends scaling by running multiple instances. I'd say for us we can run two, one for short run metrics that we keep for a week or so, and another for longer term metrics. We can also switch our current retention period to one month only - explicitly archiving metrics we wanna keep long term. |
jupyterhub/binderhub#708 ought to slim down our metrics so that the ones we use most of the time won't have per-repo entries (since we don't use the repo there), and a separate metric tracks repos. We would need to update "most popular" charts to use the new counters, but the others should hopefully just get way faster. |
I just deployed jupyterhub/binderhub#708, which might help with this. It's unclear to me whether this will improve as soon as the inefficient metrics fall off the selected window, or if we need to purge the prometheus data in order for it to happen. But i it works, the build/launch status charts should be much quicker in a couple of days. If it's still super slow, then we might want to consider exporting and resetting our prometheus data. |
I've noticed lately that our grafana boards have gotten slower and slower. I just tried to load the launch percentages over the last 12 hours as a single page (https://grafana.mybinder.org/d/3SpLQinmk/1-overview?refresh=1m&panelId=28&fullscreen&orgId=1) and it's been at least 2 minutes and the graph still hasn't shown up.
I think this is actually not grafana's fault, but prometheus'. I tried to run a single query for one percentile of launch times:
histogram_quantile(0.1, sum(rate(binderhub_launch_time_seconds_bucket[5m])) without (instance, repo, provider)) > 0
and it took 14 seconds to return. AFAICT Grafana is basically just running this query a bunch of times to build a dashboard, so that'd explain the slowness.
Anybody have an idea why this would be?
The text was updated successfully, but these errors were encountered: