Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query performance for large time series. #4391

Closed
mrsiano opened this Issue Jul 16, 2018 · 6 comments

Comments

Projects
None yet
4 participants
@mrsiano
Copy link

mrsiano commented Jul 16, 2018

In continuing to https://groups.google.com/forum/#!search/promethues/prometheus-users/kZ2DvZnYHUA/YDd81RhtBAAJ

we found the default resolution \ step size in seconds significantly affect the query execution time performance, specially on large scale setups.

we can change the default graph resolution to 15 or even 30 sec to make sure large results set will not reach the gateway timeout error.

as long the step size increased it will speed up the results, most of the impact is around innerEvalTime, see the following:

step 15:
2018-07-16 17:44:50,240 - INFO - duration: 1.862589-":{"timings":{"evalTotalTime":1.778064077,"resultSortTime":0.007469064,"queryPreparationTime":0.349853693,"innerEvalTime":1.420712212,"execQueueTime":0.000001542,"execTotalTime":1.778074411}}}}
2018-07-16 17:44:52,327 - INFO - duration: 1.882672-":{"timings":{"evalTotalTime":1.805533145,"resultSortTime":0.007118621,"queryPreparationTime":0.335963486,"innerEvalTime":1.462436464,"execQueueTime":0.000001463,"execTotalTime":1.805540575}}}} 


step 5:
2018-07-16 17:47:00,519 - INFO - duration: 4.037879-":{"timings":{"evalTotalTime":3.8893069000000002,"resultSortTime":0.008419253,"queryPreparationTime":0.361821925,"innerEvalTime":3.519047432,"execQueueTime":0.000001234,"execTotalTime":3.889314616}}}}
2018-07-16 17:47:05,371 - INFO - duration: 4.309454-":{"timings":{"evalTotalTime":4.141177569,"resultSortTime":0.008191162,"queryPreparationTime":0.34855919,"innerEvalTime":3.784409001,"execQueueTime":0.000001366,"execTotalTime":4.141185931}}}}

by this issue we trying to address and support reliable performance by high resolution in sec, specially on large scale.
currently change the resolution is sort of workaround and not an optimal solution.

  • Environment
    running prometheus by docker on quite large host with 40 cores with no limits

  • System information:
    Linux 3.10.0-907.el7.x86_64 x86_64

  • Prometheus version:
    v2.3.2

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 16, 2018

Can you clarify exactly what you are reporting here?
You requested 3 times the amount of work to be done, which took less than 3 times longer. I don't see anything wrong here.

@mrsiano

This comment has been minimized.

Copy link
Author

mrsiano commented Jul 17, 2018

my point is when users uses large scale tsdb, and they trying to work with the Prometheus UI graph they probably faced a timeout due to the default resolution which is 3 sec.

  1. we can make the default to be 15s instead of 3.
  2. do some further investigation to see whether is more room to improve the performance of vector-range functions for large scale
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 17, 2018

There is no default setting for this, it's based on the resolution so will generally have roughly the same number of steps.
If you're working with lots of data then things will be slow, and that's always going to be the case no matter how many optimisations we do. It sounds like you need to optimise how you are using PromQL, which is something best covered on the mailing list.

@free

This comment has been minimized.

Copy link
Contributor

free commented Jul 17, 2018

As (snarkily) explained in the prometheus-users thread, the time it takes to calculate a range (as opposed to instant) query is directly proportional to the resolution/number of steps. If your step is 3 seconds, then that's going to take ~5x the amount of computation as for a 15 second step.

While optimizing the query engine would help improve things overall, your query is still going to take 5x more at 3 second resolution than at 15 second resolution. I have proposed a handful of PRs that attempt to improve query performance, but they will only do so incrementally (e.g. even if a query turns out to be twice as fast with these changes, all it takes to undo that is twice the number of series; or sum(x) - sum(y) instead of sum(x)).

As Brian pointed out, you need to optimize your queries. What I'm doing for the equivalent queries on my dashboards (which would otherwise take on the order of 30 seconds to load) is that I have recorded rules to precompute e.g. rates and pre-aggregate e.g. by handled or status code. It takes away a bit of the flexibility, so for a couple of important dashboards I have the fast version (which displays the output of said recorded rules) for general consumption and the debug version (which does everything on the fly and thus allows for a few more filters and knobs) for actual debugging. But I know that I can only use the latter for ranges of up to 1 hour and wait for 30 seconds after every change to the filters.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 7, 2018

This is stale, and there doesn't appear to be an issue here.

@mrsiano mrsiano changed the title Query execution time performance Query performance for large time series. Dec 17, 2018

@valyala

This comment has been minimized.

Copy link

valyala commented Mar 6, 2019

@mrsiano , Prometheus may write data to remote storage, so you can offload heavy queries to remote storage instead of Prometheus. While many remote storage solutions do not understand PromQL (so you need to use another query languages such as SQL, InfluxQL, Flux, etc.), there are a few remote storage solutions with native PromQL suppot (so you can just update Prometheus datasource url in Grafana for querying the remote storage):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.