Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPerformance issue on Prometheus 2.7.x #5173
Comments
This comment has been minimized.
This comment has been minimized.
|
That query is going to touch every series in the index lookup, which is not going to be fast. Try it without the name matcher. Also, this is not the sort of query you should be using often. |
This comment has been minimized.
This comment has been minimized.
|
I give this query as example, not sure if this is 10 time slower it is good, so you think its not an issue? or you mean i need to test another query? |
This comment has been minimized.
This comment has been minimized.
|
This is a query that it's not important to be fast, if you have a query that's not doing a wildcard on name with this issue I'm interested. |
This comment has been minimized.
This comment has been minimized.
|
I made some tests with production data loaded to dev server and here is new result, its still slower but not 10 times, but 2-5 times at least. i run this query
we use this query on graph in grafana dashboard, and directly on prometheus it runs with next results comparing to 2.6.x version
I think its pretty serious speed dropdown, no changes to configuration or sources, also historical data only 1 day. If you need some additional test to be done from our side as proof please write to me i will give any information you need, because for now its blocking us from upgrading to new version. Also i run this tests at least 3 times to be sure that this is clear result. |
This comment has been minimized.
This comment has been minimized.
|
Do you know which subpart of that query is slower? That query is also broken in two ways, you're aggregating away the labels you're trying to add in and you can't average averages. |
This comment has been minimized.
This comment has been minimized.
|
I think you miss the point, the same query is much slower in new prometheus, its not a problem with optimizing the query its problem that we cannot run new prometheus because its slower than previous version. If you think that this is because of second query i can run another test without it and result is pretty the same with query like this
if you want to discuss how query can be optimized i think its a good idea but problem not in optimization for now. |
This comment has been minimized.
This comment has been minimized.
allengeorge
commented
Feb 6, 2019
|
Unfortunately we're able to replicate similar behavior on an internal 2.7.1 prometheus install. We're trying to run the following query:
We were getting cases where our queries and our dashboards were timing out. A simple, naive test was running that query in a loop and recording the runtimes on a server that was not servicing any other queries. This gave us results that were (1) slower, and (2) varied wildly:
Downgrading to 2.6.1 and running the same query under the same conditions resulted in the following runtimes:
|
This comment has been minimized.
This comment has been minimized.
|
Can you give me an idea of the cardinality of each of those labels, and the Prometheus overall? |
This comment has been minimized.
This comment has been minimized.
For our prometheus the cardinality usually not more than 10, but average number i think something like 5-6, total number of timeseries is near 5 millions and storage rotation is 3 weeks |
This comment has been minimized.
This comment has been minimized.
|
You have no more than 10 customer_ids? |
This comment has been minimized.
This comment has been minimized.
allengeorge
commented
Feb 7, 2019
|
Yes, we do have more than 10 customer ids, but due to the nature of our market it’s not massive. As in, we’re not talking 10s of thousands of unique customer ids here. And, while not ideal (we read the warnings about label cardinality) we had to make some internal tradeoffs and opted with this approach. FWIW, this metric structure has been working well since the 1.x releases until 2.7, at which point we experienced significant slowdown and variability. |
This comment has been minimized.
This comment has been minimized.
hoffie
commented
Feb 13, 2019
|
I am seeing this as well and I think I have come up with a self-contained reproducer. I think this boils down to a change in performance characteristics regarding label matchers.
I haven't done any further tests yet. Bisecting may help (if noone has an upfront idea what commit could have introduced this). I'll try to continue debugging as time permits. # Generate some fake metrics
$ for i in 1 2; do for m in {1..200}; do
for x in {1..1000}; do
echo 'example_metric'$m'{a="'"${x}"'"} 1'
done
done > metrics$i; done
# Set up a web server which exposes those fake metrics
$ python2 -m SimpleHTTPServer
# Prepare and start prometheus 2.6.1 and 2.7.1
$ cat > prometheus.yml
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'example1'
metrics_path: '/metrics1'
static_configs:
- targets: ['127.0.0.1:8000']
- job_name: 'example2'
metrics_path: '/metrics2'
static_configs:
- targets: ['127.0.0.1:8000']
^D
$ ./prometheus-2.6.1 --web.listen-address 127.0.0.1:9096 --storage.tsdb.path data-2.6.1 &
$ ./prometheus-2.7.1 --web.listen-address 127.0.0.1:9097 --storage.tsdb.path data-2.7.1 &
# Benchmark
$ time for i in {1..100}; do curl -s 'http://localhost:9096/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m4,766s
user 0m0,942s
sys 0m0,550s
$ time for i in {1..100}; do curl -s 'http://localhost:9097/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m22,171s
user 0m1,334s
sys 0m0,697s |
This comment has been minimized.
This comment has been minimized.
|
It's almost certainly prometheus/tsdb@296f943, however I can't reproduce with a micro-benchmark so something bigger is going on. |
This comment has been minimized.
This comment has been minimized.
|
Okay, have it on the micro-benchmark now. Had missed a for loop. |
This comment has been minimized.
This comment has been minimized.
|
Have a try of this branch: https://github.com/prometheus/tsdb/tree/index-perf Only postings.go has changed. |
This comment has been minimized.
This comment has been minimized.
hoffie
commented
Feb 14, 2019
|
Thanks for your work, highly appreciated! I have built three prometheus binaries with go1.11.5:
I can confirm that only unpatched master has the above mentioned performance problem. Both patched 2.7.1 and master with your tsdb branch show highly improved performance which is equal or even better than on 2.6.1. So, at least for this case, this seems to be a huge improvement. I have not done any other tests nor am I able to run a test on our production environment with a patched version. However, based on my tests (documented below) I'm quite confident that the suggested change will help. Looking forward to getting this merged and released! :) $ make build
$ cp prometheus prometheus-b7594f650f348b9ac108ea631b63e99a88b9413d
$ GO111MODULE=on go mod edit -require github.com/prometheus/tsdb@index-perf
$ GO111MODULE=on go mod vendor
$ make build
$ cp prometheus prometheus-b7594f650f348b9ac108ea631b63e99a88b9413d-tsdb-v0.4.1-0.20190214171337-fd372973ae24
$ git checkout -f .
$ git checkout v2.7.1
$ wget https://raw.githubusercontent.com/prometheus/tsdb/index-perf/index/postings.go -O vendor/github.com/prometheus/tsdb/index/postings.go
$ make build
$ cp prometheus ~/tmp/prom-2.7-performance-regression/prometheus-2.7.1-postings.go-patched
$ ./prometheus-2.7.1-postings.go-patched --web.listen-address 127.0.0.1:9097 --storage.tsdb.path data-2.7.1-postings.go-patched &
$ ./prometheus-b7594f650f348b9ac108ea631b63e99a88b9413d --web.listen-address 127.0.0.1:9098 --storage.tsdb.path data-b7594f650f348b9ac108ea631b63e99a88b9413d &
$ ./prometheus-b7594f650f348b9ac108ea631b63e99a88b9413d-tsdb-v0.4.1-0.20190214171337-fd372973ae24 --web.listen-address 127.0.0.1:9099 --storage.tsdb.path data-b7594f650f348b9ac108ea631b63e99a88b9413d-tsdb-v0.4.1-0.20190214171337-fd372973ae24 &
$ time for i in {1..100}; do curl -s 'http://localhost:9097/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m3,608s
user 0m0,914s
sys 0m0,360s
$ time for i in {1..100}; do curl -s 'http://localhost:9098/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m21,882s
user 0m1,322s
sys 0m0,694s
$ time for i in {1..100}; do curl -s 'http://localhost:9099/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m3,935s
user 0m0,848s
sys 0m0,420s |
This comment has been minimized.
This comment has been minimized.
|
Great, thanks for all the feedback. I'll look at getting in proper benchmarks for this tomorrow and raising a PR. |
krasi-georgiev
added
component/local storage
kind/bug
labels
Feb 15, 2019
This comment has been minimized.
This comment has been minimized.
eahydra
commented
Feb 19, 2019
|
Hey, I have analogous performance issue that I update from Prometheus v2.5.0 to Prometheus v2.7.1.
In Prometheus v2.5.0, the In this case, the metric The result of this difference seems to be because the 2.5.0 version first checks the metric and then checks the label, but in the 2.7.1 version, the label is checked first and then the metric is checked. |
This comment has been minimized.
This comment has been minimized.
The intersection is done left to right, that hasn't changed. |
This comment has been minimized.
This comment has been minimized.
eahydra
commented
Feb 19, 2019
|
If didn't changed,why got the different result? When start Prometheus v2.7.1 for a while such as 1hour, the query becomes slow.
And the target metric is not exist.
在 2019年2月19日,18:03,Brian Brazil <notifications@github.com<mailto:notifications@github.com>> 写道:
The result of this difference seems to be because the 2.5.0 version first checks the metric and then checks the label, but in the 2.7.1 version, the label is checked first and then the metric is checked.
The intersection is done left to right, that hasn't changed.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#5173 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAyy9gvF42F1fNpsSxKiO--VVR0fZEVoks5vO8vbgaJpZM4aeKbc>.
|
This comment has been minimized.
This comment has been minimized.
|
I've put that out as a PR now with some additional changes, if anyone wants to test: prometheus/tsdb#531 |
This comment has been minimized.
This comment has been minimized.
hoffie
commented
Feb 21, 2019
|
Looks good, thanks! $ ./prometheus-2.6.1 --web.listen-address 127.0.0.1:9096 --storage.tsdb.path data-2.6.1
$ ./prometheus-2.8.0_pre-5fbda4c9d72b4519adbc9447f0a0023565d03b14 --web.listen-address 127.0.0.1:9098 --storage.tsdb.path data-2.8.0_pre-5fbda4c9d72b4519adbc9447f0a0023565d03b14
$ ./prometheus-2.8.0_pre-5fbda4c9d72b4519adbc9447f0a0023565d03b14-tsdb-v0.4.1-0.20190221160417-10f171afa6d5 --web.listen-address 127.0.0.1:9099 --storage.tsdb.path data-2.8.0_pre-5fbda4c9d72b4519adbc9447f0a0023565d03b14-tsdb-v0.4.1-0.20190221160417-10f171afa6d5
$ time for i in {1..100}; do curl -s 'http://localhost:9096/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m3,574s
user 0m0,982s
sys 0m0,507s
$ time for i in {1..100}; do curl -s 'http://localhost:9098/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m22,113s
user 0m1,370s
sys 0m0,669s
$ time for i in {1..100}; do curl -s 'http://localhost:9099/api/v1/query?query=example_metric1%7Ba%3D~%22.%2B%22%2Cjob%3D%22example1%22%7D' -o /dev/null; done
real 0m3,498s
user 0m1,021s
sys 0m0,409s |
This comment has been minimized.
This comment has been minimized.
eahydra
commented
Feb 26, 2019
It works fine for my case. Use Prometheus v2.7.1, query use Prometheus master branch latest code, just cost 0.001 second. |
dmitriy-lukyanchikov commentedFeb 1, 2019
Bug Report
What did you do?
Update to the latest Prometheus 2.7.1 from 2.6.0
What did you expect to see?
I expected to see new prometheus with the same queries speed and new features
What did you see instead? Under which circumstances?
I see that some queries run more than 10 times slower. Screen shots are attached
Environment
Linux 4.4.0-141-generic x86_64
2.7.1
Prometheus 2.7.1

Prometheus 2.6.0
