Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upWorse query performance of 2.1 than 1.8.2 observed #3771
Comments
This comment has been minimized.
This comment has been minimized.
TheyDroppedMe
commented
Jan 31, 2018
|
I am also seeing decreased performance with 2.1.0 compared to 2.0.0 |
This comment has been minimized.
This comment has been minimized.
|
Could you check whether this is still the case when you build Prometheus from head of master? We recently fixed a performance issue that was introduced in v2.1.0 (#3715). A patch release should be coming soon. |
This comment has been minimized.
This comment has been minimized.
|
Thank you for your comment. I've conducted the same test with the master version of prometheus. It's version is as follows:
Unfortunately, the result is more or less same. Query TPS and system metrics are all very similar to the case of the previous 2.1.0. From the cpu profiling I suspect that this is because prometheus 2 always has to decode data, which has been encoded by taking delta and xoring. I read from internet articles that prometheus 2 does not use memory chunks but relies on page cache, and it uses some techniques involving delta/xor. As I understand, this means that the ingested metric is always stored directly to the file (mmapped) in an encoded form. So when Prometheus 2 needs to read out, it has to decode it first. Admittedly this is a wild guess. But if it's true, I would be happily waste more storage for lower CPU consumption. |
This comment has been minimized.
This comment has been minimized.
|
300 relatively heavy PromQL queries per second is quite a bit, in practice that is the load of hundreds of dashboards being refreshed every 10 seconds. In this sort of case recording rules are recommended to reduce load, as you're hitting 1k metrics per query. I'm not sure this is a case we should be trading off RAM for. What sort of CPU is it? |
This comment has been minimized.
This comment has been minimized.
|
I'm testing on a private cloud enviornment, so it's hard to tell how much CPU power I actually use in terms of a real HW. At least I can tell that the prometheus container uses 30 vcpus and 64GB RAM on a 56 core 256GB RAM machine. If you need more detail, please let me know. I kept on testing and even more strange results is when I tested a metadata query I attach some data about the above test. pprof CPU profile: https://drive.google.com/open?id=1KKaYbbsDP9nbK0YID9Dy-zQK2WzX9JHJ As @brian-brazil said, the workload I applied may not be very realistic, and we may never see real such workload on our system. But what I'm concerning is that the performance gap between 1.8 and 2.1 on the same container spec because it will eventually determine whether to upgrade. I'd like to know if this is expected of prometheus 2 or just a bug. |
This comment has been minimized.
This comment has been minimized.
ddewar2k
commented
Feb 9, 2018
This comment has been minimized.
This comment has been minimized.
|
I've compared query performance of 1.8.2 and 2.1. And the results are as follows:
Both prometheus was running on m4.4xlarge. |
This comment has been minimized.
This comment has been minimized.
xThomo
commented
Feb 9, 2018
•
|
@roengram @TheyDroppedMe above mentioned there was also a drop in performance from 2.0 to 2.1. Do you have the numbers for 2.0 as well? Maybe there are other issues beyond #3715 |
This comment has been minimized.
This comment has been minimized.
ddewar2k
commented
Feb 13, 2018
|
Note: the dashboard we are using is: Kubernetes cluster monitoring (via Prometheus) This query is used in one panel. These queries are used in another panel. sum (rate (container_cpu_usage_seconds_total{image!="",name! sum (rate (container_cpu_usage_seconds_total{rkt_container_name!="",kubernetes_io_hostname=~"^$Node$"}[$interval])) by (kubernetes_io_hostname, rkt_container_name) Do these queries look onerous, or is it just the number of samples that is driving the high CPU usage? I am going to guess many folks doing contianer work will use this dashboard. |
This comment has been minimized.
This comment has been minimized.
|
Prometheus 2.3.0 includes a significant rewrite of PromQL which gives good performance improvements. There are unlikely to be many significant performance improvements beyond that. If you have ideas to optimise specific queries that seem slow, please let us know. |
brian-brazil
closed this
Jun 12, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


roengram commentedJan 31, 2018
•
edited
What did you do?
Compare query performance between 1.8.2 and 2.1.0.
Query:
count(metric2{label1="v1"})What did you expect to see?
Similar or better performance than 1.8.2.
What did you see instead? Under which circumstances?
2.1.0 shows about 1/2 TPS than 1.8.2, when 300 concurrent users are querying.
Environment
System information:
Linux 3.13.0 x86_64 (Centos7-based docker running on a Joyent Triton container)
Container has 64GB RAM with 128GB swap memory, 800GB SSD
Prometheus version:
1.8.2: prometheus, version 1.8.2 (branch: HEAD, revision: 5211b96)
2.1.0: prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
Alertmanager version:
N/A
Prometheus configuration file:
Basically each prometheus is scraping 1000 targets every 40 sec, each exports 1000 time series per scrape, and each time series has 15 labels. The sample value is randomly generated between -2^32 ~ 2^32.
Alertmanager configuration file:
Logs:
Background
Before upgrading the current 1.8.2 to 2.1.0, we are doing performance tests. 2.1.0 is much better than 1.8.2 in terms of scraping performance. But query performance is much worse than 1.8.2.
Both prometheus is scraping 1K targets every 40 sec, each exporting 1K samples with 15 labels. Samples are gauge whose values are dynamically generated between -2^32 ~ 2^32. This amounts to 25K/sec. With the given container spec (64GB RAM, 800GB SSD), both prometheus shows less than 1% of CPU utilization, and 11% (2.1.0) and 21% (1.8.2) of MEM utilization when not queried.
The samples are simply
metric1{label1="v1", label2="v2"..., label15="v15"},metric2{...}...metric1000{...}.Performance Test
I used nGrinder to apply stress to Prometheus. nGrinder creates 300 vusers, each querying prometheus. Timeout is 6 sec, but no test cases below incurred timeout.
metric12.1.0 shows ~330 TPS while 1.8.2 shows ~270 TPS. So far so good.
count(metric2{label1="v1"})1.8.2 shows about ~1.3K TPS while 2.1.0 ~470 TPS.
2.1.0 shows much higher CPU usage while similar MEM utilization.
CPU:

MEM (RSS):

MEM (virtual):

Goroutines and Gothread:

HTTP query duration (endpoint: /query):

FS read/write bandwidth:

NW connection:

CPU pprof:
https://drive.google.com/open?id=18qiiciQYLrT3TGpz9fcCl-q_1oYjPU96
HEAP pprof:
https://drive.google.com/open?id=1_zkIaGWknWEpaRnIG-w19wgCjpB8IFbX
Goroutine pprof:
https://drive.google.com/open?id=12dK4KesLeedTFWV4MwMqmf6gIGo6vuPb
Questions
CPU profiling indicates that 2.1.0 spends a lot of time reading data. I'm not sure but this may be related to the variable bit encoding that 2.1.0 introduced to reduce storage foot prints. Anyway, the query filters and aggregates only the current values, which should be residing in memory for both Prometheus (from chunks or mmapped files). So it's very hard for me to understand this results.
2.1.0 shows much larger active/established network connections, and about twice many goroutines. Could this be a cause?
Any help will be welcomed.
I edited several typos, uploaded images with more system metrics.