Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upCPU Usage triples after upgrade to v2.1.0 #3715
Comments
This comment has been minimized.
This comment has been minimized.
|
I see the same thing on Kubernetes Engine 1.8.6-gke.0 going from Prometheus 2.0.0 to 2.1.0: The logs don't indicate there's anything special going on:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Running a test setup where I am federating 380K series, there is indeed a CPU gain and allocs increase of about 20% and slight slowdown (~10%). While nowhere close to the numbers reported, this certainly mandates a look. This is one of the changes in the federation code-path: #3569 |
This comment has been minimized.
This comment has been minimized.
goddog1118
commented
Jan 23, 2018
|
may i ask what's the min time interval of this tool ? can i set scrape_interval to sub second? |
This comment has been minimized.
This comment has been minimized.
|
@goddog1118 Please don't ask unrelated questions on bugs, use the prometheus-users mailing list. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Can someone share their configuration and a pprof? |
This comment has been minimized.
This comment has been minimized.
|
Here's our configuration. I've had to heavily redact it; I hope it's still useful. The codeshare will expire in 24 hours. I'm not sure how to take a pprof but I'm investigating that now. |
This comment has been minimized.
This comment has been minimized.
|
@paddynewman thanks I will try to look into this in the next few days and will let you know if I find anything. |
This comment has been minimized.
This comment has been minimized.
|
@FUSAKLA the graph you showed on memory allocations also looks interesting. In general as @krasi-georgiev mentioned cpu profile as well as memory allocation profiles would be great. If you have a go runtime installed you can collect a memory profile (note that you may need to make sure your go runtime running these commands is the same version as the Prometheus server was compiled with):
And a 30s CPU profile with
These commands will write profiles to disk, you can upload that profile for us to analyze. |
This comment has been minimized.
This comment has been minimized.
|
I have the profiling output, at least for 2.1, but my $employer will not allow me to share it. I can generate the output for 2.0 as well but again, I won't be allowed to share it. Is there anything else I can do to help while I figure out how to use pprof myself? |
This comment has been minimized.
This comment has been minimized.
|
Profiling output will not contain anything sensitive, it's mainly function names. |
This comment has been minimized.
This comment has been minimized.
|
I can create those profiles and share them but have no experience using pprof. When I run it it puts me in an interactive mode. Are there any flags to pass so it collects what you need in a non-interactive way? |
This comment has been minimized.
This comment has been minimized.
|
you can leave the interactive mode right away, the first line or so should say that the full profile has been written to |
This comment has been minimized.
This comment has been minimized.
|
Prometheus 2.1 profiles: https://drive.google.com/folderview?id=13_Scu9wS2u_odOyXvbhJMAZsgXk9yOIr |
This comment has been minimized.
This comment has been minimized.
|
I've created profiles for both versions although they're from different servers. None of the servers uses federation. https://drive.google.com/file/d/1fXE64oJ8uA86a47nqD_23BQ2nuaKB6XR/view?usp=sharing |
This comment has been minimized.
This comment has been minimized.
|
Sorry for the delay I managed to get pprof data from the federated i also added prometheus metrics and configurations (both the federated and the federating instances) Sadly a tried to downgrade back to
the metafile.json contains:
So it looks like i have problem with downgrade from This would mean more problems and minor data loss for me which won't be so critical but I would prefer to solve this some better way. Than I'll gladly give the pprof data from |
This comment has been minimized.
This comment has been minimized.
|
I suspect this is a side effect of the switch to mergeseriesset over dedupeseriesset - the latter would just drop equivalent series, whilst the former merges them. This should be easy to test - but can someone share the federation match params they use? If they have multiple matchers returns the same series that might explain it. |
tomwilkie
referenced this issue
Jan 25, 2018
Merged
Don't allocate a mergeSeries if there is only one series to merge. #3736
This comment has been minimized.
This comment has been minimized.
|
If someone experiencing the issue could test the above PR, I think that might fix this. |
This comment has been minimized.
This comment has been minimized.
|
my federation uses I can try the PR version but i need docker image is it built automatically or do i have to build it by myself? |
This comment has been minimized.
This comment has been minimized.
|
@FUSAKLA thats a single match, so shouldn't cause the original issue I suspected, although my PR might speed things up. I'll build you a docker image. |
This comment has been minimized.
This comment has been minimized.
|
It's ok i can do it by myself i just didn't know if you have some CI docker repository with branch builds. |
This comment has been minimized.
This comment has been minimized.
|
Just pushed |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Okay, I think that probably rules out the federation changes as the cause - with that fix they should be strictly better than 2.0. Thanks for your help. |
This comment has been minimized.
This comment has been minimized.
|
No problem, I'll be glad to help sort this out. I was waiting for those bugfixes introduced in 2.1.0. |
This comment has been minimized.
This comment has been minimized.
tinytub
commented
Jan 25, 2018
•
|
|
This comment has been minimized.
This comment has been minimized.
|
The gzip code hasn't changed in a long time. |
This comment has been minimized.
This comment has been minimized.
tinytub
commented
Jan 25, 2018
|
sorry I am wrong, Hope it helps |
This comment has been minimized.
This comment has been minimized.
|
I just opened #3740 which I believe fixes the primary issue (and elaborates on another one, which shouldn't be a problem though except for higher allocations). If someone could test it, that would be great! |
brancz
referenced this issue
Jan 25, 2018
Closed
discovery: Throttle updating discovery per provider #3740
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
seems that this was the main cause, when I did the refactoring I removed the throttling which caused the increased mem and cpu usage. |
This comment has been minimized.
This comment has been minimized.
|
I'm testing it and it looks really promising. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Are you looking at memory allocated and exposed by the go runtime, or from the view of the operating system? If it's the latter then this looks like a pretty normal GC behavior (albeit still optimizable). |
This comment has been minimized.
This comment has been minimized.
|
this figure shows |
This comment has been minimized.
This comment has been minimized.
|
I'm fairly certain there's a memory leak. In my previous comment/graph I am showing the OS's view of memory but the consumption never goes down. I have a midnight restart enabled via cron which is why you see the saw-tooth pattern and this was only required after upgrading from 2.0 to 2.1. Before that the cosumption was very stable at ~ 17.5GiB. We're now reaching ~ 65GiB in a 24-hour period before our cron job restarts our instances. |
This comment has been minimized.
This comment has been minimized.
|
After another day of benchmarking we've come up with this: #3747 Please try it out and give us feedback! |
This comment has been minimized.
This comment has been minimized.
|
Thank`s for the effort! Unfortunately I left the previous PR version deployed so this is not compared to the official Tomorrow I'll switch to the official But again it looks good. Memory usage fell down and hopefully wont spike anymore |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Should be closed in #3747 |
gouthamve
closed this
Feb 1, 2018
brancz
referenced this issue
Feb 1, 2018
Closed
Worse query performance of 2.1 than 1.8.2 observed #3771
This comment has been minimized.
This comment has been minimized.
|
Guys! I dropped this bomb on went on vacation One of the main reasons I love Prometheus so much is thanks to the incredible core team and community!!! Can't wait for the next release with the fixes |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |























dannyk81 commentedJan 20, 2018
What did you do?
Upgraded Prometheus from v2.0.0 to v2.1.0 (using the official Docker image)
What did you expect to see?
Similar or better performance figures.
What did you see instead? Under which circumstances?
CPU Usage almost tripled and rule evaluation times increased, we observed the same on two different deployment (two separate K8s clusters), after failing back to v2.0.0 figures are back to normal.
Cluster 1 [CPU]:

Cluster 2 [CPU]:

Cluster 1 [Rule evluation duration]:

Cluster 2 [Rule evluation duration]: graph looks about the same, but this cluster is smaller:

Except changing the docker image version, everything remained unchanged.
Environment
Linux 4.12.10-coreos x86_64(Runs in K8s 1.5.3 Cluster)Alertmanager version:
N/A
Prometheus configuration file: