Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Usage triples after upgrade to v2.1.0 #3715

Closed
dannyk81 opened this Issue Jan 20, 2018 · 45 comments

Comments

Projects
None yet
@dannyk81
Copy link

dannyk81 commented Jan 20, 2018

What did you do?

Upgraded Prometheus from v2.0.0 to v2.1.0 (using the official Docker image)

What did you expect to see?

Similar or better performance figures.

What did you see instead? Under which circumstances?

CPU Usage almost tripled and rule evaluation times increased, we observed the same on two different deployment (two separate K8s clusters), after failing back to v2.0.0 figures are back to normal.

Cluster 1 [CPU]:
image

Cluster 2 [CPU]:
image

Cluster 1 [Rule evluation duration]:
image

Cluster 2 [Rule evluation duration]: graph looks about the same, but this cluster is smaller:
image

Except changing the docker image version, everything remained unchanged.

Environment

  • System information:

Linux 4.12.10-coreos x86_64 (Runs in K8s 1.5.3 Cluster)

  • Prometheus version:
prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d82a045d103ea7f3c89a91fba4a93e6367a)
  build user:       root@6e784304d3ff
  build date:       20180119-12:01:23
  go version:       go1.9.2
  • Alertmanager version:

    N/A

  • Prometheus configuration file:

Can provide if needed
  • Alertmanager configuration file:
N/A
  • Logs:
Can provide if needed
@JorritSalverda

This comment has been minimized.

Copy link
Contributor

JorritSalverda commented Jan 20, 2018

I see the same thing on Kubernetes Engine 1.8.6-gke.0 going from Prometheus 2.0.0 to 2.1.0:

20180120-prometheus-210-cpu-increase

The logs don't indicate there's anything special going on:

level=info ts=2018-01-20T13:09:22.585447311Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2018-01-20T13:09:22.585523718Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2018-01-20T13:09:22.585546075Z caller=main.go:227 host_details="(Linux 4.4.86+ #1 SMP Thu Dec 7 20:11:11 PST 2017 x86_64 prometheus-1 (none))"
level=info ts=2018-01-20T13:09:22.585564529Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-01-20T13:09:22.588658792Z caller=main.go:499 msg="Starting TSDB ..."
level=info ts=2018-01-20T13:09:22.588686556Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T13:09:50.424774629Z caller=main.go:509 msg="TSDB started"
level=info ts=2018-01-20T13:09:50.42494659Z caller=main.go:585 msg="Loading configuration file" filename=/prometheus-config/prometheus.yml
level=info ts=2018-01-20T13:09:50.432711846Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.433969004Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.434900141Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.435884965Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.437116695Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.438063243Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.438998843Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-20T13:09:50.459854142Z caller=main.go:486 msg="Server is ready to receive web requests."
level=info ts=2018-01-20T13:09:50.459854122Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."
@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 21, 2018

I see the same thing in our kubernetes cluster running official docker image and also changing only the image tag.
(the firts annotation is upgrade from 2.0.0 to 2.1.0 the second one is restart due to the OOM kill by kuberrnetes)

allocations
compaction
cpu
federation
This is interesting because according to the release note the federation should be faster

[ENHANCEMENT] Federation is now much faster for large numbers of series.

To provide HA I federate all timeseries (I know it's not right it's just temporary solution) so it is I think exactly this case as stated and the federation is even slower.

memory
reloaded-blocks
wal-fsync

Recording rules evaluation time did not change in my case.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 22, 2018

Running a test setup where I am federating 380K series, there is indeed a CPU gain and allocs increase of about 20% and slight slowdown (~10%). While nowhere close to the numbers reported, this certainly mandates a look.

This is one of the changes in the federation code-path: #3569

@goddog1118

This comment has been minimized.

Copy link

goddog1118 commented Jan 23, 2018

may i ask what's the min time interval of this tool ? can i set scrape_interval to sub second?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 23, 2018

@goddog1118 Please don't ask unrelated questions on bugs, use the prometheus-users mailing list.

@paddynewman

This comment has been minimized.

Copy link
Contributor

paddynewman commented Jan 24, 2018

I'm also seeing similar problems, in case it helps with prioritisation.

Here's a graph showing the difference in resource utilisation at our site. The only change was to upgrade from 2.0 to 2.1:

prometheus 2 0 to 2 1

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 24, 2018

Can someone share their configuration and a pprof?

@paddynewman

This comment has been minimized.

Copy link
Contributor

paddynewman commented Jan 24, 2018

Here's our configuration. I've had to heavily redact it; I hope it's still useful. The codeshare will expire in 24 hours.

https://codeshare.io/2B8zmd

I'm not sure how to take a pprof but I'm investigating that now.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 24, 2018

@paddynewman thanks I will try to look into this in the next few days and will let you know if I find anything.
mem and cpu pprof would really help narrow it down.
If it is not too much to ask please generate mem and cpu pprof with 2.0 (normal performance) and 2.1 higher cpu and mem usage to compare the results.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 24, 2018

@FUSAKLA the graph you showed on memory allocations also looks interesting.

In general as @krasi-georgiev mentioned cpu profile as well as memory allocation profiles would be great. If you have a go runtime installed you can collect a memory profile (note that you may need to make sure your go runtime running these commands is the same version as the Prometheus server was compiled with):

go tool pprof http://<your prometheus server>/debug/pprof/heap

And a 30s CPU profile with

go tool pprof http://<your prometheus server>/debug/pprof/profile

These commands will write profiles to disk, you can upload that profile for us to analyze.

@paddynewman

This comment has been minimized.

Copy link
Contributor

paddynewman commented Jan 24, 2018

I have the profiling output, at least for 2.1, but my $employer will not allow me to share it. I can generate the output for 2.0 as well but again, I won't be allowed to share it.

Is there anything else I can do to help while I figure out how to use pprof myself?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 24, 2018

Profiling output will not contain anything sensitive, it's mainly function names.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor

JorritSalverda commented Jan 24, 2018

I can create those profiles and share them but have no experience using pprof. When I run it it puts me in an interactive mode. Are there any flags to pass so it collects what you need in a non-interactive way?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 24, 2018

you can leave the interactive mode right away, the first line or so should say that the full profile has been written to ~/pprof/<profile-name>. Simply share that file.

@paddynewman

This comment has been minimized.

Copy link
Contributor

paddynewman commented Jan 24, 2018

@JorritSalverda

This comment has been minimized.

Copy link
Contributor

JorritSalverda commented Jan 24, 2018

I've created profiles for both versions although they're from different servers. None of the servers uses federation.

https://drive.google.com/file/d/1fXE64oJ8uA86a47nqD_23BQ2nuaKB6XR/view?usp=sharing

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 24, 2018

Sorry for the delay

I managed to get pprof data from the federated 2.1.0 prometheus instance
Unfortunately the instance got OOM kill recently so the memory usage is not SO high.
I'll get more critical data tomorrow if needed.
link here (I hope i don't break any rules in my company :) )

i also added prometheus metrics and configurations (both the federated and the federating instances)

Sadly a tried to downgrade back to 2.0.0 version but end up with (using official upstream docker images)

level=info ts=2018-01-24T17:32:57.106618248Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-24T17:32:57.106646142Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-24T17:32:57.187507378Z caller=main.go:323 msg="Opening storage failed" err="read meta information /prometheus/01C4MHVPYRN7THY4KS3KAZP1A5: unexpected meta file version 2"

the metafile.json contains:

{
	"ulid": "01C4MHVPYRN7THY4KS3KAZP1A5",
	"minTime": 1516802400000,
	"maxTime": 1516809600000,
	"stats": {
		"numSamples": 13784198,
		"numSeries": 42773,
		"numChunks": 117088
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01C4MHVPYRN7THY4KS3KAZP1A5"
		]
	},
	"version": 2
}

So it looks like i have problem with downgrade from 2.1.0 to 2.0.0 but in changelog was no mention about backward incompatibility in storage?

This would mean more problems and minor data loss for me which won't be so critical but I would prefer to solve this some better way. Than I'll gladly give the pprof data from 2.0.0 version.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 25, 2018

I suspect this is a side effect of the switch to mergeseriesset over dedupeseriesset - the latter would just drop equivalent series, whilst the former merges them. This should be easy to test - but can someone share the federation match params they use? If they have multiple matchers returns the same series that might explain it.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 25, 2018

If someone experiencing the issue could test the above PR, I think that might fix this.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

my federation uses '{job=~"(.+)"}' the configuration is in the file i posted

I can try the PR version but i need docker image is it built automatically or do i have to build it by myself?

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 25, 2018

@FUSAKLA thats a single match, so shouldn't cause the original issue I suspected, although my PR might speed things up.

I'll build you a docker image.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

It's ok i can do it by myself i just didn't know if you have some CI docker repository with branch builds.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 25, 2018

Just pushed tomwilkie/prometheus:3715-2.1-cpu-749781ed, let me know if it helps!

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

There is some improvement in memory usage but I'm not sure if significant enough to say it caused by the fix. The CPU and allocations seems to by untouched by the fix.
screenshot from 2018-01-25 12-47-29

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jan 25, 2018

Okay, I think that probably rules out the federation changes as the cause - with that fix they should be strictly better than 2.0. Thanks for your help.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

No problem, I'll be glad to help sort this out. I was waiting for those bugfixes introduced in 2.1.0.

@tinytub

This comment has been minimized.

Copy link

tinytub commented Jan 25, 2018

prometheus/client_golang#366

maybe the gzip cause the high cpu usage?
federate and client_golang promhttp use same writeSample funcstions.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 25, 2018

The gzip code hasn't changed in a long time.

@tinytub

This comment has been minimized.

Copy link

tinytub commented Jan 25, 2018

sorry I am wrong,
I take a pprof when curl "http://mdb73:9090/federate?match[]{__name__=~"disk.*"}" -H 'Accept-Encoding: gzip'
It's about 637342 line series
https://drive.google.com/file/d/1QLAcmxbbkovc7h9EStlG37_AwvNer4hR/view?usp=sharing

Hope it helps

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 25, 2018

I just opened #3740 which I believe fixes the primary issue (and elaborates on another one, which shouldn't be a problem though except for higher allocations). If someone could test it, that would be great! 🙂

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

Actually SD went pretty crazy also after the update but I didnt consider it so important.
There is graph of the upgrade influence on SD
screenshot from 2018-01-25 23-33-08

I'll test the PR asap

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 25, 2018

seems that this was the main cause, when I did the refactoring I removed the throttling which caused the increased mem and cpu usage.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

I'm testing it and it looks really promising.
In few minutes I'll post the results. Sorry for not posting the SD figures at the first time.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 25, 2018

I think this pretty much confirms it.
All values are back to 2.0.0 normal

prometheus, version 2.1.0 (branch: HEAD, revision: 3a8f87304adca830dfba46ae3ea0d7ce72e0d47c)
  build user:       fusakla@FUSAKLA
  build date:       20180125-23:09:18
  go version:       go1.9.2

screenshot from 2018-01-26 00-44-07
screenshot from 2018-01-26 00-44-55
screenshot from 2018-01-26 00-44-31
even the federation is back to the previous values
screenshot from 2018-01-26 00-44-21

Do you want pprof or any other data to verify this?

Thanks for the PR it looks promissing!

@paddynewman

This comment has been minimized.

Copy link
Contributor

paddynewman commented Jan 26, 2018

Do you think this PR will resolve (what I'm assuming is) a memory leak that appeared in 2.1 as well?

memory-leak

We've configured a midnight restart, hence the saw-tooth pattern.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 26, 2018

as far as i can tell from metrics probably not.. this figure shows impact on memory since 2.1.0 upgrade and the last annotation is upgrade to the mentioned PR and memory is still rising although the CPU and so on seems to be ok
screenshot from 2018-01-26 10-23-38

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 26, 2018

Are you looking at memory allocated and exposed by the go runtime, or from the view of the operating system? If it's the latter then this looks like a pretty normal GC behavior (albeit still optimizable).

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 26, 2018

this figure shows process_resident_memory_bytes and go_memstats_alloc_bytes. And data from cadvisor looks the same.

@paddynewman

This comment has been minimized.

Copy link
Contributor

paddynewman commented Jan 26, 2018

I'm fairly certain there's a memory leak. In my previous comment/graph I am showing the OS's view of memory but the consumption never goes down.

I have a midnight restart enabled via cron which is why you see the saw-tooth pattern and this was only required after upgrading from 2.0 to 2.1. Before that the cosumption was very stable at ~ 17.5GiB.

We're now reaching ~ 65GiB in a 24-hour period before our cron job restarts our instances.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 26, 2018

After another day of benchmarking we've come up with this: #3747

Please try it out and give us feedback! 🙂

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 26, 2018

Thank`s for the effort!

Unfortunately I left the previous PR version deployed so this is not compared to the official 2.1.0 release but to the proof of concept version in #3740

Tomorrow I'll switch to the official 2.1.0 and send those results.

But again it looks good. Memory usage fell down and hopefully wont spike anymore
also reloading blocks from disk mentioned in #3725 fell to minimum
The rest like SD, federation and so on looks the same as in the previous fix
screenshot from 2018-01-27 00-20-56
screenshot from 2018-01-27 00-19-59

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Jan 27, 2018

So i have the results.. There are 3 versions in those figures:

(from left to right separated by annotations)

  1. The first PR already deleted
  2. This PR now requested to benchmark from branch sched-update-throttle
  3. official 2.1.0 release

screenshot from 2018-01-27 15-33-37
screenshot from 2018-01-27 15-33-28
screenshot from 2018-01-27 15-33-49
screenshot from 2018-01-27 15-34-00

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Feb 1, 2018

Should be closed in #3747

@dannyk81

This comment has been minimized.

Copy link
Author

dannyk81 commented Feb 1, 2018

Guys! I dropped this bomb on went on vacation 😛 but was amazed to follow remotely how this was tackled and finally resolved.

One of the main reasons I love Prometheus so much is thanks to the incredible core team and community!!! ❤️

Can't wait for the next release with the fixes 👏 👏

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.