Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up2.3.0 significatnt memory usage increase. #4254
Comments
This comment has been minimized.
This comment has been minimized.
|
Can you share your configuration, a snapshot of the benchmark dashboard, and if you've made any other changes? |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil where do I find the "benchmark dashboard", and the config for this is rather large are there any specific areas of interest? |
This comment has been minimized.
This comment has been minimized.
|
Found the dashboard |
This comment has been minimized.
This comment has been minimized.
|
The entire config please, we don't know what might be relevant. |
This comment has been minimized.
This comment has been minimized.
|
Obfuscated by hand, hopefully didn't introduce any additional problems. |
This comment has been minimized.
This comment has been minimized.
|
Summary of config: 15s interval, using 13 gce_sd_configs and I think 36 kuberentes_sd_configs. |
This comment has been minimized.
This comment has been minimized.
|
How often is the config file being reloaded? Can you try running it without the rules to eliminate those? |
This comment has been minimized.
This comment has been minimized.
|
sounds about right. |
This comment has been minimized.
This comment has been minimized.
|
config reload only occurs on recording rule publishing, (in the order of < 1 / day). Here's a before/after (the blue annotation shows when I reverted to 2.2.1) FWIW queries should be relatively consistant across the timespan of that plot. |
This comment has been minimized.
This comment has been minimized.
|
My primary suspicions would be on kuberentes_sd, as that changed a good bit. Query memory should only have gone down - which your graphs show. |
This comment has been minimized.
This comment has been minimized.
|
My original graph may be incredibly misleading. I thought the spikes were the cause of the OOM, now I realise that is show the total allocated CPU so the spikes are actually two instances running, not one, I think the instance is already over its mrmory budget. |
This comment has been minimized.
This comment has been minimized.
|
@tcolgate could you provide a heap profile SVG when the usage is very high?
And then |
This comment has been minimized.
This comment has been minimized.
|
@fabxc I've up'd the memory request to 26GB, which I think is enough now that at least it doesn't get OOM'd (it's pretty much on a dedicated node now). |
This comment has been minimized.
This comment has been minimized.
|
internal prom perf dashboard capture during that profile: https://snapshot.raintank.io/dashboard/snapshot/si3JoG47UshfAlbwdLg6GgCiXvFep4JQ |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the quick response. That appears to be a 30s CPU profile, which looks fairly normal. |
This comment has been minimized.
This comment has been minimized.
|
Doh |
This comment has been minimized.
This comment has been minimized.
|
This profile does not show any service discovery at all, i.e. usage is probably very minor. The graph your first posted also indicates that usage didn't increase continuously but rather spikes a lot. The baseline seems to be a bit lower than before actually. Can you get another profile with the |
This comment has been minimized.
This comment has been minimized.
|
This is a alloc_space one from during the climbing memory This one seems to suggest that the labels built up during scraping are the problem? What's changed around the scrapes? more concurrency maybe? |
This comment has been minimized.
This comment has been minimized.
|
Do you have a normal heap_inuse and one during a spike? |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I think the previous one was at the start of one. Timing is tricky as things become unresponsive. I've had to revert now as I've tested these on a prod instance (thanks to the kind patience of some devs). |
This comment has been minimized.
This comment has been minimized.
|
The allocs profile only shows allocations, recording rule evaluations, and serving Prometheus's own My best guess for now is thus PromQL. The CPU profile doesn't show much GC work, which indicates that generally the allocation improvements in 2.3 are working. But possibly the changed evaluation model pins to much memory for a single query at once? – I've no hard reasoning for this though. @tcolgate any chance you can share (possibly privately) the recording rules that server is running so we can get an understanding of the queries that are running? |
This comment has been minimized.
This comment has been minimized.
|
Could you also spin up a test server without the recording rules? |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil that's going to take me a while. I can try and find the time tomorrow |
This comment has been minimized.
This comment has been minimized.
I can only imaging that happening with very high churn in the underlying data, which the graphs don't show. |
This comment has been minimized.
This comment has been minimized.
|
@tcolgate did you do any other changes apart from just upgrading the prom version? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev nope, just upgraded. I've updated and reverted a few times, behaviour is consistant. 2.3.0 crashes, 2.2.1, stable. |
This comment has been minimized.
This comment has been minimized.
rajatjindal
commented
Jun 13, 2018
|
@tcolgate , the graphs that you shared are very interesting. we are also using Prometheus and running into performance issues when we enabling remote storage. (it might be a completely different issue) is there a place where we can import these dashboards from? will be interesting to see these metrics |
This comment has been minimized.
This comment has been minimized.
|
@rajatjindal The Prometheus Benchmark dashboard is available on grafana.net (make sure you get the 2.0 version), and the other dashboard is our internal prom dashboard, better dashboards exist on grafana.net (our is adapted from one of the earlier v1 prometheus perf dashboards) |
This comment has been minimized.
This comment has been minimized.
|
From the #4248 tries to optimize away the latter 4 GB (and I've seen it use 10x less memory for
Neither is a particularly difficult technical challenge and might be worth pursuing even if the root cause turns out to be totally different (which I actually doubt). |
This comment has been minimized.
This comment has been minimized.
|
Oh, and on a related note, while testing my humongous aggregated rate I haven't managed to pinpoint the cause, but I imagine it's the TSDB loading everything. |
brian-brazil
added
kind/bug
component/local storage
and removed
kind/more-info-needed
labels
Jun 14, 2018
This comment has been minimized.
This comment has been minimized.
|
I've managed to trigger it again, got a allocs svg, but I don't think that is terriby useful (seems to be since the process started?) There seems to be some kind of time/event element to this, but it doesn't obviously align with some other event having occurred. Basically, if I leave the thing alone for an hour or so, I can trigger the OOM by using the federate query. At some time before this, I can hammer federate mercilessly without issues (tried hitting it with hey). I need to head off soon. Tomorrow I'll try and capture a CPU profile during a crash. |
This comment has been minimized.
This comment has been minimized.
|
This is smelling like a TSDB issue and it doesn't align with blocks, so it's probably chunks. |
This comment has been minimized.
This comment has been minimized.
|
I've looked through the code changes on the federation codepath between 2.3.0 and 2.2.1, there's a few changes on the path but none of them seem plausible. |
This comment has been minimized.
This comment has been minimized.
|
I've not been able to trigger a crash this morning as yet. I'll give it a try every 10 mins or so and see if I can get a trace during the crash, |
This comment has been minimized.
This comment has been minimized.
|
I'm currently suspecting it's #4185. If you can reproduce again, try rolling that one back. |
This comment has been minimized.
This comment has been minimized.
|
Also if you could try the federate call without the {job="prometheus"}, to see if it's overlapping selectors that's the issue. |
This comment has been minimized.
This comment has been minimized.
|
removing the |
This comment has been minimized.
This comment has been minimized.
|
I would expect removing the |
This comment has been minimized.
This comment has been minimized.
|
Any suggestions of how I can update the /federate to make it more likely to trigger your guess? I'd rather be able to reliably reproduce the problem. |
This comment has been minimized.
This comment has been minimized.
|
Adding some duplicate matchers might do it, but it's a bit of a shot in the dark. |
This comment has been minimized.
This comment has been minimized.
|
I repeated one of the matchers a couple of times and boom, that did it. and I appear to be able to crash it on demand. Impressive in-the-dark shooting there! |
This comment has been minimized.
This comment has been minimized.
|
Okay, I can reproduce locally now on a Prometheus scraping only itself. |
This comment has been minimized.
This comment has been minimized.
|
Okay, the issue is if you try to federate a NaN value from a time series that more than one selector returns. NaN's never equal themselves, so it looks like a different value even though it's the same. So we end up in an infinite loop, which is also buffering up all these duplicate points in RAM. |
brian-brazil
referenced this issue
Jun 15, 2018
Merged
Avoid infinite loop on duplicate NaN values. #4275
brian-brazil
added a commit
that referenced
this issue
Jun 15, 2018
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil great catch, cheers. I'll try a build from release-2.3 once you've merged |
brian-brazil
added a commit
that referenced
this issue
Jun 18, 2018
brian-brazil
closed this
Jun 18, 2018
brian-brazil
referenced this issue
Jun 22, 2018
Closed
prometheus 2.3.0 become OOM when consul is unavailable #4253
mknapphrt
added a commit
to mknapphrt/prometheus
that referenced
this issue
Jul 26, 2018
gouthamve
added a commit
to gouthamve/prometheus
that referenced
this issue
Aug 1, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
tcolgate commentedJun 12, 2018
•
edited
Bug Report
What did you do?
Upgraded to 2.3.0
What did you expect to see?
General improvements.
What did you see instead? Under which circumstances?
Memory usage, possibly driven by queries, has considerably increased. Upgrade at 09:27, the memory usage drops on the graph after then are from container restarts due to OOM.
container_memory_usage_bytes
Environment
Prometheus in kubernetes 1.9
System information:
Standard docker containers, on docker kubelet on linux.
Prometheus version:
2.3.0
insert output of
prometheus --versionhere