Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTSDB Compaction failures bring to OOM shutdown #4757
Comments
simonpasquier
added
the
component/local storage
label
Oct 18, 2018
This comment has been minimized.
This comment has been minimized.
|
This seems related to #4110. Getting pprof heap profiles during the compaction would help to debug the issues. |
This comment has been minimized.
This comment has been minimized.
|
@SuperQ Hello, thanks for your answer. Generally we detect compaction failures at about 16Gb ram consuming (50% of total). Do you think it might need so much memory for compaction process? |
This comment has been minimized.
This comment has been minimized.
|
Yes, right now compactions do need quite a lot of memory. One thing you're doing that's making it worse is setting the Also, if you set min-block and max-block to the same number, you will spend no time doing additional compactions. |
This comment has been minimized.
This comment has been minimized.
metri
commented
Oct 22, 2018
|
@SuperQ , hi! I am a colleague of @ashepelev . How can I give you the archive with pproff debug? |
This comment has been minimized.
This comment has been minimized.
|
The easiest way is to attach it as a file to this issue. |
This comment has been minimized.
This comment has been minimized.
metri
commented
Oct 23, 2018
|
prom_debug.tar.gz |
This comment has been minimized.
This comment has been minimized.
|
@SuperQ I think the problem here is different. The error is when trying to persist the head into a block and not when trying to join 2 or more blocks together.
I think the problem here is with some strange character in your series labels.
|
This comment has been minimized.
This comment has been minimized.
|
If that is the case please let me know the series label that caused this so we can add some sanitization somewhere. |
This comment has been minimized.
This comment has been minimized.
metri
commented
Oct 30, 2018
|
@krasi-georgiev hi! The fact that this happens at different labels. Eg, yesterday:
This was due to a different label. |
This comment has been minimized.
This comment has been minimized.
|
hm , that looks like something very specific to your setup.
When did this start happening and did you change anything in your setup around that time? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev, hello, I guess that compaction failed errors is not the root but the consequence of the problem. We detected that theoretically we should be fine with 32GB RAM installation for our amount of metrics (30m min/max block, 1d retention, 60s global scrape interval -> 23k samples/s). But the truth is we cannot make any more pressure on prometheus. Changing any of this settings causes OOM. After restart it starts to read WAL, starts compaction again and OOM again. I've tried to find out some prometheus metrics which will help me to resolve this issue, but all of them are just fine: I see head flushing every 30m, I see head time window size behaves just normally. I don't see any exceptions in prometheus debug mode. I've looked across other prometheus RAM issues and figured out several things to check:
|
This comment has been minimized.
This comment has been minimized.
|
The problem I see is completely different. The head is the in memory holder of the ingested metrics and these are flushed to disk periodically to free up the memory. We need to figure out why are you getting the Back to my original question, did you change anything recently? New exporter, new app or anything new in the metrics side, relabelling etc.
|
This comment has been minimized.
This comment has been minimized.
|
adding an additional series label/name pair to an existing application could be another culprit. |
This comment has been minimized.
This comment has been minimized.
|
Another thing that can help if you could upload the |
This comment has been minimized.
This comment has been minimized.
|
@fabxc , @gouthamve any other suggestion that might cause this? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev |
This comment has been minimized.
This comment has been minimized.
|
I guess we figured out the problem. |
This comment has been minimized.
This comment has been minimized.
|
increased memory usage is expected, but not the original error you have posted so I still think that some sanitization is missing somewhere. can you maybe post a sample of a scrape in the text format (removing all confidential info) so It gives me some starting point to dig deeper into this? |
This comment has been minimized.
This comment has been minimized.
|
ping @ashepelev |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev
As you can see the label
We had lots of such samples in scrape. That lead to incredible block index-file growing.
Generally each scrape produced new values. I guess prometheus wasn't ready for that. |
This comment has been minimized.
This comment has been minimized.
|
Closing at it was due to unbounded label cardinality. |
simonpasquier
closed this
Nov 9, 2018
This comment has been minimized.
This comment has been minimized.
|
still curious for that error log messages. Could it be that the system didn't have anymore memory to take anymore symbol entries so it failed to add it causing the following read of it to fail as well. |

ashepelev commentedOct 18, 2018
•
edited
Bug Report
What did you do?
Using prometheus with 2000+ targets, 41k+ samples/sec, 10Mbit/s downlink
What did you expect to see?
Stable work
What did you see instead? Under which circumstances?
Compaction failures, increasing RAM consuming, wal log growth, OOM error shutdown
Environment
We use two prometheus servers with the same configuration (targets, consuld_sd_config, alertmanagers, etc). Luckily the compaction breaks on one of them and both of them are not affected at the same time.
For the following graph we didn't detect any compaction failed errors in logs:


We've lived very long with 16GB RAM per node and then it went out of control.
If some additional metrics are needed - we can take them.