Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up'compaction failed' - prometheus suddenly ate up entire disk #3487
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for the report, looking into it. @fabxc My best guess here is that, we are trying to compact, but because of the out-of-order append, we are abandoning it mid-way without cleaning it up, filling the diskspace. A stopgap would be to clean up the compaction directory if compacting fails. I will look into it now. Now why is there an out-of-order append in the compaction path is another question altogether. |
This comment has been minimized.
This comment has been minimized.
TimSimmons
commented
Nov 20, 2017
|
This happened to me today, data directory went from ~3GB to ~300GB. These are the first lines related to compaction
I see that log 586 times, and there were 588 |
gouthamve
self-assigned this
Nov 20, 2017
gouthamve
added
component/local storage
kind/bug
priority/P1
labels
Nov 20, 2017
This comment has been minimized.
This comment has been minimized.
vsakhart
commented
Nov 21, 2017
|
Just wanted to comment that I'm also facing this issue |
gouthamve
added a commit
to gouthamve/tsdb
that referenced
this issue
Nov 21, 2017
gouthamve
added a commit
to gouthamve/tsdb
that referenced
this issue
Nov 21, 2017
This comment has been minimized.
This comment has been minimized.
hectorag
commented
Nov 22, 2017
|
Hi, I'm also having this issue Here the error log msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set "{name=\"container_fs_inodes_free |
This comment has been minimized.
This comment has been minimized.
yinchuan
commented
Nov 23, 2017
|
Thanks for reporting. |
This comment has been minimized.
This comment has been minimized.
|
Hi, thanks, I have been looking into this, but I am not able to reproduce this which is making things hard. Would any of you be willing to ship your WAL directory to us? That would help reproduce this and will make things much easier. Further, is this happening after a restart or crash of prometheus? @hectorag @vsakhart @TimSimmons |
gouthamve
added a commit
to gouthamve/tsdb
that referenced
this issue
Nov 23, 2017
alxmk
referenced this issue
Nov 23, 2017
Closed
Retention cutoff failed leading to compaction failed/spiralling disk usage #3506
This comment has been minimized.
This comment has been minimized.
|
@hectorag Thanks! Downloaded, will let you know how this goes. Also, feel free to hop onto prometheus IRC channel next time you want to send an ephemeral message :) |
This comment has been minimized.
This comment has been minimized.
|
Is there any progress on this issue? |
This comment has been minimized.
This comment has been minimized.
homelessnessbo
commented
Nov 28, 2017
|
Same here. I was about to upgrade to 2.0 when I saw it, any news? |
This comment has been minimized.
This comment has been minimized.
frafranck
commented
Nov 28, 2017
•
|
I got the same problem, hope it can be resolv soon. the problem is not just on DISK SPACE, it increase head chunk count, length of block duration and memory usage i was around 3 Go RAM used and prometheus increase linearly to 7Go (in few hours) and more ... For information, i did those things to get back a good running prometheus without errors :
After prometheus getting better and no errors logs |
This comment has been minimized.
This comment has been minimized.
hectorag
commented
Nov 28, 2017
|
I also have plenty of this .tmp directories. There is any reason why prometheus is generating all these directories? It's normal? It's good to remove them as @frafranck proposed? |
This comment has been minimized.
This comment has been minimized.
TimSimmons
commented
Nov 28, 2017
|
I removed all the |
This comment has been minimized.
This comment has been minimized.
|
Sorry for taking too long on this. While I could see what was going wrong, I was not able to figure out why, maybe because it was already fixed upstream. Full description below: So heres what was happening, two series were ending up with the same seriesID in the WAL which was causing issues. Now two series can never have the same seriesID because we atomically increase the seriesID. When reading the WAL, we store the highest seriesID we saw in the WAL and then increment it for any new series we see. This ensures why the seriesID can never be the same for two different series. Now when I was looking at the data (thanks @hectorag), I could see that some seriesIDs which were at the end of a segment near a restart were also occuring in the beginning of the next segment. Essentially, the Prometheus server somehow mysteriously never the saw the series in WAL at the end, else the next seriesID's would be higher. But on a later restart, it saw them, causing two series to have the same ID. Now this was the culprit (rather the fix): prometheus/tsdb#204 In 2.0, this change was not included which meant that sometimes when we restart, there is still some data in the linux page-buffers which was not flushed to disk but was flushed later. This meant, on the immediate restart, we never read that data! I have a test that reproduces this behaviour if that change is reverted. While this fixes the case of a clean shutdown, it might still be an issue during crashes. Will have a PR with the fix out early tomorrow. Thanks for your patience! Also, yes, you can happily delete the |
gouthamve
referenced this issue
Nov 30, 2017
Closed
Fdatasync on read to flush any unflushed data. #218
gouthamve
closed this
in
prometheus/tsdb#207
Nov 30, 2017
gouthamve
added a commit
to gouthamve/tsdb
that referenced
this issue
Nov 30, 2017
gouthamve
added a commit
to gouthamve/tsdb
that referenced
this issue
Nov 30, 2017
This comment has been minimized.
This comment has been minimized.
zegl
commented
Jan 12, 2018
|
I just got to experience this issue. Are there any plans for a bugfix release to 2.0 any time soon? |
This comment has been minimized.
This comment has been minimized.
|
@zegl I think 2.1 is coming in the next 1-2 weeks. |
This comment has been minimized.
This comment has been minimized.
zegl
commented
Jan 12, 2018
•
|
@krasi-georgiev Thanks for the reply. Deleting all Soon after Prometheus has started up again, new tmp-folders will be created. This is the full log from Prometheus since the cleanup of all
One line from the boot procedure is especially concerning:
Is this error the root cause? Is it safe to delete the This is what the fs looked like before the start of Prometheus:
And this is what it looks like half an hour later:
Currently the Is there anything we can do to more permanently resolve this issue? Update: Setting |
This comment has been minimized.
This comment has been minimized.
|
Looks like you're hitting #3190 which depends on prometheus/tsdb#238 and isn't solved yet.
|
hackmad
pushed a commit
to LoyaltyOne/kafka-infra
that referenced
this issue
Apr 5, 2018
This comment has been minimized.
This comment has been minimized.
var23rav
commented
Apr 19, 2018
•
|
Am facing the same issue while reloading prometheus(forced closed and restart from cmd prompt(windows) or using the HTTP POST request http:localost:9090/-/reload). Note: During prometheus restart, meta.json is getting deleted automatically, from ./data/*/ folder. Any fix on this ! |
This comment has been minimized.
This comment has been minimized.
|
same with the latest release? |
This comment has been minimized.
This comment has been minimized.
|
is it writing to local disk or NFS? |
This comment has been minimized.
This comment has been minimized.
var23rav
commented
Apr 27, 2018
|
@krasi-georgiev Sorry for the late reply. I were using the old version(2.0.0). |
This comment has been minimized.
This comment has been minimized.
strowi
commented
May 4, 2018
|
Hi, seems i am experiencing the same issue with 2.2.1 writing on nfs |
This comment has been minimized.
This comment has been minimized.
|
I can't remember was what the issue, but remember nfs is misbehaving on all versions. |
This comment has been minimized.
This comment has been minimized.
strowi
commented
May 8, 2018
|
Yes, here too. Moved to 'emptyDir' and everything seems to be working fine for now. If needed i can provide more info.. |
ntindall
referenced this issue
Jul 2, 2018
Closed
[Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0 #4292
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
gregorycerna commentedNov 17, 2017
What did you do?
Four or five days ago, I upgraded to Prometheus V2, running in a 4-node docker swarm
What did you expect to see?
Prometheus metrics data to grow fairly slowly, at roughly the same rate as with v1.8 (~1gb/month)
What did you see instead? Under which circumstances?
In the past 24 hours, the size of my prometheus data suddenly and inexplicably increased more than 1500x, from ~500mb to 771gb, completely filling up my disk.
Environment
I'm not sure what caused this, as I haven't modified any of prometheus's configs since I got v2 up and running smoothly. I'm running prometheus in a docker container in swarm mode, so my best guess is that something got corrupted when its container was killed and subsequently restarted on another host. Prometheus's data is being stored on an nfs share available to all hosts, which is then mounted to the container. When checking the data folder, the vast majority of folder in it are
<randomhash>.tmpfolders - the only other files besides the tmp folders are two folders with hashes for names (but no.tmp), along with awalfolder and alockfile.System information:
Linux 4.13.0-1-amd64 x86_64Prometheus version:
(all logs retrieved using
docker service logs monitor_prom)Logs on prometheus startup
an example of the countless
compaction failederrors from before disk was filledcompaction failederrors from after disk was filled