Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upWAL truncation fails in weird ways under load #4578
Comments
beorn7
added
kind/bug
component/local storage
labels
Sep 5, 2018
This comment has been minimized.
This comment has been minimized.
|
WRT the metrics about this: |
This comment has been minimized.
This comment has been minimized.
|
It should be fixed by #4692 (once merged) that includes prometheus/tsdb#396 and adds the following metrics:
|
This comment has been minimized.
This comment has been minimized.
|
@beorn7 the new wal metrics are added. |
This comment has been minimized.
This comment has been minimized.
|
I'm happy to report if I run into the issue again. Once I get to it, I'll add alerts for the new metrics. (That's yet another "ASAP" on my growing list...) |
This comment has been minimized.
This comment has been minimized.
|
Shall we close and revisit if needed? |
This comment has been minimized.
This comment has been minimized.
|
Sure. Whatever helps bookkeeping most... |
beorn7 commentedSep 5, 2018
Bug Report
What did you do?
Ran a beefy Prometheus (128GiB RAM) under high load (15M head series and in particular two rules that aggregated 5.5M time series into a still quite large number of new serios).
What did you expect to see?
Overload/degradation is probably expected in that case, but I would like to see some metrics suggesting that, and I would like to not see data corruptions and such.
What did you see instead? Under which circumstances?
WAL truncation failed with error messages like this:
I have never seen this error message reported in any issue.
Blocks were still written, but the WAL grew forever until the disk filled.
Restarting the server cleared the WAL (leaving a gap in the time series) and presumably also got rid of the corruption.
I could not find any metric that got incremented for a failure to truncate the WAL. That would at least allow to alert on a non-truncating WAL before the disk runs full.
Environment
System information:
Linux 4.15.3+soundcloud2 x86_64
Prometheus version:
build user: root@5258e0bd9cc1
build date: 20180712-14:02:52
go version: go1.10.3
Logs: