Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upWAL remote_write starts falling behind after config reload #5299
Comments
This comment has been minimized.
This comment has been minimized.
|
From our logs, we can see the remote write code thinks the WAL is corrupt:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The second change here might help: #5300 |
This comment has been minimized.
This comment has been minimized.
|
We grabbed a copy of the WAL and tried to replay it, multiple segments were corrupt. I struggle to see how this has anything to do with the WAL code. When I restarted the Prometheus server, it also found corruption and repaired the WAL:
Even after this, remote_write was still broken:
|
This comment has been minimized.
This comment has been minimized.
|
I think #5300 fixed this. |
tomwilkie
closed this
Mar 5, 2019
This comment has been minimized.
This comment has been minimized.
andridzi
commented
Mar 12, 2019
|
hi @tomwilkie |
This comment has been minimized.
This comment has been minimized.
|
@andridzi when you upgraded did prometheus repair the WAL when it started on v2.8? If you copy your current WAL dir to somewhere else and start a new prometheus using that copied WAL, does it attempt to repair the WAL? Those messages are indicative of a corruption in the WAL. |
This comment has been minimized.
This comment has been minimized.
andridzi
commented
Mar 13, 2019
|
I've also tried to delete wal records and start prometheus and even after that still get the same errors P.S.: after copying current wal dir and start new instance - no attempts to correct wal records |
tomwilkie commentedMar 4, 2019
This is running master plus #5286