Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uperror tailing WAL remote_write #5347
Comments
simonpasquier
added
kind/bug
component/remote storage
labels
Mar 13, 2019
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I've also tried to delete wal records and start prometheus and even after that still get the same errors |
This comment has been minimized.
This comment has been minimized.
|
Sorry you hit an issue @andridzi? Can you post the logs from your Prometheus at startup pls? Also, might be worth grabbing a copy of the WAL if its not too late. |
This comment has been minimized.
This comment has been minimized.
|
logs from startup
|
andridzi
closed this
Mar 13, 2019
This comment has been minimized.
This comment has been minimized.
|
and then error messages
|
andridzi
reopened this
Mar 13, 2019
This comment has been minimized.
This comment has been minimized.
|
Is this still happening? When we hit a corrupt WAL in the remote_write code, we back off and try and replay the WAL from the last checkpoint. Hopefully a new checkpoint has been created and the WAL has been able to tail successfully. Its strange the startup didn't detect any corruption (although not sure about |
This comment has been minimized.
This comment has been minimized.
|
unfortunately, it is not possible due to security reasons |
This comment has been minimized.
This comment has been minimized.
|
@andridzi I understand. Can you post a screen shot of |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Yeah that doesn't look very happy. The large spikes in records read show the retries replaying the WAL when it hits corruptions. Can you tell me a bit about the system you're running on? What OS and disk are you using? |
This comment has been minimized.
This comment has been minimized.
|
it is running in k8s cluster |
This comment has been minimized.
This comment has been minimized.
Oh thats interesting. While I doubt we're seeing corruption from that, I could imagine a possibility where we don't play well with gluster from a fsync PoV. Unfortunately I can't think of a way to test this. Perhaps you could spin up another Prometheus using local disks? If you restart this Prometheus does the issue persist? |


andridzi commentedMar 12, 2019
-->
Bug Report
What did you do?
After upgrading from 2.7.0 to 2.8.0 I've started to get WAL tailing errors