Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upWAL Corruption - recover.go does not recover #5369
Comments
This comment has been minimized.
This comment has been minimized.
|
Can you reproduce with a newer version? It should delete everything after that point automatically. |
This comment has been minimized.
This comment has been minimized.
|
I don't know what caused the error, really. Any idea where this line: |
This comment has been minimized.
This comment has been minimized.
|
That's unrelated, it's shutdown stopping running queries. |
This comment has been minimized.
This comment has been minimized.
|
Any ideas about the SIGKILL, then? That was not a manual action. |
This comment has been minimized.
This comment has been minimized.
|
That would have come from outside Prometheus. |
This comment has been minimized.
This comment has been minimized.
|
OK. I'll try to reproduce with both the new and old versions of Prom and see what I come up with. I suspect it has to do with an overload of the query engine. |
This comment has been minimized.
This comment has been minimized.
|
@kiefersmith we had few fixes in the WAL in the recent versions. Did you manage to test the latest version? |
This comment has been minimized.
This comment has been minimized.
|
I was able to replicate the crash scenario - it seems sufficiently large queries coming from Grafana will hit the rate limit / overwhelm the query engine. We use a custom version of Prometheus, so I have not yet been able to test the latest version (put together the required PR). I may be able to get to it next week. |
This comment has been minimized.
This comment has been minimized.
|
Ok would appreciate an update and if the bug is still there would try to replicate and work on a fix. |
This comment has been minimized.
This comment has been minimized.
|
@kiefersmith did you have the chance to test the latest 2.8 Prometheus release? |
This comment has been minimized.
This comment has been minimized.
|
I did test out 2.8, but could not replicate the exact issue. I tried writing garbage to the WAL, deleting lines, but that ended up being a bit too destructive. |
This comment has been minimized.
This comment has been minimized.
|
Will try sending large queries to the query API later today and see if that turns up anything interesting. |
This comment has been minimized.
This comment has been minimized.
|
It would seem that the latest version is indeed better at recovering from failure! Feel free to call this one closed. I think it's supervisorD that sends the SIGKILL when I lock up Prometheus. Let me know if there's any more information you guys would like.
|
This comment has been minimized.
This comment has been minimized.
|
Thanks for the update. Feel free to reopen if you can replicate and will continue the troubleshooting. |
kiefersmith commentedMar 15, 2019
•
edited
Bug Report
What did you do?
I noticed Grafana lost connection to Prom. Looking at the Prom logs (below) there seems to be a corruption in the WAL. This error made Prometheus' tsdb completely unavailable.
What did you expect to see?
No errors.
What did you see instead? Under which circumstances?
I saw an error in Grafana -
tsdb.HandleRequest() error ... dial tcp <prom_ip_addr>:9090: connect: connection refusedErrors in Prom logs.
Environment
System information:
Linux 4.19.12-1.el7.elrepo.x86_64 x86_64
Prometheus version:
prometheus, version 2.3.3 (branch: release-2.3.3, revision: c94ba49)
build user: timo@umbra
build date: #20180814-13:05:03
go version: go1.10.3
Prometheus configuration file:
NULL
Logs:
The repair.go loop continued until I deleted the contents of WAL (as suggested in other issues). This issue happened on our Dev platform, so not a huge impact, but for our production grade platforms deleting WAL is less acceptable.
How can I go about establishing root cause? What is the best course of action to take in this situation (ideally avoiding deleting all of WAL dir)?
Thank you!
#4705 #4603