Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upMigrated to 2.4.0 - Crash during startup corrupted WAL "opening storage failed: read WAL: repair corrupted WAL: cannot handle error" #4603
Comments
Eric-Fontana-Bose
changed the title
Migrated to 2.4.0 - Crash during startup corrupted WAL
Migrated to 2.4.0 - Crash during startup corrupted WAL "opening storage failed: read WAL: repair corrupted WAL: cannot handle error"
Sep 13, 2018
This comment has been minimized.
This comment has been minimized.
|
What was the version before migrating to 2.4.0? |
gouthamve
referenced this issue
Sep 14, 2018
Closed
Migration might not be handling WAL corruptions correctly #379
This comment has been minimized.
This comment has been minimized.
|
@Eric-Fontana-Bose thanks for the report. Can you send me the corrupted WAL privetely at kgeorgi at redhat.com so I can reproduce locally and find the culprit. |
krasi-georgiev
added
kind/bug
component/local storage
labels
Sep 14, 2018
This comment has been minimized.
This comment has been minimized.
|
Is this the contents of the /data dir? Where is it |
This comment has been minimized.
This comment has been minimized.
|
yes in |
This comment has been minimized.
This comment has been minimized.
|
did you find it? it should be some files in a format of 0000001 , 0000002 (numbering is not important) |
This comment has been minimized.
This comment has been minimized.
lostick
commented
Sep 18, 2018
•
|
@krasi-georgiev Seeing a similar issue since we bumped from
I've sent you the corrupted |
This comment has been minimized.
This comment has been minimized.
|
@lostick haven't received it. Maybe too big to send over an email. maybe upload somewhere and send me an invitation. btw why do you think the issue is the same? |
This comment has been minimized.
This comment has been minimized.
lostick
commented
Sep 18, 2018
•
|
@krasi-georgiev yes i'm unable to send it, redhat rejects the file (whether it's attached or as a wetransfer link).
We had no memory issues when previously running on |
This comment has been minimized.
This comment has been minimized.
That is weird, we don't have any restrictions AFAIK, you can send it to krasi.root at gmail.com anyway although the issue is with the WAL it is different than the problem in this issue. Could you please open a new issue with steps to reproduce so we don't mix and I can look into it. 2.4 had a new WAL implementation so probably it is using more memory causing this problem on your system. You can increase the memory and that should solve the problem, but once you open the issue I will also check if there is anything we can do about it. what is strange is that I have run a Benchmark before adding the new wal and it actually showed a reduced memory usage |
This comment has been minimized.
This comment has been minimized.
VR6Pete
commented
Sep 18, 2018
|
I am also seeing this issue and as a result have lost one of my servers in my environment. Is there anything I can do to assist resolution of the issue? |
This comment has been minimized.
This comment has been minimized.
|
@VR6Pete open an issue, steps to reproduce and ping me |
This comment has been minimized.
This comment has been minimized.
|
or if you are sure it is the same as the original one reported here please add steps to reproduce and send me the WAL |
This comment has been minimized.
This comment has been minimized.
VR6Pete
commented
Sep 18, 2018
|
How do i determine which is the corrupted WAL, nothing in the logs to state which one is at fault. caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 573 at 64205313: unexpected checksum 5327a8, expected 970642dd" caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error" Happy to open another issue, or send you the WAL file direct as I feel is related to this issue. |
This comment has been minimized.
This comment has been minimized.
|
I'm trying to get the WAL, but the issue is I can't get into the pod long enough to get it, prometheus |
This comment has been minimized.
This comment has been minimized.
lostick
commented
Sep 18, 2018
Thanks, we have already doubled the memory, to no luck so far as it keeps growing. |
This comment has been minimized.
This comment has been minimized.
|
@VR6Pete send me everything from the data/wal folder @Eric-Fontana-Bose don't you have the data mounted a persistent volume? If not than this is what you should do to keep it on restarts. |
This comment has been minimized.
This comment has been minimized.
stefancrain
commented
Sep 18, 2018
|
We've also ran into a similar issue. This morning we went to restart the Prometheus container and we're faced with the issue. Prometheus version:
Previous version:
Prometheus configuration file:
Logs:
Directory:
|
This comment has been minimized.
This comment has been minimized.
|
@stefancrain please send me the wal files. Ping to @fabxc who can probably fix this a lot quicker with his magic ;) |
krasi-georgiev
referenced this issue
Sep 18, 2018
Open
Add random restarts to check how it recovers #154
This comment has been minimized.
This comment has been minimized.
stefancrain
commented
Sep 18, 2018
|
@krasi-georgiev sent. please reach out if there is anything else I can do to help. |
This comment has been minimized.
This comment has been minimized.
sc250024
commented
Sep 18, 2018
•
|
@krasi-georgiev Same issue here. Using Previous Prometheus version: 2.3.2
Config file
|
This comment has been minimized.
This comment has been minimized.
stefancrain
commented
Sep 18, 2018
|
This is causing a production monitoring outage for us. Could you suggest a fix to get back online? Would deleting the content of the wal folder and restarting resolve this? |
This comment has been minimized.
This comment has been minimized.
|
@stefancrain Sorry that is the case. Deleting the WAL and restarting would fix it. Now that we have an offending WAL, we'll have a fix quite soon, hopefully by tomorrow. It'll also help if you could forward the mail to krasi to me also at (gouthamve [at] gmail.com) |
This comment has been minimized.
This comment has been minimized.
stefancrain
commented
Sep 18, 2018
|
@gouthamve Deleting the wal files allowed prometheus to come back online. Thanks for that! |
krasi-georgiev
referenced this issue
Sep 18, 2018
Open
Add tsdb.Scan() to unblock from a corrupted db. #320
gouthamve
referenced this issue
Sep 19, 2018
Merged
Make sure WAL Repair can handle wrapped errors #389
This comment has been minimized.
This comment has been minimized.
|
Hi, thanks for sending the WALs everyone, it made finding and fixing the issue very easy. The fix is out here: prometheus/tsdb#389 and will make it into the |
fabxc
closed this
in
prometheus/tsdb#389
Sep 19, 2018
This comment has been minimized.
This comment has been minimized.
|
This did NOT! make it into 2.4.1 . when will it get fixed? We're busted. |
This comment has been minimized.
This comment has been minimized.
lostick
commented
Sep 20, 2018
•
|
They will probably get a fix for the fix in |
This comment has been minimized.
This comment has been minimized.
|
Hi, sorry, due to an error on my side, it did not make it in. We're in the process of releasing 2.4.2 but due to other issues on Windows we've discovered, we had to delay it until they are fixed. The PR is out and reviewed and we hope the release of Sorry for the experience you had. See: prometheus/tsdb#392 which is the fix for: #4635 |
This comment has been minimized.
This comment has been minimized.
stefancrain
commented
Sep 21, 2018
|
2.4.2 looks to have been released ~7 hours ago with the fix for this issue. Our team has upgraded and will report back if we run into this issue again. |
This comment has been minimized.
This comment has been minimized.
sumkincpp
commented
Sep 21, 2018
|
Hit the same problem on 2.4.0, 2.4.2 repaired DB and fixed the issue repairing corrupted block :
|
This comment has been minimized.
This comment has been minimized.
Place1
commented
Oct 7, 2018
|
I've hit this as well on
|
gouthamve
referenced this issue
Oct 7, 2018
Closed
WAL requires more robust corruption handling #4705
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
lxkaka
commented
Nov 26, 2018
|
same issue with version 2.5.0 |
This comment has been minimized.
This comment has been minimized.
|
@lxkaka you need to open a new issue with some details , error message and how to replicate. |
This comment has been minimized.
This comment has been minimized.
lxkaka
commented
Nov 26, 2018
|
@krasi-georgiev ok, for now I delete WAL directory to resolve this issue |
This comment has been minimized.
This comment has been minimized.
anisimovyuriy
commented
Mar 11, 2019
|
we have the same issue in our dev environment after we added multiple consul integrations and many of the services are failed to scrape because on the service-side issues. To delete WAL directory won't help in our case as when we do this - same issue will be reproduced in a day... |
This comment has been minimized.
This comment has been minimized.
|
Locking this issue now as all these will need a new issue with more details for some proper troubleshooting. |
Eric-Fontana-Bose commentedSep 13, 2018
Crashing on startup after moving to 2.4.0
Bug Report
What did you do?
Prometheus version:
2.4.0
Prometheus configuration file: