Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upcrash on startup: open /data/*/chunks no such file or directory #5138
Comments
This comment has been minimized.
This comment has been minimized.
|
Did the VM suddenly stop or restart? |
This comment has been minimized.
This comment has been minimized.
|
Yes, it's a preempt-instance in GCE |
krasi-georgiev
added
the
component/local storage
label
Feb 26, 2019
This comment has been minimized.
This comment has been minimized.
|
I think we have discussed this and the usual decision is to hard fail on all sort of un-repairable data corruption. Maybe @brian-brazil or @brancz , can give some more info why this is better to hard fail than auto deleting the corrupted blocks. |
This comment has been minimized.
This comment has been minimized.
|
It seems reasonable to do a "repair" type thing on non-WAL files as well. At least in known-safe/recoverable scenarios (I'm not necessarily saying this case is, but generally). @haraldschilly do you still have the "corrupted" storage files, so we could examine them, and be able to judge the situation better? |
This comment has been minimized.
This comment has been minimized.
|
It sounds like an entire directory went missing. |
This comment has been minimized.
This comment has been minimized.
|
the main question is: what is should be the behaviour in such cases and why? |
This comment has been minimized.
This comment has been minimized.
|
I would first like to understand what "such cases" means. I'm not sure there is a general answer. |
This comment has been minimized.
This comment has been minimized.
|
This would be hard. so far "such cases" are irreproducible. @haraldschilly was this a one of or can you help with steps to reproduce? |
This comment has been minimized.
This comment has been minimized.
|
Agreed, we can only work with these things if we can either reproduce a case or a storage snapshot is provided that has the respective problem. Otherwise I agree, there is little we can do. |
This comment has been minimized.
This comment has been minimized.
|
Just double checked the code and can't see any way to produce an empty chunks dir as all writes to that dir are f.Sync()-ed so even a host crash shouldn't leave it empty. Maybe that bug is fixed in the more recent Prometheus version. I see that you are running 2.6 would you mind to test it with the latest and reopen if you still experience the same problem. |
This comment has been minimized.
This comment has been minimized.
|
I followed the ticket and congratulations to actually figure out the missing detail. If this happens with a newer release that includes this fix, I'll open a new ticket! |
This comment has been minimized.
This comment has been minimized.
|
Thanks. It was solved thanks to @pborzenkov's pointers. Closing this one, but feel free to reopen if you can replicate with a more recent version and will continue the troubleshooting. Please try to include details how it happened with steps to replicate. |
haraldschilly commentedJan 26, 2019
Proposal
Prometheus crashes each time it tries to start up. This is a follow up of #4058 (similar case I reported, but this time the setup is different)
Bug Report
System information:
Linux 4.15.0-1026-gcp x86_64Setup: it runs in a docker container on a VM in GCE. The filesystem is btrfs (in #4058 it was ext4). I hope to eliminate partially written files, or other inconsistencies. With prometheus stopped and volume unmounted, it is a healthy filesystem:
Actions
What I did is to just delete
/data/01D1YDSW3AKX2X4MA8WT2VXA7Fand prometheus did start up fine! I would hope it would do it on its own :-)