Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign updoes not start up after corrupted meta.json file #4058
Comments
This comment has been minimized.
This comment has been minimized.
|
I did a PR that should address this issue. @fabxc and @gouthamve will advise soon. |
This comment has been minimized.
This comment has been minimized.
TimSimmons
commented
Apr 16, 2018
|
I ran into this today, had a single Prometheus server with the exact same behavior as above.
I moved the dir out of the data directory and Prometheus restarted happily. |
gouthamve
added
the
component/local storage
label
May 9, 2018
krasi-georgiev
referenced this issue
May 10, 2018
Open
Add tsdb.Scan() to unblock from a corrupted db. #320
This comment has been minimized.
This comment has been minimized.
bmihaescu
commented
Oct 2, 2018
|
I faced this problem also today.
|
This comment has been minimized.
This comment has been minimized.
|
do you by any chance use NFS? There was some discussion in the past and NFS sometimes behaves weird and I don't think there was anything we could do to prevent this so for this reason NFS is considered unsupported. |
This comment has been minimized.
This comment has been minimized.
Vlaaaaaaad
commented
Oct 8, 2018
|
@krasi-georgiev: working with @bmihaescu so I can comment on this. The error is from a kops-deployed cluster running on AWS using EBS. |
This comment has been minimized.
This comment has been minimized.
|
That would be hard to troubleshoot. I would need some specific steps how to replicate. How often does it happen and can you replicate with latest release? |
This comment has been minimized.
This comment has been minimized.
|
@Vlaaaaaaad , @bmihaescu are you sure you have enough free space?. (suggested by Brian on IRC so worth checking.) |
This comment has been minimized.
This comment has been minimized.
Vlaaaaaaad
commented
Oct 9, 2018
|
@krasi-georgiev oh, that is likely to be the issue. We do have some free space, but not much. Is there some documentation on how much space should be free( a certain value, a percentage)? This is happening on two older clusters( k8s 1.10.3 and 1.9.6), with prometheus-operator v0.17.0 and prometheus v2.2.1 so the issue might be fixed in newer versions. Tagging @markmunozoz too. |
This comment has been minimized.
This comment has been minimized.
|
I am not 100% sure, but logically I would say at least 5 times your biggest block. btw there are plans to add storage based retention so should help use cases where storage is limited. prometheus/tsdb#343 |
This comment has been minimized.
This comment has been minimized.
|
anyone else wants to add anything else before we marked as resolved? @haraldschilly did you find the cause for your case?
This is implemented as part of the tsdb cli scan tool which is still in review. |
krasi-georgiev
closed this
Nov 12, 2018
This comment has been minimized.
This comment has been minimized.
slomo
commented
Dec 10, 2018
|
I am running into the same issue on prometheus 2.4.3 with vagrant. When I suspend my machine virtual box seems to crash and after the crash I reboot the machine and usually one but sometimes up to 90% of my
I am not seeing this in production yet. I guess simply because my machines rarely ever crash. |
This comment has been minimized.
This comment has been minimized.
|
I double checked the code again and the only way I could see this happening is if using nfs or other non POSIX filesystem. @slomo can you replicate this every time? |
This comment has been minimized.
This comment has been minimized.
|
are you maybe mounting a dir from the host to use as a data dir? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I'm the one originally reporting this. In case it helps, this did thappen a GCE PD disk, mounted via |
This comment has been minimized.
This comment has been minimized.
|
yeah GCE PD disk is ok. |
krasi-georgiev
reopened this
Dec 10, 2018
This comment has been minimized.
This comment has been minimized.
|
well, I don't remember seeing any logs specific to that with useful info. it usually happens when there is an OOM event and the kernel kills the prometheus job or the whole VM is shutdown. I think the main underlying reason is that ext4 isn't 100% atomic. This makes me think I should try using zfs or btrfs. |
This comment has been minimized.
This comment has been minimized.
slomo
commented
Dec 10, 2018
|
It is an ext4 inside an virtualbox vm. I would say it happens on every virtual box crash, I'll try to reproduce it. |
This comment has been minimized.
This comment has been minimized.
|
steps to reproduce would really help so I can try to replicate as well. Thanks! |
This comment has been minimized.
This comment has been minimized.
|
@slomo any luck with steps to replicate this? |
This comment has been minimized.
This comment has been minimized.
slomo
commented
Dec 12, 2018
|
Well. In my setup (which contains a lot of consul sd) hosts I can reproduce it by resetting the virtualbox vm. I tried to create a smaller setup with just a few static node_exporters that are queried and I cant trigger the corruption anymore. |
This comment has been minimized.
This comment has been minimized.
|
so you think it is related to the SD being used? |
This comment has been minimized.
This comment has been minimized.
slomo
commented
Dec 13, 2018
|
@krasi-georgiev I think it would be jumping a bit fast to conclusions to say that sd is at fault, but it definitely requires a certain complexity to occur. I have 6 jobs with a total of ca. 400 target, all targets are added using service discovery with consul. @haraldschilly Could you roughly describe your setup? Do you use service discovery and how many hosts/applications do you monitor. |
This comment has been minimized.
This comment has been minimized.
|
@slomo thanks for the update, any chance to ping me on IRC to speed this up? |
haraldschilly
referenced this issue
Jan 26, 2019
Closed
crash on startup: open /data/*/chunks no such file or directory #5138
This comment has been minimized.
This comment has been minimized.
|
@slomo maybe the main difference is the load. Try it with a higher load - 400 static targets |
This comment has been minimized.
This comment has been minimized.
mmencner
commented
Feb 28, 2019
|
We are hitting the same issue on single prometheus instance (version 2.6.0 with local storage), running inside a docker container. So far it happened twice, out of ~50 deployed instances:
It's not directly connected with the container restart, as in majority of cases it starts without any issues. It's also not a matter of not enough disk space. As discussed with @gouthamve, we are planning to mitigate this, by introducing a check for an empty |
This comment has been minimized.
This comment has been minimized.
|
Are we not creating the meta.json atomically? |
This comment has been minimized.
This comment has been minimized.
|
@mmencner would you mind trying it with the latest release as there have been a lot of changes to fix such issues since 2.6. Would also need the full logs to start some useful troubleshooting. @brian-brazil just had a quick look and it does indeed happen atomically. My guess is that something happens during compaction when creating a new block. |
krasi-georgiev
added
the
kind/more-info-needed
label
Feb 28, 2019
This comment has been minimized.
This comment has been minimized.
|
I can see no way that code can produce a zero-length file. It'd have to be the kernel saying it's successfully written and closed, but then not having space for it. |
This comment has been minimized.
This comment has been minimized.
yes I suspect something similar. especially the case of resetting the VM. |
This comment has been minimized.
This comment has been minimized.
pborzenkov
commented
Mar 26, 2019
|
@brian-brazil @krasi-georgiev We are facing the same issue. Sometimes lots of meta.json files are zero-sized. We run Prometheus on local ext4 FS. Looking at the |
This comment has been minimized.
This comment has been minimized.
pborzenkov
commented
Mar 26, 2019
|
Strangely enough, all the meta.json files have the same modification time and zero size:
There are no errors in prometheus log... |
This comment has been minimized.
This comment has been minimized.
|
hm I would expect that the kernel should handle the file sync so don't think this is the culprit. How long does it take to replicate? can you ping me on #prometheus-dev , @krasi-georgiev and will try to replicate and find the culprit as this has been a pending issue for a while now. |
This comment has been minimized.
This comment has been minimized.
pborzenkov
commented
Mar 26, 2019
Close definitely doesn't guarantee sync. Also, if a node crashes before the kernel flushes its write-back cache, then we can end-up with a file with no contents, yet successful write/close/rename.
Not sure. Happens sporadically. What I can say is that we've seen it only after a node crashed. Everything is fine during normal operation. |
This comment has been minimized.
This comment has been minimized.
|
@pborzenkov yeah maybe you are right, just checked the code and Fsync is called for the other block Write operations. I will open a PR with the fsync for |
This comment has been minimized.
This comment has been minimized.
pborzenkov
commented
Mar 26, 2019
•
|
@krasi-georgiev I'll be happy to test (though, definitely not in production :)), but crash-related bugs are notoriously hard to reproduce. I tried to check the bug using ALICE (http://research.cs.wisc.edu/adsl/Software/alice/doc/adsl-doc.html) which greatly helped me in the past and that is what I got: Here is the write part of the test (tsdb.WriteMetaFile just calls tsdb.writeMetaFile):
And here the checker:
This is what I got with unmodified tsdb:
And this is what I got after adding
While this is definitely not a proof that the bug is indeed fixed, the tool has great track record and usually finds real problems. |
This comment has been minimized.
This comment has been minimized.
|
wow, that is amazing. Thanks for spending the time. I will open a PR soon. |
This was referenced Mar 27, 2019
This comment has been minimized.
This comment has been minimized.
|
Just opened a PR that would close this issue and will go in the next release. Feel free to reopen if you still experience the same issue after that. |
haraldschilly commentedApr 7, 2018
•
edited
This ticket is a follow up of #2805 (there are similar comments at the bottom after closing it)
What did you do?
Run prometheus in a kubernetes cluster. On a GCE PD disk.
What did you see instead? Under which circumstances?
It crashed upon start, logfile:
The point here is, that the
meta.jsonfile has a size of zero:Manual resolution
I've deleted that directory
01CAF1K5SQZT4HBQE9P6W7J56Ewith the problematic meta.json file in it and now it start up fine again.Environment
System information:
Linux 4.10.0-40-generic x86_64
Prometheus version:
("official" docker build)
Expected behavior
What I would wish is that prometheus starts up and doesn't CrashLoop. It should either
[directoryname].broken/?