Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does not start up after corrupted meta.json file #4058

Closed
haraldschilly opened this issue Apr 7, 2018 · 37 comments · Fixed by prometheus/tsdb#573 · May be fixed by prometheus/tsdb#320

Comments

@haraldschilly
Copy link

commented Apr 7, 2018

This ticket is a follow up of #2805 (there are similar comments at the bottom after closing it)

What did you do?

Run prometheus in a kubernetes cluster. On a GCE PD disk.

What did you see instead? Under which circumstances?

It crashed upon start, logfile:

level=error ts=2018-04-07T04:28:53.784390578Z caller=main.go:582
err="Opening storage failed unexpected end of JSON input"
level=info ts=2018-04-07T04:28:53.784418708Z caller=main.go:584
msg="See you next time!"

The point here is, that the meta.json file has a size of zero:

> ls -l /data/*/meta.json
[...]
-rw-rw-r--    1 1000     2000           283 Apr  7 03:05 01CAF10VV5FNDJ6PG84E6RSEV3/meta.json
-rw-rw-r--    1 1000     2000             0 Apr  7 03:15 01CAF1K5SQZT4HBQE9P6W7J56E/meta.json

Manual resolution

I've deleted that directory 01CAF1K5SQZT4HBQE9P6W7J56E with the problematic meta.json file in it and now it start up fine again.

Environment

  • System information:

    Linux 4.10.0-40-generic x86_64

  • Prometheus version:

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10

("official" docker build)

  • Logs:
level=error ts=2018-04-07T08:51:28.789897822Z caller=main.go:582 err="Opening storage failed unexpected end of JSON input"

Expected behavior

What I would wish is that prometheus starts up and doesn't CrashLoop. It should either

  • ignore that directory, saying in the log that meta.json is faulty
  • maybe move it to [directoryname].broken/ ?
  • reconstruct the meta.json file from the data
  • delete the problematic directory (bit harsh, ignoring might be better)
@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Apr 7, 2018

I did a PR that should address this issue. @fabxc and @gouthamve will advise soon.
prometheus/tsdb#283

@TimSimmons

This comment has been minimized.

Copy link

commented Apr 16, 2018

I ran into this today, had a single Prometheus server with the exact same behavior as above.

$ ls -l 01CB7X7ZX2E1C6R1SQHVTVVX9J
total 180008
drwxr-xr-x 2 prometheus prometheus      4096 Apr 16 19:00 chunks
-rw-r--r-- 1 prometheus prometheus 184323907 Apr 16 19:00 index
-rw-r--r-- 1 prometheus prometheus         0 Apr 16 19:00 meta.json
-rw-r--r-- 1 prometheus prometheus         0 Apr 16 19:00 tombstones

I moved the dir out of the data directory and Prometheus restarted happily.

@bmihaescu

This comment has been minimized.

Copy link

commented Oct 2, 2018

I faced this problem also today.
Finding and deleting the folder with the empty json solved the issue.

root@ip-xxx-xxx-xxx-xxx:/var/lib/kubelet/pods/ce0fba7f-c649-11e8-128370c697fc/volumes/kubernetes.io~aws-ebs/pvc-a7cf0ff9-11e8-acf2-020ae50968fe/prometheus-db# find ./ -name meta.json -size 0
./01CRRG0QYZS8GJQGS96QZYR5TJ/meta.json
@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Oct 2, 2018

do you by any chance use NFS?

There was some discussion in the past and NFS sometimes behaves weird and I don't think there was anything we could do to prevent this so for this reason NFS is considered unsupported.

@Vlaaaaaaad

This comment has been minimized.

Copy link

commented Oct 8, 2018

@krasi-georgiev: working with @bmihaescu so I can comment on this. The error is from a kops-deployed cluster running on AWS using EBS.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Oct 8, 2018

That would be hard to troubleshoot. I would need some specific steps how to replicate.

How often does it happen and can you replicate with latest release?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Oct 8, 2018

@Vlaaaaaaad , @bmihaescu are you sure you have enough free space?. (suggested by Brian on IRC so worth checking.)
During compaction it needs a bit for temporary holder during the joining.

@Vlaaaaaaad

This comment has been minimized.

Copy link

commented Oct 9, 2018

@krasi-georgiev oh, that is likely to be the issue. We do have some free space, but not much. Is there some documentation on how much space should be free( a certain value, a percentage)?
@bmihaescu will have more details as he was the one who did the investigation and fix.

This is happening on two older clusters( k8s 1.10.3 and 1.9.6), with prometheus-operator v0.17.0 and prometheus v2.2.1 so the issue might be fixed in newer versions.

Tagging @markmunozoz too.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

I am not 100% sure, but logically I would say at least 5 times your biggest block.
Maybe a bit less will do, but storage is not expensive these days so better be on the safe side.

btw there are plans to add storage based retention so should help use cases where storage is limited. prometheus/tsdb#343

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Nov 9, 2018

anyone else wants to add anything else before we marked as resolved?

@haraldschilly did you find the cause for your case?

delete the problematic directory (bit harsh, ignoring might be better)

This is implemented as part of the tsdb cli scan tool which is still in review.

@slomo

This comment has been minimized.

Copy link

commented Dec 10, 2018

I am running into the same issue on prometheus 2.4.3 with vagrant. When I suspend my machine virtual box seems to crash and after the crash I reboot the machine and usually one but sometimes up to 90% of my meta.json files are empty.

[vagrant@prometheus ~]$ sudo find /var/lib/prometheus  -iname meta.json -size 0b | wc
    345     345   19665
[vagrant@prometheus ~]$ sudo find /var/lib/prometheus  -iname meta.json | wc
    357     357   20349

I am not seeing this in production yet. I guess simply because my machines rarely ever crash.
I will try to upgrade to the latest version. But I am a bit scared now that I might loose also data in production on power outage.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

I double checked the code again and the only way I could see this happening is if using nfs or other non POSIX filesystem.

@slomo can you replicate this every time?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

are you maybe mounting a dir from the host to use as a data dir?

@haraldschilly

This comment has been minimized.

Copy link
Author

commented Dec 10, 2018

@krasi-georgiev I'm the one originally reporting this. In case it helps, this did thappen a GCE PD disk, mounted via /dev/sda1 on /prometheus type ext4 (rw,relatime,data=ordered) and /dev/sdd on /data type ext4 (rw,relatime,discard,data=ordered) in a docker container.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

yeah GCE PD disk is ok.
It would help to trace the logs for any errors or warnings before the crash.

@haraldschilly

This comment has been minimized.

Copy link
Author

commented Dec 10, 2018

well, I don't remember seeing any logs specific to that with useful info. it usually happens when there is an OOM event and the kernel kills the prometheus job or the whole VM is shutdown. I think the main underlying reason is that ext4 isn't 100% atomic. This makes me think I should try using zfs or btrfs.

@slomo

This comment has been minimized.

Copy link

commented Dec 10, 2018

It is an ext4 inside an virtualbox vm. I would say it happens on every virtual box crash, I'll try to reproduce it.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

steps to reproduce would really help so I can try to replicate as well. Thanks!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

@slomo any luck with steps to replicate this?
If possible please ping me on IRC (@krasi-georgiev #prometheus-dev) to speed up the troubleshooting.

@slomo

This comment has been minimized.

Copy link

commented Dec 12, 2018

Well. In my setup (which contains a lot of consul sd) hosts I can reproduce it by resetting the virtualbox vm. I tried to create a smaller setup with just a few static node_exporters that are queried and I cant trigger the corruption anymore.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

so you think it is related to the SD being used?

@slomo

This comment has been minimized.

Copy link

commented Dec 13, 2018

@krasi-georgiev I think it would be jumping a bit fast to conclusions to say that sd is at fault, but it definitely requires a certain complexity to occur. I have 6 jobs with a total of ca. 400 target, all targets are added using service discovery with consul.

@haraldschilly Could you roughly describe your setup? Do you use service discovery and how many hosts/applications do you monitor.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

@slomo thanks for the update, any chance to ping me on IRC to speed this up?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

@slomo maybe the main difference is the load. Try it with a higher load - 400 static targets

@mmencner

This comment has been minimized.

Copy link

commented Feb 28, 2019

We are hitting the same issue on single prometheus instance (version 2.6.0 with local storage), running inside a docker container. So far it happened twice, out of ~50 deployed instances:

  • First time, only the last block had an empty meta.json
  • Second time, it was last two blocks with an empty meta.json

It's not directly connected with the container restart, as in majority of cases it starts without any issues. It's also not a matter of not enough disk space.

As discussed with @gouthamve, we are planning to mitigate this, by introducing a check for an empty meta.json and deleting directories where this condition is true, but it's definitely not a perfect solution.

@brian-brazil

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Are we not creating the meta.json atomically?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

@mmencner would you mind trying it with the latest release as there have been a lot of changes to fix such issues since 2.6. Would also need the full logs to start some useful troubleshooting.

@brian-brazil just had a quick look and it does indeed happen atomically. My guess is that something happens during compaction when creating a new block.

@brian-brazil

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

I can see no way that code can produce a zero-length file. It'd have to be the kernel saying it's successfully written and closed, but then not having space for it.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

I can see no way that code can produce a zero-length file. It'd have to be the kernel saying it's successfully written and closed, but then not having space for it.

yes I suspect something similar. especially the case of resetting the VM.

@pborzenkov

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

@brian-brazil @krasi-georgiev We are facing the same issue. Sometimes lots of meta.json files are zero-sized. We run Prometheus on local ext4 FS.

Looking at the writeMetaFile, shouldn't the file contents itself be synced before close?

@pborzenkov

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

Strangely enough, all the meta.json files have the same modification time and zero size:

[root@xxx]# ls -la /var/lib/prometheus/data/*/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5NZP8MVJXC9P2TQ9QJB1FEV/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5PM9ED3MV7HJSVG5ES6W6KR/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5Q8WM6TR1AH1STJXNRWA55R/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5QXFSYARB5E8M7T4K8MWX8E/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5RJ2ZMZZENCT48KD2KJ0B85/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5S6P5D1WV8G6FKE5RHHEETA/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5SV9B4H5A5BPZ6AYMVD2KX3/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5TFWGXREGZJ4EJ51EREXH9D/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5V4FPQCHJ2VTMZECGNBCM0T/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5VS2WEDDKTHV2WEBG1BT5MB/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5WDP25JN7T2QX2AM3WPK3V3/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5X29FTXJ6C0Z9P1R2GN9PY3/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5XPWDRF57JT697BZN5KDY14/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5YBF6R8F6ZFG17TT2JQ91H9/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5Z02CF9RG1WYPAHMARNHJWG/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D5ZMNJ7CG1XZTT295HH68NCB/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D6098R1BQQ0N04N9FARC4ZWJ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D60XW6GTZCDAYWDC2PXYBMXP/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D61JF921X3HHNESWTM1A5HF7/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D6272H50VCEDRHJ53NBXXTAQ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D62VNYAA6TA5MX11GE3NQPWJ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D63G8XJ907DQK9MC5PAE8CJP/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D644W0QY5AYHW9Z943Y4PFEQ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D64SF6PQ5WAPMGN4V7BRV4BQ/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D65E2CP1N2C6WTPBPCXW0N70/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D662NSQ94HJZ05XHHZ75E683/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D66Q8T3C8YV3PSKZWW6572B4/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67BVVRJXG5KJY01RBFBZ1W2/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67BVWYTJRQ4KVNB5GSECZ43/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67JQK0EEPFJ50EAF9900P1Z/meta.json
-rw-r--r-- 1 prometheus prometheus 0 Mar 18 08:54 /var/lib/prometheus/data/01D67SKA8EGHH9FAHVRWQVP1Y4/meta.json

There are no errors in prometheus log...

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

hm I would expect that the kernel should handle the file sync so don't think this is the culprit.

How long does it take to replicate?

can you ping me on #prometheus-dev , @krasi-georgiev and will try to replicate and find the culprit as this has been a pending issue for a while now.

@pborzenkov

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

hm I would expect that the kernel should handle the file sync so don't think this is the culprit.

Close definitely doesn't guarantee sync. Also, if a node crashes before the kernel flushes its write-back cache, then we can end-up with a file with no contents, yet successful write/close/rename.

How long does it take to replicate?

Not sure. Happens sporadically. What I can say is that we've seen it only after a node crashed. Everything is fine during normal operation.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

@pborzenkov yeah maybe you are right, just checked the code and Fsync is called for the other block Write operations.
https://github.com/prometheus/tsdb/blob/7757fe6f21d4d9f21bf7a1da92fe573b7d4318c3/compact.go#L629

I will open a PR with the fsync for writeMetaFile and will add will attach a binary to it if you want to test it.

@pborzenkov

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

@krasi-georgiev I'll be happy to test (though, definitely not in production :)), but crash-related bugs are notoriously hard to reproduce. I tried to check the bug using ALICE (http://research.cs.wisc.edu/adsl/Software/alice/doc/adsl-doc.html) which greatly helped me in the past and that is what I got:

Here is the write part of the test (tsdb.WriteMetaFile just calls tsdb.writeMetaFile):

package main

import (
        "math/rand"
        "os"
        "time"

        "github.com/oklog/ulid"
        "github.com/prometheus/tsdb"
)

func main() {
        now := time.Now()

        err := tsdb.WriteMetaFile(".", &tsdb.BlockMeta{
                ULID:    ulid.MustNew(ulid.Now(), rand.New(rand.NewSource(123))),
                MinTime: now.Add(-2*time.Hour).Unix() * 1000,
                MaxTime: now.Unix() * 1000,
                Version: 1,
        })
        if err != nil {
                os.Exit(1)
        }
        os.Exit(0)
}

And here the checker:

package main

import (
        "os"

        "github.com/prometheus/tsdb"
)

func main() {
        _, err := tsdb.ReadMetaFile(os.Args[1])
        if err != nil && !os.IsNotExist(err) {
                os.Exit(1)
        }
        os.Exit(0)
}

This is what I got with unmodified tsdb:

❯ alice-check --traces_dir=../traces_dir --checker=../../read-test/read-test --debug_level 1
-------------------------------------------------------------------------------
ALICE tool version 0.0.1. Please go through the documentation, particularly the
listed caveats and limitations, before deriving any inferences from this tool.
-------------------------------------------------------------------------------
Parsing traces to determine logical operations ...
Logical operations:
0       creat("meta.json.tmp", parent=107377493, mode='0666', inode=107377690)
1       append("meta.json.tmp", offset=0, count=159, inode=107377690)
2       rename(dest="meta.json", source="meta.json.tmp", source_hardlinks=1, source_parent=107377493, dest_size=0, dest_hardlinks=0, source_size=159, dest_pare
nt=107377493, source_inode=107377690, dest_inode=False)
3       fsync(".", size=60, inode=107377493)
-------------------------------------
Finding vulnerabilities...
(Dynamic vulnerability) Ordering: Operation 1 needs to be persisted before 2
(Static vulnerability) Ordering: Operation /usr/lib/golang/src/syscall/asm_linux_amd64.s:27[syscall.Syscall] needed before /usr/lib/golang/src/syscall/asm_linu
x_amd64.s:53[syscall.Syscall6]
Done finding vulnerabilities.

And this is what I got after adding f.Sync() call before f.Close in writeMetaFile:

❯ alice-check --traces_dir=../traces_dir --checker=../../read-test/read-test --debug_level 1
-------------------------------------------------------------------------------
ALICE tool version 0.0.1. Please go through the documentation, particularly the
listed caveats and limitations, before deriving any inferences from this tool.
-------------------------------------------------------------------------------
Parsing traces to determine logical operations ...
Logical operations:
0       creat("meta.json.tmp", parent=107379907, mode='0666', inode=107378562)
1       append("meta.json.tmp", offset=0, count=159, inode=107378562)
2       fsync("meta.json.tmp", size=159, inode=107378562)
3       rename(dest="meta.json", source="meta.json.tmp", source_hardlinks=1, source_parent=107379907, dest_size=0, dest_hardlinks=0, source_size=159, dest_pare
nt=107379907, source_inode=107378562, dest_inode=False)
4       fsync(".", size=60, inode=107379907)
-------------------------------------
Finding vulnerabilities...
Done finding vulnerabilities.

While this is definitely not a proof that the bug is indeed fixed, the tool has great track record and usually finds real problems.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

wow, that is amazing. Thanks for spending the time. I will open a PR soon.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

Just opened a PR that would close this issue and will go in the next release. Feel free to reopen if you still experience the same issue after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.