Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash on startup: open /data/*/chunks no such file or directory #5138

Closed
haraldschilly opened this Issue Jan 26, 2019 · 12 comments

Comments

Projects
None yet
5 participants
@haraldschilly
Copy link

haraldschilly commented Jan 26, 2019

Proposal

Prometheus crashes each time it tries to start up. This is a follow up of #4058 (similar case I reported, but this time the setup is different)

Bug Report

  • System information: Linux 4.15.0-1026-gcp x86_64

  • Setup: it runs in a docker container on a VM in GCE. The filesystem is btrfs (in #4058 it was ext4). I hope to eliminate partially written files, or other inconsistencies. With prometheus stopped and volume unmounted, it is a healthy filesystem:

root@...# btrfs check /dev/sdb
Checking filesystem on /dev/sdb
UUID: c1738961-beeb-42e8-b140-6ac3bbbf4b0c
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 55551893504 bytes used, no error found
total csum bytes: 19560816
total tree bytes: 50315264
total fs tree bytes: 11862016
total extent tree bytes: 12763136
btree space waste bytes: 11906010
file data blocks allocated: 56575320064
 referenced 28899885056
  • Prometheus version:
prometheus, version 2.6.0 (branch: HEAD, revision: dbd1d58c894775c0788470944b818cc724f550fb)
  build user:       root@bf5760470f13
  build date:       20181217-15:14:46
  go version:       go1.11.3
  • Logs:
level=info ts=2019-01-26T08:23:14.75214264Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=d
bd1d58c894775c0788470944b818cc724f550fb)"
level=info ts=2019-01-26T08:23:14.752224681Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:
46)"
level=info ts=2019-01-26T08:23:14.752254667Z caller=main.go:245 host_details="(Linux 4.15.0-1026-gcp #27-Ubuntu SMP Thu Dec 6 18:27:01 UT
C 2018 x86_64 prom-main-64b5dc4c48-24c5t (none))"
level=info ts=2019-01-26T08:23:14.75227611Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-26T08:23:14.752292803Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-01-26T08:23:14.753829615Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2019-01-26T08:23:14.753968701Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-01-26T08:23:14.754481385Z caller=web.go:472 component=web msg="router prefix" prefix=/prometheus
level=info ts=2019-01-26T08:23:14.755001687Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548281700000 maxt=1548282
600000 ulid=01D1YDSW3AKX2X4MA8WT2VXA7F
level=info ts=2019-01-26T08:23:14.755060939Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548282600000 maxt=1548283
500000 ulid=01D1YEN8QREN3V8RZH50TAGQHH
level=info ts=2019-01-26T08:23:14.755119575Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548283500000 maxt=1548284
400000 ulid=01D1YFGMCC3V8FYWFGN4YY5Y01
level=info ts=2019-01-26T08:23:14.75517606Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548284400000 maxt=15482853
00000 ulid=01D1YGC3J1Y9RJAZQR4AGW51Y7

[......]

level=info ts=2019-01-26T08:23:14.769685494Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548458400000 maxt=1548458700000 ulid=01D23NDXRFW216G7DVT68Q6K6T
level=info ts=2019-01-26T08:23:14.769718913Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548458700000 maxt=1548459000000 ulid=01D23NQ2PT7E5KR6FQQQNPP0FA
level=info ts=2019-01-26T08:23:14.769758414Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548459000000 maxt=1548459300000 ulid=01D23P07NRH8GQNVW939YN0A1A
level=info ts=2019-01-26T08:23:14.769800451Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548458100000 maxt=1548459000000 ulid=01D23P11C3HX2RDMETD6A4FGGW
level=info ts=2019-01-26T08:23:14.781990264Z caller=main.go:430 msg="Stopping scrape discovery manager..."
level=info ts=2019-01-26T08:23:14.782050817Z caller=main.go:444 msg="Stopping notify discovery manager..."
level=info ts=2019-01-26T08:23:14.782063994Z caller=main.go:466 msg="Stopping scrape manager..."
level=info ts=2019-01-26T08:23:14.782077676Z caller=main.go:440 msg="Notify discovery manager stopped"
level=info ts=2019-01-26T08:23:14.782101018Z caller=main.go:426 msg="Scrape discovery manager stopped"
level=info ts=2019-01-26T08:23:14.782120084Z caller=main.go:460 msg="Scrape manager stopped"
level=info ts=2019-01-26T08:23:14.782115449Z caller=manager.go:664 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-01-26T08:23:14.782159435Z caller=manager.go:670 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-01-26T08:23:14.782175101Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-01-26T08:23:14.782201249Z caller=main.go:615 msg="Notifier manager stopped"
level=error ts=2019-01-26T08:23:14.782326031Z caller=main.go:624 err="opening storage failed: open block /data/01D1YDSW3AKX2X4MA8WT2VXA7F: open /data/01D1YDSW3AKX2X4MA8WT2VXA7F/chunks: no such file or directory"

Actions

What I did is to just delete /data/01D1YDSW3AKX2X4MA8WT2VXA7F and prometheus did start up fine! I would hope it would do it on its own :-)

@aixeshunter

This comment has been minimized.

Copy link
Contributor

aixeshunter commented Jan 28, 2019

Did the VM suddenly stop or restart?

@haraldschilly

This comment has been minimized.

Copy link
Author

haraldschilly commented Jan 28, 2019

Yes, it's a preempt-instance in GCE

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 26, 2019

I think we have discussed this and the usual decision is to hard fail on all sort of un-repairable data corruption.

Maybe @brian-brazil or @brancz , can give some more info why this is better to hard fail than auto deleting the corrupted blocks.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Feb 26, 2019

It seems reasonable to do a "repair" type thing on non-WAL files as well. At least in known-safe/recoverable scenarios (I'm not necessarily saying this case is, but generally).

@haraldschilly do you still have the "corrupted" storage files, so we could examine them, and be able to judge the situation better?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2019

It sounds like an entire directory went missing.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 26, 2019

the main question is: what is should be the behaviour in such cases and why?
Hard fail
or
Delete the corrupted block with a warning log and continue as normal.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Feb 26, 2019

I would first like to understand what "such cases" means. I'm not sure there is a general answer.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 8, 2019

This would be hard. so far "such cases" are irreproducible.

@haraldschilly was this a one of or can you help with steps to reproduce?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 11, 2019

Agreed, we can only work with these things if we can either reproduce a case or a storage snapshot is provided that has the respective problem. Otherwise I agree, there is little we can do.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 3, 2019

Just double checked the code and can't see any way to produce an empty chunks dir as all writes to that dir are f.Sync()-ed so even a host crash shouldn't leave it empty.

Maybe that bug is fixed in the more recent Prometheus version. I see that you are running 2.6 would you mind to test it with the latest and reopen if you still experience the same problem.

@haraldschilly

This comment has been minimized.

Copy link
Author

haraldschilly commented Apr 3, 2019

I followed the ticket and congratulations to actually figure out the missing detail. If this happens with a newer release that includes this fix, I'll open a new ticket!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 3, 2019

Thanks. It was solved thanks to @pborzenkov's pointers.

Closing this one, but feel free to reopen if you can replicate with a more recent version and will continue the troubleshooting. Please try to include details how it happened with steps to replicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.