Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected fault address on 2.0.0-rc2 #3375

Closed
WeiBanjo opened this Issue Oct 30, 2017 · 8 comments

Comments

Projects
None yet
6 participants
@WeiBanjo
Copy link

WeiBanjo commented Oct 30, 2017

What did you do?
Upgrade from 2.0.0-rc0 to 2.0.0-rc2

What did you see instead? Under which circumstances?
unexpected fault address

Environment

$ docker info
Containers: 8
 Running: 7
 Paused: 0
 Stopped: 1
Images: 7
Server Version: 1.12.6
Storage Driver: overlay2
 Backing Filesystem: xfs
  • System information:
$ uname -srm
Linux 4.8.6-1.el7.elrepo.x86_64 x86_64
  • Prometheus version:
    Starting Prometheus" version="(version=2.0.0-rc.2, branch=HEAD, revision=ce63a5a8557bb33e2030a7756c58fd773736b592)

  • Logs:

level=info ts=2017-10-30T21:54:20.53038616Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0-rc.2, branch=HEAD, revision=ce63a5a8557bb33e2030a7756c58fd773736b592)"
level=info ts=2017-10-30T21:54:20.53044143Z caller=main.go:216 build_context="(go=go1.9.1, user=root@a6d2e4a7b8da, date=20171025-18:42:54)"
level=info ts=2017-10-30T21:54:20.530459519Z caller=main.go:217 host_details="(Linux 4.8.6-1.el7.elrepo.x86_64 #1 SMP Mon Oct 31 12:56:11 EDT 2016 x86_64 kube-prometheus-fh5ks (none))"
level=info ts=2017-10-30T21:54:20.532910997Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2017-10-30T21:54:20.532944511Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2017-10-30T21:54:20.532936038Z caller=targetmanager.go:69 component="target manager" msg="Starting target manager..."
level=info ts=2017-10-30T21:54:37.08099138Z caller=main.go:326 msg="TSDB started"
level=info ts=2017-10-30T21:54:37.081069506Z caller=main.go:394 msg="Loading configuration file" filename=/prometheus/prometheus.yaml
level=info ts=2017-10-30T21:54:37.095743435Z caller=main.go:371 msg="Server is ready to receive requests."
level=info ts=2017-10-30T21:55:37.161796416Z caller=compact.go:361 component=tsdb msg="compact blocks" count=3 mint=1508932800000 maxt=1508997600000
unexpected fault address 0x7f5cd2de0014
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f5cd2de0014 pc=0x52e74b]

goroutine 237 [running]:
runtime.throw(0x1ba3cbc, 0x5)
	/usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc569822f90 sp=0xc569822f70 pc=0x42bc55
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:364 +0x29d fp=0xc569822fe0 sp=0xc569822f90 pc=0x44273d
encoding/binary.Uvarint(0x7f5cd2de0014, 0xf090c7f, 0xf090c7f, 0x58, 0xf090cdb)
	/usr/local/go/src/encoding/binary/varint.go:63 +0x3b fp=0xc569822fe8 sp=0xc569822fe0 pc=0x52e74b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkReader).Chunk(0xc56ec19900, 0xc60d014, 0x2891fe0, 0xc5324fda98, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.go:354 +0xb1 fp=0xc5698230a0 sp=0xc569822fe8 pc=0x157b381
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*blockChunkReader).Chunk(0xc5731ca4c0, 0xc60d014, 0x2891fe0, 0xc5324fda98, 0x0, 0x0)
	<autogenerated>:1 +0x46 fp=0xc5698230e0 sp=0xc5698230a0 pc=0x15b0906
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionSeriesSet).Next(0xc57346c2c0, 0xc56229f8d8)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:646 +0x1aa fp=0xc569823220 sp=0xc5698230e0 pc=0x1580aea
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc56c996690, 0xc56229fa40)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:740 +0x46c fp=0xc569823350 sp=0xc569823220 pc=0x158182c
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc56c996700, 0xb59b2)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:728 +0x83a fp=0xc569823480 sp=0xc569823350 pc=0x1581bfa
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc420079360, 0xc572cbcd80, 0x3, 0x4, 0xc572c8b3e0, 0x2891f20, 0xc572c72900, 0x287e740, 0xc572cd4690, 0x0, ...)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:508 +0xbd4 fp=0xc5698239a8 sp=0xc569823480 pc=0x157f944
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).write(0xc420079360, 0x7ffcf9f0e2e1, 0x5, 0xc572c8b3e0, 0xc572cbcd80, 0x3, 0x4, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:405 +0x57a fp=0xc569823be0 sp=0xc5698239a8 pc=0x157e51a
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).Compact(0xc420079360, 0x7ffcf9f0e2e1, 0x5, 0xc56ec198c0, 0x3, 0x4, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:321 +0x5b9 fp=0xc569823d08 sp=0xc569823be0 pc=0x157d9f9
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).compact(0xc420130a90, 0x0, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:377 +0x507 fp=0xc569823df0 sp=0xc569823d08 pc=0x1584527
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).run(0xc420130a90)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:258 +0x30a fp=0xc569823fd8 sp=0xc569823df0 pc=0x15837ba
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc569823fe0 sp=0xc569823fd8 pc=0x45ca71
created by github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.Open
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:222 +0x56e

Full crash log

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 31, 2017

Thanks for reporting. Is this reproducible, i.e. does it crash again if you restart it?

Crash happening here on memory access against an mmaped chunk file: https://github.com/prometheus/tsdb/blob/master/chunks.go#L354

From a bit of research, SIGBUS occurs in combination with mmap if the mapped region exceeds the actual file size. However, we call mmap exactly with the size of the file: https://github.com/prometheus/tsdb/blob/master/block.go#L406

The source of the crash is also not at the potential end of the file, which generally rules out an off-by-one error here. Also, every byte of chunk files are traversed on compaction and a general error would've caused this panic for virtually everyone.

Assuming the instantiation is correct, the only potential cause seems to be modification/truncating of the file after loading it. There should be no code to hypothetically do this in Prometheus itself.
Is there any chance someone or something fiddled with the files while Prometheus was running or that any corruption happened?

AFAICS, there are no other potential causes for SIGBUS.

@WeiBanjo

This comment has been minimized.

Copy link
Author

WeiBanjo commented Oct 31, 2017

I am not able to reproduce it consistently. But it happened couple times already. I have to wipe the data directory to recover it.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Nov 2, 2017

I have to wipe the data directory to recover it.

@WeiBanjo You mean Prometheus crashes on start with the existing data? Is it the same error on restart? If yes, then that would be helpful in pinpointing whats wrong.

@WeiBanjo

This comment has been minimized.

Copy link
Author

WeiBanjo commented Nov 2, 2017

Yes. Prometheus crashes on start with the existing data.

When Prometheus in that state, it's a crash loop.

@WeiBanjo

This comment has been minimized.

Copy link
Author

WeiBanjo commented Nov 3, 2017

Both my Prometheus containers crashed again today. Here is the log from first occurrence. Hopefully it's usefully to identify this issue.

  | November 3rd 2017, 06:02:51.581 | fatal error: fault
  | November 3rd 2017, 06:02:51.581 | unexpected fault address 0x7f0af4854053
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.go:242 +0x786
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/runtime/signal_unix.go:511 +0x220
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).run(0xc4201641a0)
  | November 3rd 2017, 06:02:51.586 | created by github.com/prometheus/prometheus/retrieval.(*TargetManager).reload
  | November 3rd 2017, 06:02:51.586 | net.(*conn).Read(0xc42014eaa0, 0xc4209ba000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionSeriesSet).Next(0xc51be1cdc0, 0x1)
  | November 3rd 2017, 06:02:51.586 | internal/poll.(*pollDesc).wait(0xc420286218, 0x72, 0xc42015fa00, 0x0, 0x0)
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*blockChunkReader).Chunk(0xc548b067c0, 0x19705053, 0x2891fe0, 0xc4d781cd90, 0x0, 0x0)
  | November 3rd 2017, 06:02:51.586 | runtime.throw(0x1ba3cbc, 0x5)
  | November 3rd 2017, 06:02:51.586 | os/signal.signal_recv(0x0)
  | November 3rd 2017, 06:02:51.586 | runtime.gopark(0x1c31938, 0x0, 0x1ba55e8, 0x6, 0x18, 0x1)
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc4da7b0e00, 0x82d40)
  | November 3rd 2017, 06:02:51.586 | internal/poll.(*FD).Accept(0xc420286200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*SegmentWAL).run(0xc4203be280, 0x2540be400)
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/google.golang.org/grpc.(*ClientConn).resetAddrConn.func1(0xc4203f0680)
  | November 3rd 2017, 06:02:51.586 | created by github.com/prometheus/prometheus/prompb.RegisterAdminHandlerFromEndpoint.func1
  | November 3rd 2017, 06:02:51.586 | main.main()
  | November 3rd 2017, 06:02:51.586 | runtime.ensureSigM.func1()
  | November 3rd 2017, 06:02:51.586 | created by main.main
  | November 3rd 2017, 06:02:51.586 | os/signal.loop()
  | November 3rd 2017, 06:02:51.586 | internal/poll.(*pollDesc).wait(0xc420286618, 0x72, 0xffffffffffffff00, 0x287cbc0, 0x28705b8)
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/google.golang.org/grpc/transport.(*http2Client).reader(0xc420461380)
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/io/io.go:309 +0x86
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/web/web.go:463 +0xdbc
  | November 3rd 2017, 06:02:51.586 | goroutine 189 [running]:
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/runtime/asm_amd64.s:2337 +0x1
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/golang.org/x/net/http2/frame.go:492 +0xa4
  | November 3rd 2017, 06:02:51.586 | goroutine 185 [select]:
  | November 3rd 2017, 06:02:51.586 | net.(*netFD).accept(0xc420286200, 0x1, 0x28932a0, 0xc4201ba7d8)
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/net/tcpsock.go:247 +0x49
  | November 3rd 2017, 06:02:51.586 | created by github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.OpenSegmentWAL
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc4da7b0af0, 0x0)
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc4a15bbfe0 sp=0xc4a15bbfd8 pc=0x45ca71
  | November 3rd 2017, 06:02:51.586 | <autogenerated>:1 +0x46 fp=0xc4a15bb0e0 sp=0xc4a15bb0a0 pc=0x15b0906
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/google.golang.org/grpc/transport/http2_client.go:1168 +0x142
  | November 3rd 2017, 06:02:51.586 | github.com/prometheus/prometheus/vendor/google.golang.org/grpc.(*addrConn).transportMonitor(0xc4203f0680)
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:258 +0x30a fp=0xc4a15bbfd8 sp=0xc4a15bbdf0 pc=0x15837ba
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:377 +0x507 fp=0xc4a15bbdf0 sp=0xc4a15bbd08 pc=0x1584527
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/encoding/binary/varint.go:63 +0x3b fp=0xc4a15bafe8 sp=0xc4a15bafe0 pc=0x52e74b
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:288 +0x3dfa
  | November 3rd 2017, 06:02:51.586 | goroutine 9 [chan receive]:
  | November 3rd 2017, 06:02:51.586 | goroutine 168 [select, 3004 minutes]:
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/github.com/mwitkow/go-conntrack/listener_wrapper.go:86 +0x37
  | November 3rd 2017, 06:02:51.586 | internal/poll.(*pollDesc).waitRead(0xc420286618, 0xc4209ba000, 0x8000, 0x8000)
  | November 3rd 2017, 06:02:51.586 | /go/src/github.com/prometheus/prometheus/vendor/golang.org/x/net/netutil/listen.go:30 +0x53
  | November 3rd 2017, 06:02:51.586 | /usr/local/go/src/runtime/sigqueue.go:131 +0xa6
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 6, 2017

How to proceed here? Is this a bug to block the final 2.0.0 release on? I'm missing the deeper insight to make that call.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 1, 2018

now that we are on the 2.2.1 this looks obsolete.
Please feel free to reopen of you still experience this.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.