Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poller crash - "morestack on g0" #4650

Closed
sevagh opened this Issue Sep 24, 2018 · 12 comments

Comments

Projects
None yet
2 participants
@sevagh
Copy link

sevagh commented Sep 24, 2018

Bug Report

What did you do?

Running the Prometheus poller which crashed overnight. It seems to crash regularly (this poller in particular has more scrape targets than my other pollers, it's in a higher-traffic DC).

Some grafana graphs on the performance characteristics:

prometheus-bug

Environment

  • System information:
$ uname -srm
Linux 4.9.0-6-amd64 x86_64
  • Prometheus version:
$ /opt/prometheus2/prometheus --version
prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e29e815795e9dd14742ee7725682fa14b7b)
  build user:       root@5258e0bd9cc1
  build date:       20180712-14:02:52
  go version:       go1.10.3
  • Logs:
Sep 22 14:03:16 clip prometheus[31431]: level=info ts=2018-09-22T21:03:16.426247985Z caller=compact.go:398 component=tsdb msg="write block" mint=1537639200000 maxt=1537646400000 ulid=01CR1HA0
Sep 22 14:03:40 clip prometheus[31431]: level=info ts=2018-09-22T21:03:40.247023327Z caller=head.go:348 component=tsdb msg="head GC completed" duration=13.160222661s
Sep 22 14:04:11 clip prometheus[31431]: level=info ts=2018-09-22T21:04:11.584463744Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=31.337302457s
Sep 22 14:06:02 clip prometheus[31431]: fatal: morestack on g0
Sep 22 14:06:02 clip prometheus[31431]: SIGTRAP: trace trap
Sep 22 14:06:02 clip prometheus[31431]: PC=0x4573e2 m=1 sigcode=128
Sep 22 14:06:02 clip prometheus[31431]: goroutine 0 [idle]:
Sep 22 14:06:02 clip prometheus[31431]: runtime.morestack()
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/runtime/asm_amd64.s:451 +0x22
Sep 22 14:06:02 clip prometheus[31431]: goroutine 1 [chan receive, 44779 minutes]:
Sep 22 14:06:02 clip prometheus[31431]: github.com/prometheus/prometheus/vendor/github.com/oklog/oklog/pkg/group.(*Group).Run(0xc422399c90, 0xc42061f420, 0x8)
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/vendor/github.com/oklog/oklog/pkg/group/group.go:43 +0xec
Sep 22 14:06:02 clip prometheus[31431]: main.main()
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:595 +0x5459
Sep 22 14:06:02 clip prometheus[31431]: goroutine 17 [syscall, 1645 minutes]:
Sep 22 14:06:02 clip prometheus[31431]: os/signal.signal_recv(0x1ddf480)
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/runtime/sigqueue.go:139 +0xa6
Sep 22 14:06:02 clip prometheus[31431]: os/signal.loop()
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
Sep 22 14:06:02 clip prometheus[31431]: created by os/signal.init.0
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/os/signal/signal_unix.go:28 +0x41
Sep 22 14:06:02 clip prometheus[31431]: goroutine 33 [chan receive]:
Sep 22 14:06:02 clip prometheus[31431]: github.com/prometheus/prometheus/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x2afeb00)
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/vendor/github.com/golang/glog/glog.go:879 +0x8b
Sep 22 14:06:02 clip prometheus[31431]: created by github.com/prometheus/prometheus/vendor/github.com/golang/glog.init.0
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/vendor/github.com/golang/glog/glog.go:410 +0x203
Sep 22 14:06:02 clip prometheus[31431]: goroutine 212 [select]:
Sep 22 14:06:02 clip prometheus[31431]: github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*SegmentWAL).run(0xc420212000, 0x2540be400)
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.go:704 +0x36e
Sep 22 14:06:02 clip prometheus[31431]: created by github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.OpenSegmentWAL
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.go:244 +0x776
Sep 22 14:06:02 clip prometheus[31431]: goroutine 177 [chan receive (nil chan), 44779 minutes]:
Sep 22 14:06:02 clip prometheus[31431]: github.com/prometheus/prometheus/prompb.RegisterAdminHandlerFromEndpoint.func1.1(0x1deec80, 0xc42003c028, 0xc420207380, 0x1c77f03, 0xc)
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/prompb/rpc.pb.gw.go:85 +0x4c
Sep 22 14:06:02 clip prometheus[31431]: created by github.com/prometheus/prometheus/prompb.RegisterAdminHandlerFromEndpoint.func1
Sep 22 14:06:02 clip prometheus[31431]:         /go/src/github.com/prometheus/prometheus/prompb/rpc.pb.gw.go:84 +0x19b
Sep 22 14:06:02 clip prometheus[31431]: goroutine 160 [select, 44779 minutes, locked to thread]:
Sep 22 14:06:02 clip prometheus[31431]: runtime.gopark(0x1d03f48, 0x0, 0x1c71671, 0x6, 0x18, 0x1)
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/runtime/proc.go:291 +0x11a
Sep 22 14:06:02 clip prometheus[31431]: runtime.selectgo(0xc42049b750, 0xc420202720)
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/runtime/select.go:392 +0xe50
Sep 22 14:06:02 clip prometheus[31431]: runtime.ensureSigM.func1()
Sep 22 14:06:02 clip prometheus[31431]:         /usr/local/go/src/runtime/signal_unix.go:549 +0x1f4
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 24, 2018

Hmm, this is weird... One possibility is a data race in the code so it would be good if you could run the binary compiled with -race. Would you be able to?

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented Sep 24, 2018

Sure - would that solve anything, or create more informative logs for the next crash? I can run a -race binary on this poller.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 24, 2018

Additional information as it would log the data race if any is detected. The counter part is that the program will be slower and use more memory.

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented Sep 24, 2018

Having some trouble with promu build --cgo and -race. In promu.yml:

build:
    binaries:
        - name: prometheus
          path: ./cmd/prometheus
        - name: promtool
          path: ./cmd/promtool
    flags: -a -tags netgo -race

In Makefile.common build target:

build: promu
        @echo ">> building binaries"
        $(PROMU) build --cgo --prefix $(PREFIX)

Doesn't seem to propagate CGO_ENABLED=1 to the go build:

shanssian:prometheus $ make build
GOOS= GOARCH= go get -u github.com/prometheus/promu
>> building binaries
/home/shanssian/go/bin/promu build --cgo --prefix /home/shanssian/go/src/github.com/prometheus/prometheus
 >   prometheus
go build: -race requires cgo; enable cgo by setting CGO_ENABLED=1
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 24, 2018

You need to add this to .promu.yml too:

go:
  cgo: true
...
@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented Sep 24, 2018

Thanks, that worked. I'll report back with the new logs from the race binary (when it happens again).

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented Sep 24, 2018

Unfortunately the performance hit was rather big. I'll need a better way to run this, rather than on my main 2 HA pollers - maybe a third poller.

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented Sep 27, 2018

The poller crashes after a long startup when -race is enabled:

Sep 27 09:36:16 clip prometheus[49735]: ThreadSanitizer: DenseSlabAllocator overflow. Dying.
Sep 27 09:36:23 clip systemd[1]: prometheus2.service: Main process exited, code=exited, status=66/n/a
Sep 27 09:36:23 clip systemd[1]: prometheus2.service: Unit entered failed state.
Sep 27 09:36:23 clip systemd[1]: prometheus2.service: Failed with result 'exit-code'.
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 28, 2018

This looks like running with race detection requires too much memory on your system so no luck. Can you run promtool debug all http://<prometheus address:port> and share the resulting file?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 18, 2018

@sevagh ping

@sevagh

This comment has been minimized.

Copy link
Author

sevagh commented Oct 18, 2018

Oops, so what happened is, I was running the poller 2.3.2 (where promtool doesn't have debug). I upgraded to 2.4.2.

At the same time, somebody reduced the amount of metrics by around 12x (there was an application bug where many scrape targets were exposing duplicate metrics).

I haven't seen any crashes since.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 18, 2018

Thanks for the follow-up! I'm closing it now, feel free to reopen if it crashes again but the system was probably unstable because of the too many metrics.

@lock lock bot locked and limited conversation to collaborators Apr 16, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.