Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus crash: goroutine stack exceeds #3827

Closed
iobestar opened this Issue Feb 12, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@iobestar
Copy link

iobestar commented Feb 12, 2018

What did you do?
Nothing special. This happens from time to time; it looks like some sort of leak.

What did you expect to see?
Prometheus not crash.

What did you see instead? Under which circumstances?
Prometheus crash and Docker container was restarted after that. Before crash there was a lot of errors in log (see Logs)

  • Prometheus version:
    prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
    build user: root@6e784304d3ff
    build date: 20180119-12:07:34
    go version: go1.9.2

  • Logs:

goroutine 56659867 [select]:
net/http.(*persistConn).writeLoop(0xca53a26a20)
	/usr/local/go/src/net/http/transport.go:1759 +0x165
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1187 +0xa53

goroutine 56543182 [IO wait]:
internal/poll.runtime_pollWait(0x7f93ac90c160, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc7fefe3898, 0x72, 0xffffffffffffff00, 0x28f8a80, 0x28ec600)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc7fefe3898, 0xc90c066000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc7fefe3880, 0xc90c066000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:126 +0x18a
net.(*netFD).Read(0xc7fefe3880, 0xc90c066000, 0x1000, 0x1000, 0x0, 0xc786d48fc0, 0xc662e8b320)
	/usr/local/go/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc5c27083d8, 0xc90c066000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6d
github.com/prometheus/prometheus/vendor/github.com/mwitkow/go-conntrack.(*clientConnTracker).Read(0xc82ec18200, 0xc90c066000, 0x1000, 0x1000, 0x404108, 0xc94e5a3320, 0xc70b53ea80)
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 12, 2018

Can you upload the full logs somewhere?

@iobestar

This comment has been minimized.

Copy link
Author

iobestar commented Feb 12, 2018

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 12, 2018

It looks like some infinite recursion... Can you share your Prometheus config too?

@iobestar

This comment has been minimized.

Copy link
Author

iobestar commented Feb 12, 2018

My configuration is full of sensitive data and sharing is last option. Do you have any suggestions what to look? This starts happening after upgrade to 2.1.0 and it is correlated with heavy queries execution and extensive disk reading.

Also there is a lot of errors:
level=error ts=2018-02-12T15:06:09.824626834Z caller=notifier.go:454 component=notifier alertmanager=/api/v1/alerts count=1 msg="Error sending alert" err="bad response status 400 Bad Request"

but I think it is not related.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 12, 2018

You can look at the prometheus_tsdb_... metrics and see if there's anything unusual. Can you provide at least an rough number of time series & samples scraped? As well as an example of the "heavy" queries you're mentioning?

Also there is a lot of errors:
level=error ts=2018-02-12T15:06:09.824626834Z caller=notifier.go:454 component=notifier alertmanager=/api/v1/alerts count=1 msg="Error sending alert" err="bad response status 400 Bad Request"
but I think it is not related.

Probably not.

@iobestar iobestar changed the title Prometheus crash after IO wait errors Prometheus crash: goroutine stack exceeds Feb 13, 2018

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 14, 2018

We could pin this down to an implementation detail in prometheus/tsdb. This should be possible to be fixed for 2.2.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.