Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus v2.2.0 deadlocked loading targets page #3940

Closed
tomwilkie opened this Issue Mar 9, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@tomwilkie
Copy link
Member

tomwilkie commented Mar 9, 2018

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 9, 2018

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 13, 2018

@krasi-georgiev that has good odds of being a regression from the scrape refactoring. Over the years we had more deadlocks caused by loading target page than I can count – and they all came somewhere from there :)

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 13, 2018

Sorry I didn't add more background:

  • I think I hit the targets page very quickly after the instance came up.
  • I was debugging something else so I got the stacks and then threw the instance away, and it didn't reoccur.
  • This was with the v2.2.0 docker release image.
  • The instance was around for about 5mins and other endpoints worked, but couldn't load the targets page for the entire time.
  • It didn't look like it had managed to find any targets though, as there was no data in it.
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 13, 2018

I will have a look when I get bored poking with tsdb 👍

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 13, 2018

There are a bunch of goroutines all blocked at roughly the same point - goroutine 1114, 1293, 1118. Goroutine 1118 is holding m.mtx write lock, and blocked on the getting the scrape pool read lock. The scrape pool write lock is being held by scrapePool.Sync (goroutine 183) which is waiting for scrapers to stop. The scrapers in turn are blocked on remote write, which is blocked on resharding, which is blocked on the remote write bug I was trying to fix (#3809).

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 13, 2018

As 3809 is fixed, looks like nothing to fix here.

@tomwilkie tomwilkie closed this Mar 13, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.