Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6 deadlocked loading targets page #5082

Closed
tomwilkie opened this Issue Jan 8, 2019 · 11 comments

Comments

Projects
None yet
4 participants
@tomwilkie
Copy link
Member

tomwilkie commented Jan 8, 2019

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Jan 8, 2019

goroutine 5873214 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc0004e434c, 0xc077f91300)
	/usr/local/Cellar/go/1.11.2/libexec/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc0004e4348)
	/usr/local/Cellar/go/1.11.2/libexec/src/sync/mutex.go:134 +0xff
github.com/prometheus/prometheus/scrape.(*Manager).TargetsActive(0xc0004e4320, 0x0)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/scrape/manager.go:189 +0x57
github.com/prometheus/prometheus/web.(*Handler).targets(0xc000468d00, 0x7fafc97e78a8, 0xc0d7861d60, 0xc0017bc300)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/web/web.go:708 +0x51
github.com/prometheus/prometheus/web.(*Handler).targets-fm(0x7fafc97e78a8, 0xc0d7861d60, 0xc0017bc300)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/web/web.go:286 +0x48
github.com/prometheus/prometheus/web.(*Handler).testReady.func1(0x7fafc97e78a8, 0xc0d7861d60, 0xc0017bc300)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/web/web.go:404 +0x55
@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Jan 8, 2019

Lock: https://github.com/prometheus/prometheus/blob/v2.6.0/scrape/manager.go#L189

Looks like its ApplyConfig thats holding the lock, and it itself is blocked in scrapePool.reload on a wg.Wait, which I suspect is blocked on a old scrape look stopping.

goroutine 119 [semacquire, 3 minutes]:
sync.runtime_Semacquire(0xc0e389e638)
	/usr/local/Cellar/go/1.11.2/libexec/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc0e389e630)
	/usr/local/Cellar/go/1.11.2/libexec/src/sync/waitgroup.go:130 +0x64
github.com/prometheus/prometheus/scrape.(*scrapePool).reload(0xc0038a4c80, 0xc0fb300240)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/scrape/scrape.go:271 +0x4c8
github.com/prometheus/prometheus/scrape.(*Manager).ApplyConfig(0xc0004e4320, 0xc0909c2d80, 0x0, 0x0)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/scrape/manager.go:167 +0x259
github.com/prometheus/prometheus/scrape.(*Manager).ApplyConfig-fm(0xc0909c2d80, 0x0, 0x0)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:336 +0x34
main.reloadConfig(0x7ffe28fa96e3, 0x1e, 0x1e99b60, 0xc000690a50, 0xc0006949c0, 0x7, 0x7, 0x0, 0x0)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:649 +0x228
main.main.func13(0x0, 0x0)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:490 +0x1e6
github.com/oklog/oklog/pkg/group.(*Group).Run.func1(0xc0002b6540, 0xc0002b6480, 0xc0001d4140)
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/vendor/github.com/oklog/oklog/pkg/group/group.go:38 +0x27
created by github.com/oklog/oklog/pkg/group.(*Group).Run
	/Users/twilkie/Documents/src/github.com/prometheus/prometheus/vendor/github.com/oklog/oklog/pkg/group/group.go:37 +0xbe
@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Jan 8, 2019

There are 162 scrapeloops, anecdotally a bunch are blocks trying to append to the remote write queues.

I don't see whos holding that lock yet.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 15, 2019

Have you been able to troubleshoot the problem?

@mmerrill3

This comment has been minimized.

Copy link

mmerrill3 commented Jan 18, 2019

I'm being hit by this issue as well. +1

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Jan 18, 2019

@mmerrill3

This comment has been minimized.

Copy link

mmerrill3 commented Jan 18, 2019

I've also seen it when the remote write queues are full. When I turn of remote writes, its all good, no issues.

This bit looks suspicious to me, especially if enqueue fails.

t.shardsMtx.RLock()
enqueued := t.shards.enqueue(&snew)
t.shardsMtx.RUnlock()

@michael-doubez

This comment has been minimized.

Copy link

michael-doubez commented Jan 22, 2019

I have also had this issue; in particular when changing a target from one job to another.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Apr 4, 2019

@tomwilkie is this issue still relevant considering that the remote write code has changed a lot in 2.8?

@michael-doubez

This comment has been minimized.

Copy link

michael-doubez commented Apr 10, 2019

Since I switched to 2.8.1, I now longer have slowness and freeze when reloading.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Apr 19, 2019

Closing, feel free to re-open if it still occurs with 2.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.