runtime: lock order violation between gscan and profInsert after CL 544195 #64706
That said, I am currently confused about the entire
The text was updated successfully, but these errors were encountered:
Found new dashboard test flakes for:
2023-12-01 19:20 linux-amd64-staticlockranking go@5a2161ce runtime (log)
2023-12-06 17:59 linux-amd64-staticlockranking go@3f2bf706 runtime (log)
2023-12-13 00:22 linux-amd64-staticlockranking go@400e24a8 runtime (log)
I'd thought this was an issue with instrumentation / reporting, but it's not: Although the (new in Go 1.22) code for adding runtime-internal lock contention events to the mprof hash table is guarded by a check to run only when the last lock is released (
gp.m.locks==1 because its caller has yet to decrement the count down to zero,
gscan lockrank appears to not be an actual "lock/unlock" mutex. The lock rank is used directly (
I'm not up to speed on gscan / casgstatus / castogscanstatus.
Does that look like an accurate description, @prattmic ? If so, then it seems the lock rank builder is trying to warn us of a real problem.
Come to think of it, I think I may have been the one that added the Gscan lock rank. It was because we'd encountered a few Gscan-related deadlocks and they were hard to debug, so we wanted to avoid that in the future. Gscan acts like a kind of spinlock on ownership of the G, but indeed, it does not increment
Oof. I think we narrowly miss a real deadlock issue here and it turns out to just be a limitation of our lock modeling.
TL;DR: This lock modeling issue is only present when
Here's an almost problematic ordering:
(1) G1 is running on T1 and calls into
(2) can't actually happen but the reasons are very subtle. There are only three cases where the GC tries to acquire a goroutines Gscan bit.
One is after asynchronously preempting, but asynchronous preemption will back out early if any locks are held by the target goroutine, so that can't happen.
Another case is where the Gscan bit is acquired and it's on a running goroutine in anticipation of a preemption, here: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=199?q=preempt.go&ss=go%2Fgo. Luckily nothing happens that could cause a lock to be acquired between the acquire and release of the Gscan bit.
Finally, the Gscan bit can be acquired for a blocked goroutine, but if that was the case for G1 above, then (1) couldn't have been happening, since the goroutine needs to be running to acquire it's own scan bit.
One other possibly problematic ordering is that of two GC threads compete for the same Gscan bit. One acquires and releases a lock while that bit is held, causing
Modeling this properly in the lock ranking is going to be a pain. And I'm not a fan of incrementing
OK, that's all the bad-ish news. Our saving grace for the release is that this can only happen with
Moving out of release-blocker and into Go 1.23.
Like #64253, I think this can be worked on as a test-only fix for Go 1.22 still. The lock ranking mode is entirely for our own testing, and this issue only pops up with
But there's no reason this has to block the Go 1.22 release at all, hence why I moved it to Go 1.23.
I'm also seeing a similar problem in