-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: lock order violation between gscan and profInsert after CL 544195 #64706
Comments
Found new dashboard test flakes for:
2023-12-01 19:20 linux-amd64-staticlockranking go@5a2161ce runtime (log)
2023-12-06 17:59 linux-amd64-staticlockranking go@3f2bf706 runtime (log)
2023-12-13 00:22 linux-amd64-staticlockranking go@400e24a8 runtime (log)
|
Should this be a release blocker? Lock inversion in the runtime seems like a pretty severe bug. 😅 (Or is this an existing issue that was only recently exposed by testing?) |
It should be a release blocker until we understand better. It is a new issue, not just uncovered. |
I'd thought this was an issue with instrumentation / reporting, but it's not: Although the (new in Go 1.22) code for adding runtime-internal lock contention events to the mprof hash table is guarded by a check to run only when the last lock is released ( Line 673 in 400e24a
gp.m.locks==1 because its caller has yet to decrement the count down to zero, Line 134 in 400e24a
gscan lockrank appears to not be an actual "lock/unlock" mutex. The lock rank is used directly (Line 1082 in 400e24a
gp.m.locks .
I'm not up to speed on gscan / casgstatus / castogscanstatus. Does that look like an accurate description, @prattmic ? If so, then it seems the lock rank builder is trying to warn us of a real problem. |
Come to think of it, I think I may have been the one that added the Gscan lock rank. It was because we'd encountered a few Gscan-related deadlocks and they were hard to debug, so we wanted to avoid that in the future. Gscan acts like a kind of spinlock on ownership of the G, but indeed, it does not increment |
I haven't yet been able to come up with a way that this could cause an actual deadlock, but I agree that having gscan increment |
Oof. I think we narrowly miss a real deadlock issue here and it turns out to just be a limitation of our lock modeling. TL;DR: This lock modeling issue is only present when Here's an almost problematic ordering: (1) G1 is running on T1 and calls into (2) can't actually happen but the reasons are very subtle. There are only three cases where the GC tries to acquire a goroutines Gscan bit. One is after asynchronously preempting, but asynchronous preemption will back out early if any locks are held by the target goroutine, so that can't happen. Another case is where the Gscan bit is acquired and it's on a running goroutine in anticipation of a preemption, here: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=199?q=preempt.go&ss=go%2Fgo. Luckily nothing happens that could cause a lock to be acquired between the acquire and release of the Gscan bit. Finally, the Gscan bit can be acquired for a blocked goroutine, but if that was the case for G1 above, then (1) couldn't have been happening, since the goroutine needs to be running to acquire it's own scan bit. One other possibly problematic ordering is that of two GC threads compete for the same Gscan bit. One acquires and releases a lock while that bit is held, causing Modeling this properly in the lock ranking is going to be a pain. And I'm not a fan of incrementing OK, that's all the bad-ish news. Our saving grace for the release is that this can only happen with Moving out of release-blocker and into Go 1.23. |
Like #64253, I think this can be worked on as a test-only fix for Go 1.22 still. The lock ranking mode is entirely for our own testing, and this issue only pops up with But there's no reason this has to block the Go 1.22 release at all, hence why I moved it to Go 1.23. |
I'm also seeing a similar problem in |
Yes, I think so |
Profiling of runtime-internal locks checks gp.m.locks to see if it's safe to add a new record to the profile, but direct use of acquireLockRank can change the list of the M's active lock ranks without updating gp.m.locks to match. The runtime's internal rwmutex implementation makes a point of calling acquirem/releasem when manipulating the lock rank list, but the other user of acquireLockRank (the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks. Codify the rwmutex approach by having acquireLockRank (now called acquireLockRankAndM) include a call to aquirem. Do the same for release. For golang#64706 For golang#66004
Change https://go.dev/cl/571056 mentions this issue: |
Profiling of runtime-internal locks checks gp.m.locks to see if it's safe to add a new record to the profile, but direct use of acquireLockRank can change the list of the M's active lock ranks without updating gp.m.locks to match. The runtime's internal rwmutex implementation makes a point of calling acquirem/releasem when manipulating the lock rank list, but the other user of acquireLockRank (the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks. Codify the rwmutex approach by adding a variant of acquireLockRank, acquireLockRankAndM, include a call to aquirem. Do the same for release. Leave runtime/time.go's use of the old variants intact for the moment. For golang#64706 For golang#66004 Change-Id: Id18e4d8de1036de743d2937fad002c6feebe2faf
Profiling of runtime-internal locks checks gp.m.locks to see if it's safe to add a new record to the profile, but direct use of acquireLockRank can change the list of the M's active lock ranks without updating gp.m.locks to match. The runtime's internal rwmutex implementation makes a point of calling acquirem/releasem when manipulating the lock rank list, but the other user of acquireLockRank (the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks. Codify the rwmutex approach by adding a variant of acquireLockRank, acquireLockRankAndM, include a call to aquirem. Do the same for release. Leave runtime/time.go's use of the old variants intact for the moment. For golang#64706 For golang#66004 Change-Id: Id18e4d8de1036de743d2937fad002c6feebe2faf
Profiling of runtime-internal locks checks gp.m.locks to see if it's safe to add a new record to the profile, but direct use of acquireLockRank can change the list of the M's active lock ranks without updating gp.m.locks to match. The runtime's internal rwmutex implementation makes a point of calling acquirem/releasem when manipulating the lock rank list, but the other user of acquireLockRank (the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks. Codify the rwmutex approach by renaming acquireLockRank to acquireLockRankAndM and having it include a call to aquirem. Do the same for release. For golang#64706 For golang#66004 Change-Id: Id18e4d8de1036de743d2937fad002c6feebe2faf
https://build.golang.org/log/9ef94ae07bd07605e17fcd241ef0d38d5064b41f
https://build.golang.org/log/3f80d959a6cc683334f8521b8a3b002dda1bd6b3
https://build.golang.org/log/a5bfdd06fa69808fe6378b412610e65b06ddd8ff
Taking
profInsert
aftergscan
is indeed an explicit violation based on our defined lock order: https://cs.opensource.google/go/go/+/master:src/runtime/mklockrank.go;l=125-138;drc=23c0d30244d5e38d928e68661bf39458eb85d2b2.That said, I am currently confused about the entire
STACKGROW
category (long day, perhaps). As far as I can tell, the gscan "lock" is only ever held from the system stack, so I don't understand how any of these other locks could "grow the stack", or what exactly we are protecting against.cc @rhysh @mknyszek
The text was updated successfully, but these errors were encountered: