You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The benchmark runs on GOOS=linux, GOARCH=amd64 with GOMAXPROCS=8. It creates 10 workers which consume 1000 trivial work units from a shared channel. I've reproduced the regression on linux/amd64 and confirmed that it's introduced by https://go.dev/cl/544195 (and its predecessor, https://go.dev/cl/528657).
The magnitude of the regression is a ~5% slowdown on contended channel operations. On the hardware I used to test, they move from 174ns each for 10 workers on 8 threads (174µs for each b.N iteration, representing 1000 items) to 184ns.
CPU profiles show that extra 10ns goes to:
nanotime (to mark the start and end of contended lock acquisitions, sampled at gTrackingPeriod)
function call overhead
the bookkeeping that's now part of all runtime.unlock2 calls regardless of contention (to see if the M is storing contention info that needs to be moved to the mprof map)
Note that this code is active even with the default of GODEBUG=profileruntimelocks=0, because https://go.dev/cl/544195 reports the magnitude of runtime-internal lock contention as part of the /sync/mutex/wait/total:seconds metric.
I've been able to reduce the overhead through sound changes (reducing the sample rate, structuring the function calls to convince the compiler to inline the fast paths into runtime.lock2 and runtime.unlock2). I've been able to eliminate the overhead through an additional unsound change (to have runtime.unlock2 only interact with the lock profile when it's waking another thread). That change isn't sound because promoting data into the mprof structure is only safe when we're unlocking the M's last lock (which might not be one for which another thread is waiting), but it's similar to the Go 1.23 plans that @prattmic and I discussed, wherein we'd capture the stack of the unlocking thread and give it weight equivalent to how long it had delayed the other threads.
I'm not sure how much of a problem 10ns in this microbenchmark is in the big picture. But the above is what I learned when looking into it.
Thanks for taking a look. Whichever improvements we do, right now I'm of the opinion that a 5% regression in heavily-contended microbenchmark isn't worth fixing in the freeze and can wait until the next release. My opinion may change if there are real-world applications affected.