-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
From the performance dashboard, https://perf.golang.org/dashboard/?benchmark=regressions, I found a performance regression on a microbenchmark of contended channel operations (which involve a contended runtime.mutex), https://perf.golang.org/dashboard/?benchmark=ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-8&unit=sec/op .
https://github.com/kubernetes/client-go/blob/v0.22.2/util/workqueue/parallelizer_test.go
https://github.com/kubernetes/client-go/blob/v0.22.2/util/workqueue/parallelizer.go
The benchmark runs on GOOS=linux, GOARCH=amd64 with GOMAXPROCS=8. It creates 10 workers which consume 1000 trivial work units from a shared channel. I've reproduced the regression on linux/amd64 and confirmed that it's introduced by https://go.dev/cl/544195 (and its predecessor, https://go.dev/cl/528657).
The magnitude of the regression is a ~5% slowdown on contended channel operations. On the hardware I used to test, they move from 174ns each for 10 workers on 8 threads (174µs for each b.N iteration, representing 1000 items) to 184ns.
CPU profiles show that extra 10ns goes to:
nanotime(to mark the start and end of contended lock acquisitions, sampled atgTrackingPeriod)- function call overhead
- the bookkeeping that's now part of all
runtime.unlock2calls regardless of contention (to see if the M is storing contention info that needs to be moved to themprofmap)
Note that this code is active even with the default of GODEBUG=profileruntimelocks=0, because https://go.dev/cl/544195 reports the magnitude of runtime-internal lock contention as part of the /sync/mutex/wait/total:seconds metric.
I've been able to reduce the overhead through sound changes (reducing the sample rate, structuring the function calls to convince the compiler to inline the fast paths into runtime.lock2 and runtime.unlock2). I've been able to eliminate the overhead through an additional unsound change (to have runtime.unlock2 only interact with the lock profile when it's waking another thread). That change isn't sound because promoting data into the mprof structure is only safe when we're unlocking the M's last lock (which might not be one for which another thread is waiting), but it's similar to the Go 1.23 plans that @prattmic and I discussed, wherein we'd capture the stack of the unlocking thread and give it weight equivalent to how long it had delayed the other threads.
I'm not sure how much of a problem 10ns in this microbenchmark is in the big picture. But the above is what I learned when looking into it.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status