runtime: ~5% regression in ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-8 sec/op at 450ecbe #64455
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
From the performance dashboard, https://perf.golang.org/dashboard/?benchmark=regressions, I found a performance regression on a microbenchmark of contended channel operations (which involve a contended
runtime.mutex
), https://perf.golang.org/dashboard/?benchmark=ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-8&unit=sec/op .https://github.com/kubernetes/client-go/blob/v0.22.2/util/workqueue/parallelizer_test.go
https://github.com/kubernetes/client-go/blob/v0.22.2/util/workqueue/parallelizer.go
The benchmark runs on GOOS=linux, GOARCH=amd64 with GOMAXPROCS=8. It creates 10 workers which consume 1000 trivial work units from a shared channel. I've reproduced the regression on linux/amd64 and confirmed that it's introduced by https://go.dev/cl/544195 (and its predecessor, https://go.dev/cl/528657).
The magnitude of the regression is a ~5% slowdown on contended channel operations. On the hardware I used to test, they move from 174ns each for 10 workers on 8 threads (174µs for each
b.N
iteration, representing 1000 items) to 184ns.CPU profiles show that extra 10ns goes to:
nanotime
(to mark the start and end of contended lock acquisitions, sampled atgTrackingPeriod
)runtime.unlock2
calls regardless of contention (to see if the M is storing contention info that needs to be moved to themprof
map)Note that this code is active even with the default of
GODEBUG=profileruntimelocks=0
, because https://go.dev/cl/544195 reports the magnitude of runtime-internal lock contention as part of the/sync/mutex/wait/total:seconds
metric.I've been able to reduce the overhead through sound changes (reducing the sample rate, structuring the function calls to convince the compiler to inline the fast paths into
runtime.lock2
andruntime.unlock2
). I've been able to eliminate the overhead through an additional unsound change (to haveruntime.unlock2
only interact with the lock profile when it's waking another thread). That change isn't sound because promoting data into themprof
structure is only safe when we're unlocking the M's last lock (which might not be one for which another thread is waiting), but it's similar to the Go 1.23 plans that @prattmic and I discussed, wherein we'd capture the stack of the unlocking thread and give it weight equivalent to how long it had delayed the other threads.I'm not sure how much of a problem 10ns in this microbenchmark is in the big picture. But the above is what I learned when looking into it.
CC @prattmic @mknyszek
CC @golang/runtime
The text was updated successfully, but these errors were encountered: