This is the second in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.
This issue is about contention in runtime.(*mheap).freeSpan.func1 on mheap_.lock.
I hope it'll be of interest to @mknyszek . Also CC @golang/runtime
Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.
What version of Go are you using (go version)?
$ go version
go version go1.19.6 linux/amd64
Does this issue reproduce with the latest release?
I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.
What operating system and processor architecture are you using (go env)?
The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env from the app's runtime environment.
What did you do?
A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:
The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.
What did you expect to see?
I expected to see very little contention on runtime-internal singleton locks.
What did you see instead?
Contention in runtime.(*mheap).freeSpan.func1 makes it hard for callers of runtime.mallocgc to do their required sweep assist work. The contention is on these two lines:
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1486
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1500
The code in runtime.(*sweepLocked).sweep closer to the root of the call stack looks interesting. It's the same in the development tip now as it was in Go 1.15. From my read of it, when a swept span is either completely or partially full, it gets pushed onto a lock-free stack. But if the swept span is completely empty, the resulting call to mheap_.freeSpan requires taking the global mheap_.lock lock.
@mknyszek , would it be possible for mheap_.freeSpan to use a lock-free structure (maybe lfstack) too?
|
if !preserve { |
|
// The caller may not have removed this span from whatever |
|
// unswept set its on but taken ownership of the span for |
|
// sweeping by updating sweepgen. If this span still is in |
|
// an unswept set, then the mcentral will pop it off the |
|
// set, check its sweepgen, and ignore it. |
|
if nalloc == 0 { |
|
// Free totally free span directly back to the heap. |
|
mheap_.freeSpan(s) |
|
return true |
|
} |
|
// Return span back to the right mcentral list. |
|
if uintptr(nalloc) == s.nelems { |
|
mheap_.central[spc].mcentral.fullSwept(sweepgen).push(s) |
|
} else { |
|
mheap_.central[spc].mcentral.partialSwept(sweepgen).push(s) |
|
} |
|
} |
Data set 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.16 seconds, during which it collected samples corresponding to a total 392 seconds of on-CPU time (an average of 76 on-CPU threads) and 98.16+73.70 = 171.86 seconds of time (an average of 33 on-CPU threads) in calls from runtime.(*mheap).freeSpan.func1 to lock and unlock mheap_.lock.
$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:23pm (PDT)
Duration: 5.16s, Total samples = 391.71s (7591.56%)
Active filters:
focus=futex
show_from=^runtime\.
Showing nodes accounting for 181.79s, 46.41% of 391.71s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
98.16s 94.27% | runtime.(*mheap).freeSpan.func1 (inline)
2.81s 2.70% | runtime.newMarkBits (inline)
1.64s 1.57% | runtime.(*mheap).allocSpan (inline)
0.66s 0.63% | runtime.chansend (inline)
0.31s 0.3% | runtime.(*spanSet).push (inline)
0.27s 0.26% | runtime.sellock (inline)
0.10s 0.096% | runtime.findRunnable (inline)
0.06s 0.058% | runtime.goschedImpl (inline)
0.04s 0.038% | runtime.stopm (inline)
0.02s 0.019% | runtime.checkRunqsNoP (inline)
0.02s 0.019% | runtime.newArenaMayUnlock (inline)
0.02s 0.019% | runtime.runOneTimer (inline)
0.01s 0.0096% | runtime.checkTimers (inline)
0.01s 0.0096% | runtime.setprofilebucket (inline)
0.01s 0.0026% 0.0026% 104.13s 26.58% | runtime.lock
104.12s 100% | runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
73.70s 95.05% | runtime.(*mheap).freeSpan.func1 (inline)
1.57s 2.02% | runtime.(*mheap).allocSpan (inline)
1.41s 1.82% | runtime.newMarkBits (inline)
0.32s 0.41% | runtime.chansend (inline)
0.22s 0.28% | runtime.(*spanSet).push (inline)
0.11s 0.14% | runtime.selunlock (inline)
0.08s 0.1% | runtime.findRunnable (inline)
0.06s 0.077% | runtime.goschedImpl (inline)
0.02s 0.026% | runtime.chansend.func1 (inline)
0.01s 0.013% | runtime.freeSpecial (inline)
0.01s 0.013% | runtime.mProf_Free (inline)
0.01s 0.013% | runtime.runOneTimer (inline)
0.01s 0.013% | runtime.stopm (inline)
0 0% 0.0026% 77.54s 19.80% | runtime.unlock
77.54s 100% | runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

This is the second in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.
This issue is about contention in
runtime.(*mheap).freeSpan.func1onmheap_.lock.I hope it'll be of interest to @mknyszek . Also CC @golang/runtime
Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.
What version of Go are you using (
go version)?Does this issue reproduce with the latest release?
I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.
What operating system and processor architecture are you using (
go env)?The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of
go envfrom the app's runtime environment.What did you do?
A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:
The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.
What did you expect to see?
I expected to see very little contention on runtime-internal singleton locks.
What did you see instead?
Contention in
runtime.(*mheap).freeSpan.func1makes it hard for callers ofruntime.mallocgcto do their required sweep assist work. The contention is on these two lines:https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1486
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1500
The code in
runtime.(*sweepLocked).sweepcloser to the root of the call stack looks interesting. It's the same in the development tip now as it was in Go 1.15. From my read of it, when a swept span is either completely or partially full, it gets pushed onto a lock-free stack. But if the swept span is completely empty, the resulting call tomheap_.freeSpanrequires taking the globalmheap_.locklock.@mknyszek , would it be possible for
mheap_.freeSpanto use a lock-free structure (maybelfstack) too?go/src/runtime/mgcsweep.go
Lines 751 to 768 in 5d481ab
Data set 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.16 seconds, during which it collected samples corresponding to a total 392 seconds of on-CPU time (an average of 76 on-CPU threads) and 98.16+73.70 = 171.86 seconds of time (an average of 33 on-CPU threads) in calls from
runtime.(*mheap).freeSpan.func1to lock and unlockmheap_.lock.