Description
This is the third in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.
This issue is about contention in runtime.newMarkBits
on gcBitsArenas.lock
.
I hope it'll be of interest to @mknyszek . Also CC @golang/runtime
Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.
What version of Go are you using (go version
)?
$ go version go version go1.19.6 linux/amd64
Does this issue reproduce with the latest release?
I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.
What operating system and processor architecture are you using (go env
)?
The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env
from the app's runtime environment.
What did you do?
A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:
The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.
What did you expect to see?
I expected to see very little contention on runtime-internal singleton locks.
What did you see instead?
Contention on a lock in runtime.newMarkBits
, and on the lock-free datastructure in runtime.(*spanSet).pop
.
Although the runtime uses more than one spanSet
, it appears to have only one of those for each size class. That makes them singletons in effect, especially if the app's memory usage is concentrated on a small number of size classes.
Looking at the application-level callers of runtime.mallocgc
, the on-CPU time is mostly from runtime.newobject
as called from seven allocation sites. From the allocs profile there are two that allocate 64-byte objects, one that allocates 80-byte objects, two of 112, one of 128, and one that allocates slices up to about 1 MiB in size (though most of the memory here is in 10 kiB allocations).
Looking at the size of memory allocations over the lifetime of the process (command below, though I suppose it's also available via runtime.ReadMemStats
) confirms that the bulk of the bytes allocated are in the 64, 80, 112, and 128-byte classes.
$ go tool pprof -raw pprof/allocs | sed -n '/^Samples/,/^Locations/ p' | grep ': ' | awk '{if (int($1)>0) {print int($2/int($1)),$0}}' | column -t | sort -n -k 3 -r
Data set 6d039fa93e6d0b7bb2446028f322b7ac9a0d3b3b47a2ea0ed201779a6d81ed78 is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.18 seconds, during which it collected samples corresponding to a total 444 seconds of on-CPU time (an average of 86 on-CPU threads) and 36.90+15.37 = 52.27 seconds of time (an average of 10 on-CPU threads) in calls from runtime.newMarkBits
to lock and unlock gcBitsArenas.lock
.
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L2059
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L2064
It shows 100.46 seconds of time (an average of 19 on-CPU threads) spent manipulating the lock-free data structure in runtime.(*spanSet).pop
. The breakdown of where that time goes includes the following:
54% of samples in CAS https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L168
16% of samples in Xadd https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L214
11% of samples here, with no atomics but within claimloop https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L146
...
0.17% of samples in this atomic load following claimloop, indicating that the easy instructions within claimloop run many more times than easy instructions after it https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L190
$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 6d039fa93e6d0b7bb2446028f322b7ac9a0d3b3b47a2ea0ed201779a6d81ed78 && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:16pm (PDT)
Duration: 5.18s, Total samples = 443.84s (8562.03%)
Active filters:
focus=futex
show_from=^runtime\.
Showing nodes accounting for 54.11s, 12.19% of 443.84s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
36.90s 97.16% | runtime.newMarkBits (inline)
0.30s 0.79% | runtime.newArenaMayUnlock (inline)
0.28s 0.74% | runtime.(*spanSet).push (inline)
0.25s 0.66% | runtime.chansend (inline)
0.08s 0.21% | runtime.(*mheap).allocSpan (inline)
0.06s 0.16% | runtime.goschedImpl (inline)
0.06s 0.16% | runtime.sellock (inline)
0.02s 0.053% | runtime.findRunnable (inline)
0.01s 0.026% | runtime.entersyscall_gcwait (inline)
0.01s 0.026% | runtime.mProf_Free (inline)
0.01s 0.026% | runtime.stackcacherefill (inline)
0 0% 0% 37.98s 8.56% | runtime.lock
37.98s 100% | runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
15.37s 95.35% | runtime.newMarkBits (inline)
0.27s 1.67% | runtime.chansend (inline)
0.18s 1.12% | runtime.(*spanSet).push (inline)
0.10s 0.62% | runtime.selunlock (inline)
0.07s 0.43% | runtime.(*mheap).allocSpan (inline)
0.07s 0.43% | runtime.newArenaMayUnlock (inline)
0.03s 0.19% | runtime.goschedImpl (inline)
0.01s 0.062% | runtime.(*mheap).freeSpan.func1 (inline)
0.01s 0.062% | runtime.addspecial (inline)
0.01s 0.062% | runtime.runOneTimer (inline)
0 0% 0% 16.12s 3.63% | runtime.unlock
16.12s 100% | runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------
$ (cd 6d039fa93e6d0b7bb2446028f322b7ac9a0d3b3b47a2ea0ed201779a6d81ed78 && go tool pprof "-trim_path=$trim_path" '-show_from=^runtime\.mallocgc$' '-peek=cacheSpan .*:137$' -lines /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:16pm (PDT)
Duration: 5.18s, Total samples = 443.84s (8562.03%)
Active filters:
show_from=^runtime\.mallocgc$
Showing nodes accounting for 392.93s, 88.53% of 443.84s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
100.46s 100% | runtime.(*mcache).refill runtime/mcache.go:181
0 0% 0% 100.46s 22.63% | runtime.(*mcentral).cacheSpan runtime/mcentral.go:137
54.62s 54.37% | runtime.(*spanSet).pop runtime/mspanset.go:168
15.82s 15.75% | runtime.(*spanSet).pop runtime/mspanset.go:214
11.48s 11.43% | runtime.(*spanSet).pop runtime/mspanset.go:146
9.11s 9.07% | runtime.(*spanSet).pop runtime/mspanset.go:172
5.28s 5.26% | runtime.(*spanSet).pop runtime/mspanset.go:154
2.91s 2.90% | runtime.(*spanSet).pop runtime/mspanset.go:178
0.88s 0.88% | runtime.(*spanSet).pop runtime/mspanset.go:191
0.17s 0.17% | runtime.(*spanSet).pop runtime/mspanset.go:190
0.05s 0.05% | runtime.(*spanSet).pop runtime/mspanset.go:183
0.04s 0.04% | runtime.(*spanSet).pop runtime/mspanset.go:171
0.03s 0.03% | runtime.(*mcentral).fullUnswept runtime/mcentral.go:69 (inline)
0.02s 0.02% | runtime.(*spanSet).pop runtime/mspanset.go:141
0.01s 0.01% | runtime.(*spanSet).pop runtime/mspanset.go:167
0.01s 0.01% | runtime.(*spanSet).pop runtime/mspanset.go:200
0.01s 0.01% | runtime.(*spanSet).pop runtime/mspanset.go:216
0.01s 0.01% | runtime.(*spanSet).pop runtime/mspanset.go:219
0.01s 0.01% | runtime.(*spanSet).pop runtime/mspanset.go:221
----------------------------------------------------------+-------------
Metadata
Metadata
Assignees
Labels
Type
Projects
Status