Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: contention in runtime.newMarkBits on gcBitsArenas.lock #61428

Open
rhysh opened this issue Jul 18, 2023 · 1 comment
Open

runtime: contention in runtime.newMarkBits on gcBitsArenas.lock #61428

rhysh opened this issue Jul 18, 2023 · 1 comment
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Jul 18, 2023

This is the third in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.

This issue is about contention in runtime.newMarkBits on gcBitsArenas.lock.

I hope it'll be of interest to @mknyszek . Also CC @golang/runtime

Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.

What version of Go are you using (go version)?

$ go version
go version go1.19.6 linux/amd64

Does this issue reproduce with the latest release?

I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.

What operating system and processor architecture are you using (go env)?

The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env from the app's runtime environment.

What did you do?

A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:

The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.

What did you expect to see?

I expected to see very little contention on runtime-internal singleton locks.

What did you see instead?

Contention on a lock in runtime.newMarkBits, and on the lock-free datastructure in runtime.(*spanSet).pop.

Although the runtime uses more than one spanSet, it appears to have only one of those for each size class. That makes them singletons in effect, especially if the app's memory usage is concentrated on a small number of size classes.

Looking at the application-level callers of runtime.mallocgc, the on-CPU time is mostly from runtime.newobject as called from seven allocation sites. From the allocs profile there are two that allocate 64-byte objects, one that allocates 80-byte objects, two of 112, one of 128, and one that allocates slices up to about 1 MiB in size (though most of the memory here is in 10 kiB allocations).

Looking at the size of memory allocations over the lifetime of the process (command below, though I suppose it's also available via runtime.ReadMemStats) confirms that the bulk of the bytes allocated are in the 64, 80, 112, and 128-byte classes.

$ go tool pprof -raw pprof/allocs | sed -n '/^Samples/,/^Locations/ p' | grep ': ' | awk '{if (int($1)>0) {print int($2/int($1)),$0}}' | column -t | sort -n -k 3 -r

Data set 6d039fa93e6d0b7bb2446028f322b7ac9a0d3b3b47a2ea0ed201779a6d81ed78 is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.18 seconds, during which it collected samples corresponding to a total 444 seconds of on-CPU time (an average of 86 on-CPU threads) and 36.90+15.37 = 52.27 seconds of time (an average of 10 on-CPU threads) in calls from runtime.newMarkBits to lock and unlock gcBitsArenas.lock.

https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L2059
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L2064

It shows 100.46 seconds of time (an average of 19 on-CPU threads) spent manipulating the lock-free data structure in runtime.(*spanSet).pop. The breakdown of where that time goes includes the following:

54% of samples in CAS https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L168
16% of samples in Xadd https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L214
11% of samples here, with no atomics but within claimloop https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L146
...
0.17% of samples in this atomic load following claimloop, indicating that the easy instructions within claimloop run many more times than easy instructions after it https://github.com/golang/go/blob/go1.19.6/src/runtime/mspanset.go#L190

$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 6d039fa93e6d0b7bb2446028f322b7ac9a0d3b3b47a2ea0ed201779a6d81ed78 && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:16pm (PDT)
Duration: 5.18s, Total samples = 443.84s (8562.03%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 54.11s, 12.19% of 443.84s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            36.90s 97.16% |   runtime.newMarkBits (inline)
                                             0.30s  0.79% |   runtime.newArenaMayUnlock (inline)
                                             0.28s  0.74% |   runtime.(*spanSet).push (inline)
                                             0.25s  0.66% |   runtime.chansend (inline)
                                             0.08s  0.21% |   runtime.(*mheap).allocSpan (inline)
                                             0.06s  0.16% |   runtime.goschedImpl (inline)
                                             0.06s  0.16% |   runtime.sellock (inline)
                                             0.02s 0.053% |   runtime.findRunnable (inline)
                                             0.01s 0.026% |   runtime.entersyscall_gcwait (inline)
                                             0.01s 0.026% |   runtime.mProf_Free (inline)
                                             0.01s 0.026% |   runtime.stackcacherefill (inline)
         0     0%     0%     37.98s  8.56%                | runtime.lock
                                            37.98s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            15.37s 95.35% |   runtime.newMarkBits (inline)
                                             0.27s  1.67% |   runtime.chansend (inline)
                                             0.18s  1.12% |   runtime.(*spanSet).push (inline)
                                             0.10s  0.62% |   runtime.selunlock (inline)
                                             0.07s  0.43% |   runtime.(*mheap).allocSpan (inline)
                                             0.07s  0.43% |   runtime.newArenaMayUnlock (inline)
                                             0.03s  0.19% |   runtime.goschedImpl (inline)
                                             0.01s 0.062% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             0.01s 0.062% |   runtime.addspecial (inline)
                                             0.01s 0.062% |   runtime.runOneTimer (inline)
         0     0%     0%     16.12s  3.63%                | runtime.unlock
                                            16.12s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------
$ (cd 6d039fa93e6d0b7bb2446028f322b7ac9a0d3b3b47a2ea0ed201779a6d81ed78 && go tool pprof "-trim_path=$trim_path" '-show_from=^runtime\.mallocgc$' '-peek=cacheSpan .*:137$' -lines /tmp/redacted pprof/profile)
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:16pm (PDT)
Duration: 5.18s, Total samples = 443.84s (8562.03%)
Active filters:
   show_from=^runtime\.mallocgc$
Showing nodes accounting for 392.93s, 88.53% of 443.84s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                           100.46s   100% |   runtime.(*mcache).refill runtime/mcache.go:181
         0     0%     0%    100.46s 22.63%                | runtime.(*mcentral).cacheSpan runtime/mcentral.go:137
                                            54.62s 54.37% |   runtime.(*spanSet).pop runtime/mspanset.go:168
                                            15.82s 15.75% |   runtime.(*spanSet).pop runtime/mspanset.go:214
                                            11.48s 11.43% |   runtime.(*spanSet).pop runtime/mspanset.go:146
                                             9.11s  9.07% |   runtime.(*spanSet).pop runtime/mspanset.go:172
                                             5.28s  5.26% |   runtime.(*spanSet).pop runtime/mspanset.go:154
                                             2.91s  2.90% |   runtime.(*spanSet).pop runtime/mspanset.go:178
                                             0.88s  0.88% |   runtime.(*spanSet).pop runtime/mspanset.go:191
                                             0.17s  0.17% |   runtime.(*spanSet).pop runtime/mspanset.go:190
                                             0.05s  0.05% |   runtime.(*spanSet).pop runtime/mspanset.go:183
                                             0.04s  0.04% |   runtime.(*spanSet).pop runtime/mspanset.go:171
                                             0.03s  0.03% |   runtime.(*mcentral).fullUnswept runtime/mcentral.go:69 (inline)
                                             0.02s  0.02% |   runtime.(*spanSet).pop runtime/mspanset.go:141
                                             0.01s  0.01% |   runtime.(*spanSet).pop runtime/mspanset.go:167
                                             0.01s  0.01% |   runtime.(*spanSet).pop runtime/mspanset.go:200
                                             0.01s  0.01% |   runtime.(*spanSet).pop runtime/mspanset.go:216
                                             0.01s  0.01% |   runtime.(*spanSet).pop runtime/mspanset.go:219
                                             0.01s  0.01% |   runtime.(*spanSet).pop runtime/mspanset.go:221
----------------------------------------------------------+-------------

newMarkBits-pprof

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 18, 2023
@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 19, 2023
@heschi
Copy link
Contributor

heschi commented Jul 19, 2023

cc @golang/runtime

@mknyszek mknyszek added this to the Backlog milestone Jul 25, 2023
@mknyszek mknyszek self-assigned this Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Todo
Development

No branches or pull requests

5 participants