Skip to content

runtime: contention in runtime.(*mheap).freeSpan.func1 on mheap_.lock #61427

@rhysh

Description

@rhysh

This is the second in what I think will be a series of three issue reports about contention I've seen on various runtime-internal singleton locks in an app that typically runs on dual-socket linux/amd64 machines with 96 (or sometimes 64) hardware threads.

This issue is about contention in runtime.(*mheap).freeSpan.func1 on mheap_.lock.

I hope it'll be of interest to @mknyszek . Also CC @golang/runtime

Sorry about the somewhat outdated version. I haven't seen related issues go by, so I hope the info in this one is still of some use.

What version of Go are you using (go version)?

$ go version
go version go1.19.6 linux/amd64

Does this issue reproduce with the latest release?

I don't know yet if the issue is present in Go 1.20 series or in the release candidates of the Go 1.21 series.

What operating system and processor architecture are you using (go env)?

The app runs on Linux / x86_64, typically on machines with two processor sockets (NUMA). I don't have the output of go env from the app's runtime environment.

What did you do?

A data-processing app received a higher volume of data than usual. Here are some more of its dimensions:

The average allocation size is about 128 bytes. Across the whole app, it averages around 4 GiB per second of allocations, or 16 kiB of allocations every 4 µs. Split across the 96 threads the runtime uses for the app (GOMAXPROCS == num hardware threads), that's 40 MiB per second per thread, an average of 3 µs between allocations, or 16 kiB of allocations every 400 µs.

What did you expect to see?

I expected to see very little contention on runtime-internal singleton locks.

What did you see instead?

Contention in runtime.(*mheap).freeSpan.func1 makes it hard for callers of runtime.mallocgc to do their required sweep assist work. The contention is on these two lines:

https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1486
https://github.com/golang/go/blob/go1.19.6/src/runtime/mheap.go#L1500

The code in runtime.(*sweepLocked).sweep closer to the root of the call stack looks interesting. It's the same in the development tip now as it was in Go 1.15. From my read of it, when a swept span is either completely or partially full, it gets pushed onto a lock-free stack. But if the swept span is completely empty, the resulting call to mheap_.freeSpan requires taking the global mheap_.lock lock.

@mknyszek , would it be possible for mheap_.freeSpan to use a lock-free structure (maybe lfstack) too?

go/src/runtime/mgcsweep.go

Lines 751 to 768 in 5d481ab

if !preserve {
// The caller may not have removed this span from whatever
// unswept set its on but taken ownership of the span for
// sweeping by updating sweepgen. If this span still is in
// an unswept set, then the mcentral will pop it off the
// set, check its sweepgen, and ignore it.
if nalloc == 0 {
// Free totally free span directly back to the heap.
mheap_.freeSpan(s)
return true
}
// Return span back to the right mcentral list.
if uintptr(nalloc) == s.nelems {
mheap_.central[spc].mcentral.fullSwept(sweepgen).push(s)
} else {
mheap_.central[spc].mcentral.partialSwept(sweepgen).push(s)
}
}


Data set 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b is from a instance of the app running on a 96-thread machine. The CPU profile ran for 5.16 seconds, during which it collected samples corresponding to a total 392 seconds of on-CPU time (an average of 76 on-CPU threads) and 98.16+73.70 = 171.86 seconds of time (an average of 33 on-CPU threads) in calls from runtime.(*mheap).freeSpan.func1 to lock and unlock mheap_.lock.

$ ln -s /usr/local/go/bin/go /tmp/redacted
$ (cd 39f2fa9279db66ec9e1b0a19b63bb1f479d827b39abfc67dbd715a4e18f7680b && go tool pprof -focus=futex '-show_from=^runtime\.' '-prune_from=runtime.(lock|unlock)' '-peek=runtime\.(lock|unlock)$' /tmp/redacted pprof/profile)                  
File: redacted
Type: cpu
Time: Jul 1, 2023 at 1:23pm (PDT)
Duration: 5.16s, Total samples = 391.71s (7591.56%)
Active filters:
   focus=futex
   show_from=^runtime\.
Showing nodes accounting for 181.79s, 46.41% of 391.71s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            98.16s 94.27% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             2.81s  2.70% |   runtime.newMarkBits (inline)
                                             1.64s  1.57% |   runtime.(*mheap).allocSpan (inline)
                                             0.66s  0.63% |   runtime.chansend (inline)
                                             0.31s   0.3% |   runtime.(*spanSet).push (inline)
                                             0.27s  0.26% |   runtime.sellock (inline)
                                             0.10s 0.096% |   runtime.findRunnable (inline)
                                             0.06s 0.058% |   runtime.goschedImpl (inline)
                                             0.04s 0.038% |   runtime.stopm (inline)
                                             0.02s 0.019% |   runtime.checkRunqsNoP (inline)
                                             0.02s 0.019% |   runtime.newArenaMayUnlock (inline)
                                             0.02s 0.019% |   runtime.runOneTimer (inline)
                                             0.01s 0.0096% |   runtime.checkTimers (inline)
                                             0.01s 0.0096% |   runtime.setprofilebucket (inline)
     0.01s 0.0026% 0.0026%    104.13s 26.58%                | runtime.lock
                                           104.12s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------
                                            73.70s 95.05% |   runtime.(*mheap).freeSpan.func1 (inline)
                                             1.57s  2.02% |   runtime.(*mheap).allocSpan (inline)
                                             1.41s  1.82% |   runtime.newMarkBits (inline)
                                             0.32s  0.41% |   runtime.chansend (inline)
                                             0.22s  0.28% |   runtime.(*spanSet).push (inline)
                                             0.11s  0.14% |   runtime.selunlock (inline)
                                             0.08s   0.1% |   runtime.findRunnable (inline)
                                             0.06s 0.077% |   runtime.goschedImpl (inline)
                                             0.02s 0.026% |   runtime.chansend.func1 (inline)
                                             0.01s 0.013% |   runtime.freeSpecial (inline)
                                             0.01s 0.013% |   runtime.mProf_Free (inline)
                                             0.01s 0.013% |   runtime.runOneTimer (inline)
                                             0.01s 0.013% |   runtime.stopm (inline)
         0     0% 0.0026%     77.54s 19.80%                | runtime.unlock
                                            77.54s   100% |   runtime.unlockWithRank (inline)
----------------------------------------------------------+-------------

freeSpan-pprof

Metadata

Metadata

Assignees

Labels

NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions