Description
Proposal Details
Currently, mallocgc()
contains a publicationBarrier
(see malloc.go#L1201), which is a store/store barrier on weakly ordered machines (e.g., a data memory barrier store-store instruction on ARM64: DMB ST
). This barrier is critical for correctness, as it prevents the garbage collector from seeing "uninitialized memory or stale heap bits". However, it may heavily impact the performance, as any store operations below it must wait for the completion of all preceding stores.
One way to mitigate the negative effect is through software prefetching, as shown in the following patch:
diff --git a/src/runtime/malloc.go b/src/runtime/malloc.go
index b24ebec27d..1d227e5ab7 100644
--- a/src/runtime/malloc.go
+++ b/src/runtime/malloc.go
@@ -1188,6 +1188,9 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
header = &span.largeType
}
}
+ if goarch.IsArm64 != 0 {
+ sys.Prefetch(uintptr(x))
+ }
if !noscan && !delayedZeroing {
c.scanAlloc += heapSetType(uintptr(x), dataSize, typ, header, span)
}
// Ensure that the stores above that initialize x to
// type-safe memory and set the heap bits occur before
// the caller can make x observable to the garbage
// collector. Otherwise, on weakly ordered machines,
// the garbage collector could follow a pointer to x,
// but see uninitialized memory or stale heap bits.
publicationBarrier()
The impacts of using prefetch(x)
are as follows:
prefetch(x)
not only fetch the currently allocated object but can also speculatively fetch future allocated objects based on the access pattern. This speculative prefetching significantly mitigates the negative effects of the barrier, which is the primary benefit.- Another benefit may come from the allocated objects that do need zeroing, which means
mallocgc
won't touch addressx
, then there may be cache-miss for subsequent accessingx
in user code. Callingprefetch(x)
explicitly could reduce such cache-misses. - If the prefetched caches are never used, the penalty should be small, as the main performance issue is related to the barrier.
I tested the performance with the latest master branch (version a9e6a96ac0
) on following ARM64 Linux servers (other ARM64 machines are not available and not tested) and x86-64 Linux server (config: 4K page size and THP enabled):
- AmpereOne (ARM v8.6+) from Ampere Computing.
- Ampere Altra (ARM Neoverse N1) from Ampere Computing.
- Graviton2 (ARM Neoverse N1) from AWS.
- EPYC 9754 (Zen4c) from AMD.
Results show pkg runtime Malloc* benchmarks and Sweet bleve-index have obvious improvements. Since barrier and prefetch are implementation defined, the performance varies significantly (see benchmark results):
- AmpereOne is heavily affected by the barrier and shows the greatest improvement with prefetching.
- Altra and Graviton2 also have obvious improvements.
- X86 machines have slight regressions on Malloc* benchmarks (although no regression on bleve-index), so the conditional check
if goarch.IsArm64 != 0
is added to disable the prefetch on x86-64 and other architectures that were not tested. This is reasonable aspublicationBarrier()
is no op on strong memory models like X86.
About the prefetch
distance and location:
- Why use
prefetch(x)
instead ofprefetch(x+offset)
? The reason is that we only know the currently allocated object's address and don’t know the address of the next object, so I can’t apply a properoffset
. The experiments showx
is just a proper argument for the prefetch, and just let the hardware prefetcher to do the speculative works. - I also experimented with different locations in
mallocgc
for insertingprefetch
, and found that placing it before theheapTypeSet()
could get the best performance.
Currently, there are existing prefetch
usages in scanobject
and greayobject
to improve the GC performance. What do you think about this change?
cc @aclements
Sweet bleve-index benchmark results
The overhead introduced by the barrier can significantly impact the performance of bleve-index, which frequently creates new small obejcts to build a treap (a binary search tree):
│ ampere-one.base │ ampere-one.new │
│ sec/op │ sec/op vs base │
BleveIndexBatch100-8 7.168 ± 2% 5.928 ± 3% -17.30% (p=0.000 n=10)
│ ampere-altra.base │ ampere-altra.new │
│ sec/op │ sec/op vs base │
BleveIndexBatch100-8 5.388 ± 2% 4.952 ± 1% -8.09% (p=0.000 n=10)
│ aws-graviton2.base │ aws-graviton2.new │
│ sec/op │ sec/op vs base │
BleveIndexBatch100-8 5.768 ± 2% 5.368 ± 3% -6.93% (p=0.000 n=10)
BTW, other Sweet benchmarks were also tested, but they are not obviously affected by this change.
Pkg runtime Malloc* benchmark results
Malloc* benchmarks test the performance of mallocgc
by allocating objects:
goos: linux
goarch: arm64
pkg: runtime
│ ampere-one.base │ ampere-one.new │
│ sec/op │ sec/op vs base │
Malloc8-8 22.23n ± 0% 19.30n ± 1% -13.16% (p=0.000 n=30)
Malloc16-8 35.58n ± 0% 30.48n ± 1% -14.35% (p=0.000 n=30)
MallocTypeInfo8-8 36.78n ± 0% 34.66n ± 0% -5.79% (p=0.000 n=30)
MallocTypeInfo16-8 42.69n ± 0% 38.36n ± 0% -10.13% (p=0.000 n=30)
MallocLargeStruct-8 246.0n ± 0% 251.9n ± 0% +2.38% (p=0.000 n=30)
geomean 49.78n 45.60n -8.40%
│ ampere-altra.base │ ampere-altra.new │
│ sec/op │ sec/op vs base │
Malloc8-8 20.73n ± 0% 19.67n ± 1% -5.14% (p=0.000 n=30)
Malloc16-8 32.36n ± 0% 31.38n ± 0% -3.04% (p=0.000 n=30)
MallocTypeInfo8-8 39.06n ± 0% 38.58n ± 0% -1.23% (p=0.000 n=30)
MallocTypeInfo16-8 40.86n ± 0% 40.69n ± 0% -0.42% (p=0.000 n=30)
MallocLargeStruct-8 234.6n ± 1% 233.0n ± 1% -0.70% (p=0.016 n=30)
geomean 47.86n 46.85n -2.12%
│ aws-graviton2.base │ aws-graviton2.new │
│ sec/op │ sec/op vs base │
Malloc8-8 23.03n ± 0% 22.38n ± 0% -2.82% (p=0.000 n=30)
Malloc16-8 35.38n ± 0% 35.14n ± 0% -0.66% (p=0.000 n=30)
MallocTypeInfo8-8 45.81n ± 0% 45.65n ± 0% -0.35% (p=0.000 n=30)
MallocTypeInfo16-8 46.05n ± 0% 46.35n ± 0% +0.65% (p=0.000 n=30)
MallocLargeStruct-8 261.8n ± 0% 264.6n ± 0% +1.05% (p=0.000 n=30)
geomean 53.78n 53.55n -0.44%