Skip to content

runtime: enhance mallocgc on ARM64 via prefetch #69224

Closed as not planned
Closed as not planned
@haoliu-ampere

Description

@haoliu-ampere

Proposal Details

Currently, mallocgc() contains a publicationBarrier (see malloc.go#L1201), which is a store/store barrier on weakly ordered machines (e.g., a data memory barrier store-store instruction on ARM64: DMB ST). This barrier is critical for correctness, as it prevents the garbage collector from seeing "uninitialized memory or stale heap bits". However, it may heavily impact the performance, as any store operations below it must wait for the completion of all preceding stores.

One way to mitigate the negative effect is through software prefetching, as shown in the following patch:

diff --git a/src/runtime/malloc.go b/src/runtime/malloc.go
index b24ebec27d..1d227e5ab7 100644
--- a/src/runtime/malloc.go
+++ b/src/runtime/malloc.go
@@ -1188,6 +1188,9 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
			header = &span.largeType
		}
	}
+	if goarch.IsArm64 != 0 {
+		sys.Prefetch(uintptr(x))
+	}
	if !noscan && !delayedZeroing {
		c.scanAlloc += heapSetType(uintptr(x), dataSize, typ, header, span)
	}

	// Ensure that the stores above that initialize x to
	// type-safe memory and set the heap bits occur before
	// the caller can make x observable to the garbage
	// collector. Otherwise, on weakly ordered machines,
	// the garbage collector could follow a pointer to x,
	// but see uninitialized memory or stale heap bits.
	publicationBarrier()

The impacts of using prefetch(x) are as follows:

  1. prefetch(x) not only fetch the currently allocated object but can also speculatively fetch future allocated objects based on the access pattern. This speculative prefetching significantly mitigates the negative effects of the barrier, which is the primary benefit.
  2. Another benefit may come from the allocated objects that do need zeroing, which means mallocgc won't touch address x, then there may be cache-miss for subsequent accessing x in user code. Calling prefetch(x) explicitly could reduce such cache-misses.
  3. If the prefetched caches are never used, the penalty should be small, as the main performance issue is related to the barrier.

I tested the performance with the latest master branch (version a9e6a96ac0) on following ARM64 Linux servers (other ARM64 machines are not available and not tested) and x86-64 Linux server (config: 4K page size and THP enabled):

  1. AmpereOne (ARM v8.6+) from Ampere Computing.
  2. Ampere Altra (ARM Neoverse N1) from Ampere Computing.
  3. Graviton2 (ARM Neoverse N1) from AWS.
  4. EPYC 9754 (Zen4c) from AMD.

Results show pkg runtime Malloc* benchmarks and Sweet bleve-index have obvious improvements. Since barrier and prefetch are implementation defined, the performance varies significantly (see benchmark results):

  1. AmpereOne is heavily affected by the barrier and shows the greatest improvement with prefetching.
  2. Altra and Graviton2 also have obvious improvements.
  3. X86 machines have slight regressions on Malloc* benchmarks (although no regression on bleve-index), so the conditional check if goarch.IsArm64 != 0 is added to disable the prefetch on x86-64 and other architectures that were not tested. This is reasonable as publicationBarrier() is no op on strong memory models like X86.

About the prefetch distance and location:

  • Why use prefetch(x) instead of prefetch(x+offset)? The reason is that we only know the currently allocated object's address and don’t know the address of the next object, so I can’t apply a proper offset. The experiments show x is just a proper argument for the prefetch, and just let the hardware prefetcher to do the speculative works.
  • I also experimented with different locations in mallocgc for inserting prefetch, and found that placing it before the heapTypeSet() could get the best performance.

Currently, there are existing prefetch usages in scanobject and greayobject to improve the GC performance. What do you think about this change?

cc @aclements

Sweet bleve-index benchmark results

The overhead introduced by the barrier can significantly impact the performance of bleve-index, which frequently creates new small obejcts to build a treap (a binary search tree):

                     │ ampere-one.base │           ampere-one.new           │
                     │     sec/op      │   sec/op    vs base                │
BleveIndexBatch100-8        7.168 ± 2%   5.928 ± 3%  -17.30% (p=0.000 n=10)

                     │ ampere-altra.base │         ampere-altra.new          │
                     │      sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8          5.388 ± 2%   4.952 ± 1%  -8.09% (p=0.000 n=10)

                     │ aws-graviton2.base │         aws-graviton2.new         │
                     │       sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8           5.768 ± 2%   5.368 ± 3%  -6.93% (p=0.000 n=10)

BTW, other Sweet benchmarks were also tested, but they are not obviously affected by this change.

Pkg runtime Malloc* benchmark results

Malloc* benchmarks test the performance of mallocgc by allocating objects:

goos: linux
goarch: arm64
pkg: runtime
                    │ ampere-one.base │           ampere-one.new            │
                    │     sec/op      │   sec/op     vs base                │
Malloc8-8                 22.23n ± 0%   19.30n ± 1%  -13.16% (p=0.000 n=30)
Malloc16-8                35.58n ± 0%   30.48n ± 1%  -14.35% (p=0.000 n=30)
MallocTypeInfo8-8         36.78n ± 0%   34.66n ± 0%   -5.79% (p=0.000 n=30)
MallocTypeInfo16-8        42.69n ± 0%   38.36n ± 0%  -10.13% (p=0.000 n=30)
MallocLargeStruct-8       246.0n ± 0%   251.9n ± 0%   +2.38% (p=0.000 n=30)
geomean                   49.78n        45.60n        -8.40%

                    │ ampere-altra.base │          ampere-altra.new          │
                    │      sec/op       │   sec/op     vs base               │
Malloc8-8                   20.73n ± 0%   19.67n ± 1%  -5.14% (p=0.000 n=30)
Malloc16-8                  32.36n ± 0%   31.38n ± 0%  -3.04% (p=0.000 n=30)
MallocTypeInfo8-8           39.06n ± 0%   38.58n ± 0%  -1.23% (p=0.000 n=30)
MallocTypeInfo16-8          40.86n ± 0%   40.69n ± 0%  -0.42% (p=0.000 n=30)
MallocLargeStruct-8         234.6n ± 1%   233.0n ± 1%  -0.70% (p=0.016 n=30)
geomean                     47.86n        46.85n       -2.12%

                    │ aws-graviton2.base │         aws-graviton2.new          │
                    │       sec/op       │   sec/op     vs base               │
Malloc8-8                    23.03n ± 0%   22.38n ± 0%  -2.82% (p=0.000 n=30)
Malloc16-8                   35.38n ± 0%   35.14n ± 0%  -0.66% (p=0.000 n=30)
MallocTypeInfo8-8            45.81n ± 0%   45.65n ± 0%  -0.35% (p=0.000 n=30)
MallocTypeInfo16-8           46.05n ± 0%   46.35n ± 0%  +0.65% (p=0.000 n=30)
MallocLargeStruct-8          261.8n ± 0%   264.6n ± 0%  +1.05% (p=0.000 n=30)
geomean                      53.78n        53.55n       -0.44%

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions