Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: enhance mallocgc on ARM64 via prefetch #69224

Open
haoliu-ampere opened this issue Sep 3, 2024 · 13 comments
Open

runtime: enhance mallocgc on ARM64 via prefetch #69224

haoliu-ampere opened this issue Sep 3, 2024 · 13 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@haoliu-ampere
Copy link

Proposal Details

Currently, mallocgc() contains a publicationBarrier (see malloc.go#L1201), which is a store/store barrier on weakly ordered machines (e.g., a data memory barrier store-store instruction on ARM64: DMB ST). This barrier is critical for correctness, as it prevents the garbage collector from seeing "uninitialized memory or stale heap bits". However, it may heavily impact the performance, as any store operations below it must wait for the completion of all preceding stores.

One way to mitigate the negative effect is through software prefetching, as shown in the following patch:

diff --git a/src/runtime/malloc.go b/src/runtime/malloc.go
index b24ebec27d..1d227e5ab7 100644
--- a/src/runtime/malloc.go
+++ b/src/runtime/malloc.go
@@ -1188,6 +1188,9 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
			header = &span.largeType
		}
	}
+	if goarch.IsArm64 != 0 {
+		sys.Prefetch(uintptr(x))
+	}
	if !noscan && !delayedZeroing {
		c.scanAlloc += heapSetType(uintptr(x), dataSize, typ, header, span)
	}

	// Ensure that the stores above that initialize x to
	// type-safe memory and set the heap bits occur before
	// the caller can make x observable to the garbage
	// collector. Otherwise, on weakly ordered machines,
	// the garbage collector could follow a pointer to x,
	// but see uninitialized memory or stale heap bits.
	publicationBarrier()

The impacts of using prefetch(x) are as follows:

  1. prefetch(x) not only fetch the currently allocated object but can also speculatively fetch future allocated objects based on the access pattern. This speculative prefetching significantly mitigates the negative effects of the barrier, which is the primary benefit.
  2. Another benefit may come from the allocated objects that do need zeroing, which means mallocgc won't touch address x, then there may be cache-miss for subsequent accessing x in user code. Calling prefetch(x) explicitly could reduce such cache-misses.
  3. If the prefetched caches are never used, the penalty should be small, as the main performance issue is related to the barrier.

I tested the performance with the latest master branch (version a9e6a96ac0) on following ARM64 Linux servers (other ARM64 machines are not available and not tested) and x86-64 Linux server (config: 4K page size and THP enabled):

  1. AmpereOne (ARM v8.6+) from Ampere Computing.
  2. Ampere Altra (ARM Neoverse N1) from Ampere Computing.
  3. Graviton2 (ARM Neoverse N1) from AWS.
  4. EPYC 9754 (Zen4c) from AMD.

Results show pkg runtime Malloc* benchmarks and Sweet bleve-index have obvious improvements. Since barrier and prefetch are implementation defined, the performance varies significantly (see benchmark results):

  1. AmpereOne is heavily affected by the barrier and shows the greatest improvement with prefetching.
  2. Altra and Graviton2 also have obvious improvements.
  3. X86 machines have slight regressions on Malloc* benchmarks (although no regression on bleve-index), so the conditional check if goarch.IsArm64 != 0 is added to disable the prefetch on x86-64 and other architectures that were not tested. This is reasonable as publicationBarrier() is no op on strong memory models like X86.

About the prefetch distance and location:

  • Why use prefetch(x) instead of prefetch(x+offset)? The reason is that we only know the currently allocated object's address and don’t know the address of the next object, so I can’t apply a proper offset. The experiments show x is just a proper argument for the prefetch, and just let the hardware prefetcher to do the speculative works.
  • I also experimented with different locations in mallocgc for inserting prefetch, and found that placing it before the heapTypeSet() could get the best performance.

Currently, there are existing prefetch usages in scanobject and greayobject to improve the GC performance. What do you think about this change?

cc @aclements

Sweet bleve-index benchmark results

The overhead introduced by the barrier can significantly impact the performance of bleve-index, which frequently creates new small obejcts to build a treap (a binary search tree):

                     │ ampere-one.base │           ampere-one.new           │
                     │     sec/op      │   sec/op    vs base                │
BleveIndexBatch100-8        7.168 ± 2%   5.928 ± 3%  -17.30% (p=0.000 n=10)

                     │ ampere-altra.base │         ampere-altra.new          │
                     │      sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8          5.388 ± 2%   4.952 ± 1%  -8.09% (p=0.000 n=10)

                     │ aws-graviton2.base │         aws-graviton2.new         │
                     │       sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8           5.768 ± 2%   5.368 ± 3%  -6.93% (p=0.000 n=10)

BTW, other Sweet benchmarks were also tested, but they are not obviously affected by this change.

Pkg runtime Malloc* benchmark results

Malloc* benchmarks test the performance of mallocgc by allocating objects:

goos: linux
goarch: arm64
pkg: runtime
                    │ ampere-one.base │           ampere-one.new            │
                    │     sec/op      │   sec/op     vs base                │
Malloc8-8                 22.23n ± 0%   19.30n ± 1%  -13.16% (p=0.000 n=30)
Malloc16-8                35.58n ± 0%   30.48n ± 1%  -14.35% (p=0.000 n=30)
MallocTypeInfo8-8         36.78n ± 0%   34.66n ± 0%   -5.79% (p=0.000 n=30)
MallocTypeInfo16-8        42.69n ± 0%   38.36n ± 0%  -10.13% (p=0.000 n=30)
MallocLargeStruct-8       246.0n ± 0%   251.9n ± 0%   +2.38% (p=0.000 n=30)
geomean                   49.78n        45.60n        -8.40%

                    │ ampere-altra.base │          ampere-altra.new          │
                    │      sec/op       │   sec/op     vs base               │
Malloc8-8                   20.73n ± 0%   19.67n ± 1%  -5.14% (p=0.000 n=30)
Malloc16-8                  32.36n ± 0%   31.38n ± 0%  -3.04% (p=0.000 n=30)
MallocTypeInfo8-8           39.06n ± 0%   38.58n ± 0%  -1.23% (p=0.000 n=30)
MallocTypeInfo16-8          40.86n ± 0%   40.69n ± 0%  -0.42% (p=0.000 n=30)
MallocLargeStruct-8         234.6n ± 1%   233.0n ± 1%  -0.70% (p=0.016 n=30)
geomean                     47.86n        46.85n       -2.12%

                    │ aws-graviton2.base │         aws-graviton2.new          │
                    │       sec/op       │   sec/op     vs base               │
Malloc8-8                    23.03n ± 0%   22.38n ± 0%  -2.82% (p=0.000 n=30)
Malloc16-8                   35.38n ± 0%   35.14n ± 0%  -0.66% (p=0.000 n=30)
MallocTypeInfo8-8            45.81n ± 0%   45.65n ± 0%  -0.35% (p=0.000 n=30)
MallocTypeInfo16-8           46.05n ± 0%   46.35n ± 0%  +0.65% (p=0.000 n=30)
MallocLargeStruct-8          261.8n ± 0%   264.6n ± 0%  +1.05% (p=0.000 n=30)
geomean                      53.78n        53.55n       -0.44%
@gopherbot gopherbot added this to the Proposal milestone Sep 3, 2024
@randall77
Copy link
Contributor

This does not need to be a proposal, as there's no new API here. Taking out of the proposal process.

@randall77 randall77 modified the milestones: Proposal, Unplanned Sep 3, 2024
@randall77
Copy link
Contributor

I find this odd. Why is a prefetch, which is intended to make reads faster, make barriers faster?
We just cleared the memory (in most cases), so it should already be writeable in L1 cache, so the prefetch should be a no-op.

That said, benchmarks seem to show an effect. I'm not against this, but I would really like to understand what is going on first.

@haoliu-ampere
Copy link
Author

We just cleared the memory (in most cases), so it should already be writeable in L1 cache, so the prefetch should be a no-op.

Yes. That's odd at first sight.

I find this odd. Why is a prefetch, which is intended to make reads faster, make barriers faster?

The prefetch(x) here is a hint to hardware prefetcher, which not only fetches the cache-line of x (i.e., the current mallocgc), but also speculatively fetches the cache-line of next x (i.e., the next or any following mallocgc), which may be not in cache.

When the next mallocgc tries to clear the memory of next x, the store-store barrier will block until the store of zeroing next-x is completed. The barrier is heavy as it has to wait: 1) the L1 cache of next x is ready; 2) the store is completed.

E.g., if x is 0x00 and the next x is 0x40 (say if the cache line size is 64-byte), those two objects are in different cache-line. When allocating x in the current mallocgc, the prefetch(x) could also speculatively fetch next x in the L1 cache.

@haoliu-ampere haoliu-ampere changed the title proposal: runtime: enhance mallocgc on ARM64 via prefetch runtime: enhance mallocgc on ARM64 via prefetch Sep 4, 2024
@randall77
Copy link
Contributor

So this isn't really intended to modify the publication barrier behavior, it is to help the next call to mallocgc (which may include its publication barrier, I guess).
If that's the case, then I think prefetch(x+size) at the very end of mallocgc would be just as good?

I'm surprised the hardware sequential access prefetcher wouldn't handle this case.

Maybe we should distinguish read prefetch and write prefetch in our codebase. It would help with reading the code, and I think on arm we could use different instructions (different options PST instead of PLD, anyway).

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. compiler/runtime Issues related to the Go compiler and/or runtime. labels Sep 4, 2024
@wingrez
Copy link
Contributor

wingrez commented Sep 5, 2024

On my arm64 machine (Kunpeng 920 processor), unfortunately, it didn't work very well.

name                 old time/op  new time/op  delta
Malloc8-4            14.2ns ± 0%  14.5ns ± 0%  +1.76%  (p=0.008 n=5+5)
Malloc16-4           23.7ns ± 0%  23.9ns ± 0%  +0.83%  (p=0.008 n=5+5)
MallocTypeInfo8-4    30.2ns ± 0%  30.6ns ± 0%  +1.08%  (p=0.008 n=5+5)
MallocTypeInfo16-4   31.7ns ± 0%  31.7ns ± 0%    ~     (p=0.532 n=5+5)
MallocLargeStruct-4   247ns ± 1%   247ns ± 1%    ~     (p=0.730 n=5+5)

Based on commit fc9f02c.

@haoliu-ampere
Copy link
Author

On my arm64 machine (Kunpeng 920 processor), unfortunately, it didn't work very well.

name                 old time/op  new time/op  delta
Malloc8-4            14.2ns ± 0%  14.5ns ± 0%  +1.76%  (p=0.008 n=5+5)
Malloc16-4           23.7ns ± 0%  23.9ns ± 0%  +0.83%  (p=0.008 n=5+5)
MallocTypeInfo8-4    30.2ns ± 0%  30.6ns ± 0%  +1.08%  (p=0.008 n=5+5)
MallocTypeInfo16-4   31.7ns ± 0%  31.7ns ± 0%    ~     (p=0.532 n=5+5)
MallocLargeStruct-4   247ns ± 1%   247ns ± 1%    ~     (p=0.730 n=5+5)

Based on commit fc9f02c.

Hi @wingrez,

could you also help to test sweet bleve-index?

@haoliu-ampere
Copy link
Author

I'm surprised the hardware sequential access prefetcher wouldn't handle this case.

Normally, the hardware sequential access prefetcher would handle load case better. This situation is pure store case, which may not work as good as the load case.

Maybe we should distinguish read prefetch and write prefetch in our codebase. It would help with reading the code, and I think on arm we could use different instructions (different options PST instead of PLD, anyway).

Currently, it generates prfm pldl1keep, [x10]. I tried the prfm pstl1keep and prfm pldl1strm, which are similar. Anyway, I think distinguishing read prefetch and write prefetch is a good idea.

So this isn't really intended to modify the publication barrier behavior, it is to help the next call to mallocgc (which may include its publication barrier, I guess).
If that's the case, then I think prefetch(x+size) at the very end of mallocgc would be just as good?

I tried more experiments, it seems x+size helps a bit, while putting it at the very end of mallocgc is worse. Let's call the previous patch as v1. I tried following versions:

v2

which adds size:

@@ -1188,6 +1188,9 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
			header = &span.largeType
		}
	}
+	if goarch.IsArm64 != 0 {
+		sys.Prefetch(uintptr(unsafe.Add(x, size)))
+	}
	if !noscan && !delayedZeroing {
		c.scanAlloc += heapSetType(uintptr(x), dataSize, typ, header, span)
	}

	// Ensure that the stores above that initialize x to
	// type-safe memory and set the heap bits occur before
	// the caller can make x observable to the garbage
	// collector. Otherwise, on weakly ordered machines,
	// the garbage collector could follow a pointer to x,
	// but see uninitialized memory or stale heap bits.
	publicationBarrier()

it is a bit better than v1 for bleve-index:

                     │ ampere-one.base │         ampere-one.v2.new          │
                     │     sec/op      │   sec/op    vs base                │
BleveIndexBatch100-8        7.168 ± 2%   5.790 ± 4%  -19.23% (p=0.000 n=10)

                     │ ampere-altra.base │        ampere-altra.v2.new        │
                     │      sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8          5.388 ± 2%   4.893 ± 1%  -9.18% (p=0.000 n=10)

                     │ aws-graviton2.base │       aws-graviton2.v2.new        │
                     │       sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8           5.768 ± 2%   5.350 ± 1%  -7.26% (p=0.000 n=10)

                    │ ampere-one.base │          ampere-one.v2.new          │
                    │     sec/op      │   sec/op     vs base                │
Malloc8-8                 22.23n ± 0%   19.19n ± 1%  -13.70% (p=0.000 n=30)
Malloc16-8                35.58n ± 0%   30.19n ± 1%  -15.15% (p=0.000 n=30)
MallocTypeInfo8-8         36.78n ± 0%   34.75n ± 0%   -5.53% (p=0.000 n=30)
MallocTypeInfo16-8        42.69n ± 0%   38.42n ± 1%   -9.99% (p=0.000 n=30)
MallocLargeStruct-8       246.0n ± 0%   252.5n ± 0%   +2.62% (p=0.000 n=30)
geomean                   49.78n        45.51n        -8.57%

                    │ ampere-altra.base │        ampere-altra.v2.new         │
                    │      sec/op       │   sec/op     vs base               │
Malloc8-8                   20.73n ± 0%   19.90n ± 1%  -3.98% (p=0.000 n=30)
Malloc16-8                  32.36n ± 0%   31.69n ± 1%  -2.09% (p=0.000 n=30)
MallocTypeInfo8-8           39.06n ± 0%   38.95n ± 0%  -0.29% (p=0.000 n=30)
MallocTypeInfo16-8          40.86n ± 0%   41.23n ± 0%  +0.91% (p=0.000 n=30)
MallocLargeStruct-8         234.6n ± 1%   235.0n ± 1%       ~ (p=0.228 n=30)
geomean                     47.86n        47.35n       -1.07%

                    │ aws-graviton2.base │        aws-graviton2.v2.new        │
                    │       sec/op       │   sec/op     vs base               │
Malloc8-8                    23.03n ± 0%   22.68n ± 0%  -1.50% (p=0.000 n=30)
Malloc16-8                   35.38n ± 0%   35.63n ± 1%  +0.72% (p=0.000 n=30)
MallocTypeInfo8-8            45.81n ± 0%   45.96n ± 0%  +0.34% (p=0.000 n=30)
MallocTypeInfo16-8           46.05n ± 0%   47.16n ± 0%  +2.40% (p=0.000 n=30)
MallocLargeStruct-8          261.8n ± 0%   264.6n ± 0%  +1.07% (p=0.000 n=30)
geomean                      53.78n        54.11n       +0.60%

v3

which just put prefetch(x) at the end of mallocgc:

@@ -1321,6 +1321,10 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
		x = add(x, size-dataSize)
	}
 
+	if goarch.IsArm64 != 0 {
+		sys.Prefetch(uintptr(x))
+	}
	return x

bleve-index is worse than v1 (Malloc benchmarks are also similarly worse):

                     │ ampere-one.base │         ampere-one.v3.new         │
                     │     sec/op      │   sec/op    vs base               │
BleveIndexBatch100-8        7.168 ± 2%   6.462 ± 1%  -9.85% (p=0.000 n=10)

                     │ ampere-altra.base │        ampere-altra.v3.new        │
                     │      sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8          5.388 ± 2%   5.269 ± 3%  -2.20% (p=0.015 n=10)

                     │ aws-graviton2.base │       aws-graviton2.v3.new        │
                     │       sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8           5.768 ± 2%   5.674 ± 1%  -1.64% (p=0.002 n=10)

v4

which is prefetch(x+size) at the end of mallocgc:

@@ -1321,6 +1321,10 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
		x = add(x, size-dataSize)
	}
 
+	if goarch.IsArm64 != 0 {
+		sys.Prefetch(uintptr(unsafe.Add(x, size)))
+	}
	return x

it is better than v3, but worse than v1:

                     │ ampere-one.base │         ampere-one.v4.new          │
                     │     sec/op      │   sec/op    vs base                │
BleveIndexBatch100-8        7.168 ± 2%   6.312 ± 3%  -11.94% (p=0.000 n=10)

                     │ ampere-altra.base │        ampere-altra.v4.new        │
                     │      sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8          5.388 ± 2%   5.109 ± 2%  -5.18% (p=0.000 n=10)

                     │ aws-graviton2.base │       aws-graviton2.v4.new        │
                     │       sec/op       │   sec/op    vs base               │
BleveIndexBatch100-8           5.768 ± 2%   5.440 ± 2%  -5.68% (p=0.000 n=10)

@wingrez
Copy link
Contributor

wingrez commented Sep 9, 2024

On my arm64 machine (Kunpeng 920 processor), unfortunately, it didn't work very well.

name                 old time/op  new time/op  delta
Malloc8-4            14.2ns ± 0%  14.5ns ± 0%  +1.76%  (p=0.008 n=5+5)
Malloc16-4           23.7ns ± 0%  23.9ns ± 0%  +0.83%  (p=0.008 n=5+5)
MallocTypeInfo8-4    30.2ns ± 0%  30.6ns ± 0%  +1.08%  (p=0.008 n=5+5)
MallocTypeInfo16-4   31.7ns ± 0%  31.7ns ± 0%    ~     (p=0.532 n=5+5)
MallocLargeStruct-4   247ns ± 1%   247ns ± 1%    ~     (p=0.730 n=5+5)

Based on commit fc9f02c.

Hi @wingrez,

could you also help to test sweet bleve-index?

name                  old time/op            new time/op            delta
BleveIndexBatch100-4             6.45s ± 2%             6.62s ± 2%  +2.58%  (p=0.001 n=10+10)

name                  old average-RSS-bytes  new average-RSS-bytes  delta
BleveIndexBatch100-4             195MB ± 1%             196MB ± 1%  +0.59%  (p=0.022 n=9+10)

name                  old peak-RSS-bytes     new peak-RSS-bytes     delta
BleveIndexBatch100-4             282MB ± 2%             284MB ± 3%    ~     (p=0.143 n=10+10)

name                  old peak-VM-bytes      new peak-VM-bytes      delta
BleveIndexBatch100-4            2.03GB ± 6%            2.06GB ± 4%    ~     (p=0.147 n=10+10)

@randall77
Copy link
Contributor

I see no performance changes for any of these options on an M2 Ultra.

The prefetch(x) here is a hint to hardware prefetcher, which not only fetches the cache-line of x (i.e., the current mallocgc), but also speculatively fetches the cache-line of next x (i.e., the next or any following mallocgc), which may be not in cache.

Do you have a reference for this? I'd like to understand why this helps on some chips and not others. It would be ideal if we could determine (at binary startup time) whether we're on a chip on which this would help. I'm hesitant to prefetch always, as prefetches can consume memory bus bandwidth. Memory bus bandwidth in real programs is often a limited resource, but that effect seldom shows up in microbenchmarks.

@randall77
Copy link
Contributor

I will note https://go-review.googlesource.com/c/go/+/572398 which also shows some benefits of prefetching on arm64, albeit in a kind of strange configuration. The machine used in that CL's benchmarking is I think the same as @wingrez used here. So, ... I have no idea what is going on.

@haoliu-ampere
Copy link
Author

haoliu-ampere commented Sep 10, 2024

Thanks for updating the results on Kunpeng and M2.

Do you have a reference for this? I'd like to understand why this helps on some chips and not others.

No, the implementation of barrier and prefetching depends on different micro-architectures. As the prefetching works speculatively, different benchmarks also vary a lot. My most concern is this may not help other ARM64 chips or even have regressions (unfortunately, it is true for Kunpeng).

It would be ideal if we could determine (at binary startup time) whether we're on a chip on which this would help.

Yes. That would be helpful. Does golang have any plan to implement such feature? or, Does golang have any plan to support compile options like GCC's -mcpu=neoverse-n1/-mcpu=ampere1, so that the binary is optimal for a specific CPU?

I tried to search the existing code and could only find the goarch.IsArm64 option, which is too wide and it affects all kinds of ARM64 CPUs.

I'm hesitant to prefetch always, as prefetches can consume memory bus bandwidth. Memory bus bandwidth in real programs is often a limited resource, but that effect seldom shows up in microbenchmarks.

Yes. That's true. Could we add a new variable (e.g., aggresive-prefetch) in GODEBUG? It should introduce extra overhead to check if the variable is set or not.

I'm a newbie to golang and still learning a lot of things. To my understanding, golang cares more about portability and compilation speed than performance. If the binary is compiled for a specific CPU, it may affect portability.

@randall77
Copy link
Contributor

Does golang have any plan to implement such feature? or, Does golang have any plan to support compile options like GCC's -mcpu=neoverse-n1/-mcpu=ampere1, so that the binary is optimal for a specific CPU?

I know of no plans. We added GOARM64 recently (#60905) and I don't think we're going to go any farther than that.

If we were to do something chip-dependent here, we would have to detect the chip at binary startup.

Yes. That's true. Could we add a new variable (e.g., aggresive-prefetch) in GODEBUG? It should introduce extra overhead to check if the variable is set or not.

I don't think we'd want to add a GODEBUG for this, except possibly for experimentation. That environment variable is not intended for permanent performance tweaks.

The overhead is not a huge deal, we check the GODEBUG (or do some chip detection) once on startup and then it is a load/compare/very predictable branch to decide whether to do the prefetch from then on. If the prefetch is sure to be worth it then that overhead is pretty small.

@haoliu-ampere
Copy link
Author

Hi Keith,

Thanks for your explanations.

Then, I think it's not reasonable to continue this patch. I'll just close it as not planed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants