-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: enhance mallocgc on ARM64 via prefetch #69224
Comments
This does not need to be a proposal, as there's no new API here. Taking out of the proposal process. |
I find this odd. Why is a prefetch, which is intended to make reads faster, make barriers faster? That said, benchmarks seem to show an effect. I'm not against this, but I would really like to understand what is going on first. |
Yes. That's odd at first sight.
The When the next E.g., if |
So this isn't really intended to modify the publication barrier behavior, it is to help the next call to mallocgc (which may include its publication barrier, I guess). I'm surprised the hardware sequential access prefetcher wouldn't handle this case. Maybe we should distinguish read prefetch and write prefetch in our codebase. It would help with reading the code, and I think on arm we could use different instructions (different options |
On my arm64 machine (Kunpeng 920 processor), unfortunately, it didn't work very well.
Based on commit fc9f02c. |
Hi @wingrez, could you also help to test sweet bleve-index? |
Normally, the hardware sequential access prefetcher would handle load case better. This situation is pure store case, which may not work as good as the load case.
Currently, it generates
I tried more experiments, it seems v2which adds @@ -1188,6 +1188,9 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
header = &span.largeType
}
}
+ if goarch.IsArm64 != 0 {
+ sys.Prefetch(uintptr(unsafe.Add(x, size)))
+ }
if !noscan && !delayedZeroing {
c.scanAlloc += heapSetType(uintptr(x), dataSize, typ, header, span)
}
// Ensure that the stores above that initialize x to
// type-safe memory and set the heap bits occur before
// the caller can make x observable to the garbage
// collector. Otherwise, on weakly ordered machines,
// the garbage collector could follow a pointer to x,
// but see uninitialized memory or stale heap bits.
publicationBarrier() it is a bit better than v1 for bleve-index:
v3which just put @@ -1321,6 +1321,10 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
x = add(x, size-dataSize)
}
+ if goarch.IsArm64 != 0 {
+ sys.Prefetch(uintptr(x))
+ }
return x bleve-index is worse than v1 (Malloc benchmarks are also similarly worse):
v4which is @@ -1321,6 +1321,10 @@ func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
x = add(x, size-dataSize)
}
+ if goarch.IsArm64 != 0 {
+ sys.Prefetch(uintptr(unsafe.Add(x, size)))
+ }
return x it is better than v3, but worse than v1:
|
|
I see no performance changes for any of these options on an M2 Ultra.
Do you have a reference for this? I'd like to understand why this helps on some chips and not others. It would be ideal if we could determine (at binary startup time) whether we're on a chip on which this would help. I'm hesitant to prefetch always, as prefetches can consume memory bus bandwidth. Memory bus bandwidth in real programs is often a limited resource, but that effect seldom shows up in microbenchmarks. |
I will note https://go-review.googlesource.com/c/go/+/572398 which also shows some benefits of prefetching on arm64, albeit in a kind of strange configuration. The machine used in that CL's benchmarking is I think the same as @wingrez used here. So, ... I have no idea what is going on. |
Thanks for updating the results on Kunpeng and M2.
No, the implementation of barrier and prefetching depends on different micro-architectures. As the prefetching works speculatively, different benchmarks also vary a lot. My most concern is this may not help other ARM64 chips or even have regressions (unfortunately, it is true for Kunpeng).
Yes. That would be helpful. Does golang have any plan to implement such feature? or, Does golang have any plan to support compile options like GCC's I tried to search the existing code and could only find the
Yes. That's true. Could we add a new variable (e.g., I'm a newbie to golang and still learning a lot of things. To my understanding, golang cares more about portability and compilation speed than performance. If the binary is compiled for a specific CPU, it may affect portability. |
I know of no plans. We added If we were to do something chip-dependent here, we would have to detect the chip at binary startup.
I don't think we'd want to add a The overhead is not a huge deal, we check the |
Hi Keith, Thanks for your explanations. Then, I think it's not reasonable to continue this patch. I'll just close it as not planed. |
Proposal Details
Currently,
mallocgc()
contains apublicationBarrier
(see malloc.go#L1201), which is a store/store barrier on weakly ordered machines (e.g., a data memory barrier store-store instruction on ARM64:DMB ST
). This barrier is critical for correctness, as it prevents the garbage collector from seeing "uninitialized memory or stale heap bits". However, it may heavily impact the performance, as any store operations below it must wait for the completion of all preceding stores.One way to mitigate the negative effect is through software prefetching, as shown in the following patch:
The impacts of using
prefetch(x)
are as follows:prefetch(x)
not only fetch the currently allocated object but can also speculatively fetch future allocated objects based on the access pattern. This speculative prefetching significantly mitigates the negative effects of the barrier, which is the primary benefit.mallocgc
won't touch addressx
, then there may be cache-miss for subsequent accessingx
in user code. Callingprefetch(x)
explicitly could reduce such cache-misses.I tested the performance with the latest master branch (version
a9e6a96ac0
) on following ARM64 Linux servers (other ARM64 machines are not available and not tested) and x86-64 Linux server (config: 4K page size and THP enabled):Results show pkg runtime Malloc* benchmarks and Sweet bleve-index have obvious improvements. Since barrier and prefetch are implementation defined, the performance varies significantly (see benchmark results):
if goarch.IsArm64 != 0
is added to disable the prefetch on x86-64 and other architectures that were not tested. This is reasonable aspublicationBarrier()
is no op on strong memory models like X86.About the
prefetch
distance and location:prefetch(x)
instead ofprefetch(x+offset)
? The reason is that we only know the currently allocated object's address and don’t know the address of the next object, so I can’t apply a properoffset
. The experiments showx
is just a proper argument for the prefetch, and just let the hardware prefetcher to do the speculative works.mallocgc
for insertingprefetch
, and found that placing it before theheapTypeSet()
could get the best performance.Currently, there are existing
prefetch
usages inscanobject
andgreayobject
to improve the GC performance. What do you think about this change?cc @aclements
Sweet bleve-index benchmark results
The overhead introduced by the barrier can significantly impact the performance of bleve-index, which frequently creates new small obejcts to build a treap (a binary search tree):
BTW, other Sweet benchmarks were also tested, but they are not obviously affected by this change.
Pkg runtime Malloc* benchmark results
Malloc* benchmarks test the performance of
mallocgc
by allocating objects:The text was updated successfully, but these errors were encountered: