-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8355094: Performance drop in auto-vectorized kernel due to split store #25065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 50 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the deep investigation, the excellent report, and most of all the colorful plots!
I found a typo, but otherwise the hotspot changes look good to me. I cannot review the benchmarks, unfortunately.
Co-authored-by: Manuel Hässig <manuel@haessig.org>
iwanowww
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive analysis, Emanuel! Very deep, thorough, and insightful.
Looks good.
Speaking of Vector API, we experimented with getting access alignment under control. Unfortunately, when it comes to on-heap accesses it boils down to hyper-aligned objects support which is not there yet.
PS: yay, you found a way to turn PRs into blog posts! :-)
|
@iwanowww Thanks for your kind words 😊 Indeed: on-heap access would profit from hyper-aligned objects. Are there any ideas on how to do that? I wonder if it is worth it, or if it is good enough to just use off-heap (native) MemorySegments to guarantee alignment for very performance critical cases? |
TobiHartmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive analysis, Emanuel! Very deep, thorough, and insightful.
+1 to this. Great work, Emanuel! The fix looks good to me.
|
@TobiHartmann Thank you for the review :) @theRealAph @XiaohongGong Do you have any idea about the somewhat confusing behavior of aarch64 in these benchmarks? |
|
@TobiHartmann @iwanowww @mhaessig Thanks for reviewing! I'll integrate now, but we can still continue the conversation @theRealAph @XiaohongGong @jatin-bhateja . /integrate |
|
Going to push as commit 277bb20.
Your commit was automatically rebased without conflicts. |
Hi @eme64 , to be honest, I'm not quite sure about the unaligned memory access behavior on AArch64. I tried to make it clear by reading some ARM docs. But unfortunately, the message that I got most is it's HW implementation defined behavior. Some AArch64 micro-architectures prefer aligning memory for loads instead of stores to obtain better performance, but others maybe on the contrary. That's the reality. My colleague provided to me several patches in go project which also use an option to prefer load alignment or store for a memory move library optimization [1][2][3] on AArch64. Different AArch64 micro-architecture can choose the optimal alignment solution based on the performance results. And it chooses to align loads for Neoverse CPUs by default. Hope this could help you. I think the basic ideal is align with what you did in this PR. Thanks! [1] https://go-review.googlesource.com/c/go/+/243357 |
|
@XiaohongGong Thanks a lot for taking the time to respond! That is very fascinating, and reassuring. Seems I'm not the only one seeing these kinds of results :) I suppose we could add a similar flag, to target the |
Summary
Before JDK-8325155 / #18822, we used to prefer aligning to stores. But in that change, I removed that preference, and since then we have been aligning to loads instead (there is no preference, but since loads usually come before stores in the loop body, the load gets picked). This lead to a performance regression, especially on
x64.Especially on
x64, it is more important to align stores than aligning loads. This is because memory operations that cross a cacheline boundary are split. Andx64CPU's generally have more throughput for loads than for stores, so splitting a store is worse than splitting a load.On
aarch64, the results are less clear. On two machines, the differences were marginal, but surprisingly aligning to loads was marginally faster. On another machine, aligning to stores was significantly faster. I suspect performance depends on the exactaarch64implementation. I'm not anaarch64specialist, and only have access to a limited number of machines.Fix: make automatic alignment configurable with
SuperWordAutomaticAlignment(no alignment, align to store, align to load). Default is align to store.For now, I will just align to stores on all platforms. If someone has various
aarch64machines, they are welcome do do deeper investigations. Same for other platforms. We could always turn the flag into a platform dependent one, and set different defaults depending on the exact CPU.If you are interested you can read my investigations/benchmark results below. Therre are a lot of colorful plots 📈 😊
FYI about Vector API: if you are working with the Vector API, you may also want to worry about alignment, because there can be a significant performance impact (30%+ in some cases). You may also want to know about 4k aliasing, discussed below.
Shoutout:
Introduction
I had long lived with the theory that on modern CPUs, misalignment has no consequence, especially no performance impact. When you google, many sources say that misalignment used to be an issue on older CPUs, but not any more.
That may technically be true:
So there is a connection: alignment means the load or store cannot cross a cacheline boundary, assuming a cacheline is at least as long as the load / store (e.g. 64 byte cacheline and 64 byte load / store or smaller). Conversely, a misaligned load has a good chance to cross a cacheline boundary. Especially when we are auto vectorizing, we are accessing a contiguous block of memory, and so if our accesses are misaligned, we must cross the cacheline boundary at some point. Hence, alignment has a performance impact in vectorization.
If we have a load and a store, but because of relative misalignment we can only align one: is it better to align the load or the store? Generally, x64 CPUs have more throughput for loads than stores. Splitting loads means we have more loads, which is not as bad as splitting more stores and having more stores going through the CPU. Hence, in most cases, it is better to align the store, and accept that the load is split.
The above holds for
x64, but onaarch64things are a little different / more complicated. For example, I found JEP 315, which mentions:For the aarch64 machine I use, a Neoverse N1, the N1 Optimization Guide says:
Checking in a few other manuals, it is mostly about the 64-byte cacheline boundary for loads, and the 16-byte boundary for stores. These chips have the
neonvector instructions, which are at most 16-byte (128 bit).From this I would personally conclude that with full alignment to vector length, there should be maximum performance. But the results below make me question that, and it seems I don't have the full picture yet.
Initial investigation using the Vector API
With the Vector API, we can produce code where we have direct control over what vector instructions are generated, and including their alignment. This means we can start with some experiments independent of the auto vectorizer.
I wrote a stand-alone
Benchmark.java, you can find it at the end of this PR. I did not integrate it, because it is not very well suited for regression testing, rather for visualization only. For regression testing, I am integrating the benchmarkVectorAutoAlignment.java. Still, I am also integratingVectorAutoAlignmentVisualization.java, which can be used to visualize the effect of alignment for the auto vectorizer only.Consider the following method, where we can vary the alignment of the load and store with
offset_loadandoffset_store, respectively:Let's start with a simple experiment, using
test1L1SVector, withSIZE = 2560(produces clean results because not too many other effects) andoneArray(store at beginning of array, store in same array but SIZE elements later).Below the results for my AVX512 machine, that supports up to 64 byte vectors. I show the results for 64, 32, 16 and 8 bytes, i.e. 16, 8, 4, and 2 ints per vector.

Time in
ms, red is slow, green is fast.x-axis ➡️:
offset_loady-axis ⬆️:
offset_storeWe can see how there is a very clear grid for every size, and that the grid repeats with the vector size, i.e. the number of elements per vector. We see that store-alignment alignment has a larger effect on performance than load-alignment. With 16 element vectors, we can even see a faint diagonal effect of relative alignment between the loads and stores, though I don't know the cause of that effect.
Further: we can see that the smaller the vectors, the less extreme the relative differences appear. For 16 element vectors, runtime varies from
7.5 msto11.5 ms, but for 2 element vectors it only varies from24.8to29.5. My theory is that this that for 64 byte vectors, every unaligned vector is split, leading to roughly a doubling of operations. But for 8 byte vectors, only every 8th crosses a cacheline boundary, and the effect of splitting is thus much smaller.Something else that also is visible in these results: arrays are only 8-byte aligned. Every time I run the benchmark, e.g. for different vector lengths, the alignment of the base is different. Thus, the "lines" of the "grid" do not always align between different runs of these benchmarks. This has a quite significant implication for vector api benchmarks: if one does not control the alignment of the arrays, one might get drastically unstable measurements, the results can quickly vary very significantly.
The

neon / asimdN1 aarch64 machine provides vectors up to 128 bits, so we can only display the results for 4 and 2 element vectors of ints:Time in
ms, red is slow, green is fast.x-axis ➡️: offset_load
y-axis ⬆️: offset_store
Strangely, it seems only load alignment has a significant effect. That is quite surprising.
Investigating performance loss when crossing cacheline boundary, using Vector API
While the relative performance differences for different vector lengths already match the theory that only memory accesses that cross a cacheline boundary are split, we now show this effect with a special "skip" benchmark. I ran
Benchmark.java test1L1SVectorSkip 4 2560 oneArray, i.e. with a "skip" benchmark where every int vector has 4 elements, and we skip every 4th vector:[0 1 2 3 ][4 5 6 7 ][8 9 10 11][ skip ]. If the cacheline boundary lies where the we skip a vector, then we should have no performance loss compared to when we have perfect alignment. At least in theory 😉Left the result for my AVX512 laptop, right the machine for the N1 aarch64 machine:

For comparison, the results from above, for the 4-element vectors without skipping:

Generally, we see similar 4-element wide "bands" in both directions.
The results for the AVX512 machine are quite understandable, and very crisp: We have the same repeating grid as without skip, except that every 4th band in x and y direction is "skipped", i.e. has the same performance as when aligned. It seems the theory perfectly applies for my AVX512 machine 😀
But the aarch64 results are stranger: In x direction, i.e. for loads, every 4th band has better performance. That seems to correspond to the cacheline boundary of 64 bytes, i.e. when we skip, there is no load splitting. In y direction, i.e. for stores, every 2nd band has better performance. This is surprising, because the non skipping benchmark did not show any effect in this direction. And it seems to indicate some 32 byte effect, which neither corresponds to the 64 byte cacheline (otherwise we would have to see some effect that only shows every 4th band), nor to the 16 byte store boundaries mentioned in the N1 Optimization Guide. This is really confusing. We could further investigate the behavior with different element sizes and vector sizes, and different skip methods.
Discovering 4k aliasing artifacts in benchmark, using the Vector API
On my AVX512 machine, I found an effect that happens around

4k byteboundaries, i.e. every1024 ints. For 64, 32 and 16 byte vectors, i.e. 16, 8, and 4 elements, andSIZE = 2048, so 8k bytes:In the lower half triangles, we see the normal grid pattern. Modulo 4k bytes, the loads are ahead of the stores, that may explain why there is not effect. But the upper half triangles have drastically worse performance. The grid is now diagonal, probably dominated by relative alignment rather than absolute alignment. Modulo 4k bytes, the loads are behind the stores - my theory is that this conflicts with the loads having to happen first.
I ran it on a larger grid (offsets from 0-127), and one can see that the effect slowly wares off (from red to orange) - ignore the noise, I had to lower the accuracy to complete this one in reasonable time:

But it seems on the aarch64 machine, I cannot find this
4k byteboundary effect.@iwanowww Pointed me to this article about 4k aliasing. The reason is that store-to-load-forwarding at first only operates on the lowest 12 bits of the address, and when it later detects that the rest of the address does not matter, this incurs a penalty of a few cycles. Note: I only just learned about the effects of store-to-load-forwarding recently, see #21521 / JDK-8334431.
Investigation for automatic alignment in the Auto Vectorizer
To be able to investigate the performance of the Auto-Vectorizer (SuperWord), I made the automatic alignment configurable with
SuperWordAutomaticAlignment. We can disable it, align with the store or with the load.The attached JMH benchark

VectorAutoAlignmentVisualization.bench1L1S, with automatic alignment disabled looks like this:This JMH benchmark is really slow, so we can also use the

Benchark.javafrom below.I ran it on my AVX512 laptop with
Benchmark.java test1L1SScalar 4 2560 oneArray:We can see that with no alignment, we have a grid with 90% angles. If the stores are aligned, we get about
3.35 msruntime, if only loads are aligned we get about4.4 ms, and if neither is aligned4.9 ms- if both are aligned we get only3.2 ms.With automatic alignment on stores, we get an overall better performance. But we also see the pattern is now diagonal. In most cases, we only have the store aligned, and we get about
3.4-3.5 ms. But when the load and store are relatively aligned, i.e. on the thin diagonals, then we get even only3.3 ms. These performance numbers are comparable with the numbers we see on the "no alignment" plot on the horizontal lines where the stores are aligned.With automatic alignment on loads, we get an average performance that is better than without alignment, but worse than aligning with stores. In most cases only the load is aligned, and we get
4.4 ms. But on the rare occasion where the loads and stores are relatively aligned, i.e. the thin diagonals, we get3.2-3.3 ms. These numbers are comparable with the numbers we see on the "no alignment" plot on the vertical lines, where the loads are aligned.But which one of these options should we now chose? I.e. what should be the default for
SuperWordAutomaticAlignment? In general, we do not know the alignment of the load and store, so we should assume that we land on one of the cells at random. Thus, the relevant performance metric is the average over all cells.The benchmark below does exactly this: it runs the loop for every
offset_loadandoffset_storecombination, essentially computing the average over all combinations.Automatic Alignment in SuperWord (Auto Vectorization)
Results with aliasing runtime checks #24278, on

VectorAutoAlignment, on my AVX512 laptop:Note: before #24278, this benchmark never vectorizes, because we cannot prove that the load and store do not alias.
There are clearly some artifacts around the 4k byte boundaries. See the discussion further up about 4k aliasing.
Other than those artifacts, it is very clear that aligning with stores is the best on my AVX512 CPU. Aligning the loads is significantly worse, and not aligning at all slightly worse than that. But in any case: vectorization is always very clearly profitable, no matter the alignment.
Running it on a aarch64 neon OCI machine:

The results look quite a bit different. Vectorization is still always profitable, no matter the alignment. But now, it seems aligning loads is fastest, and there is no difference between aligning stores or no alignment at all.
I also ran it on our benchmark servers:
linux aarch64, (neon):linux x64:macosx aarch64(neon):macosx x64:windows x64:The
x64results are fairly consistent: in most cases aligning to stores is best, except for the 4k artifacts.The
aarch64results are less clear. On two machines we see that aligning loads is marginally faster, but on one machine aligning to stores is faster. I suspect it may depend on the exactaarch64implementation.Standalone Benchmark.java
I did not integrate it, because it is not very well suited for regression testing, rather for visualization only. For regression testing, I am integrating the benchmark
VectorAutoAlignment.java. Still, I am also integratingVectorAutoAlignmentVisualization.java, which can be used to visualize the effect of alignment for the auto vectorizer only.I usually run the benchmark with command-lines like this:
Here some relevant flags to play with:
ObjectAlignmentInBytes: alignment of objects, i.e. the arrays in these benchmarks. Default is8bytes, which means two arrays only have a relative alignment of8bytes. Hence, it may not always be possible to align both references to two arrays to more than8bytes, i.e. we can only guarantee 64-byte alignment of at most one of them.TraceAutoVectorization: with the tagALIGN_VECTORwe can see which memory reference we auto-align.LoopUnrollLimit: some benchmarks have a rather large loop, and only auto-vectorize if this limit is artificially increased.MaxVectorSize: we can artificially lower the maximum vector length, possibly breaking larger vectors into multiple smaller ones.SuperWordAutomaticAlignment: controls if and how we auto-align.Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25065/head:pull/25065$ git checkout pull/25065Update a local copy of the PR:
$ git checkout pull/25065$ git pull https://git.openjdk.org/jdk.git pull/25065/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25065View PR using the GUI difftool:
$ git pr show -t 25065Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25065.diff
Using Webrev
Link to Webrev Comment