Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented May 6, 2025

Summary

Before JDK-8325155 / #18822, we used to prefer aligning to stores. But in that change, I removed that preference, and since then we have been aligning to loads instead (there is no preference, but since loads usually come before stores in the loop body, the load gets picked). This lead to a performance regression, especially on x64.

Especially on x64, it is more important to align stores than aligning loads. This is because memory operations that cross a cacheline boundary are split. And x64 CPU's generally have more throughput for loads than for stores, so splitting a store is worse than splitting a load.

On aarch64, the results are less clear. On two machines, the differences were marginal, but surprisingly aligning to loads was marginally faster. On another machine, aligning to stores was significantly faster. I suspect performance depends on the exact aarch64 implementation. I'm not an aarch64 specialist, and only have access to a limited number of machines.

Fix: make automatic alignment configurable with SuperWordAutomaticAlignment (no alignment, align to store, align to load). Default is align to store.

For now, I will just align to stores on all platforms. If someone has various aarch64 machines, they are welcome do do deeper investigations. Same for other platforms. We could always turn the flag into a platform dependent one, and set different defaults depending on the exact CPU.

If you are interested you can read my investigations/benchmark results below. Therre are a lot of colorful plots 📈 😊

FYI about Vector API: if you are working with the Vector API, you may also want to worry about alignment, because there can be a significant performance impact (30%+ in some cases). You may also want to know about 4k aliasing, discussed below.

Shoutout:

  • @jatin-bhateja filed the regression, and explained that it was about split stores.
  • @mhaessig helped me talk through some of the early benchmarks.
  • @iwanowww pointed me to the 4k aliasing explanation.

Introduction

I had long lived with the theory that on modern CPUs, misalignment has no consequence, especially no performance impact. When you google, many sources say that misalignment used to be an issue on older CPUs, but not any more.

That may technically be true:

  • A misaligned load or store that does not cross a cacheline boundary has no performance difference to an aligned load or store that does not cross a cacheline boundary.
  • But: A misaligned load or store that crosses a cacheline boundary is slower than a misaligned load or store that does not cross a cacheline boundary. The reason is that a load or store that crosses a cacheline boundary is split, which means we now have two memory accesses instead of one.

So there is a connection: alignment means the load or store cannot cross a cacheline boundary, assuming a cacheline is at least as long as the load / store (e.g. 64 byte cacheline and 64 byte load / store or smaller). Conversely, a misaligned load has a good chance to cross a cacheline boundary. Especially when we are auto vectorizing, we are accessing a contiguous block of memory, and so if our accesses are misaligned, we must cross the cacheline boundary at some point. Hence, alignment has a performance impact in vectorization.

If we have a load and a store, but because of relative misalignment we can only align one: is it better to align the load or the store? Generally, x64 CPUs have more throughput for loads than stores. Splitting loads means we have more loads, which is not as bad as splitting more stores and having more stores going through the CPU. Hence, in most cases, it is better to align the store, and accept that the load is split.

The above holds for x64, but on aarch64 things are a little different / more complicated. For example, I found JEP 315, which mentions:

Avoid unaligned memory access if needed. Some CPU implementations impose penalties when issuing load/store instructions across a 16-byte boundary, a dcache-line boundary, or have different optimal alignment for different load/store instructions (see, for example, the Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.

For the aarch64 machine I use, a Neoverse N1, the N1 Optimization Guide says:

4.5 Load/Store alignment
The Armv8.2-A architecture allows many types of load and store accesses to be arbitrarily
aligned. The Neoverse N1 handles most unaligned accesses without performance penalties.
However, there are cases which reduce bandwidth or incur additional latency, as described
below.

  • Load operations that cross a cache-line (64-byte) boundary.
  • Quad-word load operations that are not 4B aligned.
  • Store operations that cross a 16B boundary.

Checking in a few other manuals, it is mostly about the 64-byte cacheline boundary for loads, and the 16-byte boundary for stores. These chips have the neon vector instructions, which are at most 16-byte (128 bit).

From this I would personally conclude that with full alignment to vector length, there should be maximum performance. But the results below make me question that, and it seems I don't have the full picture yet.


Initial investigation using the Vector API

With the Vector API, we can produce code where we have direct control over what vector instructions are generated, and including their alignment. This means we can start with some experiments independent of the auto vectorizer.

I wrote a stand-alone Benchmark.java, you can find it at the end of this PR. I did not integrate it, because it is not very well suited for regression testing, rather for visualization only. For regression testing, I am integrating the benchmark VectorAutoAlignment.java. Still, I am also integrating VectorAutoAlignmentVisualization.java, which can be used to visualize the effect of alignment for the auto vectorizer only.

Consider the following method, where we can vary the alignment of the load and store with offset_load and offset_store, respectively:

    public static void test1L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
            var v = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
            v.intoArray(arr0, base0 + i + offset_store);
        }
    }

Let's start with a simple experiment, using test1L1SVector, with SIZE = 2560 (produces clean results because not too many other effects) and oneArray (store at beginning of array, store in same array but SIZE elements later).

Below the results for my AVX512 machine, that supports up to 64 byte vectors. I show the results for 64, 32, 16 and 8 bytes, i.e. 16, 8, 4, and 2 ints per vector.
image
Time in ms, red is slow, green is fast.
x-axis ➡️: offset_load
y-axis ⬆️: offset_store
We can see how there is a very clear grid for every size, and that the grid repeats with the vector size, i.e. the number of elements per vector. We see that store-alignment alignment has a larger effect on performance than load-alignment. With 16 element vectors, we can even see a faint diagonal effect of relative alignment between the loads and stores, though I don't know the cause of that effect.
Further: we can see that the smaller the vectors, the less extreme the relative differences appear. For 16 element vectors, runtime varies from 7.5 ms to 11.5 ms, but for 2 element vectors it only varies from 24.8 to 29.5. My theory is that this that for 64 byte vectors, every unaligned vector is split, leading to roughly a doubling of operations. But for 8 byte vectors, only every 8th crosses a cacheline boundary, and the effect of splitting is thus much smaller.

Something else that also is visible in these results: arrays are only 8-byte aligned. Every time I run the benchmark, e.g. for different vector lengths, the alignment of the base is different. Thus, the "lines" of the "grid" do not always align between different runs of these benchmarks. This has a quite significant implication for vector api benchmarks: if one does not control the alignment of the arrays, one might get drastically unstable measurements, the results can quickly vary very significantly.

The neon / asimd N1 aarch64 machine provides vectors up to 128 bits, so we can only display the results for 4 and 2 element vectors of ints:
image
Time in ms, red is slow, green is fast.
x-axis ➡️: offset_load
y-axis ⬆️: offset_store
Strangely, it seems only load alignment has a significant effect. That is quite surprising.


Investigating performance loss when crossing cacheline boundary, using Vector API

While the relative performance differences for different vector lengths already match the theory that only memory accesses that cross a cacheline boundary are split, we now show this effect with a special "skip" benchmark. I ran Benchmark.java test1L1SVectorSkip 4 2560 oneArray, i.e. with a "skip" benchmark where every int vector has 4 elements, and we skip every 4th vector: [0 1 2 3 ][4 5 6 7 ][8 9 10 11][ skip ]. If the cacheline boundary lies where the we skip a vector, then we should have no performance loss compared to when we have perfect alignment. At least in theory 😉

Left the result for my AVX512 laptop, right the machine for the N1 aarch64 machine:
image

For comparison, the results from above, for the 4-element vectors without skipping:
image

Generally, we see similar 4-element wide "bands" in both directions.

The results for the AVX512 machine are quite understandable, and very crisp: We have the same repeating grid as without skip, except that every 4th band in x and y direction is "skipped", i.e. has the same performance as when aligned. It seems the theory perfectly applies for my AVX512 machine 😀

But the aarch64 results are stranger: In x direction, i.e. for loads, every 4th band has better performance. That seems to correspond to the cacheline boundary of 64 bytes, i.e. when we skip, there is no load splitting. In y direction, i.e. for stores, every 2nd band has better performance. This is surprising, because the non skipping benchmark did not show any effect in this direction. And it seems to indicate some 32 byte effect, which neither corresponds to the 64 byte cacheline (otherwise we would have to see some effect that only shows every 4th band), nor to the 16 byte store boundaries mentioned in the N1 Optimization Guide. This is really confusing. We could further investigate the behavior with different element sizes and vector sizes, and different skip methods.


Discovering 4k aliasing artifacts in benchmark, using the Vector API

On my AVX512 machine, I found an effect that happens around 4k byte boundaries, i.e. every 1024 ints. For 64, 32 and 16 byte vectors, i.e. 16, 8, and 4 elements, and SIZE = 2048, so 8k bytes:
image
In the lower half triangles, we see the normal grid pattern. Modulo 4k bytes, the loads are ahead of the stores, that may explain why there is not effect. But the upper half triangles have drastically worse performance. The grid is now diagonal, probably dominated by relative alignment rather than absolute alignment. Modulo 4k bytes, the loads are behind the stores - my theory is that this conflicts with the loads having to happen first.

I ran it on a larger grid (offsets from 0-127), and one can see that the effect slowly wares off (from red to orange) - ignore the noise, I had to lower the accuracy to complete this one in reasonable time:
image

But it seems on the aarch64 machine, I cannot find this 4k byte boundary effect.

@iwanowww Pointed me to this article about 4k aliasing. The reason is that store-to-load-forwarding at first only operates on the lowest 12 bits of the address, and when it later detects that the rest of the address does not matter, this incurs a penalty of a few cycles. Note: I only just learned about the effects of store-to-load-forwarding recently, see #21521 / JDK-8334431.


Investigation for automatic alignment in the Auto Vectorizer

To be able to investigate the performance of the Auto-Vectorizer (SuperWord), I made the automatic alignment configurable with SuperWordAutomaticAlignment. We can disable it, align with the store or with the load.

The attached JMH benchark VectorAutoAlignmentVisualization.bench1L1S, with automatic alignment disabled looks like this:
image

This JMH benchmark is really slow, so we can also use the Benchark.java from below.
I ran it on my AVX512 laptop with Benchmark.java test1L1SScalar 4 2560 oneArray:
image

  • Top left: no alignment.
  • Top right: align with store.
  • Bottom right: align with load.

We can see that with no alignment, we have a grid with 90% angles. If the stores are aligned, we get about 3.35 ms runtime, if only loads are aligned we get about 4.4 ms, and if neither is aligned 4.9 ms - if both are aligned we get only 3.2 ms.

With automatic alignment on stores, we get an overall better performance. But we also see the pattern is now diagonal. In most cases, we only have the store aligned, and we get about 3.4-3.5 ms. But when the load and store are relatively aligned, i.e. on the thin diagonals, then we get even only 3.3 ms. These performance numbers are comparable with the numbers we see on the "no alignment" plot on the horizontal lines where the stores are aligned.

With automatic alignment on loads, we get an average performance that is better than without alignment, but worse than aligning with stores. In most cases only the load is aligned, and we get 4.4 ms. But on the rare occasion where the loads and stores are relatively aligned, i.e. the thin diagonals, we get 3.2-3.3 ms. These numbers are comparable with the numbers we see on the "no alignment" plot on the vertical lines, where the loads are aligned.

But which one of these options should we now chose? I.e. what should be the default for SuperWordAutomaticAlignment? In general, we do not know the alignment of the load and store, so we should assume that we land on one of the cells at random. Thus, the relevant performance metric is the average over all cells.

The benchmark below does exactly this: it runs the loop for every offset_load and offset_store combination, essentially computing the average over all combinations.


Automatic Alignment in SuperWord (Auto Vectorization)

Results with aliasing runtime checks #24278, on VectorAutoAlignment, on my AVX512 laptop:
image
Note: before #24278, this benchmark never vectorizes, because we cannot prove that the load and store do not alias.

There are clearly some artifacts around the 4k byte boundaries. See the discussion further up about 4k aliasing.

Other than those artifacts, it is very clear that aligning with stores is the best on my AVX512 CPU. Aligning the loads is significantly worse, and not aligning at all slightly worse than that. But in any case: vectorization is always very clearly profitable, no matter the alignment.

Running it on a aarch64 neon OCI machine:
image
The results look quite a bit different. Vectorization is still always profitable, no matter the alignment. But now, it seems aligning loads is fastest, and there is no difference between aligning stores or no alignment at all.

I also ran it on our benchmark servers:

linux aarch64, (neon):
image

linux x64:
image

macosx aarch64 (neon):
image

macosx x64:
image

windows x64:
image

The x64 results are fairly consistent: in most cases aligning to stores is best, except for the 4k artifacts.
The aarch64 results are less clear. On two machines we see that aligning loads is marginally faster, but on one machine aligning to stores is faster. I suspect it may depend on the exact aarch64 implementation.


Standalone Benchmark.java

I did not integrate it, because it is not very well suited for regression testing, rather for visualization only. For regression testing, I am integrating the benchmark VectorAutoAlignment.java. Still, I am also integrating VectorAutoAlignmentVisualization.java, which can be used to visualize the effect of alignment for the auto vectorizer only.

I usually run the benchmark with command-lines like this:

./java -XX:CompileCommand=compileonly,Benchmark*::test* -XX:CompileCommand=printcompilation,Benchmark*::* -Xbatch -XX:+PrintIdeal -XX:CompileCommand=printassembly,Benchmark*::test* -XX:ObjectAlignmentInBytes=8 -XX:CompileCommand=TraceAutoVectorization,Benchmark*::test*,SW_INFO,ALIGN_VECTOR -XX:+TraceLoopOpts -XX:LoopUnrollLimit=60 -XX:MaxVectorSize=64 -XX:SuperWordAutomaticAlignment=0 Benchmark.java test1L1SVector 4 2432 separateArrays

Here some relevant flags to play with:

  • ObjectAlignmentInBytes: alignment of objects, i.e. the arrays in these benchmarks. Default is 8 bytes, which means two arrays only have a relative alignment of 8 bytes. Hence, it may not always be possible to align both references to two arrays to more than 8 bytes, i.e. we can only guarantee 64-byte alignment of at most one of them.
  • TraceAutoVectorization: with the tag ALIGN_VECTOR we can see which memory reference we auto-align.
  • LoopUnrollLimit: some benchmarks have a rather large loop, and only auto-vectorize if this limit is artificially increased.
  • MaxVectorSize: we can artificially lower the maximum vector length, possibly breaking larger vectors into multiple smaller ones.
  • SuperWordAutomaticAlignment: controls if and how we auto-align.
import jdk.incubator.vector.*;
import java.nio.ByteOrder;
import java.util.ArrayList;
import java.util.Set;

public class Benchmark {
    public static int SIZE;
    public static VectorSpecies<Integer> SPECIES;

    public static int[] arr0;
    public static int[] arr1;
    public static int[] arr2;
    public static int[] arr3;
    public static int base0;
    public static int base1;
    public static int base2;
    public static int base3;

    public static void main(String[] args) {
	if (args.length != 4) {
	    System.out.println("Error: need 4 arguments, got " + args.length);
	    printUsage();
	}

	String benchmarkName = args[0];

	int vectorElements = Integer.parseInt(args[1]);
	if (!Set.of(2, 4, 8, 16).contains(vectorElements)) {
	    System.out.println("Error: vectorElements must be 2, 4, 8, or 16, got " + vectorElements);
	    printUsage();
	}
	SPECIES = VectorSpecies.of(int.class, VectorShape.forBitSize(vectorElements * 4 * 8));

	SIZE = Integer.parseInt(args[2]);
	if (SIZE < 2000 || SIZE > 100_000) {
	    System.out.println("Error: dataSize out of range [2000, 100_000], got " + SIZE);
	    printUsage();
	}

	String scenario = args[3];
	switch (scenario) {
            // Load / Store from different arrays. Relative alignment is not known.
	    case "separateArrays" -> {
                arr0 = new int[SIZE];
                arr1 = new int[SIZE];
                arr2 = new int[SIZE];
                arr3 = new int[SIZE];
		base0 = 0;
		base1 = 0;
		base2 = 0;
		base3 = 0;
	    }
            // Load / Store on same array -> base have a known relative alignment.
	    // Use the whole array, every access has its own "region".
	    case "oneArray" -> {
                int[] arr = new int[4 * SIZE];
                arr0 = arr;
                arr1 = arr;
                arr2 = arr;
                arr3 = arr;
		base0 = 0 * SIZE;
		base1 = 1 * SIZE;
		base2 = 2 * SIZE;
		base3 = 3 * SIZE;
	    }
            // Load / Store on same array -> base have a known relative alignment.
	    // Small offset -> the memory accesses use the same memory "region".
	    case "oneArraySmallOffset" -> {
                int[] arr = new int[4 * SIZE];
                arr0 = arr;
                arr1 = arr;
                arr2 = arr;
                arr3 = arr;
		base0 = 0 * (1024 + 256);
		base1 = 1 * (1024 + 256);
		base2 = 2 * (1024 + 256);
		base3 = 3 * (1024 + 256);
	    }
	    default -> {
		System.out.println("Error: scenario does not exist: " + scenario);
		printUsage();
	    }
	}
	
	BenchmarkRunner.run(benchmarkName);
    }

    public static void printUsage() {
	System.out.println("Usage: java <jvm flags> Benchmark.java <benchmark> <vectorElements> <dataSize> <scenario>");
	System.out.println("  benchmark:");
	System.out.println("    test1L1SVector test1L1SVectorSkip test1L1SScalar");
	System.out.println("    test2L1SVector test2L1SVectorSkip test2L1SScalar test2L1SScalarRearranged");
	System.out.println("    test3L1SVector test3L1SVectorSkip test3L1SScalar");
	System.out.println("  vectorElements: 2, 4, 8, 16");
	System.out.println("  dataSize: 2000 ... 100_000. Recommended: 2048.");
	System.out.println("  scenario: separateArrays oneArray oneArraySmallOffset");
        System.exit(0);
    }
}

public class BenchmarkRunner {
    // Make sure the runner has all these fields final, so we get a better chance at optimisation.
    public static final int SIZE = Benchmark.SIZE;
    public static final VectorSpecies<Integer> SPECIES = Benchmark.SPECIES;

    public static final int REPS = 50_000; // Repeat REPS times for a benchmark measurement.
    public static final int RUNS = 5; // Each benchmark measurement is repeated RUNS times, and MIN runtime is chosen.
    public static final int GRID = 32;

    public static int[] arr0 = Benchmark.arr0; // store
    public static int[] arr1 = Benchmark.arr1; // load
    public static int[] arr2 = Benchmark.arr2; // load
    public static int[] arr3 = Benchmark.arr3; // load
    public static final int base0 = Benchmark.base0;
    public static final int base1 = Benchmark.base1;
    public static final int base2 = Benchmark.base2;
    public static final int base3 = Benchmark.base3;

    interface GridBenchmark {
        void run(int offset_load, int offset_store);
    }

    public static void run(String benchmarkName) {
	switch (benchmarkName) {
	    case "test1L1SVector" -> benchmarkGrid(BenchmarkRunner::test1L1SVector);
	    case "test2L1SVector" -> benchmarkGrid(BenchmarkRunner::test2L1SVector);
	    case "test3L1SVector" -> benchmarkGrid(BenchmarkRunner::test3L1SVector);
	    case "test1L1SVectorSkip" -> benchmarkGrid(BenchmarkRunner::test1L1SVectorSkip);
	    case "test2L1SVectorSkip" -> benchmarkGrid(BenchmarkRunner::test2L1SVectorSkip);
	    case "test3L1SVectorSkip" -> benchmarkGrid(BenchmarkRunner::test3L1SVectorSkip);
	    case "test1L1SScalar" -> benchmarkGrid(BenchmarkRunner::test1L1SScalar);
	    case "test2L1SScalar" -> benchmarkGrid(BenchmarkRunner::test2L1SScalar);
	    case "test3L1SScalar" -> benchmarkGrid(BenchmarkRunner::test3L1SScalar);
	    case "test2L1SScalarRearranged" -> benchmarkGrid(BenchmarkRunner::test2L1SScalarRearranged);
	    default -> {
		System.out.println("Error: benchmark does not exist: " + benchmarkName);
		Benchmark.printUsage();
	    }
	}
	System.out.println("Done: " + benchmarkName);
        System.out.println("x-axis  (->)  LOAD_OFFSET");                                                                                                                                                                                                                                                                                                                                                                          
        System.out.println("y-axis  (up)  STORE_OFFSET");
	System.out.println("offset_load: load alignment shift");
	System.out.println("offset_store: store alignment shift");
    }

    public static void benchmarkGrid(GridBenchmark gt) {
	System.out.println("Initial Warmup");
        for (int i = 0; i < 10 * REPS; i++) {
	    gt.run(0, 0);
	}

	ArrayList<String> list = new ArrayList<>();
	float total = 0;
        for (int offset_store = 0; offset_store < GRID; offset_store++) {
	    String line = "";
            for (int offset_load = 0; offset_load < GRID; offset_load++) {
		float t = Float.POSITIVE_INFINITY;
		for (int i = 0; i < RUNS; i++) {
		    t = Math.min(t, benchmark(offset_load, offset_store, gt));
		}
		total += t;
                line += String.format("%.5f ", t);
	    }
	    System.out.println(line);
	    list.add(line);
	}
	System.out.println("Results [ms]:");
	// reverse the list, so the 0/0 point is at the bottom left.
	for (var line : list.reversed()) {
	    System.out.println(line);
	}
	System.out.println("total [ms]: " + total);
    }

    public static float benchmark(int offset_load, int offset_store, GridBenchmark gt) {
        for (int i = 0; i < REPS; i++) {
	    gt.run(offset_load, offset_store);
	}
	long t0 = System.nanoTime();
        for (int i = 0; i < REPS; i++) {
	    gt.run(offset_load, offset_store);
	}
	long t1 = System.nanoTime();
	float t = (t1 - t0) * 1e-6f;
	return t;
    }

    public static void vector1L1S(int offset_load, int offset_store, int i) {
        var v = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
        v.intoArray(arr0, base0 + i + offset_store);
    }

    public static void vector2L1S(int offset_load, int offset_store, int i) {
        var v0 = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
        var v1 = IntVector.fromArray(SPECIES, arr2, base2 + i + offset_load);
	var v = v0.add(v1);
        v.intoArray(arr0, base0 + i + offset_store);
    }

    public static void vector3L1S(int offset_load, int offset_store, int i) {
        var v0 = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
        var v1 = IntVector.fromArray(SPECIES, arr2, base2 + i + offset_load);
        var v2 = IntVector.fromArray(SPECIES, arr3, base3 + i + offset_load);
	var v = v0.add(v1).add(v2);
        v.intoArray(arr0, base0 + i + offset_store);
    }

    public static void test1L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
	    vector1L1S(offset_load, offset_store, i);
        }
    }

    public static void test2L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
	    vector2L1S(offset_load, offset_store, i);
        }
    }

    public static void test3L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
	    vector3L1S(offset_load, offset_store, i);
        }
    }

    // This one is to prove that the split happens on the cache line.
    //
    // Note: this is written for vectorElements = 4
    public static void test1L1SVectorSkip(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += 16) {
	    vector1L1S(offset_load, offset_store, i + 0);
	    vector1L1S(offset_load, offset_store, i + 4);
	    vector1L1S(offset_load, offset_store, i + 8);
	    // Skip the "i + 12" step, so we do not always go over the cache line.
        }
    }

    // This one is to prove that the split happens on the cache line.
    //
    // Note: this is written for vectorElements = 4
    public static void test2L1SVectorSkip(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += 16) {
	    vector2L1S(offset_load, offset_store, i + 0);
	    vector2L1S(offset_load, offset_store, i + 4);
	    vector2L1S(offset_load, offset_store, i + 8);
	    // Skip the "i + 12" step, so we do not always go over the cache line.
        }
    }

    // This one is to prove that the split happens on the cache line.
    //
    // Note: this is written for vectorElements = 4
    public static void test3L1SVectorSkip(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += 16) {
	    vector3L1S(offset_load, offset_store, i + 0);
	    vector3L1S(offset_load, offset_store, i + 4);
	    vector3L1S(offset_load, offset_store, i + 8);
	    // Skip the "i + 12" step, so we do not always go over the cache line.
        }
    }

    // Requires aliasing analysis runtime check JDK-8324751 to vectorize.
    public static void test1L1SScalar(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - GRID; i++) {
	    int v = arr1[base1 + i + offset_load];
	    arr0[base0 + i + offset_store] = v;
        }
    }

    // Requires aliasing analysis runtime check JDK-8324751 to vectorize.
    public static void test2L1SScalar(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - GRID; i++) {
	    int v0 = arr1[base1 + i + offset_load];
	    int v1 = arr2[base2 + i + offset_load];
	    var v = v0 + v1;
	    arr0[base0 + i + offset_store] = v;
        }
    }

    // Requires aliasing analysis runtime check JDK-8324751 to vectorize.
    public static void test3L1SScalar(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - GRID; i++) {
	    int v0 = arr1[base1 + i + offset_load];
	    int v1 = arr2[base2 + i + offset_load];
	    int v2 = arr3[base3 + i + offset_load];
	    var v = v0 + v1 + v2;
	    arr0[base0 + i + offset_store] = v;
        }
    }

    // Vectorizes even without JDK-8324751, but requires -XX:LoopUnrollLimit=10000 because loop body is large.
    // Automatic alignment is ineffective here, because of the hand-unrolling -> pre-loop cannot change alignment.
    // Gets us some funky patterns, as automatic alignment sometimes seems to actually make things slightly worse.
    //
    // Note: this test does not react to vectorElements.
    public static void test2L1SScalarRearranged(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 4 - GRID; i+=4) {
	    int v00 = arr1[base1 + i + offset_load + 0];
	    int v10 = arr1[base1 + i + offset_load + 1];
	    int v20 = arr1[base1 + i + offset_load + 2];
	    int v30 = arr1[base1 + i + offset_load + 3];
	    int v01 = arr2[base2 + i + offset_load + 0];
	    int v11 = arr2[base2 + i + offset_load + 1];
	    int v21 = arr2[base2 + i + offset_load + 2];
	    int v31 = arr2[base2 + i + offset_load + 3];
	    var v0 = v00 + v01;
	    var v1 = v10 + v11;
	    var v2 = v20 + v21;
	    var v3 = v30 + v31;
	    arr0[base0 + i + offset_store + 0] = v0;
	    arr0[base0 + i + offset_store + 1] = v1;
	    arr0[base0 + i + offset_store + 2] = v2;
	    arr0[base0 + i + offset_store + 3] = v3;
        }
    }
}

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8355094: Performance drop in auto-vectorized kernel due to split store (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25065/head:pull/25065
$ git checkout pull/25065

Update a local copy of the PR:
$ git checkout pull/25065
$ git pull https://git.openjdk.org/jdk.git pull/25065/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25065

View PR using the GUI difftool:
$ git pr show -t 25065

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25065.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 6, 2025

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 6, 2025

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8355094: Performance drop in auto-vectorized kernel due to split store

Reviewed-by: vlivanov, thartmann

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 50 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title JDK-8355094 8355094: Performance drop in auto-vectorized kernel due to split store May 6, 2025
@openjdk
Copy link

openjdk bot commented May 6, 2025

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 6, 2025
@eme64 eme64 marked this pull request as ready for review May 15, 2025 07:37
@openjdk openjdk bot added the rfr Pull request is ready for review label May 15, 2025
@mlbridge
Copy link

mlbridge bot commented May 15, 2025

Webrevs

Copy link
Contributor

@mhaessig mhaessig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the deep investigation, the excellent report, and most of all the colorful plots!

I found a typo, but otherwise the hotspot changes look good to me. I cannot review the benchmarks, unfortunately.

Co-authored-by: Manuel Hässig <manuel@haessig.org>
Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive analysis, Emanuel! Very deep, thorough, and insightful.

Looks good.

Speaking of Vector API, we experimented with getting access alignment under control. Unfortunately, when it comes to on-heap accesses it boils down to hyper-aligned objects support which is not there yet.

PS: yay, you found a way to turn PRs into blog posts! :-)

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 15, 2025
@eme64
Copy link
Contributor Author

eme64 commented May 16, 2025

@iwanowww Thanks for your kind words 😊

Indeed: on-heap access would profit from hyper-aligned objects. Are there any ideas on how to do that? I wonder if it is worth it, or if it is good enough to just use off-heap (native) MemorySegments to guarantee alignment for very performance critical cases?

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive analysis, Emanuel! Very deep, thorough, and insightful.

+1 to this. Great work, Emanuel! The fix looks good to me.

@eme64
Copy link
Contributor Author

eme64 commented May 19, 2025

@TobiHartmann Thank you for the review :)

@theRealAph @XiaohongGong Do you have any idea about the somewhat confusing behavior of aarch64 in these benchmarks?

@eme64
Copy link
Contributor Author

eme64 commented May 20, 2025

@TobiHartmann @iwanowww @mhaessig Thanks for reviewing!

I'll integrate now, but we can still continue the conversation @theRealAph @XiaohongGong @jatin-bhateja .

/integrate

@openjdk
Copy link

openjdk bot commented May 20, 2025

Going to push as commit 277bb20.
Since your change was applied there have been 77 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 20, 2025
@openjdk openjdk bot closed this May 20, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 20, 2025
@openjdk
Copy link

openjdk bot commented May 20, 2025

@eme64 Pushed as commit 277bb20.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@XiaohongGong
Copy link

@TobiHartmann Thank you for the review :)

@theRealAph @XiaohongGong Do you have any idea about the somewhat confusing behavior of aarch64 in these benchmarks?

Hi @eme64 , to be honest, I'm not quite sure about the unaligned memory access behavior on AArch64. I tried to make it clear by reading some ARM docs. But unfortunately, the message that I got most is it's HW implementation defined behavior. Some AArch64 micro-architectures prefer aligning memory for loads instead of stores to obtain better performance, but others maybe on the contrary. That's the reality.

My colleague provided to me several patches in go project which also use an option to prefer load alignment or store for a memory move library optimization [1][2][3] on AArch64. Different AArch64 micro-architecture can choose the optimal alignment solution based on the performance results. And it chooses to align loads for Neoverse CPUs by default. Hope this could help you. I think the basic ideal is align with what you did in this PR. Thanks!

[1] https://go-review.googlesource.com/c/go/+/243357
[2] https://github.com/golang/go/blob/7f806c1052aa919c1c195a5b2223626beab2495c/src/runtime/cpuflags_arm64.go#L11
[3] https://go-review.googlesource.com/c/go/+/664038

@eme64
Copy link
Contributor Author

eme64 commented May 22, 2025

@XiaohongGong Thanks a lot for taking the time to respond! That is very fascinating, and reassuring. Seems I'm not the only one seeing these kinds of results :)

I suppose we could add a similar flag, to target the aarch64 machines where load alignment is preferred. But from what I see the wins would be marginal, and I don't know aarch64 enough to figure out which implementations would benefit. But if anyone wants to take this on, I'd be happy to review the PR ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants