8284050: [vectorapi] Optimize masked store for non-predicated architectures #8544

XiaohongGong · 2022-05-05T02:00:04Z

Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector "load + blend + store". For example, storing a vector "v" controlled by a mask "m" into a memory with address "addr" (i.e. "store(addr, v, m)") can be implemented with:

 1) mem_v = load(addr)     ; non-masked load from the same memory
 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
 3) store(addr, v)         ; non-masked store into the memory

Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API.

Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about 20x ~ 50x:

Benchmark                                  before    after   Units
StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
StoreMaskedBenchmark.intStoreArrayMask     75.028   2027.557 ops/ms
StoreMaskedBenchmark.longStoreArrayMask    40.929   1032.928 ops/ms
StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms

Similar performance gain can also be observed on ARM NEON system.

And here is the performance data on X86 avx-512 system, which improves about 1.88x - 2.81x:

Benchmark                                  before     after   Units
StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
StoreMaskedBenchmark.intStoreArrayMask     4191.904  9300.428 ops/ms
StoreMaskedBenchmark.longStoreArrayMask    2025.031  4604.504 ops/ms
StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms

Similar performance gain can also be observed on ARM SVE system.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (1 reviews required, with at least 1 reviewer)

Integration blocker

⚠️ Dependency #8035 must be integrated first

Issue

JDK-8284050: [vectorapi] Optimize masked store for non-predicated architectures

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8544/head:pull/8544
$ git checkout pull/8544

Update a local copy of the PR:
$ git checkout pull/8544
$ git pull https://git.openjdk.java.net/jdk pull/8544/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8544

View PR using the GUI difftool:
$ git pr show -t 8544

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8544.diff

bridgekeeper · 2022-05-05T02:00:11Z

👋 Welcome back xgong! A progress list of the required criteria for merging this PR into pr/8035 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-05-05T02:03:43Z

@XiaohongGong The following labels will be automatically applied to this pull request:

core-libs
hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

…ctures

mlbridge · 2022-05-05T02:08:12Z

Webrevs

01: Full (9c2c2345)
00: Full (be5eae4a)

XiaohongGong · 2022-05-05T02:10:06Z

/label add hotspot-compiler

openjdk · 2022-05-05T02:10:39Z

@XiaohongGong
The hotspot-compiler label was successfully added.

rose00

The JIT (in all other circumstances AFAIK) never produces "phantom stores", stores into Java variables which are not specified as the target of a JVM store instruction (putfield, dastore, etc.). The fact that a previously-read value is used by the phantom store does not make it any better.

Yes, the memory states may be correct after the blend and store is done, but the effect on the Java Memory Model is to issue the extra phantom stores of the unselected array elements. Under certain circumstances, this will create race conditions after the optimization where there were no race conditions before the optimization. Other threads could (under Java Memory Model rules) witness the effects of the phantom stores. If the Java program is properly synchronized, the introduction of an illegitimate race condition can cause another thread, now in an illegal race, to see an old value in a variable (the recopied unselected array element) which the JMM says is impossible.

Yes, this only shows up in multi-threaded programs, and ones where two threads step on one array, but Java is a multi-threaded language, and it must conform to its own specification as such.

This blend technique would be very reasonable if there is no race condition. (Except at the very beginning or end of arrays.) And the JMM leaves room for many optimizations. And yet I think this is a step too far. I'd like to be wrong about this, but I don't think I am.

So, I think you need to use a different technique, other than blend-and-unpredicated-store, for masked stores on non-predicated architectures.

For example, you could simulate a masked store instruction on an architecture that supports scatter (scattering values of the array type). Do this by setting up two vectors of machine pointers. One vector points to each potentially-affected element of the array (some kind of index computation plus a scaled iota vector). The other vector is set up similarly, but points into a fixed-sized, thread-local buffer, what I would call the "bit bucket". Blend the addresses, and then scatter, so that selected array lanes are updated, and unselected values are sent to the "bit bucket".

This is complex enough (and platform-dependent enough) that you probably need to write a hand-coded assembly language subroutine, to call from the JIT code. Sort of like the arraycopy stubs.

It's even more work than the proposed patch here, but it's the right thing, I'm afraid.

rose00 · 2022-05-05T02:27:03Z

src/hotspot/share/opto/vectorIntrinsics.cpp

+    } else {
+      // Use the vector blend to implement the masked store. The biased elements are the original
+      // values in the memory.
+      Node* mem_val = gvn().transform(LoadVectorNode::make(0, control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt));


I'm sorry to say it, but I am pretty sure this is an invalid optimization.
See top-level comment for more details.

Thanks for your comments! Yeah, this actually influences something due to the Java Memory Model rules which I missed to consider more. I will try the scatter ways instead. Thanks so much!

Yes, phantom store can write back stale unintended value and may create problem in multithreded applications since blending is done with an older loaded value.

mlbridge · 2022-05-05T03:17:23Z

Mailing list message from Hans Boehm on hotspot-dev:

Naive question: What happens if one of the vector elements that should not
have been updated is concurrently being written by another thread? Aren't
you generating writes to vector elements that should not have been written?

Hans

On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong <xgong at openjdk.java.net> wrote:

XiaohongGong · 2022-05-05T03:22:51Z

Mailing list message from Hans Boehm on hotspot-dev:

Naive question: What happens if one of the vector elements that should not have been updated is concurrently being written by another thread? Aren't you generating writes to vector elements that should not have been written?

Hans

On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong wrote:

Yeah, this is the similar concern with what @rose00 mentioned above. The current solution cannot work well for multi-thread progresses. I will consider other better solutions. Thanks for the comments!

XiaohongGong · 2022-05-05T03:26:03Z

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java

+            if (offset >= 0 && offset <= (a.length - vsp.length())) {
+                intoBooleanArray0(a, offset, m, /* offsetInRange */ true);


The offset check could save the checkMaskFromIndexSize for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? Thanks!

mlbridge · 2022-05-05T06:17:23Z

Mailing list message from John Rose on hotspot-dev:

On May 4, 2022, at 8:29 PM, Xiaohong Gong <xgong at openjdk.java.net> wrote:

The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least?

That part is ok, yes. I wish we could get the same effect with loop optimizations but I don?t know an easy way. The explicit check in the source code gives the JIT a crutch but I hope we can figure out a way in the future to integrate mask logic into range check elimination logic, making the crutches unnecessary. For now it?s fine.

XiaohongGong · 2022-05-05T08:37:57Z

Mailing list message from John Rose on hotspot-dev:

On May 4, 2022, at 8:29 PM, Xiaohong Gong wrote:
The offset check could save the checkMaskFromIndexSize for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least?

That part is ok, yes. I wish we could get the same effect with loop optimizations but I don?t know an easy way. The explicit check in the source code gives the JIT a crutch but I hope we can figure out a way in the future to integrate mask logic into range check elimination logic, making the crutches unnecessary. For now it?s fine.

Thanks! So I will separate this part out and fix it in another PR first. For the store masked vectorization with scatter or other ideas, I'm not quite sure whether they can always benefit cross architectures and need more investigation. I prefer to close this PR now. Thanks for all your comments!

openjdk bot added the rfr Pull request is ready for review label May 5, 2022

openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels May 5, 2022

8284050: [vectorapi] Optimize masked store for non-predicated archite…

9c2c234

…ctures

XiaohongGong force-pushed the JDK-8284050 branch from be5eae4 to 9c2c234 Compare May 5, 2022 02:06

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 5, 2022

XiaohongGong mentioned this pull request May 5, 2022

8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature #8035

Closed

3 tasks

rose00 suggested changes May 5, 2022

View reviewed changes

XiaohongGong commented May 5, 2022

View reviewed changes

XiaohongGong closed this May 5, 2022

XiaohongGong deleted the JDK-8284050 branch May 5, 2022 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8284050: [vectorapi] Optimize masked store for non-predicated architectures #8544

8284050: [vectorapi] Optimize masked store for non-predicated architectures #8544

XiaohongGong commented May 5, 2022 •

edited by openjdk bot

bridgekeeper bot commented May 5, 2022

openjdk bot commented May 5, 2022

mlbridge bot commented May 5, 2022 •

edited

XiaohongGong commented May 5, 2022

openjdk bot commented May 5, 2022

rose00 left a comment

rose00 May 5, 2022

XiaohongGong May 5, 2022

jatin-bhateja May 5, 2022 •

edited

mlbridge bot commented May 5, 2022

XiaohongGong commented May 5, 2022

XiaohongGong May 5, 2022

mlbridge bot commented May 5, 2022

XiaohongGong commented May 5, 2022

		if (offset >= 0 && offset <= (a.length - vsp.length())) {
		intoBooleanArray0(a, offset, m, /* offsetInRange */ true);

8284050: [vectorapi] Optimize masked store for non-predicated architectures #8544

8284050: [vectorapi] Optimize masked store for non-predicated architectures #8544

Conversation

XiaohongGong commented May 5, 2022 • edited by openjdk bot

Progress

Integration blocker

Issue

Reviewing

bridgekeeper bot commented May 5, 2022

openjdk bot commented May 5, 2022

mlbridge bot commented May 5, 2022 • edited

Webrevs

XiaohongGong commented May 5, 2022

openjdk bot commented May 5, 2022

rose00 left a comment

Choose a reason for hiding this comment

rose00 May 5, 2022

Choose a reason for hiding this comment

XiaohongGong May 5, 2022

Choose a reason for hiding this comment

jatin-bhateja May 5, 2022 • edited

Choose a reason for hiding this comment

mlbridge bot commented May 5, 2022

XiaohongGong commented May 5, 2022

XiaohongGong May 5, 2022

Choose a reason for hiding this comment

mlbridge bot commented May 5, 2022

XiaohongGong commented May 5, 2022

XiaohongGong commented May 5, 2022 •

edited by openjdk bot

mlbridge bot commented May 5, 2022 •

edited

jatin-bhateja May 5, 2022 •

edited