8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157

jatin-bhateja · 2021-10-28T10:23:31Z

Summary of changes:

Added both scalar and vector variants of JMH performance tests for Vector.compress/expand and VectorMask.compress APIs.
Improved performance of operations where mask length is less than 4. Mask loading is a two stage process where in first the boolean array is loaded into a vector and then either transferred to a predicate register or a vector whose size is equivalent to that of underlined SPECIES. A mask whose length is less than 4 will result into a less than 32 bit vector load operation. Operations dependent on smaller masks are now being handled in java side implementation of these and some other APIs. Since the condition for special handling and fallback logic leading to C2 intrinsic call is based on constant expression hence one of the control path is optimized out. This shall also prevent any performance penalty due to failed lazy inline expansion which most often occurs due to unsupported vector sizes. If lazy inline expansion fails then C2 emits a direct call instruction to a callee method and thus we also loose any opportunity for procedure in-lining at that point, a separate issue has been created to address this problem.
Improved performance of VectorMask.compress over legacy non-AVX512 targets, added the missing checks in C2Compiler::is_intrinsics_supported routine to enable procedure in-lining early during parsing if target does not support direct compress/expand instructions.
Inline expand VectorMask.intoArray operation to trigger boxing-unboxing optimization. This significantly improved the performance of VectorMask.compress in newly added JMH micros.

Following is the performance data for included JMH micros:
System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server 40C 2S)

A) VectorMask.compress:

B) Vector.compress:

C) Vector.expand:

Patch has been regressed using tier3 regressions at various AVX levels 0/1/2/3/KNL.

Kindly review and share your feedback.

Best Regards,
Jatin

Progress

Change must not contain extraneous whitespace
Change must be properly reviewed

Issue

JDK-8276083: Incremental patch to further optimize new compress/expand APIs over X86

Reviewers

Paul Sandoz (@PaulSandoz - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/157/head:pull/157
$ git checkout pull/157

Update a local copy of the PR:
$ git checkout pull/157
$ git pull https://git.openjdk.java.net/panama-vector pull/157/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 157

View PR using the GUI difftool:
$ git pr show -t 157

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/157.diff

…Is over X86

bridgekeeper · 2021-10-28T10:24:24Z

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into vectorIntrinsics+compress will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

mlbridge · 2021-10-28T11:31:41Z

Webrevs

PaulSandoz · 2021-10-28T15:53:35Z

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template

@@ -933,10 +933,19 @@ final class $vectortype$ extends $abstractvectortype$ {
        @Override
        @ForceInline
        public $masktype$ compress() {
+            if (VLENGTH < 4) {


This impacts the code of every vector and across every architecture. I realize its dead code for many cases, but we need to find a better way to express and manage this. IMHO this is not a maintainable solution.

The fallback would be an obvious place for this logic.

We can push this into fall back path but in that case C2 will not be able to inline the fall back logic during failed lazy intrinsification attempt. I collected perf data with and without this and could clearly see benefit (around 1.5x) of keeping this outside the fall back. Given that conditions are guarded by constant expressions so only one of the path will be jit'ed.

I think we do support 2 byte vector loads for AARCH64 but currently minimum vector size over X86 is 4 bytes.
Agree with you, given this is an optimization on a slow path so its ok to compromise some gain to keep consistency across architectures.

Yes, i think that is reasonable for now. We might be able to revisit later. Generally improving the fallback path is an area we have yet to focus on.

I think the most important aspects to focus on at the moment are the common cases where I am presuming the vector length is likely to be >= 4.

PaulSandoz

How did the perf tests get generated? I don't see any related changes to see how they would be generated.

jatin-bhateja · 2021-10-29T06:08:02Z

How did the perf tests get generated? I don't see any related changes to see how they would be generated.

Templates added, generated files still needed some manual editing since after scatter/gather test overhaul scatter gather perf tests / templates do no exist. It can be fixed in other subsequent patch.

jatin-bhateja · 2021-11-01T16:58:29Z

@PaulSandoz , your comments are addressed. Please let me know if there are further comments.

PaulSandoz · 2021-11-01T22:19:46Z

@PaulSandoz , your comments are addressed. Please let me know if there are further comments.

Did you run gen-tests.sh?

I tried running that on a clone of your branch and it generates more code, some of which does not compile:

maskCompress iterates over a non-existent array a.
Benchmarks for ROR and ROL are generated and the scalar benchmarks refer to non-existent methods ROR_scalar and ROL_scalar.
The gather/scatter benchmarks are deleted. (They got removed when i switched the scatter/gather unit tests over to the load/store files. See 8266518: Refactor and expand scatter/gather tests jdk17#48)

So it looks like we got out of sync, likely via merges from mainline. We should at least fix 2 and possibly reconsider how to do 3.

jatin-bhateja · 2021-11-02T04:33:34Z

@PaulSandoz , your comments are addressed. Please let me know if there are further comments.

Did you run gen-tests.sh?

I tried running that on a clone of your branch and it generates more code, some of which does not compile:

maskCompress iterates over a non-existent array a.

Benchmarks for ROR and ROL are generated and the scalar benchmarks refer to non-existent methods ROR_scalar and ROL_scalar.

The gather/scatter benchmarks are deleted. (They got removed when i switched the scatter/gather unit tests over to the load/store files. See 8266518: Refactor and expand scatter/gather tests jdk17#48)

So it looks like we got out of sync, likely via merges from mainline. We should at least fix 2 and possibly reconsider how to do 3.

Hi @PaulSandoz ,

I have fixed 1 it was a typo, as already mentioned I did some manual editing since existing generation is broken.

Should it be OK to post the fixes for items 2 and 3 in a immediate subsequent patch once this gets integrated ?

Changes in this patch are related to compress/expand and its performance improvements.

PaulSandoz · 2021-11-02T15:43:50Z

Yes, we can fix 2/3 separately independently of this branch. In fact i think for 3 we should consider a separate template like with the unit tests.

PaulSandoz

Java changes are good.

openjdk · 2021-11-02T15:53:09Z

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8276083: Incremental patch to further optimize new compress/expand APIs over X86

Reviewed-by: psandoz

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the vectorIntrinsics+compress branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the vectorIntrinsics+compress branch, type /integrate in a new comment.

jatin-bhateja · 2021-11-04T16:39:14Z

Thanks @PaulSandoz. Integrating this patch.

jatin-bhateja · 2021-11-04T16:40:00Z

/integrate

openjdk · 2021-11-04T16:41:33Z

Going to push as commit 821635c.

openjdk · 2021-11-04T16:41:46Z

@jatin-bhateja Pushed as commit 821635c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

sviswa7 · 2021-11-04T20:45:15Z

src/hotspot/share/opto/c2compiler.cpp

+  case vmIntrinsics::_VectorComExp:
+    if (!Matcher::match_rule_supported(Op_CompressM)) return false;
+    if (!Matcher::match_rule_supported(Op_CompressV)) return false;
+    break;


Instead of break, this should return EnableVectorSupport.

Jatin Bhateja added 2 commits October 27, 2021 23:24

8276083: Incremental patch to further optimize new compress/expand AP…

dd7a418

…Is over X86

8276083: jcheck violation clearance

6381943

openjdk bot added the rfr label Oct 28, 2021

PaulSandoz reviewed Oct 28, 2021

View reviewed changes

Jatin Bhateja added 2 commits October 29, 2021 00:31

8276083: Review comments resolved.

ea6f008

8276083: Adding performance test generation templates.

9417717

PaulSandoz reviewed Oct 28, 2021

View reviewed changes

8276083: Correcting a typo in CompressExpand template file.

f0edaeb

PaulSandoz approved these changes Nov 2, 2021

View reviewed changes

openjdk bot added the ready label Nov 2, 2021

openjdk bot closed this Nov 4, 2021

openjdk bot added integrated and removed ready rfr labels Nov 4, 2021

jatin-bhateja mentioned this pull request Nov 4, 2021

VectorMask.intoArray intrinsics #160

Closed

2 tasks

sviswa7 reviewed Nov 4, 2021

View reviewed changes

jatin-bhateja mentioned this pull request Nov 9, 2021

Improve mask reduction operations on AVX #158

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157

8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157

jatin-bhateja commented Oct 28, 2021 •

edited by openjdk bot

bridgekeeper bot commented Oct 28, 2021

mlbridge bot commented Oct 28, 2021 •

edited

PaulSandoz Oct 28, 2021

jatin-bhateja Oct 28, 2021 •

edited

PaulSandoz Oct 28, 2021

PaulSandoz left a comment

jatin-bhateja commented Oct 29, 2021

jatin-bhateja commented Nov 1, 2021

PaulSandoz commented Nov 1, 2021 •

edited

jatin-bhateja commented Nov 2, 2021 •

edited

PaulSandoz commented Nov 2, 2021

PaulSandoz left a comment

openjdk bot commented Nov 2, 2021

jatin-bhateja commented Nov 4, 2021

jatin-bhateja commented Nov 4, 2021

openjdk bot commented Nov 4, 2021

openjdk bot commented Nov 4, 2021

sviswa7 Nov 4, 2021

8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157

8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157

Conversation

jatin-bhateja commented Oct 28, 2021 • edited by openjdk bot

Progress

Issue

Reviewers

Reviewing

bridgekeeper bot commented Oct 28, 2021

mlbridge bot commented Oct 28, 2021 • edited

Webrevs

PaulSandoz Oct 28, 2021

Choose a reason for hiding this comment

jatin-bhateja Oct 28, 2021 • edited

Choose a reason for hiding this comment

PaulSandoz Oct 28, 2021

Choose a reason for hiding this comment

PaulSandoz left a comment

Choose a reason for hiding this comment

jatin-bhateja commented Oct 29, 2021

jatin-bhateja commented Nov 1, 2021

PaulSandoz commented Nov 1, 2021 • edited

jatin-bhateja commented Nov 2, 2021 • edited

PaulSandoz commented Nov 2, 2021

PaulSandoz left a comment

Choose a reason for hiding this comment

openjdk bot commented Nov 2, 2021

jatin-bhateja commented Nov 4, 2021

jatin-bhateja commented Nov 4, 2021

openjdk bot commented Nov 4, 2021

openjdk bot commented Nov 4, 2021

sviswa7 Nov 4, 2021

Choose a reason for hiding this comment

jatin-bhateja commented Oct 28, 2021 •

edited by openjdk bot

mlbridge bot commented Oct 28, 2021 •

edited

jatin-bhateja Oct 28, 2021 •

edited

PaulSandoz commented Nov 1, 2021 •

edited

jatin-bhateja commented Nov 2, 2021 •

edited