Skip to content
This repository has been archived by the owner. It is now read-only.

8278267: ARM32: several vector test failures for ASHR #41

Closed
wants to merge 4 commits into from

Conversation

shqking
Copy link

@shqking shqking commented Dec 17, 2021

In ARM32, "VSHL (register)" instruction [1] is shared by vector left
shift and vector right shift, and the condition to distinguish them is
whether the shift count value is positve or negative. Hence, negation
operation is needed before conducting vector right shift.

For vector right shift, the shift count can be a RShiftCntV or a normal
vector node. Take test case Byte64VectorTests.java [2][3] as an example.
Note that RShiftCntV is already negated via rules "vsrcntD" and
"vsrcntX" whereas the normal vector node is NOT, since we don't know
whether a normal vector node is used as a vector shift count or not.
This is the root cause for these vector test failures.

The fix is simple, moving the negation from "vsrcntD|X" to the
corresponding vector right shift rules.

Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift
related rules are in form of "vsAABB_CC", where

  1. AA can be l (left shift), rl (logical right shift) and ra (arithmetic
    right shift).
  2. BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type)
    and 2L (long type).
  3. CC can be reg (register case) and immI (immediate case).

Minor updates:

  1. Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules
    conduct the same duplication operation now.
  2. Update the "match" primitive for vsraBB_immI rules.
  3. Style issue: remove the surrounding space for "ins_pipe" primitive.

Tests:
We ran tier 1~3 tests on ARM32 platform. With this patch, previously
failed vector test cases can pass now without introducing test
regression.

[1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en
[2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237
[3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8278267: ARM32: several vector test failures for ASHR

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk18 pull/41/head:pull/41
$ git checkout pull/41

Update a local copy of the PR:
$ git checkout pull/41
$ git pull https://git.openjdk.java.net/jdk18 pull/41/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 41

View PR using the GUI difftool:
$ git pr show -t 41

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk18/pull/41.diff

In ARM32, "VSHL (register)" instruction [1] is shared by vector left
shift and vector right shift, and the condition to distinguish them is
whether the shift count value is positve or negative. Hence, negation
operation is needed before conducting vector right shift.

For vector right shift, the shift count can be a RShiftCntV or a normal
vector node. Take test case Byte64VectorTests.java [2][3] as an example.
Note that RShiftCntV is already negated via rules "vsrcntD" and
"vsrcntX" whereas the normal vector node is NOT, since we don't know
whether a normal vector node is used as a vector shift count or not.
This is the root cause for these vector test failures.

The fix is simple, moving the negation from "vsrcntD|X" to the
corresponding vector right shift rules.

Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift
related rules are in form of "vsAABB_CC", where
1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic
  right shift).
2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type)
  and 2L (long type).
3) CC can be reg (register case) and immI (immediate case).

Minor updates:
1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules
conduct the same duplication operation now.
2) Update the "match" primitive for vsraBB_immI rules.
3) Style issue: remove the surrounding space for "ins_pipe" primitive.

Tests:
We ran tier 1~3 tests on ARM32 platform. With this patch, previously
failed vector test cases can pass now without introducing test
regression.

[1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en
[2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237
[3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 17, 2021

👋 Welcome back haosun! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Dec 17, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Dec 17, 2021

@shqking The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler label Dec 17, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Dec 17, 2021

Webrevs

@chhagedorn
Copy link
Member

@chhagedorn chhagedorn commented Dec 20, 2021

@nick-arm could you or someone else at ARM help to review this? That would be great.

nsjian
nsjian approved these changes Dec 21, 2021
Copy link

@nsjian nsjian left a comment

The fix looks good to me (not a Reviewer).

@dean-long
Copy link
Member

@dean-long dean-long commented Dec 21, 2021

There seems to be an interesting history here.

Method 1: negate in RShiftCntV rule
Method 2: negate in RShiftV* rules

For aarch64, the negate was moved into the shift instruction in JDK-8213134 (Method 1 --> Method 2). Then JDK-8262916 proposed to move it back out of the shift instruction again. In that PR, the opinion that Method 1 (arm32) generated better code than Method 2 (aarch64) was expressed. Now it looks like this PR proposes for arm32 to move to Method 2 like aarch64, so I suspect that there will be a performance impact.

I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same?

@nsjian
Copy link

@nsjian nsjian commented Dec 21, 2021

I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same?

Hmm, I didn't notice JDK-8277239 before. That looks like a good solution. If it works fine on arm32, we could also do it on aarch64.

…ht shifts

Method is_var_shift() denotes that vector shift count is a variable
shift:
1) for this case, vector shift count should be negated before conducting
   right shifts. E.g., vsrl4S_reg_var rule.
2) for the opposite case, vector shift count is generated via RShiftCntV
   rules and is already negated there. Hence, no negation is needed.
   E.g., vsrl4S_reg rule.

Besides, it's safe to add "hash()" and "cmp()" methods for ShiftV node.
@shqking
Copy link
Author

@shqking shqking commented Dec 24, 2021

There seems to be an interesting history here.

Method 1: negate in RShiftCntV rule Method 2: negate in RShiftV* rules

For aarch64, the negate was moved into the shift instruction in JDK-8213134 (Method 1 --> Method 2). Then JDK-8262916 proposed to move it back out of the shift instruction again. In that PR, the opinion that Method 1 (arm32) generated better code than Method 2 (aarch64) was expressed. Now it looks like this PR proposes for arm32 to move to Method 2 like aarch64, so I suspect that there will be a performance impact.

I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same?

Thanks a lot for your explanation. I do agree with your solution.

Updated the code: 1) using is_var_shift() to determine to put "negation" in RShiftCntV or RShiftV* rules, and 2) adding cmp and hash for ShiftV node.

Could you please take a look at the latest code when you have a chance? Maybe after the new year :)
Thanks.

@shqking
Copy link
Author

@shqking shqking commented Dec 24, 2021

Tests on the latest code:
We ran tier 1~3 tests on linux+ARM32 platform. With this patch, previously failed vector test cases can pass now without introducing test regression.

@bulasevich
Copy link

@bulasevich bulasevich commented Dec 24, 2021

The change is Ok for me (I am not a reviewer).
Though I would avoid changing the style. If we want to change a style, I would suggest to change all the entries (there is a lot of surrounding spaces remaining in a file), and I think FIXME for ins_cost/ins_pipe are outdated as well.

1. logical left shift rules
a). add is_var_shift check for vslAA_immI rules.
b). for vslAA_reg rules, remove the matching for URShiftV cases as we
have the separate logical right shift rules now.

2. logical right shift rules
a). add vsrlAA_reg and vsrlAA_reg_var rules.
b). add is_var_shift check for vsrlAA_immI rules.

3. arithmetic right shift rules
a). add is_var_shift check for vsraAA_reg rules.
b). add vsraAA_reg_var rules
c). for vsraAA_immI rules, add is_var_shift check and update the match
primitive.

Code style issues(FIXME and the surrounding space in ins_pipe):
1. for modified rules, keep it as it was
2. for newly added rules, update the style
@shqking
Copy link
Author

@shqking shqking commented Dec 28, 2021

The change is Ok for me (I am not a reviewer). Though I would avoid changing the style. If we want to change a style, I would suggest to change all the entries (there is a lot of surrounding spaces remaining in a file), and I think FIXME for ins_cost/ins_pipe are outdated as well.

Thanks for your review.
Yes. We may want to use another RFE to change the style issue for the whole file.
Hence, in the latest commit I made minimal updates to existing rules, and only updated the style in the newly added rules.

@@ -10790,7 +10792,7 @@ instruct vsl16B_reg(vecX dst, vecX src, vecX shift) %{
%}

instruct vsl8B_immI(vecD dst, vecD src, immI shift) %{
predicate(n->as_Vector()->length() == 8);
predicate(n->as_Vector()->length() == 8 && !n->as_ShiftV()->is_var_shift());
Copy link
Member

@dean-long dean-long Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checking for is_var_shift() really necessary for these immI rules? It seems like it should never happen, so it could be an assert.

Copy link
Author

@shqking shqking Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An assertion should be better, but I didn't find a proper way to add one assertion for instruct according to src/hotspot/share/adlc/Doc/Syntax.doc.

Hence I'd like to put this check to predicate, and "bad AD file` assertion would be issued for unexpected cases.

Copy link
Member

@dean-long dean-long Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put the assert into a helper function. You can look in aarch64.ad for examples.

bool assert_not_var_shift(const Node *n) {
assert(!n->as_ShiftV()->is_var_shift(), "illegal var shift");
return true;
}

[...]

predicate(n->as_Vector()->length() == 8 && assert_not_var_shift(n));

Copy link
Author

@shqking shqking Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you suggestion. Will update soon.

Copy link
Author

@shqking shqking Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dean-long I believe your comments are addressed in the latest commit. Could you please help to take a review when you got a chance? Thanks.

Copy link
Member

@dean-long dean-long Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your changes look good. Approved.

virtual uint hash() const { return VectorNode::hash() + _is_var_shift; }
virtual bool cmp(const Node& n) const {
return VectorNode::cmp(n) && _is_var_shift == ((ShiftVNode&)n)._is_var_shift;
}
Copy link
Member

@dean-long dean-long Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to add hash() and cmp()?

Copy link
Author

@shqking shqking Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to add them to avoid mis-optimizations at IR level.

Per my understanding, two ShiftV nodes with the same vector elements but with different _is_var_shift field values, are two different nodes, and they cannot be merged/treated as the same node.

Copy link
Member

@dean-long dean-long Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Is it possible to write a test for that?

Copy link
Author

@shqking shqking Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking hash() and cmp() need be overrided if there are new field members. E.g., LoadLNode, VectorMaskCmpNode, VectorTestNode, etc.

It could happen theoretically. But I'm afraid it's not easy to design a test case as it might involve some internal optimizations in a crafty way to construct such two kinds of ShiftV nodes.

format %{
"VNEG.S8 $tmp.D,$shift.D\n\t! neg packed8B"
"VSHL.U16 $dst.D,$src.D,$tmp.D\t! logical right shift packed4S"
%}
Copy link
Member

@dean-long dean-long Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches what aarch64 is doing, right? I was hoping there was a way for multiple shift instructions to reuse the same negated value, like we do for vsrcntX. But it would probably require introducing a new operand type, like vecNegatedRShiftX, or doing the negate in the IR like we do for the mask.

Copy link
Author

@shqking shqking Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. AArch64 works in the same way.

Agree with you. Performance benefit would be obtained if we can reuse the same negated value.

We may want to prepare another patch to enhance right shifts for aarch64(using is_var_shift() and considering your proposed two solutions).

Copy link
Member

@dean-long dean-long Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate RFE for that sounds good to me. Please file one.

Copy link
Author

@shqking shqking Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to reuse this open JBS issue. See https://bugs.openjdk.java.net/browse/JDK-8265263 . And I will work on it soon.

Copy link
Member

@dean-long dean-long Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I forgot about that RFE.

@openjdk
Copy link

@openjdk openjdk bot commented Jan 3, 2022

@shqking This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8278267: ARM32: several vector test failures for ASHR

Reviewed-by: njian, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 51 new commits pushed to the master branch:

  • 8f969a1: 8278930: javac tries to compile a file twice via PackageElement.getEnclosedElements
  • 642ab34: 8278373: JavacTrees.searchMethod finds incorrect match
  • d65c665: 8279527: Dereferencing segments backed by different scopes leads to pollution
  • 967ef0c: 8278020: ~13% variation in Renaissance-Scrabble
  • 7c792f2: 8279333: Some JFR tests do not accept 'GCLocker Initiated GC' as a valid GC Cause
  • 564c8c6: 8279529: ProblemList java/nio/channels/DatagramChannel/ManySourcesAndTargets.java on macosx-aarch64
  • 590fa9d: 8278612: [macos] test/jdk/java/awt/dnd/RemoveDropTargetCrashTest crashes with VoiceOver on macOS
  • 5cd9515: 8279525: ProblemList java/awt/GraphicsDevice/CheckDisplayModes.java on macosx-aarch64
  • 9d43d25: 8278897: Alignment of heap segments is not enforced correctly
  • 0f4807e: 8279222: Incorrect legacyMap.get in java.security.Provider after JDK-8276660
  • ... and 41 more: https://git.openjdk.java.net/jdk18/compare/e82310f3605b37519129ea34b2a6f99ee3ccc63a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@nsjian, @dean-long) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready label Jan 3, 2022
Define helper assert_not_var_shift() and use it for immI cases.

Besides, update the copyright year to 2022.
@shqking
Copy link
Author

@shqking shqking commented Jan 10, 2022

Thanks for your reviews! @dean-long @nsjian
I suppose this patch is ready to be integrated.

/integrate

@openjdk openjdk bot added the sponsor label Jan 10, 2022
@openjdk
Copy link

@openjdk openjdk bot commented Jan 10, 2022

@shqking
Your change (at version 6b09742) is now ready to be sponsored by a Committer.

@dean-long
Copy link
Member

@dean-long dean-long commented Jan 10, 2022

/sponsor

@openjdk
Copy link

@openjdk openjdk bot commented Jan 10, 2022

Going to push as commit bbc1ddb.
Since your change was applied there have been 56 commits pushed to the master branch:

  • 40df5df: 8279398: jdk/jfr/api/recording/time/TestTimeMultiple.java failed with "RuntimeException: getStopTime() > afterStop"
  • ad34f03: 8279515: C1: No inlining through invokedynamic and invokestatic call sites when resolved class is not linked
  • 928e347: 8279032: compiler/loopopts/TestSkeletonPredicateNegation.java times out with -XX:TieredStopAtLevel < 4
  • 354c904: 8273452: DocTrees.getDocCommentTree should be specified as idempotent
  • 06b4d49: 8278344: sun/security/pkcs12/KeytoolOpensslInteropTest.java test fails because of different openssl output
  • 8f969a1: 8278930: javac tries to compile a file twice via PackageElement.getEnclosedElements
  • 642ab34: 8278373: JavacTrees.searchMethod finds incorrect match
  • d65c665: 8279527: Dereferencing segments backed by different scopes leads to pollution
  • 967ef0c: 8278020: ~13% variation in Renaissance-Scrabble
  • 7c792f2: 8279333: Some JFR tests do not accept 'GCLocker Initiated GC' as a valid GC Cause
  • ... and 46 more: https://git.openjdk.java.net/jdk18/compare/e82310f3605b37519129ea34b2a6f99ee3ccc63a...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated label Jan 10, 2022
@openjdk openjdk bot closed this Jan 10, 2022
@openjdk openjdk bot removed ready rfr sponsor labels Jan 10, 2022
@openjdk
Copy link

@openjdk openjdk bot commented Jan 10, 2022

@dean-long @shqking Pushed as commit bbc1ddb.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@shqking shqking deleted the 8278267 branch Jan 11, 2022
shqking added a commit to shqking/jdk that referenced this issue Mar 7, 2022
*** Implementation

In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.

For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).

This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.

In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.

*** Example

Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.

```
public void rShiftInt() {
    for (int i = 0; i < SIZE; i++) {
        intsB[i] = intsA[i] >> count;
    }
}
```

Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.

```
0x0000ffff89057a64:   dup     v16.16b, w13              <-- dup
0x0000ffff89057a68:   mov     w12, #0x7d00                    // #32000
0x0000ffff89057a6c:   sub     w13, w2, w10
0x0000ffff89057a70:   cmp     w2, w10
0x0000ffff89057a74:   csel    w13, wzr, w13, lt
0x0000ffff89057a78:   mov     w8, #0x7d00                     // #32000
0x0000ffff89057a7c:   cmp     w13, w8
0x0000ffff89057a80:   csel    w13, w12, w13, hi
0x0000ffff89057a84:   add     w14, w13, w10
0x0000ffff89057a88:   nop
0x0000ffff89057a8c:   nop
0x0000ffff89057a90:   sbfiz   x13, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff89057a94:   add     x15, x17, x13
0x0000ffff89057a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff89057a9c:   add     x13, x0, x13
0x0000ffff89057aa0:   neg     v18.16b, v16.16b          <-- neg
0x0000ffff89057aa4:   sshl    v17.4s, v17.4s, v18.4s    <-- shift right
0x0000ffff89057aa8:   str     q17, [x13,openjdk#16]
0x0000ffff89057aac:   ...
0x0000ffff89057b1c:   add     w10, w10, #0x20
0x0000ffff89057b20:   cmp     w10, w14
0x0000ffff89057b24:   b.lt    0x0000ffff89057a90        <-- loop end
```

Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.

```
0x0000ffff8d053a68:   neg     w14, w13                  <---- neg
0x0000ffff8d053a6c:   dup     v16.16b, w14              <---- dup
0x0000ffff8d053a70:   sub     w14, w2, w10
0x0000ffff8d053a74:   cmp     w2, w10
0x0000ffff8d053a78:   csel    w14, wzr, w14, lt
0x0000ffff8d053a7c:   mov     w8, #0x7d00                     // #32000
0x0000ffff8d053a80:   cmp     w14, w8
0x0000ffff8d053a84:   csel    w14, w12, w14, hi
0x0000ffff8d053a88:   add     w13, w14, w10
0x0000ffff8d053a8c:   nop
0x0000ffff8d053a90:   sbfiz   x14, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff8d053a94:   add     x15, x17, x14
0x0000ffff8d053a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff8d053a9c:   sshl    v17.4s, v17.4s, v16.4s    <-- shift right
0x0000ffff8d053aa0:   add     x14, x0, x14
0x0000ffff8d053aa4:   str     q17, [x14,openjdk#16]
0x0000ffff8d053aa8:   ...
0x0000ffff8d053afc:   add     w10, w10, #0x20
0x0000ffff8d053b00:   cmp     w10, w13
0x0000ffff8d053b04:   b.lt    0x0000ffff8d053a90        <-- loop end
```

*** Testing

Tier1~3 tests passed on Linux/AArch64 platform.

*** Performance Evaluation

- Auto-vectorization

One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.

The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.

We witness about 30% ~ 53% improvement on microbenchmarks.

```
Benchmark                      Units    Score-1    Score-2
VectorShiftRight.rShiftByte   ops/ms  10601.980  13816.353
VectorShiftRight.rShiftInt    ops/ms   3592.831   5502.941
VectorShiftRight.rShiftLong   ops/ms   1584.012   2425.247
VectorShiftRight.rShiftShort  ops/ms   6643.414   9728.762
VectorShiftRight.urShiftByte  ops/ms   2066.965   2048.336 (*)
VectorShiftRight.urShiftChar  ops/ms   6660.805   9728.478
VectorShiftRight.urShiftInt   ops/ms   3592.909   5514.928
VectorShiftRight.urShiftLong  ops/ms   1583.995   2422.991

*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```

- VectorAPI

Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.

```
Benchmark                  Units    Score-1    Score-2
Byte128Vector.LSHL        ops/ms  10867.666  10873.993
Byte128Vector.LSHLShift   ops/ms  10945.729  10945.741
Byte128Vector.LSHR        ops/ms   8629.305   8629.343
Byte128Vector.LSHRShift   ops/ms   8245.864  10303.521   <--
Byte128Vector.ASHR        ops/ms   8619.691   8629.438
Byte128Vector.ASHRShift   ops/ms   8245.860  10305.027   <--
Int128Vector.LSHL         ops/ms   3104.213   3103.702
Int128Vector.LSHLShift    ops/ms   3114.354   3114.371
Int128Vector.LSHR         ops/ms   2380.717   2380.693
Int128Vector.LSHRShift    ops/ms   2312.871   2992.377   <--
Int128Vector.ASHR         ops/ms   2380.668   2380.647
Int128Vector.ASHRShift    ops/ms   2312.894   2992.332   <--
Long128Vector.LSHL        ops/ms   1586.907   1587.591
Long128Vector.LSHLShift   ops/ms   1589.469   1589.540
Long128Vector.LSHR        ops/ms   1209.754   1209.687
Long128Vector.LSHRShift   ops/ms   1174.718   1527.502   <--
Long128Vector.ASHR        ops/ms   1209.713   1209.669
Long128Vector.ASHRShift   ops/ms   1174.712   1527.174   <--
Short128Vector.LSHL       ops/ms   5945.542   5943.770
Short128Vector.LSHLShift  ops/ms   5984.743   5984.640
Short128Vector.LSHR       ops/ms   4613.378   4613.577
Short128Vector.LSHRShift  ops/ms   4486.023   5746.466   <--
Short128Vector.ASHR       ops/ms   4613.389   4613.478
Short128Vector.ASHRShift  ops/ms   4486.019   5746.368   <--
```

1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.

2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.

[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hotspot-compiler integrated
5 participants