8278267: ARM32: several vector test failures for ASHR #41
Conversation
In ARM32, "VSHL (register)" instruction [1] is shared by vector left shift and vector right shift, and the condition to distinguish them is whether the shift count value is positve or negative. Hence, negation operation is needed before conducting vector right shift. For vector right shift, the shift count can be a RShiftCntV or a normal vector node. Take test case Byte64VectorTests.java [2][3] as an example. Note that RShiftCntV is already negated via rules "vsrcntD" and "vsrcntX" whereas the normal vector node is NOT, since we don't know whether a normal vector node is used as a vector shift count or not. This is the root cause for these vector test failures. The fix is simple, moving the negation from "vsrcntD|X" to the corresponding vector right shift rules. Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift related rules are in form of "vsAABB_CC", where 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic right shift). 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) and 2L (long type). 3) CC can be reg (register case) and immI (immediate case). Minor updates: 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules conduct the same duplication operation now. 2) Update the "match" primitive for vsraBB_immI rules. 3) Style issue: remove the surrounding space for "ins_pipe" primitive. Tests: We ran tier 1~3 tests on ARM32 platform. With this patch, previously failed vector test cases can pass now without introducing test regression. [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425
|
Webrevs
|
@nick-arm could you or someone else at ARM help to review this? That would be great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me (not a Reviewer).
There seems to be an interesting history here. Method 1: negate in RShiftCntV rule For aarch64, the negate was moved into the shift instruction in JDK-8213134 (Method 1 --> Method 2). Then JDK-8262916 proposed to move it back out of the shift instruction again. In that PR, the opinion that Method 1 (arm32) generated better code than Method 2 (aarch64) was expressed. Now it looks like this PR proposes for arm32 to move to Method 2 like aarch64, so I suspect that there will be a performance impact. I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same? |
Hmm, I didn't notice JDK-8277239 before. That looks like a good solution. If it works fine on arm32, we could also do it on aarch64. |
…ht shifts Method is_var_shift() denotes that vector shift count is a variable shift: 1) for this case, vector shift count should be negated before conducting right shifts. E.g., vsrl4S_reg_var rule. 2) for the opposite case, vector shift count is generated via RShiftCntV rules and is already negated there. Hence, no negation is needed. E.g., vsrl4S_reg rule. Besides, it's safe to add "hash()" and "cmp()" methods for ShiftV node.
Thanks a lot for your explanation. I do agree with your solution. Updated the code: 1) using is_var_shift() to determine to put "negation" in RShiftCntV or RShiftV* rules, and 2) adding Could you please take a look at the latest code when you have a chance? Maybe after the new year :) |
Tests on the latest code: |
The change is Ok for me (I am not a reviewer). |
1. logical left shift rules a). add is_var_shift check for vslAA_immI rules. b). for vslAA_reg rules, remove the matching for URShiftV cases as we have the separate logical right shift rules now. 2. logical right shift rules a). add vsrlAA_reg and vsrlAA_reg_var rules. b). add is_var_shift check for vsrlAA_immI rules. 3. arithmetic right shift rules a). add is_var_shift check for vsraAA_reg rules. b). add vsraAA_reg_var rules c). for vsraAA_immI rules, add is_var_shift check and update the match primitive. Code style issues(FIXME and the surrounding space in ins_pipe): 1. for modified rules, keep it as it was 2. for newly added rules, update the style
Thanks for your review. |
src/hotspot/cpu/arm/arm.ad
Outdated
@@ -10790,7 +10792,7 @@ instruct vsl16B_reg(vecX dst, vecX src, vecX shift) %{ | |||
%} | |||
|
|||
instruct vsl8B_immI(vecD dst, vecD src, immI shift) %{ | |||
predicate(n->as_Vector()->length() == 8); | |||
predicate(n->as_Vector()->length() == 8 && !n->as_ShiftV()->is_var_shift()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is checking for is_var_shift() really necessary for these immI rules? It seems like it should never happen, so it could be an assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An assertion should be better, but I didn't find a proper way to add one assertion for instruct
according to src/hotspot/share/adlc/Doc/Syntax.doc
.
Hence I'd like to put this check to predicate
, and "bad AD file` assertion would be issued for unexpected cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put the assert into a helper function. You can look in aarch64.ad for examples.
bool assert_not_var_shift(const Node *n) {
assert(!n->as_ShiftV()->is_var_shift(), "illegal var shift");
return true;
}
[...]
predicate(n->as_Vector()->length() == 8 && assert_not_var_shift(n));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you suggestion. Will update soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dean-long I believe your comments are addressed in the latest commit. Could you please help to take a review when you got a chance? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your changes look good. Approved.
virtual uint hash() const { return VectorNode::hash() + _is_var_shift; } | ||
virtual bool cmp(const Node& n) const { | ||
return VectorNode::cmp(n) && _is_var_shift == ((ShiftVNode&)n)._is_var_shift; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to add hash() and cmp()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to add them to avoid mis-optimizations at IR level.
Per my understanding, two ShiftV nodes with the same vector elements but with different _is_var_shift
field values, are two different nodes, and they cannot be merged/treated as the same node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Is it possible to write a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking hash()
and cmp()
need be overrided if there are new field members. E.g., LoadLNode, VectorMaskCmpNode, VectorTestNode, etc.
It could happen theoretically. But I'm afraid it's not easy to design a test case as it might involve some internal optimizations in a crafty way to construct such two kinds of ShiftV nodes.
format %{ | ||
"VNEG.S8 $tmp.D,$shift.D\n\t! neg packed8B" | ||
"VSHL.U16 $dst.D,$src.D,$tmp.D\t! logical right shift packed4S" | ||
%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches what aarch64 is doing, right? I was hoping there was a way for multiple shift instructions to reuse the same negated value, like we do for vsrcntX. But it would probably require introducing a new operand type, like vecNegatedRShiftX, or doing the negate in the IR like we do for the mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. AArch64 works in the same way.
Agree with you. Performance benefit would be obtained if we can reuse the same negated value.
We may want to prepare another patch to enhance right shifts for aarch64(using is_var_shift() and considering your proposed two solutions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate RFE for that sounds good to me. Please file one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to reuse this open JBS issue. See https://bugs.openjdk.java.net/browse/JDK-8265263 . And I will work on it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. I forgot about that RFE.
@shqking This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 51 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@nsjian, @dean-long) but any other Committer may sponsor as well.
|
Define helper assert_not_var_shift() and use it for immI cases. Besides, update the copyright year to 2022.
Thanks for your reviews! @dean-long @nsjian /integrate |
/sponsor |
Going to push as commit bbc1ddb.
Your commit was automatically rebased without conflicts. |
@dean-long @shqking Pushed as commit bbc1ddb. |
*** Implementation In AArch64 NEON, vector shift right is implemented by vector shift left instructions (SSHL[1] and USHL[2]) with negative shift count value. In C2 backend, we generate a `neg` to given shift value followed by `sshl` or `ushl` instruction. For vector shift right, the vector shift count has two origins: 1) it can be duplicated from scalar variable/immediate(case-1), 2) it can be loaded directly from one vector(case-2). This patch aims to optimize case-1. Specifically, we move the negate from RShiftV* rules to RShiftCntV rule. As a result, the negate can be hoisted outside of the loop if it's a loop invariant. In this patch, 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle shift left and shift right respectively. Compared to vslcnt* rules, the negate is conducted in vsrcnt*. 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use vsra*_var and vsrl*_var rules to handle case-2. Note that ShiftVNode::is_var_shift() can be used to distinguish case-1 from case-2. 3) we add one assertion for the vs*_imm rules as we have done on ARM32[3]. 4) several style issues are resolved. *** Example Take function `rShiftInt()` in the newly added micro benchmark VectorShiftRight.java as an example. ``` public void rShiftInt() { for (int i = 0; i < SIZE; i++) { intsB[i] = intsA[i] >> count; } } ``` Arithmetic shift right is conducted inside a big loop. The following code snippet shows the disassembly code generated by auto-vectorization before we apply current patch. We can see that `neg` is conducted in the loop body. ``` 0x0000ffff89057a64: dup v16.16b, w13 <-- dup 0x0000ffff89057a68: mov w12, #0x7d00 // #32000 0x0000ffff89057a6c: sub w13, w2, w10 0x0000ffff89057a70: cmp w2, w10 0x0000ffff89057a74: csel w13, wzr, w13, lt 0x0000ffff89057a78: mov w8, #0x7d00 // #32000 0x0000ffff89057a7c: cmp w13, w8 0x0000ffff89057a80: csel w13, w12, w13, hi 0x0000ffff89057a84: add w14, w13, w10 0x0000ffff89057a88: nop 0x0000ffff89057a8c: nop 0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff89057a94: add x15, x17, x13 0x0000ffff89057a98: ldr q17, [x15,openjdk#16] 0x0000ffff89057a9c: add x13, x0, x13 0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg 0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right 0x0000ffff89057aa8: str q17, [x13,openjdk#16] 0x0000ffff89057aac: ... 0x0000ffff89057b1c: add w10, w10, #0x20 0x0000ffff89057b20: cmp w10, w14 0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end ``` Here is the disassembly code after we apply current patch. We can see that the negate is no longer conducted inside the loop, and it is hoisted to the outside. ``` 0x0000ffff8d053a68: neg w14, w13 <---- neg 0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup 0x0000ffff8d053a70: sub w14, w2, w10 0x0000ffff8d053a74: cmp w2, w10 0x0000ffff8d053a78: csel w14, wzr, w14, lt 0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000 0x0000ffff8d053a80: cmp w14, w8 0x0000ffff8d053a84: csel w14, w12, w14, hi 0x0000ffff8d053a88: add w13, w14, w10 0x0000ffff8d053a8c: nop 0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff8d053a94: add x15, x17, x14 0x0000ffff8d053a98: ldr q17, [x15,openjdk#16] 0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right 0x0000ffff8d053aa0: add x14, x0, x14 0x0000ffff8d053aa4: str q17, [x14,openjdk#16] 0x0000ffff8d053aa8: ... 0x0000ffff8d053afc: add w10, w10, #0x20 0x0000ffff8d053b00: cmp w10, w13 0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end ``` *** Testing Tier1~3 tests passed on Linux/AArch64 platform. *** Performance Evaluation - Auto-vectorization One micro benchmark, i.e. VectorShiftRight.java, is added by this patch in order to evaluate the optimization on vector shift right. The following table shows the result. Column `Score-1` shows the score before we apply current patch, and column `Score-2` shows the score when we apply current patch. We witness about 30% ~ 53% improvement on microbenchmarks. ``` Benchmark Units Score-1 Score-2 VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353 VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941 VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247 VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762 VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*) VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478 VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928 VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991 *: Logical shift right for Byte type(urShiftByte) is not vectorized, as disscussed in [4]. ``` - VectorAPI Furthermore, we also evaluate the impact of this patch on VectorAPI benchmarks, e.g., [5]. Details can be found in the table below. Columns `Score-1` and `Score-2` show the scores before and after applying current patch. ``` Benchmark Units Score-1 Score-2 Byte128Vector.LSHL ops/ms 10867.666 10873.993 Byte128Vector.LSHLShift ops/ms 10945.729 10945.741 Byte128Vector.LSHR ops/ms 8629.305 8629.343 Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <-- Byte128Vector.ASHR ops/ms 8619.691 8629.438 Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <-- Int128Vector.LSHL ops/ms 3104.213 3103.702 Int128Vector.LSHLShift ops/ms 3114.354 3114.371 Int128Vector.LSHR ops/ms 2380.717 2380.693 Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <-- Int128Vector.ASHR ops/ms 2380.668 2380.647 Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <-- Long128Vector.LSHL ops/ms 1586.907 1587.591 Long128Vector.LSHLShift ops/ms 1589.469 1589.540 Long128Vector.LSHR ops/ms 1209.754 1209.687 Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <-- Long128Vector.ASHR ops/ms 1209.713 1209.669 Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <-- Short128Vector.LSHL ops/ms 5945.542 5943.770 Short128Vector.LSHLShift ops/ms 5984.743 5984.640 Short128Vector.LSHR ops/ms 4613.378 4613.577 Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <-- Short128Vector.ASHR ops/ms 4613.389 4613.478 Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <-- ``` 1) For logical shift left(LSHL and LSHLShift), and shift right with variable vector shift count(LSHR and ASHR) cases, we didn't find much changes, which is expected. 2) For shift right with scalar shift count(LSHRShift and ASHRShift) case, about 25% ~ 30% improvement can be observed, and this benefit is introduced by current patch. [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register-- [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register-- [3] openjdk/jdk18#41 [4] openjdk#1087 [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
In ARM32, "VSHL (register)" instruction [1] is shared by vector left
shift and vector right shift, and the condition to distinguish them is
whether the shift count value is positve or negative. Hence, negation
operation is needed before conducting vector right shift.
For vector right shift, the shift count can be a RShiftCntV or a normal
vector node. Take test case Byte64VectorTests.java [2][3] as an example.
Note that RShiftCntV is already negated via rules "vsrcntD" and
"vsrcntX" whereas the normal vector node is NOT, since we don't know
whether a normal vector node is used as a vector shift count or not.
This is the root cause for these vector test failures.
The fix is simple, moving the negation from "vsrcntD|X" to the
corresponding vector right shift rules.
Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift
related rules are in form of "vsAABB_CC", where
right shift).
and 2L (long type).
Minor updates:
conduct the same duplication operation now.
Tests:
We ran tier 1~3 tests on ARM32 platform. With this patch, previously
failed vector test cases can pass now without introducing test
regression.
[1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en
[2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237
[3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk18 pull/41/head:pull/41
$ git checkout pull/41
Update a local copy of the PR:
$ git checkout pull/41
$ git pull https://git.openjdk.java.net/jdk18 pull/41/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 41
View PR using the GUI difftool:
$ git pr show -t 41
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk18/pull/41.diff