New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8252522: nsk/share/test/StressOptions should multiple stressTime by jtreg's timeout-factor #32
Conversation
…treg's timeout-factor
👋 Welcome back iignatyev! A progress list of the required criteria for merging this PR into |
@iignatev To determine the appropriate audience for reviewing this pull request, one or more labels corresponding to different subsystems will normally be applied automatically. However, no automatic labelling rule matches the changes in this pull request. In order to have an RFR email automatically sent to the correct mailing list, you will need to add one or more labels manually using the |
/label add hotspot |
@iignatev |
/reviewer add dholmes |
@iignatev Syntax:
|
/reviewer add dholmes |
@iignatev |
@iignatev This change now passes all automated pre-integration checks. When the change also fulfills all project specific requirements, type
Since the source branch of this PR was last updated there have been 2 commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge ➡️ To integrate this PR with the above commit message to the |
/integrate |
@iignatev Since your change was applied there have been 2 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 5f76deb. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```
Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```
The patch aims to help optimize Math.abs() mainly from these three parts: 1) Remove redundant instructions for abs with constant values 2) Remove redundant instructions for abs with char type 3) Convert some common abs operations to ideal forms 1. Remove redundant instructions for abs with constant values If we can decide the value of the input node for function Math.abs() at compile-time, we can substitute the Abs node with the absolute value of the constant and don't have to calculate it at runtime. For example, int[] a for (int i = 0; i < SIZE; i++) { a[i] = Math.abs(-38); } Before the patch, the generated code for the testcase above is: ... mov w10, #0xffffffda cmp w10, wzr cneg w17, w10, lt dup v16.8h, w17 ... After the patch, the generated code for the testcase above is : ... movi v16.4s, #0x26 ... 2. Remove redundant instructions for abs with char type In Java semantics, as the char type is always non-negative, we could actually remove the absI node in the C2 middle end. As for vectorization part, in current SLP, the vectorization of Math.abs() with char type is intentionally disabled after JDK-8261022 because it generates incorrect result before. After removing the AbsI node in the middle end, Math.abs(char) can be vectorized naturally. For example, char[] a; char[] b; for (int i = 0; i < SIZE; i++) { b[i] = (char) Math.abs(a[i]); } Before the patch, the generated assembly code for the testcase above is: B15: add x13, x21, w20, sxtw openjdk#1 ldrh w11, [x13, openjdk#16] cmp w11, wzr cneg w10, w11, lt strh w10, [x13, openjdk#16] ldrh w10, [x13, openjdk#18] cmp w10, wzr cneg w10, w10, lt strh w10, [x13, openjdk#18] ... add w20, w20, #0x1 cmp w20, w17 b.lt B15 After the patch, the generated assembly code is: B15: sbfiz x18, x19, openjdk#1, openjdk#32 add x0, x14, x18 ldr q16, [x0, openjdk#16] add x18, x21, x18 str q16, [x18, openjdk#16] ldr q16, [x0, openjdk#32] str q16, [x18, openjdk#32] ... add w19, w19, #0x40 cmp w19, w17 b.lt B15 3. Convert some common abs operations to ideal forms The patch overrides some virtual support functions for AbsNode so that optimization of gvn can work on it. Here are the optimizable forms: a) abs(0 - x) => abs(x) Before the patch: ... ldr w13, [x13, openjdk#16] neg w13, w13 cmp w13, wzr cneg w14, w13, lt ... After the patch: ... ldr w13, [x13, openjdk#16] cmp w13, wzr cneg w13, w13, lt ... b) abs(abs(x)) => abs(x) Before the patch: ... ldr w12, [x12, openjdk#16] cmp w12, wzr cneg w12, w12, lt cmp w12, wzr cneg w12, w12, lt ... After the patch: ... ldr w13, [x13, openjdk#16] cmp w13, wzr cneg w13, w13, lt ... Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
The patch aims to help optimize Math.abs() mainly from these three parts: 1) Remove redundant instructions for abs with constant values 2) Remove redundant instructions for abs with char type 3) Convert some common abs operations to ideal forms 1. Remove redundant instructions for abs with constant values If we can decide the value of the input node for function Math.abs() at compile-time, we can substitute the Abs node with the absolute value of the constant and don't have to calculate it at runtime. For example, int[] a for (int i = 0; i < SIZE; i++) { a[i] = Math.abs(-38); } Before the patch, the generated code for the testcase above is: ... mov w10, #0xffffffda cmp w10, wzr cneg w17, w10, lt dup v16.8h, w17 ... After the patch, the generated code for the testcase above is : ... movi v16.4s, #0x26 ... 2. Remove redundant instructions for abs with char type In Java semantics, as the char type is always non-negative, we could actually remove the absI node in the C2 middle end. As for vectorization part, in current SLP, the vectorization of Math.abs() with char type is intentionally disabled after JDK-8261022 because it generates incorrect result before. After removing the AbsI node in the middle end, Math.abs(char) can be vectorized naturally. For example, char[] a; char[] b; for (int i = 0; i < SIZE; i++) { b[i] = (char) Math.abs(a[i]); } Before the patch, the generated assembly code for the testcase above is: B15: add x13, x21, w20, sxtw openjdk#1 ldrh w11, [x13, openjdk#16] cmp w11, wzr cneg w10, w11, lt strh w10, [x13, openjdk#16] ldrh w10, [x13, openjdk#18] cmp w10, wzr cneg w10, w10, lt strh w10, [x13, openjdk#18] ... add w20, w20, #0x1 cmp w20, w17 b.lt B15 After the patch, the generated assembly code is: B15: sbfiz x18, x19, openjdk#1, openjdk#32 add x0, x14, x18 ldr q16, [x0, openjdk#16] add x18, x21, x18 str q16, [x18, openjdk#16] ldr q16, [x0, openjdk#32] str q16, [x18, openjdk#32] ... add w19, w19, #0x40 cmp w19, w17 b.lt B15 3. Convert some common abs operations to ideal forms The patch overrides some virtual support functions for AbsNode so that optimization of gvn can work on it. Here are the optimizable forms: a) abs(0 - x) => abs(x) Before the patch: ... ldr w13, [x13, openjdk#16] neg w13, w13 cmp w13, wzr cneg w14, w13, lt ... After the patch: ... ldr w13, [x13, openjdk#16] cmp w13, wzr cneg w13, w13, lt ... b) abs(abs(x)) => abs(x) Before the patch: ... ldr w12, [x12, openjdk#16] cmp w12, wzr cneg w12, w12, lt cmp w12, wzr cneg w12, w12, lt ... After the patch: ... ldr w13, [x13, openjdk#16] cmp w13, wzr cneg w13, w13, lt ... Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
*** Implementation In AArch64 NEON, vector shift right is implemented by vector shift left instructions (SSHL[1] and USHL[2]) with negative shift count value. In C2 backend, we generate a `neg` to given shift value followed by `sshl` or `ushl` instruction. For vector shift right, the vector shift count has two origins: 1) it can be duplicated from scalar variable/immediate(case-1), 2) it can be loaded directly from one vector(case-2). This patch aims to optimize case-1. Specifically, we move the negate from RShiftV* rules to RShiftCntV rule. As a result, the negate can be hoisted outside of the loop if it's a loop invariant. In this patch, 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle shift left and shift right respectively. Compared to vslcnt* rules, the negate is conducted in vsrcnt*. 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use vsra*_var and vsrl*_var rules to handle case-2. Note that ShiftVNode::is_var_shift() can be used to distinguish case-1 from case-2. 3) we add one assertion for the vs*_imm rules as we have done on ARM32[3]. 4) several style issues are resolved. *** Example Take function `rShiftInt()` in the newly added micro benchmark VectorShiftRight.java as an example. ``` public void rShiftInt() { for (int i = 0; i < SIZE; i++) { intsB[i] = intsA[i] >> count; } } ``` Arithmetic shift right is conducted inside a big loop. The following code snippet shows the disassembly code generated by auto-vectorization before we apply current patch. We can see that `neg` is conducted in the loop body. ``` 0x0000ffff89057a64: dup v16.16b, w13 <-- dup 0x0000ffff89057a68: mov w12, #0x7d00 // #32000 0x0000ffff89057a6c: sub w13, w2, w10 0x0000ffff89057a70: cmp w2, w10 0x0000ffff89057a74: csel w13, wzr, w13, lt 0x0000ffff89057a78: mov w8, #0x7d00 // #32000 0x0000ffff89057a7c: cmp w13, w8 0x0000ffff89057a80: csel w13, w12, w13, hi 0x0000ffff89057a84: add w14, w13, w10 0x0000ffff89057a88: nop 0x0000ffff89057a8c: nop 0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff89057a94: add x15, x17, x13 0x0000ffff89057a98: ldr q17, [x15,openjdk#16] 0x0000ffff89057a9c: add x13, x0, x13 0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg 0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right 0x0000ffff89057aa8: str q17, [x13,openjdk#16] 0x0000ffff89057aac: ... 0x0000ffff89057b1c: add w10, w10, #0x20 0x0000ffff89057b20: cmp w10, w14 0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end ``` Here is the disassembly code after we apply current patch. We can see that the negate is no longer conducted inside the loop, and it is hoisted to the outside. ``` 0x0000ffff8d053a68: neg w14, w13 <---- neg 0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup 0x0000ffff8d053a70: sub w14, w2, w10 0x0000ffff8d053a74: cmp w2, w10 0x0000ffff8d053a78: csel w14, wzr, w14, lt 0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000 0x0000ffff8d053a80: cmp w14, w8 0x0000ffff8d053a84: csel w14, w12, w14, hi 0x0000ffff8d053a88: add w13, w14, w10 0x0000ffff8d053a8c: nop 0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry 0x0000ffff8d053a94: add x15, x17, x14 0x0000ffff8d053a98: ldr q17, [x15,openjdk#16] 0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right 0x0000ffff8d053aa0: add x14, x0, x14 0x0000ffff8d053aa4: str q17, [x14,openjdk#16] 0x0000ffff8d053aa8: ... 0x0000ffff8d053afc: add w10, w10, #0x20 0x0000ffff8d053b00: cmp w10, w13 0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end ``` *** Testing Tier1~3 tests passed on Linux/AArch64 platform. *** Performance Evaluation - Auto-vectorization One micro benchmark, i.e. VectorShiftRight.java, is added by this patch in order to evaluate the optimization on vector shift right. The following table shows the result. Column `Score-1` shows the score before we apply current patch, and column `Score-2` shows the score when we apply current patch. We witness about 30% ~ 53% improvement on microbenchmarks. ``` Benchmark Units Score-1 Score-2 VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353 VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941 VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247 VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762 VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*) VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478 VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928 VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991 *: Logical shift right for Byte type(urShiftByte) is not vectorized, as disscussed in [4]. ``` - VectorAPI Furthermore, we also evaluate the impact of this patch on VectorAPI benchmarks, e.g., [5]. Details can be found in the table below. Columns `Score-1` and `Score-2` show the scores before and after applying current patch. ``` Benchmark Units Score-1 Score-2 Byte128Vector.LSHL ops/ms 10867.666 10873.993 Byte128Vector.LSHLShift ops/ms 10945.729 10945.741 Byte128Vector.LSHR ops/ms 8629.305 8629.343 Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <-- Byte128Vector.ASHR ops/ms 8619.691 8629.438 Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <-- Int128Vector.LSHL ops/ms 3104.213 3103.702 Int128Vector.LSHLShift ops/ms 3114.354 3114.371 Int128Vector.LSHR ops/ms 2380.717 2380.693 Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <-- Int128Vector.ASHR ops/ms 2380.668 2380.647 Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <-- Long128Vector.LSHL ops/ms 1586.907 1587.591 Long128Vector.LSHLShift ops/ms 1589.469 1589.540 Long128Vector.LSHR ops/ms 1209.754 1209.687 Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <-- Long128Vector.ASHR ops/ms 1209.713 1209.669 Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <-- Short128Vector.LSHL ops/ms 5945.542 5943.770 Short128Vector.LSHLShift ops/ms 5984.743 5984.640 Short128Vector.LSHR ops/ms 4613.378 4613.577 Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <-- Short128Vector.ASHR ops/ms 4613.389 4613.478 Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <-- ``` 1) For logical shift left(LSHL and LSHLShift), and shift right with variable vector shift count(LSHR and ASHR) cases, we didn't find much changes, which is expected. 2) For shift right with scalar shift count(LSHRShift and ASHRShift) case, about 25% ~ 30% improvement can be observed, and this benefit is introduced by current patch. [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register-- [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register-- [3] openjdk/jdk18#41 [4] openjdk#1087 [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
This patch fixes the wrong matching rule of replicate2L_zero. It was matched "ReplicateI" by mistake so that long immediates(not only zero) had to be moved to register first and matched to replicate2L finally. To fix this trivial bug, this patch fixes the typo and extends the rule of replicate2L_zero to replicate2L_imm, which now supports all possible long immediate values. The final code changes are shown as below: replicate2L_imm: mov x13, #0xff movk x13, #0xff, lsl openjdk#16 movk x13, #0xff, lsl openjdk#32 dup v16.2d, x13 => movi v16.2d, #0xff00ff00ff [Test] test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi passed without failure. Change-Id: Ieac92820dea560239a968de3d7430003f01726bd
… operations Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: ``` short[] addShort(short[] a, short[] b, short[] c) { for (int i = 0; i < SIZE; i++) { b[i] = (short) (a[i] + 8); // *line A* sres[i] = (short) (b[i] + c[i]); // *line B* } } ``` However, similar cases of int/float/double/long/char type can be vectorized successfully. The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. LoadS a[i] 8 \ / AddI (line A) / \ StoreC b[i] Lshift 16bits \ RShiftI 16 bits LoadS c[i] \ / AddI (line B) \ StoreC sres[i] In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. What we do in the mid-end is eliminating the sign extension before some subword integer operations like: ``` int x, y; short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 ``` to ``` short s = (short) (x OP y); ``` In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. After the patch, the short loop case above can be vectorized as: ``` movi v18.8h, #0x8 ... ldr q16, [x14, openjdk#32] // vector load a[i] // vector add, a[i] + 8, no promotion or demotion add v17.8h, v16.8h, v18.8h str q17, [x6, openjdk#32] // vector store a[i] + 8, b[i] ldr q17, [x0, openjdk#32] // vector load c[i] // vector add, a[i] + c[i], no promotion or demotion add v16.8h, v17.8h, v16.8h // vector add, a[i] + c[i] + 8, no promotion or demotion add v16.8h, v16.8h, v18.8h str q16, [x11, openjdk#32] //vector store sres[i] ... ``` The patch works for byte cases as well. Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. on AArch64: Before the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 401.521 ± 0.033 ns/op addS 523 avgt 15 401.512 ± 0.021 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 68.444 ± 0.318 ns/op addS 523 avgt 15 69.847 ± 0.043 ns/op on x86: Before the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 454.102 ± 36.180 ns/op addS 523 avgt 15 432.245 ± 22.640 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 75.812 ± 5.063 ns/op addS 523 avgt 15 72.839 ± 10.109 ns/op [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 Change-Id: I92ce42b550ef057964a3b58716436735275d8d31
``` public short[] vectorUnsignedShiftRight(short[] shorts) { short[] res = new short[SIZE]; for (int i = 0; i < SIZE; i++) { res[i] = (short) (shorts[i] >>> 3); } return res; } ``` In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. Taking unsigned right shift on short type as an example, Short: | <- 16 bits -> | <- 16 bits -> | | 1 1 1 ... 1 1 | data | when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: For T_SHORT (shift <= 16): src RShiftCntV shift src RShiftCntV shift \ / ==> \ / URShiftVS RShiftVS This patch does the transformation in SuperWord::implemented() and SuperWord::output(). It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: ``` ... sbfiz x13, x10, openjdk#1, openjdk#32 add x15, x11, x13 ldr q16, [x15, openjdk#16] sshr v16.8h, v16.8h, openjdk#3 add x13, x17, x13 str q16, [x13, openjdk#16] ... ``` Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. The perf data on AArch64: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op after the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op The perf data on X86: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op After the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 [2] https://github.com/jpountz/decode-128-ints-benchmark/ Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr openjdk#7 orr x0, x0, x0, lsr openjdk#14 orr x0, x0, x0, lsr openjdk#28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr openjdk#7 orr x0, x0, x0, lsr openjdk#14 orr x0, x0, x0, lsr openjdk#28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr openjdk#7 orr x8, x8, x8, lsr openjdk#14 orr x8, x8, x8, lsr openjdk#28 and x8, x8, #0xff orr x0, x0, x8, lsl openjdk#56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
* removed FrameKind from StackMapTableAttribute and offsets replaced with labels in StackMapFrames * simplification of StackMapFrame, removal of compressed sub-types * Implemented UnboundStackMapTableAttribute and added related factory methods dropped VerificationType SimpleVerificationTypeInfo refactored to enum * implemented StackMapTableAttribute compression and writing * StackMapTableAttribute extends CodeElement * added Classfile.Option.processStackMaps enable stack maps processing in RebuildingTransformation test helper fixed UninitializedVerificationTypeInfo labels resolution * fixed Opcode * enable unordered StackMapTableAttribute entries * removal of PROCESS_STACK_MAPS option generated stack maps override user content adjusted labels inflation from stack maps * minor patch in RebuildingTransformation test helper
* removed FrameKind from StackMapTableAttribute and offsets replaced with labels in StackMapFrames * simplification of StackMapFrame, removal of compressed sub-types * Implemented UnboundStackMapTableAttribute and added related factory methods dropped VerificationType SimpleVerificationTypeInfo refactored to enum * implemented StackMapTableAttribute compression and writing * StackMapTableAttribute extends CodeElement * added Classfile.Option.processStackMaps enable stack maps processing in RebuildingTransformation test helper fixed UninitializedVerificationTypeInfo labels resolution * fixed Opcode * enable unordered StackMapTableAttribute entries * removal of PROCESS_STACK_MAPS option generated stack maps override user content adjusted labels inflation from stack maps * minor patch in RebuildingTransformation test helper
JVM-1916: a Virtual State becomes top after merge
pre-skara ear RFR threads: Aug, Sep
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/32/head:pull/32
$ git checkout pull/32