Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8252522: nsk/share/test/StressOptions should multiple stressTime by jtreg's timeout-factor #32

Closed
wants to merge 1 commit into from

Conversation

iignatev
Copy link
Member

@iignatev iignatev commented Sep 6, 2020

pre-skara ear RFR threads: Aug, Sep

Hi all,

could you please review this small patch which updates StressOptions to adjust allocated time according to TIMEOUT_FACTOR?

from JBS:

nsk/share/test/StressOptions and Stresser aren't aware of jtreg's timeout-factor and hence don't provide enough stress time for testing in slow/stress configurations, e.g. Xcomp.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8252522: nsk/share/test/StressOptions should multiple stressTime by jtreg's timeout-factor

Reviewers

  • dholmes - Reviewer ⚠️ Added manually

Download

$ git fetch https://git.openjdk.java.net/jdk pull/32/head:pull/32
$ git checkout pull/32

@iignatev iignatev marked this pull request as ready for review September 6, 2020 16:09
@bridgekeeper
Copy link

bridgekeeper bot commented Sep 6, 2020

👋 Welcome back iignatyev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 6, 2020
@openjdk
Copy link

openjdk bot commented Sep 6, 2020

@iignatev To determine the appropriate audience for reviewing this pull request, one or more labels corresponding to different subsystems will normally be applied automatically. However, no automatic labelling rule matches the changes in this pull request.

In order to have an RFR email automatically sent to the correct mailing list, you will need to add one or more labels manually using the /label add "label" command. The following labels are valid: 2d awt beans build compiler core-libs hotspot hotspot-compiler hotspot-gc hotspot-jfr hotspot-runtime i18n javadoc jdk jmx kulla net nio security serviceability shenandoah sound swing.

@iignatev
Copy link
Member Author

iignatev commented Sep 6, 2020

/label add hotspot

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 6, 2020
@openjdk
Copy link

openjdk bot commented Sep 6, 2020

@iignatev
The hotspot label was successfully added.

@iignatev
Copy link
Member Author

iignatev commented Sep 6, 2020

/reviewer add dholmes

@openjdk
Copy link

openjdk bot commented Sep 6, 2020

@iignatev Syntax: /reviewer (add|remove) [@user | openjdk-user]+. For example:

  • /reviewer add @openjdk-bot
  • /reviewer add duke
  • /reviewer add @user1 @user2

@iignatev
Copy link
Member Author

iignatev commented Sep 6, 2020

/reviewer add dholmes

@openjdk
Copy link

openjdk bot commented Sep 6, 2020

@iignatev
Reviewer dholmes successfully added.

@openjdk
Copy link

openjdk bot commented Sep 6, 2020

@iignatev This change now passes all automated pre-integration checks. When the change also fulfills all project specific requirements, type /integrate in a new comment to proceed. After integration, the commit message will be:

8252522: nsk/share/test/StressOptions should multiple stressTime by jtreg's timeout-factor

Reviewed-by: dholmes
  • If you would like to add a summary, use the /summary command.
  • To credit additional contributors, use the /contributor command.
  • To add additional solved issues, use the /issue command.

Since the source branch of this PR was last updated there have been 2 commits pushed to the master branch:

  • 1262ae3: 8252679: Two windows specific FileDIalog tests may fail on some Windows_Server_2016_Standard
  • d0f4366: 8252715: Problem list java/awt/event/KeyEvent/KeyTyped/CtrlASCII.java on Linux

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge master into your branch, and then specify the current head hash when integrating, like this: /integrate 1262ae36af0c9a41609765966123751b5ad9098b.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 6, 2020
@iignatev
Copy link
Member Author

iignatev commented Sep 6, 2020

/integrate

@openjdk openjdk bot closed this Sep 6, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated labels Sep 6, 2020
@openjdk
Copy link

openjdk bot commented Sep 6, 2020

@iignatev Since your change was applied there have been 2 commits pushed to the master branch:

  • 1262ae3: 8252679: Two windows specific FileDIalog tests may fail on some Windows_Server_2016_Standard
  • d0f4366: 8252715: Problem list java/awt/event/KeyEvent/KeyTyped/CtrlASCII.java on Linux

Your commit was automatically rebased without conflicts.

Pushed as commit 5f76deb.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@openjdk openjdk bot removed the rfr Pull request is ready for review label Sep 6, 2020
@iignatev iignatev deleted the 8252522 branch September 6, 2020 16:18
@mlbridge
Copy link

mlbridge bot commented Sep 6, 2020

Webrevs

openjdk-notifier bot pushed a commit that referenced this pull request Oct 5, 2021
r18 should not be used as it is reserved as platform register. Linux is
fine with userspace using it, but Windows and also recently macOS (
openjdk/jdk11u-dev#301 (comment) )
are actually using it on the kernel side.

The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to
specify which registers to spill; fortunately this helper is only used
here:
https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404

I haven't seen causing this particular instance any issues in practice
_yet_, presumably because it looks hard to align the stars in order to
trigger a problem (between stp and ldp of r18 a transition to kernel
space must happen *and* the kernel needs to do something with r18). But
jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that
causes troubles as explained in the link above.

Output of `-XX:+PrintInterpreter` before this change:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000138809b00, 0x000000013880a280]  1920 bytes
--------------------------------------------------------------------------------
  0x0000000138809b00:   ldr x2, [x12, #16]
  0x0000000138809b04:   ldrh    w2, [x2, #44]
  0x0000000138809b08:   add x24, x20, x2, uxtx #3
  0x0000000138809b0c:   sub x24, x24, #0x8
[...]
  0x0000000138809fa4:   stp x16, x17, [sp, #128]
  0x0000000138809fa8:   stp x18, x19, [sp, #144]
  0x0000000138809fac:   stp x20, x21, [sp, #160]
[...]
  0x0000000138809fc0:   stp x30, xzr, [sp, #240]
  0x0000000138809fc4:   mov x0, x28
 ;; 0x10864ACCC
  0x0000000138809fc8:   mov x9, #0xaccc                 // #44236
  0x0000000138809fcc:   movk    x9, #0x864, lsl #16
  0x0000000138809fd0:   movk    x9, #0x1, lsl #32
  0x0000000138809fd4:   blr x9
  0x0000000138809fd8:   ldp x2, x3, [sp, #16]
[...]
  0x0000000138809ff4:   ldp x16, x17, [sp, #128]
  0x0000000138809ff8:   ldp x18, x19, [sp, #144]
  0x0000000138809ffc:   ldp x20, x21, [sp, #160]
```

After:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000108e4db00, 0x0000000108e4e280]  1920 bytes

--------------------------------------------------------------------------------
  0x0000000108e4db00:   ldr x2, [x12, #16]
  0x0000000108e4db04:   ldrh    w2, [x2, #44]
  0x0000000108e4db08:   add x24, x20, x2, uxtx #3
  0x0000000108e4db0c:   sub x24, x24, #0x8
[...]
  0x0000000108e4dfa4:   stp x16, x17, [sp, #128]
  0x0000000108e4dfa8:   stp x19, x20, [sp, #144]
  0x0000000108e4dfac:   stp x21, x22, [sp, #160]
[...]
  0x0000000108e4dfbc:   stp x29, x30, [sp, #224]
  0x0000000108e4dfc0:   mov x0, x28
 ;; 0x107E4A06C
  0x0000000108e4dfc4:   mov x9, #0xa06c                 // #41068
  0x0000000108e4dfc8:   movk    x9, #0x7e4, lsl #16
  0x0000000108e4dfcc:   movk    x9, #0x1, lsl #32
  0x0000000108e4dfd0:   blr x9
  0x0000000108e4dfd4:   ldp x2, x3, [sp, #16]
[...]
  0x0000000108e4dff0:   ldp x16, x17, [sp, #128]
  0x0000000108e4dff4:   ldp x19, x20, [sp, #144]
  0x0000000108e4dff8:   ldp x21, x22, [sp, #160]
[...]
```
lewurm added a commit to lewurm/openjdk that referenced this pull request Oct 6, 2021
Restore looks like this now:
```
  0x0000000106e4dfcc:   movk    x9, #0x5e4, lsl openjdk#16
  0x0000000106e4dfd0:   movk    x9, #0x1, lsl openjdk#32
  0x0000000106e4dfd4:   blr x9
  0x0000000106e4dfd8:   ldp x2, x3, [sp, openjdk#16]
  0x0000000106e4dfdc:   ldp x4, x5, [sp, openjdk#32]
  0x0000000106e4dfe0:   ldp x6, x7, [sp, openjdk#48]
  0x0000000106e4dfe4:   ldp x8, x9, [sp, openjdk#64]
  0x0000000106e4dfe8:   ldp x10, x11, [sp, openjdk#80]
  0x0000000106e4dfec:   ldp x12, x13, [sp, openjdk#96]
  0x0000000106e4dff0:   ldp x14, x15, [sp, openjdk#112]
  0x0000000106e4dff4:   ldp x16, x17, [sp, openjdk#128]
  0x0000000106e4dff8:   ldp x0, x1, [sp], openjdk#144
  0x0000000106e4dffc:   ldp xzr, x19, [sp], openjdk#16
  0x0000000106e4e000:   ldp x22, x23, [sp, openjdk#16]
  0x0000000106e4e004:   ldp x24, x25, [sp, openjdk#32]
  0x0000000106e4e008:   ldp x26, x27, [sp, openjdk#48]
  0x0000000106e4e00c:   ldp x28, x29, [sp, openjdk#64]
  0x0000000106e4e010:   ldp x30, xzr, [sp, openjdk#80]
  0x0000000106e4e014:   ldp x20, x21, [sp], openjdk#96
  0x0000000106e4e018:   ldur    x12, [x29, #-24]
  0x0000000106e4e01c:   ldr x22, [x12, openjdk#16]
  0x0000000106e4e020:   add x22, x22, #0x30
  0x0000000106e4e024:   ldr x8, [x28, openjdk#8]
```
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Dec 8, 2021
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms

1. Remove redundant instructions for abs with constant values

If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.

For example,
  int[] a
  for (int i = 0; i < SIZE; i++) {
    a[i] = Math.abs(-38);
  }

Before the patch, the generated code for the testcase above is:
...
  mov   w10, #0xffffffda
  cmp   w10, wzr
  cneg  w17, w10, lt
  dup   v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
  movi  v16.4s, #0x26
...

2. Remove redundant instructions for abs with char type

In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.

As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.

For example,

  char[] a;
  char[] b;
  for (int i = 0; i < SIZE; i++) {
    b[i] = (char) Math.abs(a[i]);
  }

Before the patch, the generated assembly code for the testcase
above is:

B15:
  add   x13, x21, w20, sxtw openjdk#1
  ldrh  w11, [x13, openjdk#16]
  cmp   w11, wzr
  cneg  w10, w11, lt
  strh  w10, [x13, openjdk#16]
  ldrh  w10, [x13, openjdk#18]
  cmp   w10, wzr
  cneg  w10, w10, lt
  strh  w10, [x13, openjdk#18]
  ...
  add   w20, w20, #0x1
  cmp   w20, w17
  b.lt  B15

After the patch, the generated assembly code is:
B15:
  sbfiz x18, x19, openjdk#1, openjdk#32
  add   x0, x14, x18
  ldr   q16, [x0, openjdk#16]
  add   x18, x21, x18
  str   q16, [x18, openjdk#16]
  ldr   q16, [x0, openjdk#32]
  str   q16, [x18, openjdk#32]
  ...
  add   w19, w19, #0x40
  cmp   w19, w17
  b.lt  B15

3. Convert some common abs operations to ideal forms

The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:

a) abs(0 - x) => abs(x)

Before the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  neg   w13, w13
  cmp   w13, wzr
  cneg  w14, w13, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

b) abs(abs(x))  => abs(x)

Before the patch:
  ...
  ldr   w12, [x12, openjdk#16]
  cmp   w12, wzr
  cneg  w12, w12, lt
  cmp   w12, wzr
  cneg  w12, w12, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Dec 8, 2021
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms

1. Remove redundant instructions for abs with constant values

If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.

For example,
  int[] a
  for (int i = 0; i < SIZE; i++) {
    a[i] = Math.abs(-38);
  }

Before the patch, the generated code for the testcase above is:
...
  mov   w10, #0xffffffda
  cmp   w10, wzr
  cneg  w17, w10, lt
  dup   v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
  movi  v16.4s, #0x26
...

2. Remove redundant instructions for abs with char type

In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.

As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.

For example,

  char[] a;
  char[] b;
  for (int i = 0; i < SIZE; i++) {
    b[i] = (char) Math.abs(a[i]);
  }

Before the patch, the generated assembly code for the testcase
above is:

B15:
  add   x13, x21, w20, sxtw openjdk#1
  ldrh  w11, [x13, openjdk#16]
  cmp   w11, wzr
  cneg  w10, w11, lt
  strh  w10, [x13, openjdk#16]
  ldrh  w10, [x13, openjdk#18]
  cmp   w10, wzr
  cneg  w10, w10, lt
  strh  w10, [x13, openjdk#18]
  ...
  add   w20, w20, #0x1
  cmp   w20, w17
  b.lt  B15

After the patch, the generated assembly code is:
B15:
  sbfiz x18, x19, openjdk#1, openjdk#32
  add   x0, x14, x18
  ldr   q16, [x0, openjdk#16]
  add   x18, x21, x18
  str   q16, [x18, openjdk#16]
  ldr   q16, [x0, openjdk#32]
  str   q16, [x18, openjdk#32]
  ...
  add   w19, w19, #0x40
  cmp   w19, w17
  b.lt  B15

3. Convert some common abs operations to ideal forms

The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:

a) abs(0 - x) => abs(x)

Before the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  neg   w13, w13
  cmp   w13, wzr
  cneg  w14, w13, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

b) abs(abs(x))  => abs(x)

Before the patch:
  ...
  ldr   w12, [x12, openjdk#16]
  cmp   w12, wzr
  cneg  w12, w12, lt
  cmp   w12, wzr
  cneg  w12, w12, lt
  ...
After the patch:
  ...
  ldr   w13, [x13, openjdk#16]
  cmp   w13, wzr
  cneg  w13, w13, lt
  ...

Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
shqking added a commit to shqking/jdk that referenced this pull request Mar 7, 2022
*** Implementation

In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.

For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).

This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.

In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.

*** Example

Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.

```
public void rShiftInt() {
    for (int i = 0; i < SIZE; i++) {
        intsB[i] = intsA[i] >> count;
    }
}
```

Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.

```
0x0000ffff89057a64:   dup     v16.16b, w13              <-- dup
0x0000ffff89057a68:   mov     w12, #0x7d00                    // #32000
0x0000ffff89057a6c:   sub     w13, w2, w10
0x0000ffff89057a70:   cmp     w2, w10
0x0000ffff89057a74:   csel    w13, wzr, w13, lt
0x0000ffff89057a78:   mov     w8, #0x7d00                     // #32000
0x0000ffff89057a7c:   cmp     w13, w8
0x0000ffff89057a80:   csel    w13, w12, w13, hi
0x0000ffff89057a84:   add     w14, w13, w10
0x0000ffff89057a88:   nop
0x0000ffff89057a8c:   nop
0x0000ffff89057a90:   sbfiz   x13, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff89057a94:   add     x15, x17, x13
0x0000ffff89057a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff89057a9c:   add     x13, x0, x13
0x0000ffff89057aa0:   neg     v18.16b, v16.16b          <-- neg
0x0000ffff89057aa4:   sshl    v17.4s, v17.4s, v18.4s    <-- shift right
0x0000ffff89057aa8:   str     q17, [x13,openjdk#16]
0x0000ffff89057aac:   ...
0x0000ffff89057b1c:   add     w10, w10, #0x20
0x0000ffff89057b20:   cmp     w10, w14
0x0000ffff89057b24:   b.lt    0x0000ffff89057a90        <-- loop end
```

Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.

```
0x0000ffff8d053a68:   neg     w14, w13                  <---- neg
0x0000ffff8d053a6c:   dup     v16.16b, w14              <---- dup
0x0000ffff8d053a70:   sub     w14, w2, w10
0x0000ffff8d053a74:   cmp     w2, w10
0x0000ffff8d053a78:   csel    w14, wzr, w14, lt
0x0000ffff8d053a7c:   mov     w8, #0x7d00                     // #32000
0x0000ffff8d053a80:   cmp     w14, w8
0x0000ffff8d053a84:   csel    w14, w12, w14, hi
0x0000ffff8d053a88:   add     w13, w14, w10
0x0000ffff8d053a8c:   nop
0x0000ffff8d053a90:   sbfiz   x14, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff8d053a94:   add     x15, x17, x14
0x0000ffff8d053a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff8d053a9c:   sshl    v17.4s, v17.4s, v16.4s    <-- shift right
0x0000ffff8d053aa0:   add     x14, x0, x14
0x0000ffff8d053aa4:   str     q17, [x14,openjdk#16]
0x0000ffff8d053aa8:   ...
0x0000ffff8d053afc:   add     w10, w10, #0x20
0x0000ffff8d053b00:   cmp     w10, w13
0x0000ffff8d053b04:   b.lt    0x0000ffff8d053a90        <-- loop end
```

*** Testing

Tier1~3 tests passed on Linux/AArch64 platform.

*** Performance Evaluation

- Auto-vectorization

One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.

The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.

We witness about 30% ~ 53% improvement on microbenchmarks.

```
Benchmark                      Units    Score-1    Score-2
VectorShiftRight.rShiftByte   ops/ms  10601.980  13816.353
VectorShiftRight.rShiftInt    ops/ms   3592.831   5502.941
VectorShiftRight.rShiftLong   ops/ms   1584.012   2425.247
VectorShiftRight.rShiftShort  ops/ms   6643.414   9728.762
VectorShiftRight.urShiftByte  ops/ms   2066.965   2048.336 (*)
VectorShiftRight.urShiftChar  ops/ms   6660.805   9728.478
VectorShiftRight.urShiftInt   ops/ms   3592.909   5514.928
VectorShiftRight.urShiftLong  ops/ms   1583.995   2422.991

*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```

- VectorAPI

Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.

```
Benchmark                  Units    Score-1    Score-2
Byte128Vector.LSHL        ops/ms  10867.666  10873.993
Byte128Vector.LSHLShift   ops/ms  10945.729  10945.741
Byte128Vector.LSHR        ops/ms   8629.305   8629.343
Byte128Vector.LSHRShift   ops/ms   8245.864  10303.521   <--
Byte128Vector.ASHR        ops/ms   8619.691   8629.438
Byte128Vector.ASHRShift   ops/ms   8245.860  10305.027   <--
Int128Vector.LSHL         ops/ms   3104.213   3103.702
Int128Vector.LSHLShift    ops/ms   3114.354   3114.371
Int128Vector.LSHR         ops/ms   2380.717   2380.693
Int128Vector.LSHRShift    ops/ms   2312.871   2992.377   <--
Int128Vector.ASHR         ops/ms   2380.668   2380.647
Int128Vector.ASHRShift    ops/ms   2312.894   2992.332   <--
Long128Vector.LSHL        ops/ms   1586.907   1587.591
Long128Vector.LSHLShift   ops/ms   1589.469   1589.540
Long128Vector.LSHR        ops/ms   1209.754   1209.687
Long128Vector.LSHRShift   ops/ms   1174.718   1527.502   <--
Long128Vector.ASHR        ops/ms   1209.713   1209.669
Long128Vector.ASHRShift   ops/ms   1174.712   1527.174   <--
Short128Vector.LSHL       ops/ms   5945.542   5943.770
Short128Vector.LSHLShift  ops/ms   5984.743   5984.640
Short128Vector.LSHR       ops/ms   4613.378   4613.577
Short128Vector.LSHRShift  ops/ms   4486.023   5746.466   <--
Short128Vector.ASHR       ops/ms   4613.389   4613.478
Short128Vector.ASHRShift  ops/ms   4486.019   5746.368   <--
```

1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.

2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.

[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
e1iu added a commit to e1iu/jdk that referenced this pull request Mar 24, 2022
This patch fixes the wrong matching rule of replicate2L_zero. It was
matched "ReplicateI" by mistake so that long immediates(not only zero)
had to be moved to register first and matched to replicate2L finally. To
fix this trivial bug, this patch fixes the typo and extends the rule of
replicate2L_zero to replicate2L_imm, which now supports all possible
long immediate values.

The final code changes are shown as below:

replicate2L_imm:

        mov   x13, #0xff
        movk  x13, #0xff, lsl openjdk#16
        movk  x13, #0xff, lsl openjdk#32
        dup   v16.2d, x13

        =>

        movi  v16.2d, #0xff00ff00ff

[Test]
test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
passed without failure.

Change-Id: Ieac92820dea560239a968de3d7430003f01726bd
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 25, 2022
… operations

Some loop cases of subword types, including byte and
short, can't be vectorized by C2's SLP. Here is an example:
```
short[] addShort(short[] a, short[] b, short[] c) {
  for (int i = 0; i < SIZE; i++) {
    b[i] = (short) (a[i] + 8); //  *line A*
    sres[i] = (short) (b[i] + c[i]); // *line B*
  }
}
```
However, similar cases of int/float/double/long/char type can
be vectorized successfully.

The reason why SLP can't vectorize the short case above is
that, as illustrated here[1], the result of the scalar add
operation on *line A* has been promoted to int type. It needs
to be narrowed to short type first before it can work as one
of source operands of addition on *line B*. The demotion is
done by left-shifting 16 bits then right-shifting 16 bits.
The ideal graph for the process is showed like below.

 LoadS a[i]  8
       \   /
        AddI (line A)
       /    \
StoreC b[i]  Lshift 16bits
                 \
               RShiftI 16 bits    LoadS c[i]
                       \           /
                       AddI (line B)
                           \
                         StoreC sres[i]

In SLP, for most short-type cases, we can determine the precise
type of the scalar int-type operation and finally execute it
with short-type vector operations[2], except rshift opcode and
abs in some situations[3]. But in this case, the source operand
of RShiftI is from LShiftI rather than from any LoadS[4], so we
can't determine its real type and conservatively assign it with
int type rather than real short type. The int-type opearation
RShiftI here can't be vectorized together with other short-type
operations, like AddI(line B). The reason for byte loop cases
is the same. Similar loop cases of char type could be
vectorized because its demotion from int to char is done by
`and` with mask rather than `lshift_rshift`.

Therefore, we try to remove the patterns like
`RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short
cases, to vectorize more scenarios. Optimizing it in the
mid-end by i-GVN is more reasonable.

What we do in the mid-end is eliminating the sign extension
before some subword integer operations like:

```
int x, y;
short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16
```
to
```
short s = (short) (x OP y);
```

In the patch, assuming that `x` can be any int number, we need
guarantee that the optimization doesn't have any impact on
result. Not all arithmetic logic OPs meet the requirements. For
example, assuming that `Imm` equals `16`, `x` equals `131068`,
`y` equals `50` and `OP` is division`/`,
`short s = (short) (((131068 << 16) >> 16) / 50)` is not
equal to `short s = (short) (131068 / 50)`. When OP is division,
we may get different result with or without demotion
before OP, because the upper 16 bits of division may have
influence on the lower 16 bits of result, which can't be
optimized. All optimizable opcodes are listed in
StoreNode::no_need_sign_extension(), whose upper 16 bits of src
operands don't influence the lower 16 bits of result for short
type and upper 24 bits of src operand don't influence the lower
8 bits of dst operand for byte.

After the patch, the short loop case above can be vectorized as:
```
movi    v18.8h, #0x8
...
ldr     q16, [x14, openjdk#32] // vector load a[i]
// vector add, a[i] + 8, no promotion or demotion
add     v17.8h, v16.8h, v18.8h
str     q17, [x6, openjdk#32]  // vector store a[i] + 8, b[i]
ldr     q17, [x0, openjdk#32]  // vector load c[i]
// vector add, a[i] + c[i], no promotion or demotion
add     v16.8h, v17.8h, v16.8h
// vector add, a[i] + c[i] + 8, no promotion or demotion
add     v16.8h, v16.8h, v18.8h
str     q16, [x11, openjdk#32]  //vector store sres[i]
...
```

The patch works for byte cases as well.

Here is the performance data for micro-benchmark before
and after this patch on both AArch64 and x64 machines.
We can observe about ~83% improvement with this patch.

on AArch64:
Before the patch:
Benchmark (length)  Mode  Cnt    Score    Error  Units
addB       523      avgt   15  401.521 ±  0.033  ns/op
addS       523      avgt   15  401.512 ±  0.021  ns/op

After the patch:
Benchmark (length)  Mode  Cnt    Score   Error  Units
addB       523      avgt   15   68.444 ± 0.318  ns/op
addS       523      avgt   15   69.847 ± 0.043  ns/op

on x86:
Before the patch:
Benchmark (length)  Mode  Cnt    Score    Error  Units
addB       523      avgt   15  454.102 ± 36.180  ns/op
addS       523      avgt   15  432.245 ± 22.640  ns/op

After the patch:
Benchmark (length)  Mode  Cnt    Score    Error  Units
addB       523      avgt   15   75.812 ±  5.063  ns/op
addS       523      avgt   15   72.839 ± 10.109  ns/op

[1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241
[2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206
[3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249
[4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251

Change-Id: I92ce42b550ef057964a3b58716436735275d8d31
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 28, 2022
```
public short[] vectorUnsignedShiftRight(short[] shorts) {
    short[] res = new short[SIZE];
    for (int i = 0; i < SIZE; i++) {
        res[i] = (short) (shorts[i] >>> 3);
    }
    return res;
}
```
In C2's SLP, vectorization of unsigned shift right on signed
subword types (byte/short) like the case above is intentionally
disabled[1]. Because the vector unsigned shift on signed
subword types behaves differently from the Java spec. It's
worthy to vectorize more cases in quite low cost. Also,
unsigned shift right on signed subword is not uncommon and we
may find similar cases in Lucene benchmark[2].

Taking unsigned right shift on short type as an example,

Short:
    | <- 16 bits  -> |  <- 16 bits ->  |
    | 1 1 1 ... 1  1 |      data       |

when the shift amount is a constant not greater than the number
of sign extended bits, 16 higher bits for short type shown like
above, the unsigned shift on signed subword types can be
transformed into a signed shift and hence becomes vectorizable.
Here is the transformation:

For T_SHORT (shift <= 16):
  src    RShiftCntV shift          src    RShiftCntV shift
   \      /                  ==>    \       /
   URShiftVS                         RShiftVS

This patch does the transformation in SuperWord::implemented() and
SuperWord::output(). It helps vectorize the short cases above. We
can handle unsigned right shift on byte type in a similar way. The
generated assembly code for one iteration on aarch64 is like:
```
...
sbfiz   x13, x10, openjdk#1, openjdk#32
add     x15, x11, x13
ldr     q16, [x15, openjdk#16]
sshr    v16.8h, v16.8h, openjdk#3
add     x13, x17, x13
str     q16, [x13, openjdk#16]
...
```

Here is the performance data for micro-benchmark before and after
this patch on both AArch64 and x64 machines. We can observe about
~80% improvement with this patch.

The perf data on AArch64:
Before the patch:
Benchmark        (SIZE)  (shiftCount)  Mode  Cnt    Score   Error  Units
urShiftImmByte    1024         3       avgt    5  295.711 ± 0.117  ns/op
urShiftImmShort   1024         3       avgt    5  284.559 ± 0.148  ns/op

after the patch:
Benchmark         (SIZE) (shiftCount)  Mode  Cnt    Score   Error  Units
urShiftImmByte     1024        3       avgt    5   45.111 ± 0.047  ns/op
urShiftImmShort    1024        3       avgt    5   55.294 ± 0.072  ns/op

The perf data on X86:
Before the patch:
Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
urShiftImmByte    1024        3       avgt    5  361.374 ±  4.621  ns/op
urShiftImmShort   1024        3       avgt    5  365.390 ±  3.595  ns/op

After the patch:
Benchmark        (SIZE) (shiftCount)  Mode  Cnt    Score    Error  Units
urShiftImmByte    1024        3       avgt    5  105.489 ±  0.488  ns/op
urShiftImmShort   1024        3       avgt    5   43.400 ±  0.394  ns/op

[1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
[2] https://github.com/jpountz/decode-128-ints-benchmark/

Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
e1iu added a commit to e1iu/jdk that referenced this pull request Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu added a commit to e1iu/jdk that referenced this pull request Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
asotona added a commit to asotona/jdk that referenced this pull request Oct 7, 2022
* removed FrameKind from StackMapTableAttribute and offsets replaced with labels in StackMapFrames

* simplification of StackMapFrame, removal of compressed sub-types

* Implemented UnboundStackMapTableAttribute and added related factory methods
dropped VerificationType
SimpleVerificationTypeInfo refactored to enum

* implemented StackMapTableAttribute compression and writing

* StackMapTableAttribute extends CodeElement

* added Classfile.Option.processStackMaps
enable stack maps processing in RebuildingTransformation test helper
fixed UninitializedVerificationTypeInfo labels resolution

* fixed Opcode

* enable unordered StackMapTableAttribute entries

* removal of PROCESS_STACK_MAPS option
generated stack maps override user content
adjusted labels inflation from stack maps

* minor patch in RebuildingTransformation test helper
asotona added a commit to asotona/jdk that referenced this pull request Feb 6, 2023
* removed FrameKind from StackMapTableAttribute and offsets replaced with labels in StackMapFrames

* simplification of StackMapFrame, removal of compressed sub-types

* Implemented UnboundStackMapTableAttribute and added related factory methods
dropped VerificationType
SimpleVerificationTypeInfo refactored to enum

* implemented StackMapTableAttribute compression and writing

* StackMapTableAttribute extends CodeElement

* added Classfile.Option.processStackMaps
enable stack maps processing in RebuildingTransformation test helper
fixed UninitializedVerificationTypeInfo labels resolution

* fixed Opcode

* enable unordered StackMapTableAttribute entries

* removal of PROCESS_STACK_MAPS option
generated stack maps override user content
adjusted labels inflation from stack maps

* minor patch in RebuildingTransformation test helper
caojoshua pushed a commit to caojoshua/jdk that referenced this pull request May 8, 2023
JVM-1916: a Virtual State becomes top after merge
robehn pushed a commit to robehn/jdk that referenced this pull request Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
1 participant