Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8262916: Merge LShiftCntV and RShiftCntV into a single node #3371

Closed
wants to merge 2 commits into from

Conversation

@theRealELiu
Copy link
Contributor

@theRealELiu theRealELiu commented Apr 7, 2021

The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift
counts are the same.

public static void test_shiftv(int sh) {
    for (int i = 0; i < N; i+=1) {
        a0[i] = a1[i] << sh;
        b0[i] = b1[i] >> sh;
    }
}

Given the example above, by merging the same shift counts into one
node, they could be shared by shift nodes(RShiftV or LShiftV) like
below:

Before:
1184  LShiftCntV  === _  1189  [[ 1185  ... ]]
1190  RShiftCntV  === _  1189  [[ 1191  ... ]]
1185  LShiftVI  === _  1181  1184  [[ 1186 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]

After:
1190  ShiftCntV  === _  1189  [[ 1191 1204  ... ]]
1204  LShiftVI  === _  1211  1190  [[ 1203 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]

The final code could remove one redundant “dup”(scalar->vector),
with one register saved.

Before:
        dup     v16.16b, w12
        dup     v17.16b, w12
        ...
        ldr     q18, [x13, #16]
        sshl    v18.4s, v18.4s, v16.4s
        add     x18, x16, x12           ; iaload

        add     x4, x15, x12
        str     q18, [x4, #16]          ; iastore

        ldr     q18, [x18, #16]
        add     x12, x14, x12
        neg     v19.16b, v17.16b
        sshl    v18.4s, v18.4s, v19.4s
        str     q18, [x12, #16]         ; iastore

After:
        dup	v16.16b, w11
        ...
        ldr	q17, [x13, #16]
        sshl	v17.4s, v17.4s, v16.4s
        add	x2, x22, x11            ; iaload

        add	x4, x16, x11
        str	q17, [x4, #16]          ; iastore

        ldr	q17, [x2, #16]
        add	x11, x21, x11
        neg	v18.16b, v16.16b
        sshl	v17.4s, v17.4s, v18.4s
        str	q17, [x11, #16]         ; iastore


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8262916: Merge LShiftCntV and RShiftCntV into a single node

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3371/head:pull/3371
$ git checkout pull/3371

Update a local copy of the PR:
$ git checkout pull/3371
$ git pull https://git.openjdk.java.net/jdk pull/3371/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3371

View PR using the GUI difftool:
$ git pr show -t 3371

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3371.diff

The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift
counts are the same.

```
public static void test_shiftv(int sh) {
    for (int i = 0; i < N; i+=1) {
        a0[i] = a1[i] << sh;
        b0[i] = b1[i] >> sh;
    }
}
```

Given the example above, by merging the same shift counts into one
node, they could be shared by shift nodes(RShiftV or LShiftV) like
below:

```
Before:
1184  LShiftCntV  === _  1189  [[ 1185  ... ]]
1190  RShiftCntV  === _  1189  [[ 1191  ... ]]
1185  LShiftVI  === _  1181  1184  [[ 1186 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]

After:
1190  ShiftCntV  === _  1189  [[ 1191 1204  ... ]]
1204  LShiftVI  === _  1211  1190  [[ 1203 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]
```

The final code could remove one redundant “dup”(scalar->vector),
with one register saved.

```
Before:
        dup     v16.16b, w12
        dup     v17.16b, w12
        ...
        ldr     q18, [x13, openjdk#16]
        sshl    v18.4s, v18.4s, v16.4s
        add     x18, x16, x12           ; iaload

        add     x4, x15, x12
        str     q18, [x4, openjdk#16]          ; iastore

        ldr     q18, [x18, openjdk#16]
        add     x12, x14, x12
        neg     v19.16b, v17.16b
        sshl    v18.4s, v18.4s, v19.4s
        str     q18, [x12, openjdk#16]         ; iastore

After:
        dup	v16.16b, w11
        ...
        ldr	q17, [x13, openjdk#16]
        sshl	v17.4s, v17.4s, v16.4s
        add	x2, x22, x11            ; iaload

        add	x4, x16, x11
        str	q17, [x4, openjdk#16]          ; iastore

        ldr	q17, [x2, openjdk#16]
        add	x11, x21, x11
        neg	v18.16b, v16.16b
        sshl	v17.4s, v17.4s, v18.4s
        str	q17, [x11, openjdk#16]         ; iastore

```

Change-Id: I047f3f32df9535d706a9920857d212610e8ce315
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 7, 2021

👋 Welcome back eliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@openjdk openjdk bot added the rfr label Apr 7, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 7, 2021

@theRealELiu The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Loading

@mlbridge
Copy link

@mlbridge mlbridge bot commented Apr 7, 2021

Webrevs

Loading

@dean-long
Copy link
Member

@dean-long dean-long commented Apr 8, 2021

You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do.

Loading

@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 8, 2021

You should be able to do this without introducing a new node type.

Do you mean just matching them with ReplicateBNode? Indeed they do generate the same instruction. I was wondering about the same but I'm not sure if ReplicateBNode has the same semantic with ShiftCntV in middle-end.

Loading

@iwanowww
Copy link

@iwanowww iwanowww commented Apr 8, 2021

You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do.

Not sure what you refer to in x86.ad: vector shifts with variable scalar count require the scalar to be placed in XMM register. ShiftCntV handles register-to-register move between different register classes (RegI and Vec*).

Do you suggest reusing some existing vector node (like Replicate) to covert the scalar index to vector first? It would have slightly different behavior on x86.

Loading

@iwanowww
Copy link

@iwanowww iwanowww commented Apr 8, 2021

Regarding the proposed change itself (LShiftCntV/RShiftCntV => ShiftCntV).

Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.

As a consequence, in an auto-vectorized loop it will lead to:

  • 1 instruction per loop iteration (multiplied by unrolling factor);
  • no way to hoist the negation of loop invariant index.

Loading

@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 8, 2021

Regarding the proposed change itself (LShiftCntV/RShiftCntV => ShiftCntV).

Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.

As a consequence, in an auto-vectorized loop it will lead to:

  • 1 instruction per loop iteration (multiplied by unrolling factor);
  • no way to hoist the negation of loop invariant index.

Thanks for your feedback!

I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.

Before:
0xf46c8338:   add fp, r7, fp
0xf46c833c:   vshl.u16  d9, d9, d8
0xf46c8340:   vstr  d9, [fp, #12]
0xf46c8344:   vshl.u16  d9, d10, d8
0xf46c8348:   vstr  d9, [fp, #20]
0xf46c834c:   vshl.u16  d9, d11, d8
0xf46c8350:   vstr  d9, [fp, #28]

After:
0xf4aa1328:   add fp, r7, fp
0xf4aa132c:   vneg.s8 d13, d8
0xf4aa1330:   vshl.u16  d9, d9, d13
0xf4aa1334:   vstr  d9, [fp, #12]
0xf4aa1338:   vneg.s8 d9, d8
0xf4aa133c:   vshl.u16  d9, d10, d9
0xf4aa1340:   vstr  d9, [fp, #20]
0xf4aa1344:   vneg.s8 d9, d8
0xf4aa1348:   vshl.u16  d9, d11, d9
0xf4aa134c:   vstr  d9, [fp, #28]  

I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86 to what AArch32 dose, or merging them and accept this defect on AArch32.

Loading

@iwanowww
Copy link

@iwanowww iwanowww commented Apr 8, 2021

LShiftCntV/RShiftCntV were added specifically for AArch32 and other platforms don't need/benefit from such separation.

Ultimately, I'd like to hear what AArch32 port maintainers think about it, but if there are concerns about performance impact expressed, as an alternative a platform-specific logic can be introduced that adjusts Ideal IR shape for URShiftV/RShiftV nodes accordingly.

(Not sure whether it is a good trade-off w.r.t. code complexity though.)

Loading

@dean-long
Copy link
Member

@dean-long dean-long commented Apr 9, 2021

@theRealELiu @iwanowww Yes, but I was thinking of the vshiftcnt rules and didn't realize the replicate rules are virtually identical. Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule). So why is it that x86 can share the shift register, but aarch64 cannot? Is it because x86 has combined left- and right-shift into the same rule?

Loading

@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 9, 2021

@theRealELiu @iwanowww Yes, but I was thinking of the vshiftcnt rules and didn't realize the replicate rules are virtually identical. Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule). So why is it that x86 can share the shift register, but aarch64 cannot? Is it because x86 has combined left- and right-shift into the same rule?

I suppose x86 has the different kinds of vector shift instructions[1], but AArch64 and AArch32 only have the left one. For this reason, arm should use a "neg-leftshift" pair to implement right shift.

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L1234

Loading

@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 9, 2021

LShiftCntV/RShiftCntV were added specifically for AArch32 and other platforms don't need/benefit from such separation.

I think AArch64 shared the same problem with AArch32 since it only has left shift instruction as well. I checked the generated code on AArch64 with latest master. Test case is TestCharVect2::test_srav[1].

  0x0000ffffa9106c68:   ldr     q17, [x15, #16]
  0x0000ffffa9106c6c:   add     x14, x10, x14
  0x0000ffffa9106c70:   neg     v18.16b, v16.16b
  0x0000ffffa9106c74:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106c78:   str     q17, [x14, #16]
  0x0000ffffa9106c7c:   ldr     q17, [x15, #32]
  0x0000ffffa9106c80:   neg     v18.16b, v16.16b
  0x0000ffffa9106c84:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106c88:   str     q17, [x14, #32]
  0x0000ffffa9106c8c:   ldr     q17, [x15, #48]
  0x0000ffffa9106c90:   neg     v18.16b, v16.16b
  0x0000ffffa9106c94:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106c98:   str     q17, [x14, #48]
  0x0000ffffa9106c9c:   ldr     q17, [x15, #64]
  0x0000ffffa9106ca0:   neg     v18.16b, v16.16b
  0x0000ffffa9106ca4:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106ca8:   str     q17, [x14, #64]
  0x0000ffffa9106cac:   ldr     q17, [x15, #80]
  0x0000ffffa9106cb0:   neg     v18.16b, v16.16b
  0x0000ffffa9106cb4:   ushl    v17.8h, v17.8h, v18.8h

It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should change to what AArch32 dose. @theRealAph

[1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/codegen/TestCharVect2.java#L1215

Loading

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Apr 9, 2021

It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph

Thanks, but it's been a while since I looked at the vector code. Can you point me to the AArch32 patterns in question, to show me the AArch64 changes needed? Thanks.

Loading

@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 9, 2021

It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph

Thanks, but it's been a while since I looked at the vector code. Can you point me to the AArch32 patterns in question, to show me the AArch64 changes needed? Thanks.

AArch32 combinates 'neg' with 'dup' in RShiftCntV[1], which AArch64 has a single 'dup' only[2] and generates 'neg' before every use sites, e.g. RShiftVB[3].

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/arm/arm.ad#L10451
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5117
[3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5188

Loading

@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 14, 2021

@theRealAph Could you please take a look at this?

Loading

Change-Id: Ie9046b1d7e8f5e2669767756b6b074b564523039
@theRealELiu
Copy link
Contributor Author

@theRealELiu theRealELiu commented Apr 19, 2021

VectorAPI would not profit from this two nodes' separation as the input of RShiftVNode may not be a RShiftCntVNode[1]. Inserting a special 'VNEG' only for AArch64 in mid-end maybe work but seems too ugly and merging those two nodes would harm AArch32. It's quite hard to compromise the benefits between AArch64 and other architectures.

[1] https://github.com/openjdk/jdk/blob/jdk-17%2B18/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5179

Loading

@theRealELiu theRealELiu deleted the merge-vector-shiftcnt branch Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants