Skip to content

8280510: AArch64: Vectorize operations with loop induction variable #7491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

pfustc
Copy link
Member

@pfustc pfustc commented Feb 16, 2022

AArch64 has SVE instruction of populating incrementing indices into an
SVE vector register. With this we can vectorize some operations in loop
with the induction variable operand, such as below.

  for (int i = 0; i < count; i++) {
    b[i] = a[i] * i;
  }

This patch enables the vectorization of operations with loop induction
variable by extending current scope of C2 superword vectorizable packs.
Before this patch, any scalar input node in a vectorizable pack must be
an out-of-loop invariant. This patch takes the induction variable input
as consideration. It allows the input to be the iv phi node or phi plus
its index offset, and creates a PopulateIndexNode to generate a vector
filled with incrementing indices. On AArch64 SVE, final generated code
for above loop expression is like below.

  add     x12, x16, x10
  add     x12, x12, #0x10
  ld1w    {z16.s}, p7/z, [x12]
  index   z17.s, w1, #1
  mul     z17.s, p7/m, z17.s, z16.s
  add     x10, x17, x10
  add     x10, x10, #0x10
  st1w    {z17.s}, p7, [x10]

As there is no populating index instruction on AArch64 NEON or other
platforms like x86, a function named is_populate_index_supported() is
created in the VectorNode class for the backend support check.

Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1
are tested and no issue is found. Hotspot jtreg has existing tests in
compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so
no new jtreg is created within this patch. A new JMH is created in this
patch and tested on a 512-bit SVE machine. Below test result shows the
performance can be significantly improved in some cases.

  Benchmark                       Performance
  IndexVector.exprWithIndex1            ~7.7x
  IndexVector.exprWithIndex2           ~13.3x
  IndexVector.indexArrayFill            ~5.7x

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8280510: AArch64: Vectorize operations with loop induction variable

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7491/head:pull/7491
$ git checkout pull/7491

Update a local copy of the PR:
$ git checkout pull/7491
$ git pull https://git.openjdk.java.net/jdk pull/7491/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7491

View PR using the GUI difftool:
$ git pr show -t 7491

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7491.diff

AArch64 has SVE instruction of populating incrementing indices into an
SVE vector register. With this we can vectorize some operations in loop
with the induction variable operand, such as below.

  for (int i = 0; i < count; i++) {
    b[i] = a[i] * i;
  }

This patch enables the vectorization of operations with loop induction
variable by extending current scope of C2 superword vectorizable packs.
Before this patch, any scalar input node in a vectorizable pack must be
an out-of-loop invariant. This patch takes the induction variable input
as consideration. It allows the input to be the iv phi node or phi plus
its index offset, and creates a PopulateIndexNode to generate a vector
filled with incrementing indices. On AArch64 SVE, final generated code
for above loop expression is like below.

  add     x12, x16, x10
  add     x12, x12, #0x10
  ld1w    {z16.s}, p7/z, [x12]
  index   z17.s, w1, openjdk#1
  mul     z17.s, p7/m, z17.s, z16.s
  add     x10, x17, x10
  add     x10, x10, #0x10
  st1w    {z17.s}, p7, [x10]

As there is no populating index instruction on AArch64 NEON or other
platforms like x86, a function named is_populate_index_supported() is
created in the VectorNode class for the backend support check.

Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1
are tested and no issue is found. Hotspot jtreg has existing tests in
compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so
no new jtreg is created within this patch. A new JMH is created in this
patch and tested on a 512-bit SVE machine. Below test result shows the
performance can be significantly improved in some cases.

  Benchmark                       Performance
  IndexVector.exprWithIndex1            ~7.7x
  IndexVector.exprWithIndex2           ~13.3x
  IndexVector.indexArrayFill            ~5.7x
@bridgekeeper
Copy link

bridgekeeper bot commented Feb 16, 2022

👋 Welcome back pli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 16, 2022
@openjdk
Copy link

openjdk bot commented Feb 16, 2022

@pfustc The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Feb 16, 2022
@mlbridge
Copy link

mlbridge bot commented Feb 16, 2022

Webrevs

@pfustc
Copy link
Member Author

pfustc commented Mar 14, 2022

Can anyone help review this? This optimization is currently for AArch64 but changes are mainly in C2 mid-end.

Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks ok to me modulo the cut-and-paste error I highlighted.

Have you any way of ensuring that the generated code is correct (other than the fact that it runs quicker).

Also have you checked this does not invalidate any other SVE tests. I would expect some failures thanks to the cut-and-paste error.

@adinn
Copy link
Contributor

adinn commented Mar 15, 2022

It would be a good idea if someone who knows the superword code better than me could comment on this extension to the range of cases covered by superword transformation. This looks like a very useful extension to the range of cases that can be optimized using superword analysis. However, I am not familiar enough with the code to know if this is the best way to proceed or if some other alternative has been considered and might provide a better alternative, especially wrt architectures other than aarch64.

@pfustc
Copy link
Member Author

pfustc commented Mar 16, 2022

The code looks ok to me modulo the cut-and-paste error I highlighted.

Thanks for your careful review. The cut-and-paste error has been fixed in my updated patch.

Have you any way of ensuring that the generated code is correct (other than the fact that it runs quicker).

Also have you checked this does not invalidate any other SVE tests. I would expect some failures thanks to the cut-and-paste error.

I've run almost full jtreg on an SVE featured machine and no issue is found. I noticed that existing hotspot jtregs cases in compiler/c2/cr7192963/Test*Vect.java have this kind of loop patterns. (Unfortunately, there is no related failure found even with that cut-and-paste error.)

It would be a good idea if someone who knows the superword code better than me could comment on this extension to the range of cases covered by superword transformation. This looks like a very useful extension to the range of cases that can be optimized using superword analysis. However, I am not familiar enough with the code to know if this is the best way to proceed or if some other alternative has been considered and might provide a better alternative, especially wrt architectures other than aarch64.

I agree. I think people who know superword well are mostly in Oracle so I'd like to see if anyone in Oracle could help look at this. I've considered enabling this for x86. But as I'm not familiar with x86 instruction set enough, I haven't found any instruction(s) to implement such index vector. I would be very glad if someone from Intel can give me some suggestions.

@pfustc
Copy link
Member Author

pfustc commented Apr 12, 2022

@vnkozlov @TobiHartmann Can you also help look at this?

@TobiHartmann
Copy link
Member

Please resolve the merge conflicts.

@pfustc
Copy link
Member Author

pfustc commented Apr 20, 2022

Please resolve the merge conflicts.

Done. After merge, this is also tested with recent new jtreg cases in compiler/vectorization/runner/ArrayIndexFillTest.java

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I executed some testing, all passed.

Do we need to add the new node to MatchRule::is_expensive()?

Do we need to add a declaration to vmStructs.cpp?

@pfustc
Copy link
Member Author

pfustc commented Apr 26, 2022

Hi @TobiHartmann ,

Do we need to add the new node to MatchRule::is_expensive()?

Do we need to add a declaration to vmStructs.cpp?

I've fixed these in my latest commit. In that commit, I also align matching rule code with other rules in AArch64 ad file.

Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good to me

@openjdk
Copy link

openjdk bot commented Apr 26, 2022

@pfustc This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8280510: AArch64: Vectorize operations with loop induction variable

Reviewed-by: adinn, thartmann

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 143 new commits pushed to the master branch:

  • d7514b0: 8285595: Assert frame anchor doesn't change in safepoints/handshakes
  • 5629c75: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping
  • 85f8d14: 8283994: Make Xerces DatatypeException stackless
  • 4f2e4c7: 8178969: [TESTBUG] Wrong reporting of gc/g1/humongousObjects/TestHeapCounters test.
  • 4795165: 8285696: AlgorithmConstraints:permits not throwing IllegalArgumentException when 'alg' is null
  • c1173c2: 8285493: ECC calculation error
  • 89fd6d3: 8284910: Buffer clean in PasswordCallback
  • cf1b00a: 8285683: Missing @ since 11 in java.security.spec.MGF1ParameterSpec fields
  • 3312d8c: 8285756: clean up use of bad arguments for @clean in langtools tests
  • 5c09349: 8285032: vmTestbase/nsk/jdi/EventSet/suspendPolicy/suspendpolicy008/ fails with "eventSet.suspendPolicy() != policyExpected"
  • ... and 133 more: https://git.openjdk.java.net/jdk/compare/b9f513c62406b9b58a8e860f7cb5b4d53226e7e9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 26, 2022
Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too. Have you considered adding a IR verification test?

@pfustc
Copy link
Member Author

pfustc commented Apr 28, 2022

Hi @TobiHartmann ,

Looks good to me too. Have you considered adding a IR verification test?

Thanks for your review. And good question!

You might have remembered that we contributed a vectorization test framework together with our post loop fix (#6828). Currently we are enabling IR tests in that framework. So far we have made some progress. So IR tests will be added for all vectorizable loops there (in hotspot:compiler/vectorization/runner/) in the near future.

@pfustc
Copy link
Member Author

pfustc commented Apr 28, 2022

/integrate

@openjdk
Copy link

openjdk bot commented Apr 28, 2022

Going to push as commit ea83b44.
Since your change was applied there have been 146 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 28, 2022
@openjdk openjdk bot closed this Apr 28, 2022
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Apr 28, 2022
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 28, 2022
@openjdk
Copy link

openjdk bot commented Apr 28, 2022

@pfustc Pushed as commit ea83b44.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@TobiHartmann
Copy link
Member

Okay, thanks for the heads up!

@pfustc pfustc deleted the indexvector branch April 28, 2022 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants