Skip to content

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Aug 8, 2024

Hi All,

As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs.

Declaration:-
    Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)

Semantics:-
Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown.

Summary of changes:

  • Java side implementation of new selectFrom API.
  • C2 compiler IR and inline expander changes.
  • In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms.
  • Optimized x86 backend implementation for AVX512 and legacy target.
  • Function tests covering new API.

JMH micro included with this patch shows around 10-15x gain over existing rearrange API :-
Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]

  Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
SelectFromBenchmark.rearrangeFromByteVector     1024  thrpt    2   2041.762          ops/ms
SelectFromBenchmark.rearrangeFromByteVector     2048  thrpt    2   1028.550          ops/ms
SelectFromBenchmark.rearrangeFromIntVector      1024  thrpt    2    962.605          ops/ms
SelectFromBenchmark.rearrangeFromIntVector      2048  thrpt    2    479.004          ops/ms
SelectFromBenchmark.rearrangeFromLongVector     1024  thrpt    2    359.758          ops/ms
SelectFromBenchmark.rearrangeFromLongVector     2048  thrpt    2    178.192          ops/ms
SelectFromBenchmark.rearrangeFromShortVector    1024  thrpt    2   1463.459          ops/ms
SelectFromBenchmark.rearrangeFromShortVector    2048  thrpt    2    727.556          ops/ms
SelectFromBenchmark.selectFromByteVector        1024  thrpt    2  33254.830          ops/ms
SelectFromBenchmark.selectFromByteVector        2048  thrpt    2  17313.174          ops/ms
SelectFromBenchmark.selectFromIntVector         1024  thrpt    2  10756.804          ops/ms
SelectFromBenchmark.selectFromIntVector         2048  thrpt    2   5398.244          ops/ms
SelectFromBenchmark.selectFromLongVector        1024  thrpt    2   5856.859          ops/ms
SelectFromBenchmark.selectFromLongVector        2048  thrpt    2   1513.378          ops/ms
SelectFromBenchmark.selectFromShortVector       1024  thrpt    2  17888.617          ops/ms
SelectFromBenchmark.selectFromShortVector       2048  thrpt    2   9079.565          ops/ms

Kindly review and share your feedback.

Best Regards,
Jatin

[1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change requires CSR request JDK-8340338 to be approved
  • Change must be properly reviewed (3 reviews required, with at least 1 Reviewer, 2 Authors)

Issues

  • JDK-8338023: Support two vector selectFrom API (Enhancement - P4)
  • JDK-8340338: Support two vector selectFrom API (CSR)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508
$ git checkout pull/20508

Update a local copy of the PR:
$ git checkout pull/20508
$ git pull https://git.openjdk.org/jdk.git pull/20508/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20508

View PR using the GUI difftool:
$ git pr show -t 20508

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20508.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 8, 2024

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 8, 2024

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8338023: Support two vector selectFrom API

Reviewed-by: psandoz, epeter, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 199 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Aug 8, 2024

@jatin-bhateja The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Aug 8, 2024
@jatin-bhateja
Copy link
Member Author

/label add hotspot-compiler-dev

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Aug 8, 2024
@openjdk
Copy link

openjdk bot commented Aug 8, 2024

@jatin-bhateja
The hotspot-compiler label was successfully added.

@jatin-bhateja jatin-bhateja marked this pull request as ready for review August 8, 2024 16:56
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 8, 2024
@mlbridge
Copy link

mlbridge bot commented Aug 8, 2024

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results look promising. I can provide guidance on the specification e.g., we can specify the behavior in terms of rearrange, with the addition of throwing on out of bounds indexes.

Regarding the throwing of exceptions, some wider context will help to know where we are heading before we finalize the specification. I believe we are considering changing the default throwing behavior for index out of bounds to wrapping, thereby we can avoid bounds checks. If that is the case we should wait until that is done then update rather than submitting a CSR just yet?

I see you created a specific intrinsic, which will avoid the cost of shuffle creation. Should we apply the same approach (in a subsequent PR) to the single argument shuffle? Or perhaps if we manage to optimize shuffles and change the default wrapping we don't require a specific intrinsic and can just use defer to rearrange?

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Aug 14, 2024

The results look promising. I can provide guidance on the specification e.g., we can specify the behavior in terms of rearrange, with the addition of throwing on out of bounds indexes.

Regarding the throwing of exceptions, some wider context will help to know where we are heading before we finalize the specification. I believe we are considering changing the default throwing behavior for index out of bounds to wrapping, thereby we can avoid bounds checks. If that is the case we should wait until that is done then update rather than submitting a CSR just yet?

I see you created a specific intrinsic, which will avoid the cost of shuffle creation. Should we apply the same approach (in a subsequent PR) to the single argument shuffle? Or perhaps if we manage to optimize shuffles and change the default wrapping we don't require a specific intrinsic and can just use defer to rearrange?

Hi @PaulSandoz ,
Thanks for your comments. With this new API we intend to enforce stricter specification w.r.t to index values to emit a lean instruction sequence preventing any cycles spent on massaging inputs to a consumable form, thus avoiding redundant wrapping and unwrapping operations.

Existing two vector rearrange API has a flexible specification which allows wrapping out of bounds shuffle indexes into exceptional index with a -ve value.

Even if we optimize existing two vector rearrange implementation we will still need to emit additional instructions to generate an indexes which lie within two vector range [0, 2*VLEN). I see this as a specialized API like vector compress/expand which cater to targets like x86-AVX512+ and aarch64-SVE which offers direct instruction for two vector lookups.

May be the API nomenclature can be refined to better reflect its semantics i.e. from selectFrom to twoVectorLookup ?

@mlbridge
Copy link

mlbridge bot commented Aug 17, 2024

Mailing list message from John Rose on hotspot-compiler-dev:

(Better late than never, although I wish I?d been more explicit
about this on panama-dev.)

I think we should be moving away from throwing exceptions on all
reorder/shuffle/permute vector ops, and moving toward wrapping.
These ops all operate on vectors (small arrays) of vector lane
indexes (small array indexes in a fixed domain, always a power
of two). The throwing behavior checks an input for bad indexes
and throws a (scalar) exception if there are any at all. The
wrapping behavior reduces bad indexes to good ones by an unsigned
modulo operation (which is at worst a mask for powers of two).

If I?m right, then new API points should start out with wrap
semantics, not throw semantics. And old API points should be
migrated ASAP.

There?s no loss of functionality in such a move. Instead the
defaults are moved around. Before, throwing was the default
and wrapping was an explicit operation. After, wrapping would
be the default and throwing would be explicit. Both wrapping
and throwing checks are available through explicit calls to
VectorShuffle methods checkIndexes and wrapIndexes.

OK, so why is wrapping better than throwing? And first, why
did we start with throwing as the default? Well, we chose
throwing as the default to make the vector operations
more Java-like. Java scalar operations don?t try to reduce
bad array indexes into the array domain; they throw. Since
a shuffle op is like an array reference, it makes sense to
emulate the checks built into Java array references.

Or it did make sense. I think there is a technical debt
here which is turning out to be hard to pay off. The tech
debt is to suppress or hoist or strength-reduce the vector
instructions that perform the check for invalid indexes
(in parallel), then ask ?did any of those checks fail??
(a mask reduction), then do a conditional branch to
failure code. I think I was over-confident that our
scalar tactics for reducing array range checks would
apply to vectors as well. On second thought, vectorizing
our key optimization, of loop range splitting (pre/main/post
loops) is kind of a nightmare.

Instead, consider the alternative of wrapping. First,
you use vpand or the like to mask the indexes down to
the valid range. Then you run the shuffle/permute
instruction. That?s it. There is no scalar query
or branch. And, there are probably some circumstances
where you can omit the vpand operation: Perhaps the
hardware already masks the inputs (as with shift
instructions). Or, perhaps C2 can do bitwise inference
of the vectors and figure out that the vpand is a nop.
(I am agitating for bitwise types in C2; this is a use
case for them.) In the worst case, the vpand op is
fast and pipelines well.

This is why I think we should switch, ASAP, to masking
instead of throwing, on bad indexes.

I think some of our reports from customers have shown
that the extra checks necessary for throwing on bad
indexes are giving their code surprising slowdowns,
relative to C-based vector code.

Did I miss a point?

? John

On 14 Aug 2024, at 3:43, Jatin Bhateja wrote:

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Aug 19, 2024

Mailing list message from John Rose on hotspot-compiler-dev:

(Better late than never, although I wish I?d been more explicit about this on panama-dev.)

I think we should be moving away from throwing exceptions on all reorder/shuffle/permute vector ops, and moving toward wrapping. These ops all operate on vectors (small arrays) of vector lane indexes (small array indexes in a fixed domain, always a power of two). The throwing behavior checks an input for bad indexes and throws a (scalar) exception if there are any at all. The wrapping behavior reduces bad indexes to good ones by an unsigned modulo operation (which is at worst a mask for powers of two).

If I?m right, then new API points should start out with wrap semantics, not throw semantics. And old API points should be migrated ASAP.

There?s no loss of functionality in such a move. Instead the defaults are moved around. Before, throwing was the default and wrapping was an explicit operation. After, wrapping would be the default and throwing would be explicit. Both wrapping and throwing checks are available through explicit calls to VectorShuffle methods checkIndexes and wrapIndexes.

OK, so why is wrapping better than throwing? And first, why did we start with throwing as the default? Well, we chose throwing as the default to make the vector operations more Java-like. Java scalar operations don?t try to reduce bad array indexes into the array domain; they throw. Since a shuffle op is like an array reference, it makes sense to emulate the checks built into Java array references.

Or it did make sense. I think there is a technical debt here which is turning out to be hard to pay off. The tech debt is to suppress or hoist or strength-reduce the vector instructions that perform the check for invalid indexes (in parallel), then ask ?did any of those checks fail?? (a mask reduction), then do a conditional branch to failure code. I think I was over-confident that our scalar tactics for reducing array range checks would apply to vectors as well. On second thought, vectorizing our key optimization, of loop range splitting (pre/main/post loops) is kind of a nightmare.

Instead, consider the alternative of wrapping. First, you use vpand or the like to mask the indexes down to the valid range. Then you run the shuffle/permute instruction. That?s it. There is no scalar query or branch. And, there are probably some circumstances where you can omit the vpand operation: Perhaps the hardware already masks the inputs (as with shift instructions). Or, perhaps C2 can do bitwise inference of the vectors and figure out that the vpand is a nop. (I am agitating for bitwise types in C2; this is a use case for them.) In the worst case, the vpand op is fast and pipelines well.

This is why I think we should switch, ASAP, to masking instead of throwing, on bad indexes.

I think some of our reports from customers have shown that the extra checks necessary for throwing on bad indexes are giving their code surprising slowdowns, relative to C-based vector code.

Did I miss a point?

? John

On 14 Aug 2024, at 3:43, Jatin Bhateja wrote:

Hi @rose00,

I agree that wrapping should be the default behaviour if indices are passed through shuffles, idea was to pick exception throwing semantics for out of bounds indexes only for selectFrom flavour of APIs which accept indexes through vector interface, this will save redundant partial wrapping and un-wrapping for cross vector permutation API which has a direct mappings in x86 and AARCH64 ISA.

As @PaulSandoz suggested we can also tune existing single 'selectFrom' API to adopt default exception throwing semantics if any of the indices lies beyond valid index range.

While we will continue keeping default wrapping semantics for APIs accepting shuffles, this little deviation of semantics for selectFrom family of APIs will enable generating efficient code and will allow users to chooses between the rearrange and selectFrom API based on convenience vs efficient code trade-off.

Since, API interfaces were crafted keeping in view long term flexibility, having multiple permutation interfaces (selectFrom / rearrange) accepting indexes though vector or shuffle enables compiler to emit efficient code.

Best Regards,
Jatin

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@sviswa7
Copy link

sviswa7 commented Oct 3, 2024

/reviewers 2

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 3, 2024
@openjdk
Copy link

openjdk bot commented Oct 3, 2024

@sviswa7
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Oct 3, 2024
@PaulSandoz
Copy link
Member

/reviewers 3

@openjdk
Copy link

openjdk bot commented Oct 8, 2024

@PaulSandoz
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 3 (with at least 1 Reviewer, 2 Authors).

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a quick scan, and I have no further comments. LGTM.

@PaulSandoz
Copy link
Member

I gave it a quick scan, and I have no further comments. LGTM.

Thank you, i will kick off an internal test.

@PaulSandoz
Copy link
Member

I gave it a quick scan, and I have no further comments. LGTM.

Thank you, i will kick off an internal test.

Tier 1 to 3 tests pass.

@jatin-bhateja
Copy link
Member Author

/integrate

@jatin-bhateja
Copy link
Member Author

Thanks @PaulSandoz , @sviswa7 and @eme64 for review suggestions.

@openjdk
Copy link

openjdk bot commented Oct 16, 2024

@jatin-bhateja This pull request has not yet been marked as ready for integration.

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 16, 2024
@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Oct 16, 2024

Going to push as commit 709914f.
Since your change was applied there have been 199 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 16, 2024
@openjdk openjdk bot closed this Oct 16, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 16, 2024
@openjdk
Copy link

openjdk bot commented Oct 16, 2024

@jatin-bhateja Pushed as commit 709914f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@liach
Copy link
Member

liach commented Oct 16, 2024

This patch failed on the lastest master. Another reason OpenJDK guide asks to merge master despite all these commit churns...

@jatin-bhateja jatin-bhateja restored the JDK-8338023 branch October 16, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants