-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8370863: VectorAPI: Optimize the VectorMaskCast chain in specific patterns #28313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…terns
`VectorMaskCastNode` is used to cast a vector mask from one type to
another type. The cast may be generated by calling the vector API `cast`
or generated by the compiler. For example, some vector mask operations
like `trueCount` require the input mask to be integer types, so for
floating point type masks, the compiler will cast the mask to the
corresponding integer type mask automatically before doing the mask
operation. This kind of cast is very common.
If the vector element size is not changed, the `VectorMaskCastNode`
don't generate code, otherwise code will be generated to extend or narrow
the mask. This IR node is not free no matter it generates code or not
because it may block some optimizations. For example:
1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))`
The middle `VectorMaskCast` prevented the following optimization:
`(VectorStoremask (VectorLoadMask x)) => (x)`
2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which
blocks the optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.
In these IR patterns, the value of the input `x` is not changed, so we
can safely do the optimization. But if the input value is changed, we
can't eliminate the cast.
The general idea of this PR is introducing an `uncast_mask` helper
function, which can be used to uncast a chain of `VectorMaskCastNode`,
like the existing `Node::uncast(bool)` function. The funtion returns
the first non `VectorMaskCastNode`.
The intended use case is when the IR pattern to be optimized may
contain one or more consecutive `VectorMaskCastNode` and this does not
affect the correctness of the optimization. Then this function can be
called to eliminate the `VectorMaskCastNode` chain.
Current optimizations related to `VectorMaskCastNode` include:
1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
=> (VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.
This PR does the following optimizations:
1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) => (x)`
as `(VectorMaskCast (VectorMaskCast ... (VectorMaskCast x))) => (x)`.
Because as long as types of the head and tail `VectorMaskCastNode` are
consistent, the optimization is correct.
2. Supports a new optimization pattern
`(VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => (x)`.
Since the value before and after the pattern is a boolean vector, it
remains unchanged as long as the vector length remains the same, and
this is guranteed in the api level.
I conducted some simple research on different mask generation methods
and mask operations, and obtained the following table, which includes
some potential optimization opportunities that may use this `uncast_mask`
function.
```
mask_gen\op toLong anyTrue allTrue trueCount firstTrue lastTrue
compare N/A N/A N/A N/A N/A N/A
maskAll TBI TBI TBI TBI TBI TBI
fromLong TBI TBI N/A TBI TBI TBI
mask_gen\op and or xor andNot not laneIsSet
compare N/A N/A N/A N/A TBI N/A
maskAll TBI TBI TBI TBI TBI TBI
fromLong N/A N/A N/A N/A TBI TBI
```
`TBI` indicated that there may be potential optimizations here that
require further investigation.
Benchmarks:
On a Nvidia Grace machine with 128-bit SVE2:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 59.23 0.21 148.12 0.07 2.50
microMaskLoadCastStoreDouble128 ops/us 2.43 0.00 38.31 0.01 15.73
microMaskLoadCastStoreFloat128 ops/us 6.19 0.00 75.67 0.11 12.22
microMaskLoadCastStoreInt128 ops/us 6.19 0.00 75.67 0.03 12.22
microMaskLoadCastStoreLong128 ops/us 2.43 0.00 38.32 0.01 15.74
microMaskLoadCastStoreShort64 ops/us 28.89 0.02 75.60 0.09 2.62
```
On a Nvidia Grace machine with 128-bit NEON:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 75.75 0.19 149.74 0.08 1.98
microMaskLoadCastStoreDouble128 ops/us 8.71 0.03 38.71 0.05 4.44
microMaskLoadCastStoreFloat128 ops/us 24.05 0.03 76.49 0.05 3.18
microMaskLoadCastStoreInt128 ops/us 24.06 0.02 76.51 0.05 3.18
microMaskLoadCastStoreLong128 ops/us 8.72 0.01 38.71 0.02 4.44
microMaskLoadCastStoreShort64 ops/us 24.64 0.01 76.43 0.06 3.10
```
On an AMD EPYC 9124 16-Core Processor with AVX3:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 82.13 0.31 115.14 0.08 1.40
microMaskLoadCastStoreDouble128 ops/us 0.32 0.00 0.32 0.00 1.01
microMaskLoadCastStoreFloat128 ops/us 42.18 0.05 57.56 0.07 1.36
microMaskLoadCastStoreInt128 ops/us 42.19 0.01 57.53 0.08 1.36
microMaskLoadCastStoreLong128 ops/us 0.30 0.01 0.32 0.00 1.05
microMaskLoadCastStoreShort64 ops/us 42.18 0.05 57.59 0.01 1.37
```
On an AMD EPYC 9124 16-Core Processor with AVX2:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 73.53 0.20 114.98 0.03 1.56
microMaskLoadCastStoreDouble128 ops/us 0.29 0.01 0.30 0.00 1.00
microMaskLoadCastStoreFloat128 ops/us 30.78 0.14 57.50 0.01 1.87
microMaskLoadCastStoreInt128 ops/us 30.65 0.26 57.50 0.01 1.88
microMaskLoadCastStoreLong128 ops/us 0.30 0.00 0.30 0.00 0.99
microMaskLoadCastStoreShort64 ops/us 24.92 0.00 57.49 0.01 2.31
```
On an AMD EPYC 9124 16-Core Processor with AVX1:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 79.68 0.01 248.49 0.91 3.12
microMaskLoadCastStoreDouble128 ops/us 0.28 0.00 0.28 0.00 1.00
microMaskLoadCastStoreFloat128 ops/us 31.11 0.04 95.48 2.27 3.07
microMaskLoadCastStoreInt128 ops/us 31.10 0.03 99.94 1.87 3.21
microMaskLoadCastStoreLong128 ops/us 0.28 0.00 0.28 0.00 0.99
microMaskLoadCastStoreShort64 ops/us 31.11 0.02 94.97 2.30 3.05
```
This PR was tested on 128-bit, 256-bit, and 512-bit (QEMU) aarch64
environments, and two 512-bit x64 machines with various configurations,
including sve2, sve1, neon, avx3, avx2, avx1, sse4 and sse3, all tests
passed.
|
👋 Welcome back erifan! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
Webrevs
|
|
Updated the JMH benchmarks and the new test results: On a Nvidia Grace machine with 128-bit SVE2: On a Nvidia Grace machine with 128-bit NEON: On an AMD EPYC 9124 16-Core Processor with AVX3: On an AMD EPYC 9124 16-Core Processor with AVX2: On an AMD EPYC 9124 16-Core Processor with AVX1: |
galderz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement @erifan, just some small comments from me
| // (VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => (x) | ||
| // x remains to be a bool vector with no changes. | ||
| // This function can be used to eliminate the VectorMaskCast in such patterns. | ||
| Node* VectorNode::uncast_mask(Node* n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a static method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's already a static method. See https://github.com/openjdk/jdk/pull/28313/files#diff-ba9e2d10a50a01316946660ec9f68321eb864fd9c815616c10abbec39360efe5R141
Or you mean a static method limited to this file ? If so, I prefer not, it may be used at other places. Thanks~
| @IR(counts = { IRNode.VECTOR_MASK_CAST, "= 0" }, | ||
| applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) | ||
| public static int testTwoCastToDifferentType2() { | ||
| // The types before and after the two casts are not the same, so the cast cannot be eliminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand the documentation on the IR assertions? It's not immediately clear why with AVX-512 the cast remains but with AVX-2 it's removed. Also, this comment is outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the following optimization on AVX2 affects this optimization:
(VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => x
On AVX2 trueCount() requires converting the mask to a boolean vector first via VectorStoreMask. So VectorStoreMask can apply the above optimization, which eliminates all VectorMaskCast nodes as a side effect.
On AVX-512, masks use dedicated mask registers (k registers), VectorStoreMask is not generated for trueCount(), so VectorMaskCast nodes remain.
I reorganised this file, please take another look, thanks~
| @IR(counts = { IRNode.VECTOR_MASK_CAST, "= 0" }, | ||
| applyIfCPUFeatureAnd = {"asimd", "true", "sve", "false"}) | ||
| public static int testTwoCastToDifferentType() { | ||
| // The types before and after the two casts are not the same, so the cast cannot be eliminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated comment. Also please expand assertion comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| } | ||
|
|
||
| @Test | ||
| @IR(counts = { IRNode.VECTOR_LONG_TO_MASK, "= 0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some assertion comments here as well to understand what causes the differences with different architectures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| @Test | ||
| @IR(counts = { IRNode.VECTOR_LONG_TO_MASK, "= 0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
erifan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! @galderz
| // (VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => (x) | ||
| // x remains to be a bool vector with no changes. | ||
| // This function can be used to eliminate the VectorMaskCast in such patterns. | ||
| Node* VectorNode::uncast_mask(Node* n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's already a static method. See https://github.com/openjdk/jdk/pull/28313/files#diff-ba9e2d10a50a01316946660ec9f68321eb864fd9c815616c10abbec39360efe5R141
Or you mean a static method limited to this file ? If so, I prefer not, it may be used at other places. Thanks~
| @IR(counts = { IRNode.VECTOR_MASK_CAST, "= 0" }, | ||
| applyIfCPUFeatureAnd = {"asimd", "true", "sve", "false"}) | ||
| public static int testTwoCastToDifferentType() { | ||
| // The types before and after the two casts are not the same, so the cast cannot be eliminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| @IR(counts = { IRNode.VECTOR_MASK_CAST, "= 0" }, | ||
| applyIfCPUFeatureAnd = {"avx2", "true", "avx512", "false"}) | ||
| public static int testTwoCastToDifferentType2() { | ||
| // The types before and after the two casts are not the same, so the cast cannot be eliminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the following optimization on AVX2 affects this optimization:
(VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => x
On AVX2 trueCount() requires converting the mask to a boolean vector first via VectorStoreMask. So VectorStoreMask can apply the above optimization, which eliminates all VectorMaskCast nodes as a side effect.
On AVX-512, masks use dedicated mask registers (k registers), VectorStoreMask is not generated for trueCount(), so VectorMaskCast nodes remain.
I reorganised this file, please take another look, thanks~
| } | ||
|
|
||
| @Test | ||
| @IR(counts = { IRNode.VECTOR_LONG_TO_MASK, "= 0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| @Test | ||
| @IR(counts = { IRNode.VECTOR_LONG_TO_MASK, "= 0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
VectorMaskCastNodeis used to cast a vector mask from one type to another type. The cast may be generated by calling the vector APIcastor generated by the compiler. For example, some vector mask operations liketrueCountrequire the input mask to be integer types, so for floating point type masks, the compiler will cast the mask to the corresponding integer type mask automatically before doing the mask operation. This kind of cast is very common.If the vector element size is not changed, the
VectorMaskCastNodedon't generate code, otherwise code will be generated to extend or narrow the mask. This IR node is not free no matter it generates code or not because it may block some optimizations. For example:(VectorStoremask (VectorMaskCast (VectorLoadMask x)))The middleVectorMaskCastprevented the following optimization:(VectorStoremask (VectorLoadMask x)) => (x)(VectorMaskToLong (VectorMaskCast (VectorLongToMask x))), which blocks the optimization(VectorMaskToLong (VectorLongToMask x)) => (x).In these IR patterns, the value of the input
xis not changed, so we can safely do the optimization. But if the input value is changed, we can't eliminate the cast.The general idea of this PR is introducing an
uncast_maskhelper function, which can be used to uncast a chain ofVectorMaskCastNode, like the existingNode::uncast(bool)function. The funtion returns the first nonVectorMaskCastNode.The intended use case is when the IR pattern to be optimized may contain one or more consecutive
VectorMaskCastNodeand this does not affect the correctness of the optimization. Then this function can be called to eliminate theVectorMaskCastNodechain.Current optimizations related to
VectorMaskCastNodeinclude:(VectorMaskCast (VectorMaskCast x)) => (x), see JDK-8356760.(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)), see JDK-8354242.This PR does the following optimizations:
(VectorMaskCast (VectorMaskCast x)) => (x)as(VectorMaskCast (VectorMaskCast ... (VectorMaskCast x))) => (x). Because as long as types of the head and tailVectorMaskCastNodeare consistent, the optimization is correct.(VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => (x). Since the value before and after the pattern is a boolean vector, it remains unchanged as long as the vector length remains the same, and this is guranteed in the api level.I conducted some simple research on different mask generation methods and mask operations, and obtained the following table, which includes some potential optimization opportunities that may use this
uncast_maskfunction.TBIindicated that there may be potential optimizations here that require further investigation.Benchmarks:
On a Nvidia Grace machine with 128-bit SVE2:
On a Nvidia Grace machine with 128-bit NEON:
On an AMD EPYC 9124 16-Core Processor with AVX3:
On an AMD EPYC 9124 16-Core Processor with AVX2:
On an AMD EPYC 9124 16-Core Processor with AVX1:
This PR was tested on 128-bit, 256-bit, and 512-bit (QEMU) aarch64 environments, and two 512-bit x64 machines. With various configurations, including sve2, sve1, neon, avx3, avx2, avx1, sse4 and sse3, all tests passed.
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28313/head:pull/28313$ git checkout pull/28313Update a local copy of the PR:
$ git checkout pull/28313$ git pull https://git.openjdk.org/jdk.git pull/28313/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28313View PR using the GUI difftool:
$ git pr show -t 28313Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28313.diff
Using Webrev
Link to Webrev Comment