8380424: C2: Fix missing identity optimization for vector nodes#30529
8380424: C2: Fix missing identity optimization for vector nodes#30529erifan wants to merge 24 commits into
Conversation
…terns
`VectorMaskCastNode` is used to cast a vector mask from one type to
another type. The cast may be generated by calling the vector API `cast`
or generated by the compiler. For example, some vector mask operations
like `trueCount` require the input mask to be integer types, so for
floating point type masks, the compiler will cast the mask to the
corresponding integer type mask automatically before doing the mask
operation. This kind of cast is very common.
If the vector element size is not changed, the `VectorMaskCastNode`
don't generate code, otherwise code will be generated to extend or narrow
the mask. This IR node is not free no matter it generates code or not
because it may block some optimizations. For example:
1. `(VectorStoremask (VectorMaskCast (VectorLoadMask x)))`
The middle `VectorMaskCast` prevented the following optimization:
`(VectorStoremask (VectorLoadMask x)) => (x)`
2. `(VectorMaskToLong (VectorMaskCast (VectorLongToMask x)))`, which
blocks the optimization `(VectorMaskToLong (VectorLongToMask x)) => (x)`.
In these IR patterns, the value of the input `x` is not changed, so we
can safely do the optimization. But if the input value is changed, we
can't eliminate the cast.
The general idea of this PR is introducing an `uncast_mask` helper
function, which can be used to uncast a chain of `VectorMaskCastNode`,
like the existing `Node::uncast(bool)` function. The funtion returns
the first non `VectorMaskCastNode`.
The intended use case is when the IR pattern to be optimized may
contain one or more consecutive `VectorMaskCastNode` and this does not
affect the correctness of the optimization. Then this function can be
called to eliminate the `VectorMaskCastNode` chain.
Current optimizations related to `VectorMaskCastNode` include:
1. `(VectorMaskCast (VectorMaskCast x)) => (x)`, see JDK-8356760.
2. `(XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
=> (VectorMaskCast (VectorMaskCmp src1 src2 ncond))`, see JDK-8354242.
This PR does the following optimizations:
1. Extends the optimization pattern `(VectorMaskCast (VectorMaskCast x)) => (x)`
as `(VectorMaskCast (VectorMaskCast ... (VectorMaskCast x))) => (x)`.
Because as long as types of the head and tail `VectorMaskCastNode` are
consistent, the optimization is correct.
2. Supports a new optimization pattern
`(VectorStoreMask (VectorMaskCast ... (VectorLoadMask x))) => (x)`.
Since the value before and after the pattern is a boolean vector, it
remains unchanged as long as the vector length remains the same, and
this is guranteed in the api level.
I conducted some simple research on different mask generation methods
and mask operations, and obtained the following table, which includes
some potential optimization opportunities that may use this `uncast_mask`
function.
```
mask_gen\op toLong anyTrue allTrue trueCount firstTrue lastTrue
compare N/A N/A N/A N/A N/A N/A
maskAll TBI TBI TBI TBI TBI TBI
fromLong TBI TBI N/A TBI TBI TBI
mask_gen\op and or xor andNot not laneIsSet
compare N/A N/A N/A N/A TBI N/A
maskAll TBI TBI TBI TBI TBI TBI
fromLong N/A N/A N/A N/A TBI TBI
```
`TBI` indicated that there may be potential optimizations here that
require further investigation.
Benchmarks:
On a Nvidia Grace machine with 128-bit SVE2:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 59.23 0.21 148.12 0.07 2.50
microMaskLoadCastStoreDouble128 ops/us 2.43 0.00 38.31 0.01 15.73
microMaskLoadCastStoreFloat128 ops/us 6.19 0.00 75.67 0.11 12.22
microMaskLoadCastStoreInt128 ops/us 6.19 0.00 75.67 0.03 12.22
microMaskLoadCastStoreLong128 ops/us 2.43 0.00 38.32 0.01 15.74
microMaskLoadCastStoreShort64 ops/us 28.89 0.02 75.60 0.09 2.62
```
On a Nvidia Grace machine with 128-bit NEON:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 75.75 0.19 149.74 0.08 1.98
microMaskLoadCastStoreDouble128 ops/us 8.71 0.03 38.71 0.05 4.44
microMaskLoadCastStoreFloat128 ops/us 24.05 0.03 76.49 0.05 3.18
microMaskLoadCastStoreInt128 ops/us 24.06 0.02 76.51 0.05 3.18
microMaskLoadCastStoreLong128 ops/us 8.72 0.01 38.71 0.02 4.44
microMaskLoadCastStoreShort64 ops/us 24.64 0.01 76.43 0.06 3.10
```
On an AMD EPYC 9124 16-Core Processor with AVX3:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 82.13 0.31 115.14 0.08 1.40
microMaskLoadCastStoreDouble128 ops/us 0.32 0.00 0.32 0.00 1.01
microMaskLoadCastStoreFloat128 ops/us 42.18 0.05 57.56 0.07 1.36
microMaskLoadCastStoreInt128 ops/us 42.19 0.01 57.53 0.08 1.36
microMaskLoadCastStoreLong128 ops/us 0.30 0.01 0.32 0.00 1.05
microMaskLoadCastStoreShort64 ops/us 42.18 0.05 57.59 0.01 1.37
```
On an AMD EPYC 9124 16-Core Processor with AVX2:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 73.53 0.20 114.98 0.03 1.56
microMaskLoadCastStoreDouble128 ops/us 0.29 0.01 0.30 0.00 1.00
microMaskLoadCastStoreFloat128 ops/us 30.78 0.14 57.50 0.01 1.87
microMaskLoadCastStoreInt128 ops/us 30.65 0.26 57.50 0.01 1.88
microMaskLoadCastStoreLong128 ops/us 0.30 0.00 0.30 0.00 0.99
microMaskLoadCastStoreShort64 ops/us 24.92 0.00 57.49 0.01 2.31
```
On an AMD EPYC 9124 16-Core Processor with AVX1:
```
Benchmark Unit Before Error After Error Uplift
microMaskLoadCastStoreByte64 ops/us 79.68 0.01 248.49 0.91 3.12
microMaskLoadCastStoreDouble128 ops/us 0.28 0.00 0.28 0.00 1.00
microMaskLoadCastStoreFloat128 ops/us 31.11 0.04 95.48 2.27 3.07
microMaskLoadCastStoreInt128 ops/us 31.10 0.03 99.94 1.87 3.21
microMaskLoadCastStoreLong128 ops/us 0.28 0.00 0.28 0.00 0.99
microMaskLoadCastStoreShort64 ops/us 31.11 0.02 94.97 2.30 3.05
```
This PR was tested on 128-bit, 256-bit, and 512-bit (QEMU) aarch64
environments, and two 512-bit x64 machines with various configurations,
including sve2, sve1, neon, avx3, avx2, avx1, sse4 and sse3, all tests
passed.
Also refined the tests.
Ideal and Identity optimizations require all input nodes of the IR pattern to be ready for the optimization to take effect. However, node generation in the incremental inlining phase is unordered, so sometimes downstream nodes in the IR pattern are generated before upstream nodes, causing Ideal or Identity optimizations to miss. If no subsequent process triggers the optimization again, the optimization misses forever. Vector nodes (especially generated by VectorAPI) are often wrapped using `VectorBoxNode` during generation, and the existence of these box nodes and unbox nodes further hinders the matching of IR optimization patterns. The `-XX:VerifyIterativeGVN` option allows us to check which IGVN optimizations are missed; however, currently, the verification for Vector nodes is skipped. Enabling the Identity optimization check for vector nodes shows that many tests fail, as shown below. ``` jdk/incubator/vector/ByteVector128LoadStoreTests.java jdk/incubator/vector/ByteVector256LoadStoreTests.java jdk/incubator/vector/ByteVector512LoadStoreTests.java jdk/incubator/vector/ByteVector64LoadStoreTests.java jdk/incubator/vector/ByteVectorMaxLoadStoreTests.java jdk/incubator/vector/ShortVector128LoadStoreTests.java jdk/incubator/vector/ShortVector256LoadStoreTests.java jdk/incubator/vector/ShortVector512LoadStoreTests.java jdk/incubator/vector/ShortVector64LoadStoreTests.java jdk/incubator/vector/ShortVectorMaxLoadStoreTests.java jdk/incubator/vector/Vector512ConversionTests.java jdk/incubator/vector/Vector64ConversionTests.java#id0 jdk/incubator/vector/VectorMaxConversionTests.java#id0 ``` They are caused by the missed optimizations of `AndVNode::Identity()` and `ShiftVNode::Identity()`. And from JDK-8370863, we know that `VectorStoreMaskNode::Identity()` may miss as well. To recover these potential missed optimizations, we need to trigger them again at appropriate points. Currently, a GVN optimization runs once during node generation, and if no subsequent changes are made, the node will not be added to the IGVN worklist to trigger IGVN optimization again. Therefore, the corresponding nodes need to be added to the IGVN worklist at appropriate points. Many phases affect the shape of the node tree, but inlining and boxing have a particularly significant impact on vector nodes. After `PhaseVector`, inlining is complete, and vector boxing/unboxing has been eliminated. At this point, the node tree is fully materialized, with no additional interfering nodes. Therefore, this PR adds all nodes to the IGVN worklist at this point to recover potentially missed GVN optimizations. However, this modification still cannot handle the situation after `PhaseVector`, so this PR also enhances the notification of multi-hop IR optimization patterns in `add_users_of_use_to_worklist`. With this PR, the above test failures passed in 100 tests, so this PR enables identity optimization verification for vector nodes. We expect that with this PR, there will be very few cases of Vector identity optimization misses; if they do occur, we should fix them rather than skip them. This PR does not enable `Ideal` optimization verification for vector nodes because the inputs of some commutative nodes may be swapped in `Ideal`, causing changes in the hash value, which could lead to verification failure. We also found many test failures caused by the missing of `ShenandoahLoadReferenceBarrierNode::Identity()`. This PR skipped the identity verification of the `ShenandoahLoadReferenceBarrierNode` because it was not investigated in this PR. This PR tested all tier1 to tier3 jtreg tests on aarch64 (sve, neon) and x64 (avx3, avx2) platforms using options `-ea -esa -XX:-TieredCompilation -XX:CompileThreshold=100 -XX:VerifyIterativeGVN=1110`, and repeated the test 100 times for the aforementioned error cases. All tests passed.
|
👋 Welcome back erfang! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
| void PhaseVector::add_all_nodes_into_igvn_worklist() { | ||
| ResourceMark rm; | ||
| Unique_Node_List useful; | ||
| C->identify_useful_nodes(useful); |
There was a problem hiding this comment.
Is there any risk that this produces a lot of nodes that are unrelated to what you are trying to achieve here? From what I can see identity_useful_node seems to add the entire graph of live nodes and that seems to me that it could be a lot. Maybe this works work well when the graph has a lot of vector nodes, but if the graph has a mix of scalar and vector nodes maybe this could get out of hand?
Another thing that is noticeable is that PhaseVector::do_cleanup calls PhaseRemoveUseless just before add_all_nodes_into_igvn_worklist is called, and PhaseRemoveUseless also invokes identify_useful_nodes and stores them into _useful. This is accessible via PhaseRemoveUseless::get_useful. Could you piggyback on that?
There was a problem hiding this comment.
Thanks, good point.
I did a quick compile-time check on a Neoverse-V2 machine with the same case:
// mul add int vd = va * vb + vc
public static void testMulAddInt() {
for (int i = 0; i < LENGTH; i += I_SPECIES.length()) {
IntVector va = IntVector.fromArray(I_SPECIES, ia, i);
IntVector vb = IntVector.fromArray(I_SPECIES, ib, i);
IntVector vc = IntVector.fromArray(I_SPECIES, ic, i);
va.mul(vb).add(vc).intoArray(ir, i);
}
}
public static void main(String[] args) {
for (int i = 0; i < 10001; i++) {
testMulAddInt();
}
}
java -Xbatch -XX:-UseOnStackReplacement -XX:CompileCommand="compileonly,Test::test*" -XX:-TieredCompilation -XX:CompileCommand="print,Test::test*" -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation Test.java > assembly.s
I ran both jdk-master and this PR build 10 times and measured the C2 compile time for Test::testMulAddInt from hotspot_pid*.log. The averages were 38.3 ms on jdk-master and 37.6 ms with this PR, so I do not see a noticeable compile-time regression from this change.
| build | avg | min | max | stdev |
|---|---|---|---|---|
| master | 38.3 ms | 37.0 ms | 40.0 ms | 0.82 ms |
| PR | 37.6 ms | 37.0 ms | 39.0 ms | 0.70 ms |
Also, as Peter suggested earlier in #28313. "Adding all nodes might be the most reliable in the long-run. We may at some point have idealization rules that start at a scalar node, and traverse up to find vector nodes."
I also took your suggestion and now reuse PhaseRemoveUseless::get_useful() instead of recomputing the useful-node set.
| import compiler.lib.ir_framework.*; | ||
| import jdk.incubator.vector.*; | ||
|
|
||
| public class VectorStoreMaskIdentityStressTest { |
There was a problem hiding this comment.
Not in this class, but should there be additional IR tests for AndV and ShiftV?
There was a problem hiding this comment.
After we enable identity check for vector nodes, with options -ea -esa -XX:-TieredCompilation -XX:CompileThreshold=100 -XX:VerifyIterativeGVN=1110, test files like jdk/incubator/vector/ByteVector128LoadStoreTests.java consistently fails; therefore I think there's no need to add any additional tests for AndV and ShiftV.
For this test, it fails intermittently, so I feel it is necessary to add a stress test.
|
The parent pull request that this pull request depends on has now been integrated and the target branch of this pull request has been updated. This means that changes from the dependent pull request can start to show up as belonging to this pull request, which may be confusing for reviewers. To remedy this situation, simply merge the latest changes from the new target branch into this pull request by running commands similar to these in the local repository for your personal fork: git checkout JDK-8380424-miss-identity-opt
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# if there are conflicts, follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@erifan this pull request can not be integrated into git checkout JDK-8380424-miss-identity-opt
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
The total number of required reviews for this PR has been set to 2 based on the presence of this label: |
|
/template append |
|
@erifan The pull request template has been appended to the pull request body |
erifan
left a comment
There was a problem hiding this comment.
Thanks for your review!
| void PhaseVector::add_all_nodes_into_igvn_worklist() { | ||
| ResourceMark rm; | ||
| Unique_Node_List useful; | ||
| C->identify_useful_nodes(useful); |
There was a problem hiding this comment.
Thanks, good point.
I did a quick compile-time check on a Neoverse-V2 machine with the same case:
// mul add int vd = va * vb + vc
public static void testMulAddInt() {
for (int i = 0; i < LENGTH; i += I_SPECIES.length()) {
IntVector va = IntVector.fromArray(I_SPECIES, ia, i);
IntVector vb = IntVector.fromArray(I_SPECIES, ib, i);
IntVector vc = IntVector.fromArray(I_SPECIES, ic, i);
va.mul(vb).add(vc).intoArray(ir, i);
}
}
public static void main(String[] args) {
for (int i = 0; i < 10001; i++) {
testMulAddInt();
}
}
java -Xbatch -XX:-UseOnStackReplacement -XX:CompileCommand="compileonly,Test::test*" -XX:-TieredCompilation -XX:CompileCommand="print,Test::test*" -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation Test.java > assembly.s
I ran both jdk-master and this PR build 10 times and measured the C2 compile time for Test::testMulAddInt from hotspot_pid*.log. The averages were 38.3 ms on jdk-master and 37.6 ms with this PR, so I do not see a noticeable compile-time regression from this change.
| build | avg | min | max | stdev |
|---|---|---|---|---|
| master | 38.3 ms | 37.0 ms | 40.0 ms | 0.82 ms |
| PR | 37.6 ms | 37.0 ms | 39.0 ms | 0.70 ms |
Also, as Peter suggested earlier in #28313. "Adding all nodes might be the most reliable in the long-run. We may at some point have idealization rules that start at a scalar node, and traverse up to find vector nodes."
I also took your suggestion and now reuse PhaseRemoveUseless::get_useful() instead of recomputing the useful-node set.
| import compiler.lib.ir_framework.*; | ||
| import jdk.incubator.vector.*; | ||
|
|
||
| public class VectorStoreMaskIdentityStressTest { |
There was a problem hiding this comment.
After we enable identity check for vector nodes, with options -ea -esa -XX:-TieredCompilation -XX:CompileThreshold=100 -XX:VerifyIterativeGVN=1110, test files like jdk/incubator/vector/ByteVector128LoadStoreTests.java consistently fails; therefore I think there's no need to add any additional tests for AndV and ShiftV.
For this test, it fails intermittently, so I feel it is necessary to add a stress test.
Ok, that's fine. |
|
Hello, could someone please help review this PR? It fixes some missing vector identity optimizations. |
Ideal and Identity optimizations require all input nodes of the IR pattern to be ready for the optimization to take effect. However, node generation in the incremental inlining phase is unordered, so sometimes downstream nodes in the IR pattern are generated before upstream nodes, causing Ideal or Identity optimizations to miss. If no subsequent process triggers the optimization again, the optimization misses forever.
Vector nodes (especially generated by VectorAPI) are often wrapped using
VectorBoxNodeduring generation, and the existence of these box nodes and unbox nodes further hinders the matching of IR optimization patterns. The-XX:VerifyIterativeGVNoption allows us to check which IGVN optimizations are missed; however, currently, the verification for Vector nodes is skipped. Enabling the Identity optimization check for vector nodes shows that many tests fail, as shown below.They are caused by the missed optimizations of
AndVNode::Identity()andShiftVNode::Identity(). And from JDK-8370863, we know thatVectorStoreMaskNode::Identity()may miss as well.To recover these potential missed optimizations, we need to trigger them again at appropriate points. Currently, a GVN optimization runs once during node generation, and if no subsequent changes are made, the node will not be added to the IGVN worklist to trigger IGVN optimization again. Therefore, the corresponding nodes need to be added to the IGVN worklist at appropriate points.
Many phases affect the shape of the node tree, but inlining and boxing have a particularly significant impact on vector nodes. After
PhaseVector, inlining is complete, and vector boxing/unboxing has been eliminated. At this point, the node tree is fully materialized, with no additional interfering nodes. Therefore, this PR adds all nodes to the IGVN worklist at this point to recover potentially missed GVN optimizations.However, this modification still cannot handle the situation after
PhaseVector, so this PR also enhances the notification of multi-hop IR optimization patterns inadd_users_of_use_to_worklist.With this PR, the above test failures passed in 100 tests, so this PR enables identity optimization verification for vector nodes. We expect that with this PR, there will be very few cases of Vector identity optimization misses; if they do occur, we should fix them rather than skip them.
This PR does not enable
Idealoptimization verification for vector nodes because the inputs of some commutative nodes may be swapped inIdeal, causing changes in the hash value, which could lead to verification failure.We also found many test failures caused by the missing of
ShenandoahLoadReferenceBarrierNode::Identity(). This PR skipped the identity verification of theShenandoahLoadReferenceBarrierNodebecause it was not investigated in this PR.This PR tested all tier1 to tier3 jtreg tests on aarch64 (sve, neon) and x64 (avx3, avx2) platforms using options
-ea -esa -XX:-TieredCompilation -XX:CompileThreshold=100 -XX:VerifyIterativeGVN=1110, and repeated the test 100 times for the aforementioned error cases. All tests passed.Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/30529/head:pull/30529$ git checkout pull/30529Update a local copy of the PR:
$ git checkout pull/30529$ git pull https://git.openjdk.org/jdk.git pull/30529/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 30529View PR using the GUI difftool:
$ git pr show -t 30529Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/30529.diff
Using Webrev
Link to Webrev Comment