Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented Nov 11, 2024

Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.

Background

With -XX:+AlignVector, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address base is already aligned. For arrays, we know that this always holds, because they are ObjectAlignmentInBytes aligned. But with native memory, the base is just some arbitrarily aligned pointer.

Problem

So far, we have just naively assumed that the base is always ObjectAlignmentInBytes aligned. But that does not hold for native memory segments: the base can also be unaligned. I had constructed such an example, and with -XX:+AlignVector -XX:+VerifyAlignVector this example hits the verification code.

MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
test3(nativeUnaligned);

When compiling the test method, we assume that the nativeUnaligned.address() is aligned - but it is not!

    static void test3(MemorySegment ms) {
        for (int i = 0; i < RANGE; i++) {
            long adr = i * 4L;
            int v = ms.get(ELEMENT_LAYOUT, adr);
            ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
        }
    }

Solution: Runtime Checks - Predicate and Multiversioning

Of course we could just forbid cases where we have a native base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned bases, and we currently vectorize those. We cannot statically determine if the base is aligned, we need a runtime check.

I came up with 2 options where to place the runtime checks:

  • A new "auto vectorization" Parse Predicate:
    • This only works when predicates are available.
    • If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
  • Multiversion the loop:
    • Create 2 copies of the loop (fast and slow loops).
    • The fast_loop can make speculative alignment assumptions, and add the corresponding check to the multiversion_if which decides which loop we take
    • In the slow_loop, we make no assumption which means we can not vectorize, but we still compile - so even unaligned bases would end up with reasonably fast code.
    • We "stall" the slow_loop from optimizing until we have fully vectorized the fast_loop, and know that we actually are adding runtime checks to the multiversion_if, and we really need the slow_loop.

Hence, the goal is that we compile like this:

  • First with predicate: if we are lucky we never see an unaligned base.
  • If we fail the check at the predicate: deopt, next time do not use the predicate for that loop.
  • When we recompile, we find no predicate, and instead multiversion the loop, so that we can compile both for aligned (vectorize) and unaligned (not vectorize) base.

Future Work: Runtime Check for Aliasing Analysis

See: JDK-8324751: C2 SuperWord: Aliasing Analysis runtime check
This whole infrastructure with "auto vectorization" Parse Predicate and Multiversioning can be used when we implement Runtime Checks for Aliasing Analysis: We speculate that there is no aliasing. If the runtime check fails, we deopt at the predicate, or take the slow_loop for Multiversioning.

Testing

Testing is passing, performance testing shows no significant change.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016
$ git checkout pull/22016

Update a local copy of the PR:
$ git checkout pull/22016
$ git pull https://git.openjdk.org/jdk.git pull/22016/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22016

View PR using the GUI difftool:
$ git pr show -t 22016

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22016.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 11, 2024

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Nov 11, 2024

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory

Reviewed-by: roland, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 37 new commits pushed to the master branch:

  • 9ec4696: 8350313: Include timings for leaving safepoint in safepoint logging
  • ec6624b: 8350649: Class unloading accesses/resurrects dead Java mirror after JDK-8346567
  • 9477c70: 8024695: new File("").exists() returns false whereas it is the current working directory
  • 3e46480: 8350770: [BACKOUT] Protection zone for easier detection of accidental zero-nKlass use
  • bd112c4: 8350443: GHA: Split static-libs-bundles into a separate job
  • 2731712: 8287749: Re-enable javadoc -serialwarn option
  • 0f82268: 8345598: Upgrade NSS binaries for interop tests
  • ea2c923: 8323807: Async UL: Add a stalling mode to async UL
  • e7d4b36: 8350667: Remove startThread_lock() and _startThread_lock on AIX
  • 1e18fff: 8328473: StringTable and SymbolTable statistics delay time to safepoint
  • ... and 27 more: https://git.openjdk.org/jdk/compare/d551dacaef938cea0cad10047b79a0a7a26dcacb...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title JDK-8323582 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory Nov 11, 2024
@openjdk
Copy link

openjdk bot commented Nov 11, 2024

@eme64 The following labels will be automatically applied to this pull request:

  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Nov 11, 2024
@eme64
Copy link
Contributor Author

eme64 commented Feb 24, 2025

@vnkozlov I'll think about the "stall" vs "delay" suggestion.

How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted?

I suppose that depends on if the slow path loop will be taken. Imagine we are working on some unaligned MemorySegment (or with aliasing runtime-checks failing). In these cases without optimizing we would for example not unroll. But unrolling can give quite the speedup, of course at the cost of more compile time and code size. Also some RangeCheck eliminations only happen if you have a pre-main-post loop structure. There are probably other optimizations as well. So yes, if the slow path loop is taken often, then optimizing is probably worth it. What do you think?

@eme64
Copy link
Contributor Author

eme64 commented Feb 24, 2025

@vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree?

@rwestrel
Copy link
Contributor

@rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.

Yes, if not too much work.

@eme64
Copy link
Contributor Author

eme64 commented Feb 24, 2025

@rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above.

Yes, if not too much work.

Ok, let's add this:

diff --git a/src/hotspot/share/opto/vectorization.cpp b/src/hotspot/share/opto/vectorization.cpp
index e607a1065dd..290ee249a42 100644
--- a/src/hotspot/share/opto/vectorization.cpp
+++ b/src/hotspot/share/opto/vectorization.cpp
@@ -98,6 +98,7 @@ VStatus VLoop::check_preconditions_helper() {
     // the pre-loop limit.
     CountedLoopEndNode* pre_end = _cl->find_pre_loop_end();
     if (pre_end == nullptr) {
+      assert(false, "found no pre-loop");
       return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT);
     }
     Node* pre_opaq1 = pre_end->limit();

And run that:

rr /oracle-work/jdk-fork7/build/linux-x64-slowdebug/jdk/bin/java -Xcomp -XX:+TraceLoopOpts -XX:CompileCommand=compileonly,jdk.internal.classfile.impl.StackMapGenerator::processBlock --version

....

PreMainPost      Loop: N7127/N4014  limit_check profile_predicated predicated counted [0,int),+1 (2147483648 iters)  rc  has_sfpt strip_mined
Unroll 2         Loop: N7127/N4014  counted [int,int),+1 (2147483648 iters)  main rc  has_sfpt strip_mined
Loop: N0/N0  has_call has_sfpt
  Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre rc  has_sfpt
  Loop: N7126/N7125  sfpts={ 7128 }
    Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main rc  has_sfpt strip_mined
  Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post rc  has_sfpt
Parallel IV: 7728   Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre has_sfpt
Parallel IV: 7725     Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main has_sfpt strip_mined
Parallel IV: 7718   Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post has_sfpt
Loop: N0/N0  has_call has_sfpt
  Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre has_sfpt
  Loop: N7126/N7125  sfpts={ 7128 }
    Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main has_sfpt strip_mined
  Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post has_sfpt
RangeCheck       Loop: N7508/N4014  counted [int,int),+2 (2147483648 iters)  main has_sfpt rce strip_mined
Unroll 4         Loop: N7508/N4014  limit_check counted [int,int),+2 (2147483648 iters)  main has_sfpt rce strip_mined
Loop: N0/N0  has_call has_sfpt
  Loop: N7453/N7460  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre rc  has_sfpt
  Loop: N7126/N7125  limit_check sfpts={ 7128 }
    Loop: N8146/N4014  limit_check counted [int,int),+4 (2147483648 iters)  main has_sfpt strip_mined
  Loop: N7409/N7416  counted [int,int),+1 (4 iters)  post rc  has_sfpt

...
#  Internal Error (/oracle-work/jdk-fork7/open/src/hotspot/share/opto/vectorization.cpp:101), pid=1381339, tid=1381348
#  assert(false) failed: found no pre-loop

The pre-loop node is not dead actually. The issue is with the main-loop in CountedLoopNode::is_canonical_loop_entry.

We skip through some predicates, but then we cannot find the ZeroTripGuard, rather I'm seeing this:

(rr) p ctrl->dump_bfs(2,0,"#cd")
dist dump
---------------------------------------------
   2   974  ConI  === 0  [[ ... ]]  #int:1
   2  8060  IfTrue  === 8056  [[ 8073 ]] #1
   1  8073  If  === 8060 974  [[ 8074 8077 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
   0  8077  IfTrue  === 8073  [[ 8103 ]] #1

The pre-loop is further up though:

(rr) p this->dump_bfs(26,0,"#c")
dist dump
---------------------------------------------
  26  7453  CountedLoop  === 7453 4015 7460  [[ 7452 7453 7454 7455 ]] inner stride: 1 pre of N7127 !orig=[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671)
  25  7455  If  === 7453 7441  [[ 7456 7464 ]] P=0.000001, C=-1.000000 !orig=[2686] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671)
  24  7456  IfFalse  === 7455  [[ 7448 7457 ]] #0 !orig=[2631],[2628] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671)
  23  7457  RangeCheck  === 7456 7446  [[ 7458 7467 ]] P=0.999999, C=-1.000000 !orig=[1189] !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671)
  22  7458  IfTrue  === 7457  [[ 7459 ]] #1 !orig=[777],385 !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671)
  21  7459  CountedLoopEnd  === 7458 7443  [[ 7460 7482 ]] [lt] P=0.900000, C=-1.000000 !orig=7122,[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
  20  7482  IfFalse  === 7459  [[ 7486 ]] #0
  19  7486  If  === 7482 7485  [[ 7461 7487 ]] P=0.999999, C=-1.000000
  18  7487  IfTrue  === 7486  [[ 7977 ]] #1
  17  7977  If  === 7487 974  [[ 7978 7981 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
  16  7981  IfTrue  === 7977  [[ 7994 ]] #1
  15  7994  If  === 7981 974  [[ 7995 7998 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
  14  7998  IfTrue  === 7994  [[ 8118 ]] #1
  13  8118  If  === 7998 8117  [[ 8119 8122 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
  12  8122  IfTrue  === 8118  [[ 8007 ]] #1
  11  8007  If  === 8122 8006  [[ 8008 8011 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
  10  8011  IfTrue  === 8007  [[ 8056 ]] #1
   9  8056  If  === 8011 974  [[ 8057 8060 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
   8  8060  IfTrue  === 8056  [[ 8073 ]] #1
   7  8073  If  === 8060 974  [[ 8074 8077 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
   6  8077  IfTrue  === 8073  [[ 8103 ]] #1
   5  8173  IfFalse  === 7122  [[ 7128 7129 ]] #0 !orig=[7524],[7123],[5442] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
   5  8103  If  === 8077 8102  [[ 8104 8107 ]] #Last Value Assertion Predicate  P=0.999999, C=-1.000000
   4  7128  SafePoint  === 8173 1 778 1 1 7129 780 1 1 781 781 782 783 784 1 1 1 785 786  [[ 7124 ]]  SafePoint  !orig=385 !jvms: StackMapGenerator::processBlock @ bci:2688 (line 670)
   4  8107  IfTrue  === 8103  [[ 8086 ]] #1
   3  7124  OuterStripMinedLoopEnd  === 7128 781  [[ 7125 7471 ]] P=0.900000, C=-1.000000
   3  8086  If  === 8107 8085  [[ 8087 8090 ]] #Init Value Assertion Predicate  P=0.999999, C=-1.000000
   2  7122  CountedLoopEnd  === 8146 7121  [[ 8173 4014 ]] [lt] P=0.900000, C=-1.000000 !orig=[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
   2  7125  IfTrue  === 7124  [[ 7126 ]] #1
   2  8090  IfTrue  === 8086  [[ 7126 ]] #1
   1  4014  IfTrue  === 7122  [[ 8146 ]] #1 !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670)
   1  7126  OuterStripMinedLoop  === 7126 8090 7125  [[ 7126 8146 ]] 
   0  8146  CountedLoop  === 8146 7126 4014  [[ 8146 1191 8157 8158 7122 7503 ]] inner stride: 4 main of N8146 strip mined !orig=[7508],[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671)

It looks like we are skipping some predicates, but not enough of them maybe?
In AssertionPredicates::find_entry we see:

  • 8090 IfTrue === 8086 [[ 7126 ]] #1: is_predicate returns true.
  • 8107 IfTrue === 8103 [[ 8086 ]] #1: is_predicate returns true.
  • 8077 IfTrue === 8073 [[ 8103 ]] #1: is_predicate returns false. The reason is that the assertion predicate Opaque nodes have already disappeared.

I talked with @chhagedorn and he says that there are some "dying" initialized assertion predicates from unrolling that can be in the way. They would be cleaned out by IGVN later, and then we can see through. But at this point they are in the way and we cannot see through and find the ZeroTripGuard, the predicate iterator is not good enough yet. But @chhagedorn is working on that. https://bugs.openjdk.org/browse/JDK-8350579

The implication is that the ZeroTripGuard can be temporarily not be found, and so we cannot even find the pre-loop, and also not the multiversion-if. So I cannot really add an assert now. And who knows, there may be other blocking reasons on top of that.

@rwestrel Does that make sense? What do you think we should do?

@eme64
Copy link
Contributor Author

eme64 commented Feb 24, 2025

@rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed.

@rwestrel
Copy link
Contributor

@rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed.

That sounds reasonable to me.

@vnkozlov
Copy link
Contributor

But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more.

Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance.

Okay.

PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

@eme64
Copy link
Contributor Author

eme64 commented Feb 25, 2025

But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more.

Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance.

Okay.

Sounds good, we will revisit and write more benchmarks there.

PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"?

@eme64
Copy link
Contributor Author

eme64 commented Feb 25, 2025

@vnkozlov @rwestrel Let me summarize the tasks left to do here:

  • Rename stalled -> delayed. And unstall -> resume_optimizations or alike. Improve some comments.
  • File follow-up RFE for more verification (must find multiversion-if from multiversioned loop) - currently blocked by predicate traversal issue. Maybe we can also assert that we can always find the pre-loop from the main-loop, at least during loop-opts.
  • When working on aliasing-analysis runtime-check, we have to do more performance analysis, and show the need of both the fast and slow path loops.

Let me know if there is more ;)

@eme64
Copy link
Contributor Author

eme64 commented Feb 25, 2025

@vnkozlov @rwestrel

  • I did the stall -> delay renaming, and added some more comments in places you asked for it. Let me know if that looks better.
  • Filed: JDK-8350637: C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
  • I added a comment to JDK-8324751 C2 SuperWord: Aliasing Analysis runtime check, to check performance around slow_loop.

Let me know what more I can do ;)

@vnkozlov
Copy link
Contributor

PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application.

For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"?

I think I nit-picked here. I see your good comments in loopTransform.cpp and loop node.hpp explaining mutiversioning fast_loop/slow_loop. I think it is fine to keep "slow/fast". We can use "uncommon" to indicate unfrequent path.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 25, 2025
@rwestrel
Copy link
Contributor

Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In PhaseIdealLoop::build_loop_late_post_work(), collect all OpaqueMultiversioningNode in a list. Then iterate over all loops the way it's done in PhaseIdealLoop::eliminate_useless_zero_trip_guard(), find loops marked as multi version, check we can get from the loop to the OpaqueMultiversioningNode and mark that one as useful. Eliminate all OpaqueMultiversioningNode not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass.

@eme64
Copy link
Contributor Author

eme64 commented Feb 26, 2025

Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In PhaseIdealLoop::build_loop_late_post_work(), collect all OpaqueMultiversioningNode in a list. Then iterate over all loops the way it's done in PhaseIdealLoop::eliminate_useless_zero_trip_guard(), find loops marked as multi version, check we can get from the loop to the OpaqueMultiversioningNode and mark that one as useful. Eliminate all OpaqueMultiversioningNode not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass.

I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?

I don't see it as super critical personally, as the slow_path is delayed, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible.

And would we not have similar issues with traversing from the loops to their OpaqueMultiversioningNode? What if some are not reachable in the meantime? Then we would just lose the multiversion_if early, and could not use it any more. So maybe we'd have to do that after the verification:
JDK-8350637: C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if

I wonder if we do not have similar issues with PhaseIdealLoop::eliminate_useless_zero_trip_guard() currently. Maybe it's rare enough we don't notice.

@rwestrel What do you think?

@rwestrel
Copy link
Contributor

I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size?

Ok

And would we not have similar issues with traversing from the loops to their OpaqueMultiversioningNode? What if some are not reachable in the meantime? Then we would just lose the multiversion_if early, and could not use it any more. So maybe we'd have to do that after the verification: JDK-8350637: C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if

I wonder if we do not have similar issues with PhaseIdealLoop::eliminate_useless_zero_trip_guard() currently. Maybe it's rare enough we don't notice.

I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet.

@eme64
Copy link
Contributor Author

eme64 commented Feb 26, 2025

And would we not have similar issues with traversing from the loops to their OpaqueMultiversioningNode? What if some are not reachable in the meantime? Then we would just lose the multiversion_if early, and could not use it any more. So maybe we'd have to do that after the verification: JDK-8350637: C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if
I wonder if we do not have similar issues with PhaseIdealLoop::eliminate_useless_zero_trip_guard() currently. Maybe it's rare enough we don't notice.

I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet.

Ah ok, I'll have to look into it myself then. But if we know that it happens at the beginning of a loop-opts phase just after igvn, and no predicates were hacked yet, then that should work fine.

@eme64
Copy link
Contributor Author

eme64 commented Feb 26, 2025

@rwestrel I filed this follow-up RFE:
JDK-8350756: C2 SuperWord Multiversioning: remove useless slow loop when the fast loop disappears

We'll have to be careful to only fold the slow_loop away if it is not used, i.e. if we did not in the meantime use the multiversion_if, and maybe the fast_loop structure is only desintegrating because of some speculative assumption, maybe because of more unrolling that only happens with vectorization. It would be good to have a test-case for that. I'm writing that here so I will remember it later ;)

@rwestrel Do you have any other ideas / suggestions?

Copy link
Contributor

@rwestrel rwestrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@eme64
Copy link
Contributor Author

eme64 commented Feb 27, 2025

@rwestrel @vnkozlov Thank you for the reviews, and all the good questions, and ideas for follow-up RFE's 😊

/integrate

@openjdk
Copy link

openjdk bot commented Feb 27, 2025

Going to push as commit 885338b.
Since your change was applied there have been 41 commits pushed to the master branch:

  • bb48b73: 8350723: RISC-V: debug.cpp help() is missing riscv line for pns
  • b29f8b0: 8350665: SIZE_FORMAT_HEX macro undefined in gtest
  • 78c18cf: 8349399: GHA: Add static-jdk build on linux-x64
  • e43960a: 8350616: Skip ValidateHazardPtrsClosure in non-debug builds
  • 9ec4696: 8350313: Include timings for leaving safepoint in safepoint logging
  • ec6624b: 8350649: Class unloading accesses/resurrects dead Java mirror after JDK-8346567
  • 9477c70: 8024695: new File("").exists() returns false whereas it is the current working directory
  • 3e46480: 8350770: [BACKOUT] Protection zone for easier detection of accidental zero-nKlass use
  • bd112c4: 8350443: GHA: Split static-libs-bundles into a separate job
  • 2731712: 8287749: Re-enable javadoc -serialwarn option
  • ... and 31 more: https://git.openjdk.org/jdk/compare/d551dacaef938cea0cad10047b79a0a7a26dcacb...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 27, 2025
@openjdk openjdk bot closed this Feb 27, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 27, 2025
@openjdk
Copy link

openjdk bot commented Feb 27, 2025

@eme64 Pushed as commit 885338b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants