-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory #22016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 37 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@vnkozlov I'll think about the "stall" vs "delay" suggestion.
I suppose that depends on if the slow path loop will be taken. Imagine we are working on some unaligned MemorySegment (or with aliasing runtime-checks failing). In these cases without optimizing we would for example not unroll. But unrolling can give quite the speedup, of course at the cost of more compile time and code size. Also some RangeCheck eliminations only happen if you have a pre-main-post loop structure. There are probably other optimizations as well. So yes, if the slow path loop is taken often, then optimizing is probably worth it. What do you think? |
|
@vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? |
Yes, if not too much work. |
Ok, let's add this: And run that: The pre-loop node is not dead actually. The issue is with the main-loop in We skip through some predicates, but then we cannot find the ZeroTripGuard, rather I'm seeing this: The pre-loop is further up though: It looks like we are skipping some predicates, but not enough of them maybe?
I talked with @chhagedorn and he says that there are some "dying" initialized assertion predicates from unrolling that can be in the way. They would be cleaned out by IGVN later, and then we can see through. But at this point they are in the way and we cannot see through and find the ZeroTripGuard, the predicate iterator is not good enough yet. But @chhagedorn is working on that. https://bugs.openjdk.org/browse/JDK-8350579 The implication is that the ZeroTripGuard can be temporarily not be found, and so we cannot even find the pre-loop, and also not the multiversion-if. So I cannot really add an assert now. And who knows, there may be other blocking reasons on top of that. @rwestrel Does that make sense? What do you think we should do? |
|
@rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed. |
That sounds reasonable to me. |
Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. Okay. PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. |
Sounds good, we will revisit and write more benchmarks there.
For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"? |
|
@vnkozlov @rwestrel Let me summarize the tasks left to do here:
Let me know if there is more ;) |
Let me know what more I can do ;) |
I think I nit-picked here. I see your good comments in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for me.
|
Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In |
I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? I don't see it as super critical personally, as the slow_path is And would we not have similar issues with traversing from the loops to their I wonder if we do not have similar issues with @rwestrel What do you think? |
Ok
I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. |
Ah ok, I'll have to look into it myself then. But if we know that it happens at the beginning of a loop-opts phase just after igvn, and no predicates were hacked yet, then that should work fine. |
|
@rwestrel I filed this follow-up RFE: We'll have to be careful to only fold the @rwestrel Do you have any other ideas / suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
Going to push as commit 885338b.
Your commit was automatically rebased without conflicts. |
Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
Background
With
-XX:+AlignVector, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the addressbaseis already aligned. For arrays, we know that this always holds, because they areObjectAlignmentInBytesaligned. But with native memory, thebaseis just some arbitrarily aligned pointer.Problem
So far, we have just naively assumed that the
baseis alwaysObjectAlignmentInBytesaligned. But that does not hold fornativememory segments: thebasecan also be unaligned. I had constructed such an example, and with-XX:+AlignVector -XX:+VerifyAlignVectorthis example hits the verification code.When compiling the test method, we assume that the
nativeUnaligned.address()is aligned - but it is not!Solution: Runtime Checks - Predicate and Multiversioning
Of course we could just forbid cases where we have a
nativebase from vectorizing. But that would lead to regressions currently - in most cases we do get alignedbases, and we currently vectorize those. We cannot statically determine if thebaseis aligned, we need a runtime check.I came up with 2 options where to place the runtime checks:
fast_loopcan make speculative alignment assumptions, and add the corresponding check to themultiversion_ifwhich decides which loop we takeslow_loop, we make no assumption which means we can not vectorize, but we still compile - so even unalignedbases would end up with reasonably fast code.slow_loopfrom optimizing until we have fully vectorized thefast_loop, and know that we actually are adding runtime checks to themultiversion_if, and we really need theslow_loop.Hence, the goal is that we compile like this:
base.base.Future Work: Runtime Check for Aliasing Analysis
See: JDK-8324751: C2 SuperWord: Aliasing Analysis runtime check
This whole infrastructure with "auto vectorization" Parse Predicate and Multiversioning can be used when we implement Runtime Checks for Aliasing Analysis: We speculate that there is no aliasing. If the runtime check fails, we deopt at the predicate, or take the
slow_loopfor Multiversioning.Testing
Testing is passing, performance testing shows no significant change.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016$ git checkout pull/22016Update a local copy of the PR:
$ git checkout pull/22016$ git pull https://git.openjdk.org/jdk.git pull/22016/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 22016View PR using the GUI difftool:
$ git pr show -t 22016Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22016.diff
Using Webrev
Link to Webrev Comment