-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8326962: C2 SuperWord: cache VPointer #18577
Conversation
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 33 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
src/hotspot/share/opto/superword.cpp
Outdated
for (uint j = i+1; j < memops.size(); j++) { | ||
MemNode* s2 = memops.at(j)->as_Mem(); | ||
if (isomorphic(s1, s2)) { | ||
VPointer p2(s2, _vloop); | ||
const VPointer& p2 = get_pointer(s2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: a classic example of a quadratic loop, where we compare "all-to-all" memops, thus parse the pointer subgraph repeatedly.
} | ||
|
||
uint bytes = number_of_pointers * sizeof(VPointer); | ||
_pointers = (VPointer*)_arena->Amalloc(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I wish I could use GrowableArray
here. But I have a StackObj
that is NONCOPYABLE
. I thus have to directly construct the VPointer
into the array, and cannot construct it outside and pass it in. Someday, I hope that GrowableArray
allows appending with the move-constructor, or something similar.
For now: I simply allocate my own memory, and use the placement-new to construct the VPointer
s directly into that memory.
// For all memory nodes before it, check if we need to add a memory edge. | ||
for (int k = slice_nodes.length() - 1; k > j; k--) { | ||
MemNode* n2 = slice_nodes.at(k); | ||
|
||
// Ignore Load-Load dependencies: | ||
if (n1->is_Load() && n2->is_Load()) { continue; } | ||
|
||
VPointer p2(n2, _vloop); | ||
const VPointer& p2 = _pointers.get(n2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: another quadratic loop where we repeatedly parse the pointers.
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: improve printing a bit for POINTERS
tag of TraceAutoVectorization
.
tty->print("[%d]", n->_idx); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: moved it up so we can use it anywhere in vectorization.cpp
.
@@ -678,15 +723,15 @@ class VPointer : public ArenaObj { | |||
int invar_factor() const; | |||
|
|||
// Comparable? | |||
bool invar_equals(VPointer& q) { | |||
bool invar_equals(const VPointer& q) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: had to make some things const
here, so that I can pass around const VPointer&
, which I get from _pointers.get(n)
/ get_pointer(n)
.
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Emanuel,
I've some general questions regarding naming and Arena usage, I hope you don't mind some runtime team input.
// We compute and cache the VPointer for every load and store. | ||
class VLoopPointers : public StackObj { | ||
private: | ||
Arena* _arena; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the pointer ever change? Could potentially change this to a reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important for this to be Arena-allocated? Seems to me like compute_and_cache
will only be computed once per VLoopPointers
instance, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss if Arena-allocated is the right thing to do. But for now it is what I did with all other submodules of VLoopAnalyzer
, so if we were to change this, then I can do that in a separate RFE.
What alternative would you prefer, and why?
I like Arena-allocation, because I have a clear location and life-time for my allocations. I can close the arena after all AutoVectorization, and I know that the data is valid up to that point, and then it gets deallocated.
CHeap allocation would require me to be more smart and careful about deallocation.
Resouce allocation in my experience often is problematic if you have different life-times for things. I like Resource-allocation only for temporary data structures, not data that is used across a large algorithm with dozens of subalgorithms.
Let me know what you think ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the pointer ever change? Could potentially change this to a reference.
I could make it a reference. But data structures like GrowableArray
take a Arena*
. So then I have to use *
and &
all the time. I don't like that, it makes the code much more "noisy".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice improvement and it makes sense to just compute them once and re-use them. I only have a few comments but generally looks good!
const VLoop& _vloop; | ||
const VLoopBody& _body; | ||
|
||
// Array of cached pointers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make a note that we allocate/cache them lazily upon request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not lazy, they are allocated and cached in compute_and_cache
. Like all other VLoopAnalyzer
submodules. Maybe I missed your point 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've meant that it's not allocated in the constructor as you initialize it with nullptr
. It's only initialized once you call compute_and_cache()
which may not happen if we bail out earlier. That's what I've meant with "lazy" but that was probably not clear enough :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. I mean all other submodules are handled the same. They also cannot really be used until VLoopAnalyzer::setup_submodules
returns with success. I guess this here is the first instance where the data structure itself is only allocated after the constructor. But I feel like if anybody has a question about where it is allocated, they can just search the reference. If I start putting down such detailed comments, then I need to put them everywhere. And that will clutter the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Here I think I've only commented it since it's allocated specially for the first time in the sub modules. But it does not really add much information per se. It's fine to leave it like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the suggestion anyway 😊
I will leave it without a comment then.
@jdksjolen @chhagedorn Thanks for your suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: will VLoopAnalyzer default destructor clean up all memory used?
@vnkozlov there is no need, since it is all allocated over the
It is that arena that I pass into all submodules, such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the suggested changes. Looks good to me!
Good. |
Thanks @vnkozlov @chhagedorn @jdksjolen for the reviews and suggestions! /integrate |
Going to push as commit f762637.
Your commit was automatically rebased without conflicts. |
This is a subtask of JDK-8315361.
Parsing
VPointer
currently happens all over SuperWord. And often in quadratic loops, where we compare all-with-all loads/stores.I propose to cache the
VPointer
s, then we can do a constant-time cache lookup rather than parsing the pointer subgraph every time.There are now only a few cases where we cannot use the cached
VPointer
:SuperWord::unrolling_analysis
: we have noVLoopAnalyzer
, and so no submodules likeVLoopPointers
. We don't need to cache, since we only iterate over the loop body once, and create only a singleVPointer
per memop.SuperWord::output
: when we have aLoad
, and try to bypassStoreVector
nodes. TheStoreVector
nodes are new, and so we have no cachedVPointer
for them. This could be fixed somehow, but I don't want to deal with it now. I intend to refactorSuperWord::output
soon, and can look into options at that point (either I bypass before we insert the vector nodes, or I remember what scalar memop the vector was created from, and then get the cached pointer this way).This changeset is also a preparation step for JDK-8325155. I will have a list of pointers, and sort them such that creating adjacent refs is much more efficient.
Benchmarking SuperWord Compile Time
I use the same benchmark from #18532.
On master:
With this patch:
This saves us about
7 sec
, which is significant. I will have to see what it effect it has once we also apply #18532, but I think the combined effect will be very significant.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18577/head:pull/18577
$ git checkout pull/18577
Update a local copy of the PR:
$ git checkout pull/18577
$ git pull https://git.openjdk.org/jdk.git pull/18577/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18577
View PR using the GUI difftool:
$ git pr show -t 18577
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18577.diff
Webrev
Link to Webrev Comment