-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism #17624
Conversation
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
return false; // failure | ||
} | ||
|
||
const char* VLoop::check_preconditions_helper() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: replaces most code from the old SuperWord::transform_loop
_num_work_vecs(0), // amount of vector work we have | ||
_num_reductions(0) // amount of reduction work we have | ||
{ | ||
} | ||
|
||
//------------------------------transform_loop--------------------------- | ||
bool SuperWord::transform_loop(IdealLoopTree* lpt, bool do_optimization) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: code moved to VLoop::check_preconditions_helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: all the do_optimization
parts are not part of preconditions, and hence they are kept in the new transform_loop
cl->set_pre_loop_end(pre_end); | ||
} | ||
|
||
init(); // initialize data structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is the end of the "preconditions", and we used to set _early_exit = false
inside init()
|
||
if (SuperWordReductions) { | ||
mark_reductions(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I now would like to move reduction marking to after precondition checking. Hence, I moved it to SLP_extract
.
assert(pre_loop_end, "must be valid"); | ||
_pre_loop_end = pre_loop_end; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This should have never been cached in the node itself, but only during autovectorization.
I moved it now into VLoop
, which I also pass into VPointer
, which has to access the pre-loop for independence checks.
@@ -231,14 +231,11 @@ class CountedLoopNode : public BaseCountedLoopNode { | |||
// vector mapped unroll factor here | |||
int _slp_maximum_unroll_factor; | |||
|
|||
// Cached CountedLoopEndNode of pre loop for main loops | |||
CountedLoopEndNode* _pre_loop_end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this makes the node smaller, and does not cache something that may be invalid later. It was used only during SuperWord. Looks like a bad pattern.
_race_possible(false), // cases where SDMU is true | ||
_early_return(true), // analysis evaluations routine | ||
_do_vector_loop(phase->C->do_vector_loop()), // whether to do vectorization/simd style | ||
_do_vector_loop(phase()->C->do_vector_loop()), // whether to do vectorization/simd style | ||
_num_work_vecs(0), // amount of vector work we have | ||
_num_reductions(0) // amount of reduction work we have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
Before this change, we used to only create SuperWord once, and use it for all loops in the compilation. Now that I ripped out the "init" method, and avoid reusing SuperWord this way, we want to make sure we do not re-allocate too much.
For some data structures I now pre-allocate memory for the maximum size they may ever reach. This is to avoid re-allocation when they grow.
IdealLoopTree* lpt = vloop.lpt(); | ||
CountedLoopNode* cl = vloop.cl(); | ||
Node* cl_exit = vloop.cl_exit(); | ||
PhaseIdealLoop* phase = vloop.phase(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Made it static, and instead of the SuperWord object, we now only have access to the VLoop object.
|
||
if (SuperWordReductions) { | ||
mark_reductions(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: now we don't need to mark reductions for unrolling_analysis any more, and only for SLP_extract. We win!
VectorSet _loop_reductions; // Reduction nodes in the current loop | ||
Node* _bb; // Current basic block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: always the same as cl
CountedLoopNode* head = pre_loop_end()->loopnode(); | ||
assert(head != nullptr, "must find head"); | ||
return head; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: before this patch, these two cache-accessors were in the CounterLoopEndNode
.
sw.unrolling_analysis(_local_loop_unroll_factor); | ||
VLoop vloop(this, true); | ||
if (vloop.check_preconditions()) { | ||
SuperWord::unrolling_analysis(vloop, _local_loop_unroll_factor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I made unrolling_analysis
static, and only pass in vloop
, not all the info in the SuperWord
object.
Advantage: we don't need to allocate any SuperWord data-structures any more for unrolling_analysis.
src/hotspot/share/opto/loopnode.cpp
Outdated
if (C->do_superword() && C->has_loops() && !C->major_progress()) { | ||
// SuperWord transform | ||
SuperWord sw(this); | ||
ResourceArea autovectorization_arena; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this allows us to free up all the space used by SuperWord
's internal data structures between every processed loop. Previously, all internal data structures were on the phase->C->comp_arena()
.
Would be nice to see affect of these changes on C2 compilation time. You have to create SW object each time now instead only once. |
@vnkozlov I created #17683 to add time measurement. Let me try it on a simple benchmark here. It seems to me that there is no significant difference, the variance is higher than the difference. Benchmark idea:
Disabled turbo-boost for benchmark. Interpretation / Implication / Speculation In SuperWord, there are lots of nested loops. Certainly 2 and 3 deep, and even 4 deep, I think. Hence, a small overhead on memory can probably not really be measured. I'll look into lowering the algorithmic complexity in the future. This is especially important if we want to increase the loop body size. test001,002,003
Before:
test101
Before:
test201
Before:
test301
Before:
This is the benchmark:
|
Thank you for running the timing testing. What about memory fragmentation? Is this code will uses default chunks in Arena (they can be reused) or allocates new chunk (malloc) each time which may lead to fragmentation. |
@vnkozlov is there a way to measure memory fragmentation? I don't know how to answer that question. Of course this is only helpful if the malloc/free'd chunks can properly be reused. I did some I wonder if we could not round up the chunk-size to the next bigger size for which we have a pool. Of course this would mean we have some padding in the chunks, but if they are short-lived chunks then at least the whole memory can be reclaimed. @jdksjolen you have done some work on Arenas. Do you have any wisdom to offer here? An alternative: we can put the |
Yes, that was my concern. There are chunks with different sizes: arena.hpp#L66. Is your allocation sizes > 32*K I think, this should not stop you from doing this refactoring. Yes, it will allow return memory back sooner and it is up to OS how it optimize it. I read your offline discussion with Johan. He has interesting suggestion fro growable arrays (use C heap). |
I think the only really big array is Does using CHeap directly really improve the situation? That just means that we directly malloc, instead of through the Arena, right? This would not make a difference for memory fragmentation, or am I wrong? I think it might be a good idea to have at least the large |
bool is_slp = true; | ||
size_t ignored_size = lpt()->_body.size(); | ||
size_t ignored_size = lpt->_body.size(); | ||
int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: make local, add ResourceMark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I will do this in a future RFE.
@vnkozlov After extensive discussions with @jdksjolen , I now decided to create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 57 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments
|
||
// Shared data structures for all AutoVectorizations, to reduce allocations | ||
// of large arrays. | ||
VSharedData vshared; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is local for each build_and_optimize()
call and space will be freed by destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
Before my change, we had a SuperWord object per build_and_optimize()
that allocated all the data structures.
So the scope is now still the same.
Except that before it all went to the comp_arena
.
So before, we might have allocated those data structures multiple times (once per build_and_optimize
), and grown comp_arena
each time. Now we put it to a dedicated arena, which is freed in the destructor. So the memory usage should be a little lower that way.
|
||
GrowableArray<int>& node_idx_to_loop_body_idx() { | ||
// Since this is a shared resource, we clear before every individual use. | ||
_node_idx_to_loop_body_idx.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be explicit VSharedData::clear()
method called in auto_vectorize()
. Otherwise much later someone will have hard time to find place where the space is cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnkozlov done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
Going to push as commit 232d136.
Your commit was automatically rebased without conflicts. |
Subtask of #16620
(The basic goal is to break SuperWord into different modules. This makes the code more maintainable and extensible. And eventually this allows some modules to be reused by other/new vectorizers.)
SuperWord::SLP_extract
(where we do vectorization) andSuperWord::unrolling_analysis
, and move it to a new classVLoop
. This allows us to decoupleunrolling_analysis
from the SuperWord object, and we can make it static.SuperWord::init
) for every loop. This is a bit of a nasty pattern. I now make a newVLoop
and a newSuperWord
object per loop.SuperWord
objects, we allocate the internal data structures more often. Therefore, I now pre-allocate/reserve sufficient space on initialization.Side-note about #17604 (integrated, no need to read any more):
I would like to remove the use of
SuperWord::is_marked_reduction
fromSuperWord::unrolling_analysis
. For starters: it is not clear what it was ever good for. Second: it requires us to do reduction marking/analysis beforeunrolling_analysis
, and hence makes the reduction marking shared betweenunrolling_analysis
and vectorization. I could move the reduction marking toVLoop
now. But the_loop_reducitons
set would have to be put on an arena, and I would like to avoid creating an arena for theunrolling_analysis
. Plus, it would just be nicer code, to have reduction analysis together with body analysis, type analysis, etc. and all of them in only inSLP_extract
.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17624/head:pull/17624
$ git checkout pull/17624
Update a local copy of the PR:
$ git checkout pull/17624
$ git pull https://git.openjdk.org/jdk.git pull/17624/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 17624
View PR using the GUI difftool:
$ git pr show -t 17624
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17624.diff
Webrev
Link to Webrev Comment