-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651 #18532
8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651 #18532
Conversation
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 7 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
src/hotspot/share/opto/superword.cpp
Outdated
Node* mem = n->in(MemNode::Memory); | ||
for (DUIterator_Fast imax, i = mem->fast_outs(imax); i < imax; i++) { | ||
Node* mem_use = mem->fast_out(i); | ||
if (_vloop.in_bb(mem_use) && !visited.test(bb_idx(mem_use)) && mem_use->is_Store()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mem_use->is_Store()
check is cheap and should be first. It will also help to skip other checks for Load node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Could such a long-running/compiling test also be added as jtreg test which fails due to a timeout without this patch and passes with the patch?
@chhagedorn I added a regression test as you requested. It results in timeout before the patch, and passes with plenty of time to spare with the patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a whitespace error in the test file.
* @bug 8327978 | ||
* @summary Test compile time for large compilation, where SuperWord takes especially much time. | ||
* @requires vm.compiler2.enabled | ||
* @run main/othervm -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=5 -XX:LoopUnrollLimit=1000 -Xbatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use main/othervm/timeout=30
or an even lower timeout. Then you might be able to get rid of RepeatCompilation
.
@@ -296,19 +296,36 @@ void VLoopDependencyGraph::add_node(MemNode* n, GrowableArray<int>& memory_pred_ | |||
// before use. With a single pass, we can compute the depth of every node, since we can | |||
// assume that the depth of all preds is already computed when we compute the depth of use. | |||
void VLoopDependencyGraph::compute_depth() { | |||
for (int i = 0; i < _body.body().length(); i++) { | |||
Node* n = _body.body().at(i); | |||
auto find_max_pred_depth = [&] (const Node* n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this code out to a separate method. Having a lambda here makes it hard to read compute_depth()
and you don't really need to capture anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the update! And nice that you've been able to extract and add a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update is good.
@vnkozlov @chhagedorn thanks for the reviews! |
Going to push as commit 9da5170.
Your commit was automatically rebased without conflicts. |
In JDK-8325651 / #17812 I refactored the dependency graph. It seems I made a typo, and missed a single
!
, which brokeVLoopDependencyGraph::compute_depth
(formerlySuperWord::compute_max_depth
).The consequence was that all nodes in the dependency graph had the same depth
1
. A node is supposed to have a higher depth than all its inputs, except for Phi nodes, which have depth 0, as they are at the beginning of the loop's basic block, i.e. they are at the beginning of the DAG.Details
Well, it is a bit more complicated. I had not just forgotten about the
!
. Before the change, we used to iterate over the body multiple times, until the depth computation is stable. When I saw this, I assumed this was not necessary, since thebody
is already ordered, such thatdef
is beforeuse
. So I reduced it to a single pass over thebody
.But this assumption was wrong: I added some assertion code, which detected that something was wrong with the ordering in the
body
. In the failing example, I saw that we had aLoad
and aStore
with the same memory state. Given the edges, our ordering algorithm for thebody
could scheduleLoad
beforeStore
orStore
beforeLoad
. But that is incorrect: our assumption is that in such casesLoads
always happen beforeStores
.Therefore, I had to change the traversal order in
VLoopBody::construct
, so that we visitLoad
beforeStore
. With this, I now know that thebody
order is correct for both the data dependency and the memory dependency. Therefore, I only need to iterate over thebody
once inVLoopDependencyGraph::compute_depth
.More Backgroud / Details
This bug was reported because there were timeouts with
TestAlignVectorFuzzer.java
. This fix seems to improve the compile time drastically for the example below. It seems to be an example with a large dependency graph, where we still attempt to create some packs. This means there is a large amount ofindependence
checks on the dependency graph. If those are not pruned well, then they visit many more nodes than necessary.Why did I not catch this earlier? I had a compile time benchmark for JDK-8325651 / #17812, but it seems it was not sensitive enough. It has a dense graph, but never actually created any packs. My new benchmark creates packs, which unlocks more checks during
filter_packs_for_mutual_independence
, which seem to be much more stressing the dependency graph traversals.If such large dense dependency graphs turn out to be very common, we could take more drastic steps in the future:
(n1, n2)
corresponding to theindependence(n1, n2)
query. This would make independence checks a constant time lookup, rather than a graph traversal.I extracted a simple compile-time benchmark from
TestAlignVectorFuzzer.java
:/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=printcompilation,TestGraph2::* -XX:CompileCommand=compileonly,TestGraph2::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=0 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph2.java
I took the above numbers before the integration of #18577. It had about a
7 sec
speedup on this same benchmark.The numbers after the intagration are below:
This makes a massive difference. And it shows that #18577 was a very strong compile-time improvement, especially for large loop bodies, and with partial vectorization. Even in such an extreme stress example, we still only spend 72.5% of compile time in AutoVectorization.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18532/head:pull/18532
$ git checkout pull/18532
Update a local copy of the PR:
$ git checkout pull/18532
$ git pull https://git.openjdk.org/jdk.git pull/18532/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18532
View PR using the GUI difftool:
$ git pr show -t 18532
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18532.diff
Webrev
Link to Webrev Comment