-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8287087: C2: perform SLP reduction analysis on-demand #13120
Conversation
👋 Welcome back rcastanedalo! A progress list of the required criteria for merging this PR into |
@robcasloz The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
test/hotspot/jtreg/compiler/loopopts/superword/TestGeneralizedReductions.java
Outdated
Show resolved
Hide resolved
I filed this RFE, it is related to this work here: JDK-8305707 "SuperWord should vectorize reverse-order reduction loops" |
I have resolved conflicts caused by the integration of JDK-8304042; added minimal, debug-only code for emitting |
Hi @jatin-bhateja, I have written a qualitative comparison between this PR and the generic search approach proposed by @eme64 and you (see Alternative approaches section in the updated PR description). I hope the comparison clarifies and motivates the plan outlined in #13120 (comment). Please let me know whether you agree with that plan so that we can move forward with this RFE, JDK-8302673, and also JDK-8302652. |
src/hotspot/share/opto/superword.cpp
Outdated
const Node* current = first; | ||
const Node* pred = phi; // current's predecessor in the reduction cycle. | ||
bool used_in_loop = false; | ||
for (int i = 0; i < path_nodes; i++) { | ||
for (DUIterator_Fast jmax, j = current->fast_outs(jmax); j < jmax; j++) { | ||
Node* u = current->fast_out(j); | ||
if (!in_bb(u)) { | ||
continue; | ||
} | ||
if (u == pred) { | ||
continue; | ||
} | ||
used_in_loop = true; | ||
break; | ||
} | ||
if (used_in_loop) { | ||
break; | ||
} | ||
pred = current; | ||
current = original_input(current, reduction_input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried out your suggestion but unfortunately, the bookkeeping code (marking/storing candidate nodes and their predecessors in the tentative reduction chain) became more complex than the simplifications it enabled.
Hi @robcasloz , Ok, my concern was that post path detection we have two occurrences of original_input , this can be optimized if we bookkeep node encountered during path detection. Kindly consider attached rough patch which records the nodes during patch detection.
reduction_patch.txt
Hi @robcasloz , Problem occurs due to Min/Max canonicalizing transformations which results into creation of new nodes but does not propagate the has_swapped_edges flags. A forward traversal starting from output of phi node can avoid edge swapping related issues and can give up discovering a path if any node feeds more than one users, want to stress that even if mark_reductions detects a set of nodes as part of reduction chain SLP may still not vectorize it e.g. an AddI reduction chain with different constant inputs. Your approach looks good to me, path finding is strict and follows same edge for path discovery and fixes several missed reduction scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @robcasloz , Apart from some earlier shared concerns on path detection traversal which are not blocking issues, patch looks good to me.
Best Regards,
Jatin
@robcasloz This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 52 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thanks for reviewing, Jatin! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good.
Thanks for looking at this again, Emanuel! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thorough analysis. The fix looks good to me.
Thanks for reviewing, Tobias! |
/integrate |
Going to push as commit 1be80a4.
Your commit was automatically rebased without conflicts. |
@robcasloz Pushed as commit 1be80a4. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Filed now: JDK-8306989. |
Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point
Math.min()/max()
implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see JDK-8261147 and JDK-8279622).This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops:
The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks.
Performance Benefits
As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point
Math.min()/max()
in multiple scenarios.Increased Auto-Vectorization Scope
There are two main scenarios in which the proposed changeset enables further auto-vectorization:
Reductions Using Global Accumulators
Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis:
Reductions of partially unrolled loops
These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically.
Increased Performance of x64 Floating-Point
Math.min()/max()
Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point
Math.min()/max()
implementations for reductions in non-counted and outer loops (see the new micro-benchmarks inFpMinMaxIntrinsics.java
for more details).Implementation details
The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case
testReductionOnPartiallyUnrolledLoopWithSwappedInputs
fromTestGeneralizedReductions.java
) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops).The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of
[Min|Max][F|D]
nodes.Alternative approaches
A complication results from edge swapping in the nodes cloned by loop unrolling (see here and here), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative:
Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE.
Testing
Functionality
TestGeneralizedReductions.java
Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones.
testReductionOnPartiallyUnrolledLoop
has been observed to fail on linux-x86 due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction.TestFpMinMaxReductions.java
Tests the matching of floating-point max/min implementations in x64.
TestSuperwordFailsUnrolling.java
This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test.
Performance
General Benchmarks
The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64.
Micro-benchmarks
The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis (full results).
VectorReduction.java
These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running
VectorReduction.java
on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except forandRedIOnGlobalAccumulator
andandRedIPartiallyUnrolled
, where the changeset improves performance by 2.4x in both cases.MaxIntrinsics.java
This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point
Math.min()
implementation that is specialized for reduction min operations):fMinReduceInOuterLoop
fMinReduceNonCounted
fMinReduceGlobalAccumulator
fMinReducePartiallyUnrolled
Acknowledgments
Thanks to @danielogh for making it possible to test this improvement with confidence (JDK-8294715) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120
$ git checkout pull/13120
Update a local copy of the PR:
$ git checkout pull/13120
$ git pull https://git.openjdk.org/jdk.git pull/13120/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 13120
View PR using the GUI difftool:
$ git pr show -t 13120
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13120.diff
Webrev
Link to Webrev Comment