Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8287087: C2: perform SLP reduction analysis on-demand #13120

Closed
wants to merge 33 commits into from

Conversation

robcasloz
Copy link
Contributor

@robcasloz robcasloz commented Mar 21, 2023

Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point Math.min()/max() implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see JDK-8261147 and JDK-8279622).

This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops:

reduction-before-after-unrolling

The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks.

Performance Benefits

As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point Math.min()/max() in multiple scenarios.

Increased Auto-Vectorization Scope

There are two main scenarios in which the proposed changeset enables further auto-vectorization:

Reductions Using Global Accumulators

public class Foo {
  int acc = 0;
  (..)
  void reduce(int[] array) {
    for (int i = 0; i < array.length; i++) {
      acc += array[i];
    }
  }
}

Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis:

global-reduction-before-after-unrolling

Reductions of partially unrolled loops

    (..)
    for (int i = 0; i < array.length / 2; i++) {
      acc += array[2*i];
      acc += array[2*i + 1];
    }
    (..)

These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically.

Increased Performance of x64 Floating-Point Math.min()/max()

Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point Math.min()/max() implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in FpMinMaxIntrinsics.java for more details).

Implementation details

The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case testReductionOnPartiallyUnrolledLoopWithSwappedInputs from TestGeneralizedReductions.java) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops).

The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of [Min|Max][F|D] nodes.

Alternative approaches

A complication results from edge swapping in the nodes cloned by loop unrolling (see here and here), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative:

approach correctness efficiency effectiveness conceptual complexity
mainline (current) hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) high low (misses substantial reduction vectorization opportunities) high (requires maintaining non-local reduction node state)
this changeset easy to establish since client transformations operate on the same graph that is analyzed medium (limited search for chains of nodes) high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) medium (requires maintaining local swapped-edge node state)
general search easy to establish (same as above) low (general search), particularly for x64 matching where the analysis runs once for every node in a chain high (similar to above but also covering manually-swapped inputs) low (no node state required, use of well-known graph search algorithms)

Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE.

Testing

Functionality

  • tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64).
  • fuzzing (12 h. on linux-x64 and linux-aarch64).
TestGeneralizedReductions.java

Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. testReductionOnPartiallyUnrolledLoop has been observed to fail on linux-x86 due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction.

TestFpMinMaxReductions.java

Tests the matching of floating-point max/min implementations in x64.

TestSuperwordFailsUnrolling.java

This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test.

Performance

General Benchmarks

The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64.

Micro-benchmarks

The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis (full results).

VectorReduction.java

These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running VectorReduction.java on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for andRedIOnGlobalAccumulator and andRedIPartiallyUnrolled, where the changeset improves performance by 2.4x in both cases.

MaxIntrinsics.java

This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point Math.min() implementation that is specialized for reduction min operations):

micro-benchmark speedup compared to mainline
fMinReduceInOuterLoop 1.1x
fMinReduceNonCounted 2.3x
fMinReduceGlobalAccumulator 2.4x
fMinReducePartiallyUnrolled 3.9x

Acknowledgments

Thanks to @danielogh for making it possible to test this improvement with confidence (JDK-8294715) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8287087: C2: perform SLP reduction analysis on-demand

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120
$ git checkout pull/13120

Update a local copy of the PR:
$ git checkout pull/13120
$ git pull https://git.openjdk.org/jdk.git pull/13120/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13120

View PR using the GUI difftool:
$ git pr show -t 13120

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13120.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 21, 2023

👋 Welcome back rcastanedalo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 21, 2023

@robcasloz The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 21, 2023
@robcasloz robcasloz marked this pull request as ready for review March 22, 2023 11:06
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 22, 2023
@mlbridge
Copy link

mlbridge bot commented Mar 22, 2023

Webrevs

@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 5, 2023
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Apr 5, 2023
@eme64
Copy link
Contributor

eme64 commented Apr 6, 2023

I filed this RFE, it is related to this work here: JDK-8305707 "SuperWord should vectorize reverse-order reduction loops"

@robcasloz robcasloz marked this pull request as ready for review April 14, 2023 12:43
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 14, 2023
@robcasloz
Copy link
Contributor Author

I have resolved conflicts caused by the integration of JDK-8304042; added minimal, debug-only code for emitting Node::Flag_has_swapped_edges for IGV nodes; and addressed @jatin-bhateja's comments (including analyzing the interaction with JDK-8302673 and determining it is orthogonal to this RFE). Please review.

@robcasloz
Copy link
Contributor Author

Hi @jatin-bhateja, I have written a qualitative comparison between this PR and the generic search approach proposed by @eme64 and you (see Alternative approaches section in the updated PR description). I hope the comparison clarifies and motivates the plan outlined in #13120 (comment). Please let me know whether you agree with that plan so that we can move forward with this RFE, JDK-8302673, and also JDK-8302652.

src/hotspot/share/opto/superword.cpp Outdated Show resolved Hide resolved
Comment on lines 519 to 539
const Node* current = first;
const Node* pred = phi; // current's predecessor in the reduction cycle.
bool used_in_loop = false;
for (int i = 0; i < path_nodes; i++) {
for (DUIterator_Fast jmax, j = current->fast_outs(jmax); j < jmax; j++) {
Node* u = current->fast_out(j);
if (!in_bb(u)) {
continue;
}
if (u == pred) {
continue;
}
used_in_loop = true;
break;
}
if (used_in_loop) {
break;
}
pred = current;
current = original_input(current, reduction_input);
}
Copy link
Member

@jatin-bhateja jatin-bhateja Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out your suggestion but unfortunately, the bookkeeping code (marking/storing candidate nodes and their predecessors in the tentative reduction chain) became more complex than the simplifications it enabled.

Hi @robcasloz , Ok, my concern was that post path detection we have two occurrences of original_input , this can be optimized if we bookkeep node encountered during path detection. Kindly consider attached rough patch which records the nodes during patch detection.
reduction_patch.txt

@jatin-bhateja
Copy link
Member

jatin-bhateja commented Apr 24, 2023

Hi @jatin-bhateja, I have written a qualitative comparison between this PR and the generic search approach proposed by @eme64 and you (see Alternative approaches section in the updated PR description). I hope the comparison clarifies and motivates the plan outlined in #13120 (comment). Please let me know whether you agree with that plan so that we can move forward with this RFE, JDK-8302673, and also JDK-8302652.

Hi @robcasloz , Problem occurs due to Min/Max canonicalizing transformations which results into creation of new nodes but does not propagate the has_swapped_edges flags. A forward traversal starting from output of phi node can avoid edge swapping related issues and can give up discovering a path if any node feeds more than one users, want to stress that even if mark_reductions detects a set of nodes as part of reduction chain SLP may still not vectorize it e.g. an AddI reduction chain with different constant inputs.

Your approach looks good to me, path finding is strict and follows same edge for path discovery and fixes several missed reduction scenarios.

Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @robcasloz , Apart from some earlier shared concerns on path detection traversal which are not blocking issues, patch looks good to me.
Best Regards,
Jatin

@openjdk
Copy link

openjdk bot commented Apr 25, 2023

@robcasloz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8287087: C2: perform SLP reduction analysis on-demand

Reviewed-by: epeter, jbhateja, thartmann

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 52 new commits pushed to the master branch:

  • 86f41a4: 8306735: G1: G1FullGCScope remove unnecessary member _explicit_gc
  • d747698: 8306823: Native memory leak in SharedRuntime::notify_jvmti_unmount/mount.
  • 8d89992: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors
  • 44d9f55: 8306072: Open source several AWT MouseInfo related tests
  • cc894d8: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL
  • ed1ebd2: 8306652: Open source AWT MenuItem related tests
  • f3e8bd1: 8306755: Open source few Swing JComponent and AbstractButton tests
  • 1c1a73f: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API
  • adf62fe: 8304918: Remove unused decl field from AnnotatedType implementations
  • 00b1eac: 8306031: Update IANA Language Subtag Registry to Version 2023-04-13
  • ... and 42 more: https://git.openjdk.org/jdk/compare/7400aff3b8a0294dcbb6e89e9d8aad984f29fe92...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 25, 2023
@robcasloz
Copy link
Contributor Author

Hi @robcasloz , Apart from some earlier shared concerns on path detection traversal which are not blocking issues, patch looks good to me. Best Regards, Jatin

Thanks for reviewing, Jatin!

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good.

@robcasloz
Copy link
Contributor Author

Still looks good.

Thanks for looking at this again, Emanuel!

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thorough analysis. The fix looks good to me.

@robcasloz
Copy link
Contributor Author

Great, thorough analysis. The fix looks good to me.

Thanks for reviewing, Tobias!

@robcasloz
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Apr 27, 2023

Going to push as commit 1be80a4.
Since your change was applied there have been 76 commits pushed to the master branch:

  • ba43649: 8306976: UTIL_REQUIRE_SPECIAL warning on grep
  • cbccc4c: 8304265: Implementation of Foreign Function and Memory API (Third Preview)
  • 41d5853: 8306940: test/jdk/java/net/httpclient/XxxxInURI.java should call HttpClient::close
  • d94ce65: 8306858: Remove some remnants of CMS from SA agent
  • a83c02f: 8306654: Disable NMT location_printing_cheap_dead_xx tests again
  • de0c05d: 6995195: Static initialization deadlock in sun.java2d.loops.Blit and GraphicsPrimitiveMgr
  • 748476f: 8306732: TruncatedSeq::predict_next() attempts linear regression with only one data point
  • 27c5c10: 8306883: Thread stacksize is reported with wrong units in os::create_thread logging
  • 9ebcda2: 8229147: Linux os::create_thread() overcounts guardpage size with newer glibc (>=2.27)
  • 1e4eafb: 8071693: Introspector ignores default interface methods
  • ... and 66 more: https://git.openjdk.org/jdk/compare/7400aff3b8a0294dcbb6e89e9d8aad984f29fe92...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 27, 2023
@openjdk openjdk bot closed this Apr 27, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 27, 2023
@openjdk
Copy link

openjdk bot commented Apr 27, 2023

@robcasloz Pushed as commit 1be80a4.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@robcasloz
Copy link
Contributor Author

Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE.

Filed now: JDK-8306989.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants