Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651 #18532

Conversation

eme64
Copy link
Contributor

@eme64 eme64 commented Mar 28, 2024

In JDK-8325651 / #17812 I refactored the dependency graph. It seems I made a typo, and missed a single !, which broke VLoopDependencyGraph::compute_depth (formerly SuperWord::compute_max_depth).

The consequence was that all nodes in the dependency graph had the same depth 1. A node is supposed to have a higher depth than all its inputs, except for Phi nodes, which have depth 0, as they are at the beginning of the loop's basic block, i.e. they are at the beginning of the DAG.

Details

Well, it is a bit more complicated. I had not just forgotten about the !. Before the change, we used to iterate over the body multiple times, until the depth computation is stable. When I saw this, I assumed this was not necessary, since the body is already ordered, such that def is before use. So I reduced it to a single pass over the body.

But this assumption was wrong: I added some assertion code, which detected that something was wrong with the ordering in the body. In the failing example, I saw that we had a Load and a Store with the same memory state. Given the edges, our ordering algorithm for the body could schedule Load before Store or Store before Load. But that is incorrect: our assumption is that in such cases Loads always happen before Stores.

Therefore, I had to change the traversal order in VLoopBody::construct, so that we visit Load before Store. With this, I now know that the body order is correct for both the data dependency and the memory dependency. Therefore, I only need to iterate over the body once in VLoopDependencyGraph::compute_depth.

More Backgroud / Details

This bug was reported because there were timeouts with TestAlignVectorFuzzer.java. This fix seems to improve the compile time drastically for the example below. It seems to be an example with a large dependency graph, where we still attempt to create some packs. This means there is a large amount of independence checks on the dependency graph. If those are not pruned well, then they visit many more nodes than necessary.

Why did I not catch this earlier? I had a compile time benchmark for JDK-8325651 / #17812, but it seems it was not sensitive enough. It has a dense graph, but never actually created any packs. My new benchmark creates packs, which unlocks more checks during filter_packs_for_mutual_independence, which seem to be much more stressing the dependency graph traversals.

If such large dense dependency graphs turn out to be very common, we could take more drastic steps in the future:

  • Bail out of SuperWord if the graph gets too large.
  • Implement a data structure that is better for dense graphs, such as a matrix, where we mark the cell for (n1, n2) corresponding to the independence(n1, n2) query. This would make independence checks a constant time lookup, rather than a graph traversal.

I extracted a simple compile-time benchmark from TestAlignVectorFuzzer.java:

/oracle-work/jdk-fork2/build/linux-x64/jdk/bin/java -XX:CompileCommand=printcompilation,TestGraph2::* -XX:CompileCommand=compileonly,TestGraph2::test* -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=0 -XX:LoopUnrollLimit=1000 -Xbatch TestGraph2.java

With patch:
    C2 Compile Time: 8.234 s
         IdealLoop: 8.170 s
           AutoVectorize: 7.789 s

master:
    C2 Compile Time: 56.223 s
         IdealLoop: 56.017 s
           AutoVectorize: 55.576 s

I took the above numbers before the integration of #18577. It had about a 7 sec speedup on this same benchmark.
The numbers after the intagration are below:

With patch:
    C2 Compile Time:        1.572 s
         IdealLoop:             1.511 s
           AutoVectorize:       1.141 s

Master:
    C2 Compile Time:       53.639 s
         IdealLoop:            53.433 s
           AutoVectorize:      53.002 s

This makes a massive difference. And it shows that #18577 was a very strong compile-time improvement, especially for large loop bodies, and with partial vectorization. Even in such an extreme stress example, we still only spend 72.5% of compile time in AutoVectorization.

import java.util.Random;

class TestGraph2 {

    private static final Random random = new Random();
    static final int RANGE_CON = 1024 * 8;

    static int init = 593436;
    static int limit = 599554;
    static int offset1 = -592394;
    static int offset2 = -592386;
    static final int offset3 = -592394;
    static final int stride =  4;
    static final int scale =   1;
    static final int hand_unrolling1 = 2;
    static final int hand_unrolling2 = 8;
    static final int hand_unrolling3 = 15;

    public static void main(String[] args) {
        byte[] aB = generateB();
        byte[] bB = generateB();
        byte[] cB = generateB();
 
        for (int i = 1; i < 100; i++) {
            testUUBBBH(aB, bB, cB);
        }
    }

    static byte[] generateB() {
        byte[] a = new byte[RANGE_CON];
        for (int i = 0; i < a.length; i++) {
            a[i] = (byte)random.nextInt();
        }
        return a;
    }

    static Object[] testUUBBBH(byte[] a, byte[] b, byte[] c) {
        int h1 = hand_unrolling1;
        int h2 = hand_unrolling2;
        int h3 = hand_unrolling3;

        for (int i = init; i < limit; i += stride) {
            if (h1 >=  1) { a[offset1 + i * scale +  0]++; }
            if (h1 >=  2) { a[offset1 + i * scale +  1]++; }
            if (h1 >=  3) { a[offset1 + i * scale +  2]++; }
            if (h1 >=  4) { a[offset1 + i * scale +  3]++; }
            if (h1 >=  5) { a[offset1 + i * scale +  4]++; }
            if (h1 >=  6) { a[offset1 + i * scale +  5]++; }
            if (h1 >=  7) { a[offset1 + i * scale +  6]++; }
            if (h1 >=  8) { a[offset1 + i * scale +  7]++; }
            if (h1 >=  9) { a[offset1 + i * scale +  8]++; }
            if (h1 >= 10) { a[offset1 + i * scale +  9]++; }
            if (h1 >= 11) { a[offset1 + i * scale + 10]++; }
            if (h1 >= 12) { a[offset1 + i * scale + 11]++; }
            if (h1 >= 13) { a[offset1 + i * scale + 12]++; }
            if (h1 >= 14) { a[offset1 + i * scale + 13]++; }
            if (h1 >= 15) { a[offset1 + i * scale + 14]++; }
            if (h1 >= 16) { a[offset1 + i * scale + 15]++; }

            if (h2 >=  1) { b[offset2 + i * scale +  0]++; }
            if (h2 >=  2) { b[offset2 + i * scale +  1]++; }
            if (h2 >=  3) { b[offset2 + i * scale +  2]++; }
            if (h2 >=  4) { b[offset2 + i * scale +  3]++; }
            if (h2 >=  5) { b[offset2 + i * scale +  4]++; }
            if (h2 >=  6) { b[offset2 + i * scale +  5]++; }
            if (h2 >=  7) { b[offset2 + i * scale +  6]++; }
            if (h2 >=  8) { b[offset2 + i * scale +  7]++; }
            if (h2 >=  9) { b[offset2 + i * scale +  8]++; }
            if (h2 >= 10) { b[offset2 + i * scale +  9]++; }
            if (h2 >= 11) { b[offset2 + i * scale + 10]++; }
            if (h2 >= 12) { b[offset2 + i * scale + 11]++; }
            if (h2 >= 13) { b[offset2 + i * scale + 12]++; }
            if (h2 >= 14) { b[offset2 + i * scale + 13]++; }
            if (h2 >= 15) { b[offset2 + i * scale + 14]++; }
            if (h2 >= 16) { b[offset2 + i * scale + 15]++; }

            if (h3 >=  1) { c[offset3 + i * scale +  0]++; }
            if (h3 >=  2) { c[offset3 + i * scale +  1]++; }
            if (h3 >=  3) { c[offset3 + i * scale +  2]++; }
            if (h3 >=  4) { c[offset3 + i * scale +  3]++; }
            if (h3 >=  5) { c[offset3 + i * scale +  4]++; }
            if (h3 >=  6) { c[offset3 + i * scale +  5]++; }
            if (h3 >=  7) { c[offset3 + i * scale +  6]++; }
            if (h3 >=  8) { c[offset3 + i * scale +  7]++; }
            if (h3 >=  9) { c[offset3 + i * scale +  8]++; }
            if (h3 >= 10) { c[offset3 + i * scale +  9]++; }
            if (h3 >= 11) { c[offset3 + i * scale + 10]++; }
            if (h3 >= 12) { c[offset3 + i * scale + 11]++; }
            if (h3 >= 13) { c[offset3 + i * scale + 12]++; }
            if (h3 >= 14) { c[offset3 + i * scale + 13]++; }
            if (h3 >= 15) { c[offset3 + i * scale + 14]++; }
            if (h3 >= 16) { c[offset3 + i * scale + 15]++; }
        }
        return new Object[]{ a, b, c };
    }
}

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651 (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18532/head:pull/18532
$ git checkout pull/18532

Update a local copy of the PR:
$ git checkout pull/18532
$ git pull https://git.openjdk.org/jdk.git pull/18532/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18532

View PR using the GUI difftool:
$ git pr show -t 18532

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18532.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 28, 2024

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 28, 2024

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651

Reviewed-by: chagedorn, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 7 new commits pushed to the master branch:

  • f3db279: 8327410: Add hostname option for UL file names
  • 21867c9: 8313332: Simplify lazy jmethodID cache in InstanceKlass
  • b9da140: 8329594: G1: Consistent Titles to Thread Work Items.
  • a169c06: 8329580: Parallel: Remove VerifyObjectStartArray
  • 8efd7aa: 8328786: [AIX] move some important warnings/errors from trcVerbose to UL
  • f26e430: 8327110: Refactor create_bool_from_template_assertion_predicate() to separate class and fix identical cloning cases used for Loop Unswitching and Split If
  • e5e21a8: 8328702: C2: Crash during parsing because sub type check is not folded

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title 8327978 8327978: C2 SuperWord: Fix compilation time regression in dependency graph traversal after JDK-8325651 Mar 28, 2024
@openjdk
Copy link

openjdk bot commented Mar 28, 2024

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 28, 2024
@eme64 eme64 mentioned this pull request Apr 2, 2024
3 tasks
@eme64 eme64 marked this pull request as ready for review April 3, 2024 13:52
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 3, 2024
@mlbridge
Copy link

mlbridge bot commented Apr 3, 2024

Node* mem = n->in(MemNode::Memory);
for (DUIterator_Fast imax, i = mem->fast_outs(imax); i < imax; i++) {
Node* mem_use = mem->fast_out(i);
if (_vloop.in_bb(mem_use) && !visited.test(bb_idx(mem_use)) && mem_use->is_Store()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mem_use->is_Store() check is cheap and should be first. It will also help to skip other checks for Load node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that :)

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Could such a long-running/compiling test also be added as jtreg test which fails due to a timeout without this patch and passes with the patch?

src/hotspot/share/opto/vectorization.cpp Outdated Show resolved Hide resolved
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 4, 2024
@eme64
Copy link
Contributor Author

eme64 commented Apr 4, 2024

@chhagedorn I added a regression test as you requested. It results in timeout before the patch, and passes with plenty of time to spare with the patch.
I also did the code change you requested.

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a whitespace error in the test file.

* @bug 8327978
* @summary Test compile time for large compilation, where SuperWord takes especially much time.
* @requires vm.compiler2.enabled
* @run main/othervm -XX:+UnlockDiagnosticVMOptions -XX:RepeatCompilation=5 -XX:LoopUnrollLimit=1000 -Xbatch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use main/othervm/timeout=30 or an even lower timeout. Then you might be able to get rid of RepeatCompilation.

@@ -296,19 +296,36 @@ void VLoopDependencyGraph::add_node(MemNode* n, GrowableArray<int>& memory_pred_
// before use. With a single pass, we can compute the depth of every node, since we can
// assume that the depth of all preds is already computed when we compute the depth of use.
void VLoopDependencyGraph::compute_depth() {
for (int i = 0; i < _body.body().length(); i++) {
Node* n = _body.body().at(i);
auto find_max_pred_depth = [&] (const Node* n) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this code out to a separate method. Having a lambda here makes it hard to read compute_depth() and you don't really need to capture anything.

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 4, 2024
Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the update! And nice that you've been able to extract and add a test for it.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 4, 2024
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update is good.

@eme64
Copy link
Contributor Author

eme64 commented Apr 5, 2024

@vnkozlov @chhagedorn thanks for the reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Apr 5, 2024

Going to push as commit 9da5170.
Since your change was applied there have been 20 commits pushed to the master branch:

  • c1cfb43: 8329109: Threads::print_on() tries to print CPU time for terminated GC threads
  • 5860a48: 8329624: Add visitors for preview language features
  • 0b01144: 8329720: Gtest failure printing markword after JDK-8325303
  • 34f7974: 8325303: Replace markWord.is_neutral() with markWord.is_unlocked()
  • 27cfcef: 8329651: TestLibGraal.java crashes with assert(_stack_base != nullptr)
  • e1183ac: 8329703: Remove unused apple.jpeg file from SwingSet2 demo
  • 12ad09a: 8322042: HeapDumper should perform merge on the current thread instead of VMThread
  • d80d478: 8328649: Disallow enclosing instances for local classes in constructor prologues
  • 83eba86: 8329332: Remove CompiledMethod and CodeBlobLayout classes
  • 28216aa: 8328366: Thread.setContextClassloader from thread in FJP commonPool task no longer works after JDK-8327501
  • ... and 10 more: https://git.openjdk.org/jdk/compare/f762637be2568f898db25aa6a57c180f1feac3a3...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 5, 2024
@openjdk openjdk bot closed this Apr 5, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 5, 2024
@openjdk
Copy link

openjdk bot commented Apr 5, 2024

@eme64 Pushed as commit 9da5170.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants