Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8326962: C2 SuperWord: cache VPointer #18577

Closed
wants to merge 7 commits into from

Conversation

eme64
Copy link
Contributor

@eme64 eme64 commented Apr 2, 2024

This is a subtask of JDK-8315361.

Parsing VPointer currently happens all over SuperWord. And often in quadratic loops, where we compare all-with-all loads/stores.

I propose to cache the VPointers, then we can do a constant-time cache lookup rather than parsing the pointer subgraph every time.

There are now only a few cases where we cannot use the cached VPointer:

  • SuperWord::unrolling_analysis: we have no VLoopAnalyzer, and so no submodules like VLoopPointers. We don't need to cache, since we only iterate over the loop body once, and create only a single VPointer per memop.
  • SuperWord::output: when we have a Load, and try to bypass StoreVector nodes. The StoreVector nodes are new, and so we have no cached VPointer for them. This could be fixed somehow, but I don't want to deal with it now. I intend to refactor SuperWord::output soon, and can look into options at that point (either I bypass before we insert the vector nodes, or I remember what scalar memop the vector was created from, and then get the cached pointer this way).

This changeset is also a preparation step for JDK-8325155. I will have a list of pointers, and sort them such that creating adjacent refs is much more efficient.

Benchmarking SuperWord Compile Time

I use the same benchmark from #18532.

On master:

    C2 Compile Time:       56.816 s
         IdealLoop:            56.604 s
           AutoVectorize:      56.192 s

With this patch:

    C2 Compile Time:       49.719 s
         IdealLoop:            49.509 s
           AutoVectorize:      49.106 s

This saves us about 7 sec, which is significant. I will have to see what it effect it has once we also apply #18532, but I think the combined effect will be very significant.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8326962: C2 SuperWord: cache VPointer (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18577/head:pull/18577
$ git checkout pull/18577

Update a local copy of the PR:
$ git checkout pull/18577
$ git pull https://git.openjdk.org/jdk.git pull/18577/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18577

View PR using the GUI difftool:
$ git pr show -t 18577

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18577.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 2, 2024

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 2, 2024

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8326962: C2 SuperWord: cache VPointer

Reviewed-by: chagedorn, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 33 new commits pushed to the master branch:

  • 8020183: 8329470: Remove obsolete CDS SharedStrings tests
  • 8267d65: 8329564: [JVMCI] TranslatedException::debugPrintStackTrace does not work in the libjvmci compiler.
  • 16576b8: 8328957: Update PKCS11Test.java to not use hardcoded path
  • 375bfac: 8327474: Review use of java.io.tmpdir in jdk tests
  • 233619b: 8329557: Fix statement around MathContext.DECIMAL128 rounding
  • 023f7f1: 8320799: Bump minimum boot jdk to JDK 22
  • 8dc43aa: 8325217: MethodSymbol.getModifiers() returns SEALED for restricted methods
  • 1c69193: 8328383: Method is not used: com.sun.tools.javac.comp.Attr::thisSym
  • ee09801: 8328352: Serial: Inline SerialBlockOffsetSharedArray
  • bea493b: 8236736: Change notproduct JVM flags to develop flags
  • ... and 23 more: https://git.openjdk.org/jdk/compare/5cddc2de493d9d8712e4bee3aed4f1a0d4c228c3...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title 8326962 8326962: C2 SuperWord: cache VPointer Apr 2, 2024
@openjdk
Copy link

openjdk bot commented Apr 2, 2024

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Apr 2, 2024
for (uint j = i+1; j < memops.size(); j++) {
MemNode* s2 = memops.at(j)->as_Mem();
if (isomorphic(s1, s2)) {
VPointer p2(s2, _vloop);
const VPointer& p2 = get_pointer(s2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: a classic example of a quadratic loop, where we compare "all-to-all" memops, thus parse the pointer subgraph repeatedly.

}

uint bytes = number_of_pointers * sizeof(VPointer);
_pointers = (VPointer*)_arena->Amalloc(bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I wish I could use GrowableArray here. But I have a StackObj that is NONCOPYABLE. I thus have to directly construct the VPointer into the array, and cannot construct it outside and pass it in. Someday, I hope that GrowableArray allows appending with the move-constructor, or something similar.

For now: I simply allocate my own memory, and use the placement-new to construct the VPointers directly into that memory.

// For all memory nodes before it, check if we need to add a memory edge.
for (int k = slice_nodes.length() - 1; k > j; k--) {
MemNode* n2 = slice_nodes.at(k);

// Ignore Load-Load dependencies:
if (n1->is_Load() && n2->is_Load()) { continue; }

VPointer p2(n2, _vloop);
const VPointer& p2 = _pointers.get(n2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: another quadratic loop where we repeatedly parse the pointers.

}
#endif
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: improve printing a bit for POINTERS tag of TraceAutoVectorization.

tty->print("[%d]", n->_idx);
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: moved it up so we can use it anywhere in vectorization.cpp.

@@ -678,15 +723,15 @@ class VPointer : public ArenaObj {
int invar_factor() const;

// Comparable?
bool invar_equals(VPointer& q) {
bool invar_equals(const VPointer& q) const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: had to make some things const here, so that I can pass around const VPointer&, which I get from _pointers.get(n) / get_pointer(n).

@eme64 eme64 marked this pull request as ready for review April 2, 2024 13:07
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 2, 2024
@mlbridge
Copy link

mlbridge bot commented Apr 2, 2024

Webrevs

Copy link
Contributor

@jdksjolen jdksjolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Emanuel,

I've some general questions regarding naming and Arena usage, I hope you don't mind some runtime team input.

src/hotspot/share/opto/superword.hpp Outdated Show resolved Hide resolved
src/hotspot/share/opto/vectorization.hpp Outdated Show resolved Hide resolved
// We compute and cache the VPointer for every load and store.
class VLoopPointers : public StackObj {
private:
Arena* _arena;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the pointer ever change? Could potentially change this to a reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important for this to be Arena-allocated? Seems to me like compute_and_cache will only be computed once per VLoopPointers instance, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss if Arena-allocated is the right thing to do. But for now it is what I did with all other submodules of VLoopAnalyzer, so if we were to change this, then I can do that in a separate RFE.

What alternative would you prefer, and why?

I like Arena-allocation, because I have a clear location and life-time for my allocations. I can close the arena after all AutoVectorization, and I know that the data is valid up to that point, and then it gets deallocated.

CHeap allocation would require me to be more smart and careful about deallocation.

Resouce allocation in my experience often is problematic if you have different life-times for things. I like Resource-allocation only for temporary data structures, not data that is used across a large algorithm with dozens of subalgorithms.

Let me know what you think ;)

Copy link
Contributor Author

@eme64 eme64 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the pointer ever change? Could potentially change this to a reference.

I could make it a reference. But data structures like GrowableArray take a Arena*. So then I have to use * and & all the time. I don't like that, it makes the code much more "noisy".

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice improvement and it makes sense to just compute them once and re-use them. I only have a few comments but generally looks good!

src/hotspot/share/opto/superword.hpp Outdated Show resolved Hide resolved
src/hotspot/share/opto/vectorization.cpp Outdated Show resolved Hide resolved
src/hotspot/share/opto/vectorization.cpp Outdated Show resolved Hide resolved
src/hotspot/share/opto/vectorization.cpp Outdated Show resolved Hide resolved
src/hotspot/share/opto/vectorization.hpp Outdated Show resolved Hide resolved
const VLoop& _vloop;
const VLoopBody& _body;

// Array of cached pointers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make a note that we allocate/cache them lazily upon request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not lazy, they are allocated and cached in compute_and_cache. Like all other VLoopAnalyzer submodules. Maybe I missed your point 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've meant that it's not allocated in the constructor as you initialize it with nullptr. It's only initialized once you call compute_and_cache() which may not happen if we bail out earlier. That's what I've meant with "lazy" but that was probably not clear enough :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. I mean all other submodules are handled the same. They also cannot really be used until VLoopAnalyzer::setup_submodules returns with success. I guess this here is the first instance where the data structure itself is only allocated after the constructor. But I feel like if anybody has a question about where it is allocated, they can just search the reference. If I start putting down such detailed comments, then I need to put them everywhere. And that will clutter the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Here I think I've only commented it since it's allocated specially for the first time in the sub modules. But it does not really add much information per se. It's fine to leave it like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the suggestion anyway 😊
I will leave it without a comment then.

src/hotspot/share/opto/superword.hpp Outdated Show resolved Hide resolved
@eme64
Copy link
Contributor Author

eme64 commented Apr 2, 2024

@jdksjolen @chhagedorn Thanks for your suggestions!
I think I addressed / commented on all your review comments.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: will VLoopAnalyzer default destructor clean up all memory used?

@eme64
Copy link
Contributor Author

eme64 commented Apr 3, 2024

One question: will VLoopAnalyzer default destructor clean up all memory used?

@vnkozlov there is no need, since it is all allocated over the Arena in VLoopAnalyzer:

// Arena for all submodules
Arena                _arena;

It is that arena that I pass into all submodules, such as VLoopVPointer. VLoopAnalyzer is stack allocated, so once the destructor removes its _arena, all submodules are also automatically deallocated.

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the suggested changes. Looks good to me!

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 3, 2024
@vnkozlov
Copy link
Contributor

vnkozlov commented Apr 3, 2024

It is that arena that I pass into all submodules, such as VLoopVPointer. VLoopAnalyzer is stack allocated, so once the destructor removes its _arena, all submodules are also automatically deallocated.

Good.

@eme64
Copy link
Contributor Author

eme64 commented Apr 4, 2024

Thanks @vnkozlov @chhagedorn @jdksjolen for the reviews and suggestions!
@jdksjolen feel free to give me your ideas about Arena-allocation, I can still improve in a follow-up RFE ;)

/integrate

@openjdk
Copy link

openjdk bot commented Apr 4, 2024

Going to push as commit f762637.
Since your change was applied there have been 35 commits pushed to the master branch:

  • 2931458: 8328938: C2 SuperWord: disable vectorization for large stride and scale
  • 4196688: 8329494: Serial: Merge GenMarkSweep into MarkSweep
  • 8020183: 8329470: Remove obsolete CDS SharedStrings tests
  • 8267d65: 8329564: [JVMCI] TranslatedException::debugPrintStackTrace does not work in the libjvmci compiler.
  • 16576b8: 8328957: Update PKCS11Test.java to not use hardcoded path
  • 375bfac: 8327474: Review use of java.io.tmpdir in jdk tests
  • 233619b: 8329557: Fix statement around MathContext.DECIMAL128 rounding
  • 023f7f1: 8320799: Bump minimum boot jdk to JDK 22
  • 8dc43aa: 8325217: MethodSymbol.getModifiers() returns SEALED for restricted methods
  • 1c69193: 8328383: Method is not used: com.sun.tools.javac.comp.Attr::thisSym
  • ... and 25 more: https://git.openjdk.org/jdk/compare/5cddc2de493d9d8712e4bee3aed4f1a0d4c228c3...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 4, 2024
@openjdk openjdk bot closed this Apr 4, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 4, 2024
@openjdk
Copy link

openjdk bot commented Apr 4, 2024

@eme64 Pushed as commit f762637.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants