Skip to content

Conversation

@asgibbons
Copy link
Contributor

@asgibbons asgibbons commented Mar 29, 2024

This code makes an intrinsic stub for Unsafe::setMemory for x86_64. See this PR for discussion around this change.

Overall, making this an intrinsic improves overall performance of Unsafe::setMemory by up to 4x for all buffer sizes.

Tested with tier-1 (and full CI). I've added a table of the before and after numbers for the JMH I ran (MemorySegmentZeroUnsafe).

setMemoryBM.txt


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8329331: Intrinsify Unsafe::setMemory (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18555/head:pull/18555
$ git checkout pull/18555

Update a local copy of the PR:
$ git checkout pull/18555
$ git pull https://git.openjdk.org/jdk.git pull/18555/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18555

View PR using the GUI difftool:
$ git pr show -t 18555

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18555.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 29, 2024

👋 Welcome back sgibbons! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 29, 2024

@asgibbons This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8329331: Intrinsify Unsafe::setMemory

Reviewed-by: sviswanathan, jbhateja, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the master branch:

  • 185e711: 8318650: Optimized subword gather for x86 targets.

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @dean-long, @sviswa7, @jatin-bhateja, @JornVernee, @vnkozlov) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 29, 2024
@openjdk
Copy link

openjdk bot commented Mar 29, 2024

@asgibbons The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Mar 29, 2024
@mlbridge
Copy link

mlbridge bot commented Mar 29, 2024

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is still a Draft/work-in-progress. There is only code for x64 and it doesn't appear it will build on other platforms. Also there are still a bunch of if 0 in the code that should not be there.

@dougxc
Copy link
Member

dougxc commented Apr 2, 2024

Wouldn't it be better to do this intrinsification directly in the JIT without calling out to a stub?

@asgibbons
Copy link
Contributor Author

Wouldn't it be better to do this intrinsification directly in the JIT without calling out to a stub?

I believe the code size is too large for a direct JIT intrinsic. A lot of registers are also used, which may be an issue.

@asgibbons
Copy link
Contributor Author

@dholmes-ora Sorry for the dead code left in. It is gone now. Plus, this was only requested for x86, thus no implementation for other platforms.

Copy link
Member

@dean-long dean-long left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right approach is to turn it into a loop in the IR, which I think is what Doug was implying. That way C2 can do all its usual optimizations, like unrolling, vectorization, and redundant store elimination (if it is an on-heap primitive array that was just allocated, then there is no need to zero the parts that are being "set").

@dholmes-ora
Copy link
Member

@dholmes-ora Sorry for the dead code left in. It is gone now. Plus, this was only requested for x86, thus no implementation for other platforms.

Only requested by whom? The JBS issue says nothing about that.

I'm not even sure how this avoids the CheckIntrinsics check for missing intrinsics ... I guess it must only look for some kind of declaration not an actual implementation.

@dean-long
Copy link
Member

As an experiment, couldn't you have the C2 intrinsic redirect to a Java helper that calls putByte() in a loop?

@asgibbons
Copy link
Contributor Author

@vnkozlov I un-did the name change and will submit a separate request for re-naming. Thanks.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I only have question about long vs short jumps in stub's code.

__ andq(size, 0x7);

// If zero, then we're done
__ jccb(Assembler::zero, L_exit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code in generate_unsafe_setmemory() uses long jumps to L_exit but here you use short. Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - the original code (3 iterations ago) was about 10 bytes too long for a short jump. It's short enough now. Changed.

do_setmemory_atomic_loop(USM_SHORT, dest, size, wide_value, rScratch1,
L_exit, _masm);
}
__ jmp(L_exit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is long jump to L_exit after do_setmemory_atomic_loop() call. Should this be also short jump?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have additional code in debug VM wihch increase distance and requires long jump? I don't see it. Usually it something which call __ STOP().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code required a long jump due to the size of do_setmemory_atomic_loop but has since been refactored. The jmp(Label) code will generate a short jump provided the label has been defined and is in range. Otherwise a long jump is generated.

Changed to jmpb

int argp = TypeFunc::Parms;
fields[argp++] = TypePtr::NOTNULL; // dest
fields[argp++] = TypeX_X; // size
LP64_ONLY(fields[argp++] = Type::HALF); // size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: align /

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1,5 +1,5 @@
/*
* Copyright (c) 2003, 2022, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2003, 2024, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

@vnkozlov vnkozlov Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to undo year change in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Done.

@vnkozlov
Copy link
Contributor

Good. I will submit our testing.

@vnkozlov
Copy link
Contributor

runtime/Unsafe/InternalErrorTest.java test SIGBUS when run with -Xcomp (and other flags in test's @run command):

#  SIGBUS (0xa) at pc=0x0000000119514760, pid=63021, tid=28163
#
# JRE version: Java(TM) SE Runtime Environment (23.0) (fastdebug build 23-internal-2024-04-19-2326152.vladimir.kozlov.jdkgit2)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-internal-2024-04-19-2326152.vladimir.kozlov.jdkgit2, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_fill 0x0000000119514760

for (int i = 0; i < 8; i++) {
switch (type) {
case USM_SHORT:
__ movw(Address(dest, (2 * i)), wide_value);
Copy link
Member

@jatin-bhateja jatin-bhateja Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MOVW emits an extra Operand Size Override prefix byte compared to 32 and 64 bit stores, any specific reason for keeping same unroll factor for all the stores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is the spec requires the appropriate-sized write based on alignment and size. This is why there's no 128-bit or 256-bit store loops.

Comment on lines +2527 to +2539
for (int i = 0; i < 8; i++) {
switch (type) {
case USM_SHORT:
__ movw(Address(dest, (2 * i)), wide_value);
break;
case USM_DWORD:
__ movl(Address(dest, (4 * i)), wide_value);
break;
case USM_QUADWORD:
__ movq(Address(dest, (8 * i)), wide_value);
break;
}
}
Copy link
Member

@jatin-bhateja jatin-bhateja Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we want to be as accurate as possible in filling the tail in an event of SIGBUS, but we are anyways creating a wide value for 8 packed bytes if destination segment was quadword aligned, aligned quadword stores are implicitly atomic on x86 targets, what's your thoughts on using a vector instruction based loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the spec is specific on the size of the store required given alignment and size. I want to honor that spec even though wider stores could be done in many cases.

@asgibbons
Copy link
Contributor Author

asgibbons commented Apr 20, 2024

The SIGBUS was due to improper scoping of the UnsafeCopyMemoryMark. The change is:

{
// Add set memory mark to protect against unsafe accesses faulting
- UnsafeCopyMemoryMark(this, ((t == T_BYTE) && !aligned), true);
+ UnsafeCopyMemoryMark usmm(this, ((t == T_BYTE) && !aligned), true);
__ generate_fill(t, aligned, to, value, r11, rax, xmm0);
}

@asgibbons
Copy link
Contributor Author

@vnkozlov Thanks for the feedback. Can you please start the testing again? I'd appreciate it.

@vnkozlov
Copy link
Contributor

Before I do testing, please sync with mainline.

@asgibbons
Copy link
Contributor Author

Merge done.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My testing passed. Good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 21, 2024
@asgibbons
Copy link
Contributor Author

/integrate

Thank you all for the reviews.

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Apr 21, 2024
@openjdk
Copy link

openjdk bot commented Apr 21, 2024

@asgibbons
Your change (at version 1122b50) is now ready to be sponsored by a Committer.

@jatin-bhateja
Copy link
Member

/sponsor

@openjdk
Copy link

openjdk bot commented Apr 21, 2024

Going to push as commit bd67ac6.
Since your change was applied there has been 1 commit pushed to the master branch:

  • 185e711: 8318650: Optimized subword gather for x86 targets.

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 21, 2024
@openjdk openjdk bot closed this Apr 21, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Apr 21, 2024
@openjdk
Copy link

openjdk bot commented Apr 21, 2024

@jatin-bhateja @asgibbons Pushed as commit bd67ac6.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@asgibbons asgibbons deleted the setMemory branch April 22, 2024 13:10
@TobiHartmann
Copy link
Member

This introduced a regression, see JDK-8331033.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.