-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8329331: Intrinsify Unsafe::setMemory #18555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back sgibbons! A progress list of the required criteria for merging this PR into |
|
@asgibbons This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 1 new commit pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @dean-long, @sviswa7, @jatin-bhateja, @JornVernee, @vnkozlov) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@asgibbons The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is still a Draft/work-in-progress. There is only code for x64 and it doesn't appear it will build on other platforms. Also there are still a bunch of if 0 in the code that should not be there.
|
Wouldn't it be better to do this intrinsification directly in the JIT without calling out to a stub? |
I believe the code size is too large for a direct JIT intrinsic. A lot of registers are also used, which may be an issue. |
|
@dholmes-ora Sorry for the dead code left in. It is gone now. Plus, this was only requested for x86, thus no implementation for other platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right approach is to turn it into a loop in the IR, which I think is what Doug was implying. That way C2 can do all its usual optimizations, like unrolling, vectorization, and redundant store elimination (if it is an on-heap primitive array that was just allocated, then there is no need to zero the parts that are being "set").
Only requested by whom? The JBS issue says nothing about that. I'm not even sure how this avoids the |
|
As an experiment, couldn't you have the C2 intrinsic redirect to a Java helper that calls putByte() in a loop? |
|
@vnkozlov I un-did the name change and will submit a separate request for re-naming. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I only have question about long vs short jumps in stub's code.
| __ andq(size, 0x7); | ||
|
|
||
| // If zero, then we're done | ||
| __ jccb(Assembler::zero, L_exit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code in generate_unsafe_setmemory() uses long jumps to L_exit but here you use short. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - the original code (3 iterations ago) was about 10 bytes too long for a short jump. It's short enough now. Changed.
| do_setmemory_atomic_loop(USM_SHORT, dest, size, wide_value, rScratch1, | ||
| L_exit, _masm); | ||
| } | ||
| __ jmp(L_exit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is long jump to L_exit after do_setmemory_atomic_loop() call. Should this be also short jump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have additional code in debug VM wihch increase distance and requires long jump? I don't see it. Usually it something which call __ STOP().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code required a long jump due to the size of do_setmemory_atomic_loop but has since been refactored. The jmp(Label) code will generate a short jump provided the label has been defined and is in range. Otherwise a long jump is generated.
Changed to jmpb
src/hotspot/share/opto/runtime.cpp
Outdated
| int argp = TypeFunc::Parms; | ||
| fields[argp++] = TypePtr::NOTNULL; // dest | ||
| fields[argp++] = TypeX_X; // size | ||
| LP64_ONLY(fields[argp++] = Type::HALF); // size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: align /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/hotspot/share/utilities/copy.hpp
Outdated
| @@ -1,5 +1,5 @@ | |||
| /* | |||
| * Copyright (c) 2003, 2022, Oracle and/or its affiliates. All rights reserved. | |||
| * Copyright (c) 2003, 2024, Oracle and/or its affiliates. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to undo year change in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Done.
|
Good. I will submit our testing. |
|
|
| for (int i = 0; i < 8; i++) { | ||
| switch (type) { | ||
| case USM_SHORT: | ||
| __ movw(Address(dest, (2 * i)), wide_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MOVW emits an extra Operand Size Override prefix byte compared to 32 and 64 bit stores, any specific reason for keeping same unroll factor for all the stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is the spec requires the appropriate-sized write based on alignment and size. This is why there's no 128-bit or 256-bit store loops.
| for (int i = 0; i < 8; i++) { | ||
| switch (type) { | ||
| case USM_SHORT: | ||
| __ movw(Address(dest, (2 * i)), wide_value); | ||
| break; | ||
| case USM_DWORD: | ||
| __ movl(Address(dest, (4 * i)), wide_value); | ||
| break; | ||
| case USM_QUADWORD: | ||
| __ movq(Address(dest, (8 * i)), wide_value); | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we want to be as accurate as possible in filling the tail in an event of SIGBUS, but we are anyways creating a wide value for 8 packed bytes if destination segment was quadword aligned, aligned quadword stores are implicitly atomic on x86 targets, what's your thoughts on using a vector instruction based loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the spec is specific on the size of the store required given alignment and size. I want to honor that spec even though wider stores could be done in many cases.
|
The SIGBUS was due to improper scoping of the UnsafeCopyMemoryMark. The change is:
|
|
@vnkozlov Thanks for the feedback. Can you please start the testing again? I'd appreciate it. |
|
Before I do testing, please sync with mainline. |
|
Merge done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My testing passed. Good.
|
/integrate Thank you all for the reviews. |
|
@asgibbons |
|
/sponsor |
|
@jatin-bhateja @asgibbons Pushed as commit bd67ac6. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
|
This introduced a regression, see JDK-8331033. |
This code makes an intrinsic stub for
Unsafe::setMemoryfor x86_64. See this PR for discussion around this change.Overall, making this an intrinsic improves overall performance of
Unsafe::setMemoryby up to 4x for all buffer sizes.Tested with tier-1 (and full CI). I've added a table of the before and after numbers for the JMH I ran (
MemorySegmentZeroUnsafe).setMemoryBM.txt
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18555/head:pull/18555$ git checkout pull/18555Update a local copy of the PR:
$ git checkout pull/18555$ git pull https://git.openjdk.org/jdk.git pull/18555/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18555View PR using the GUI difftool:
$ git pr show -t 18555Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18555.diff
Webrev
Link to Webrev Comment