-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8322770: Implement C2 VectorizedHashCode on AArch64 #18487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @mikabl-arm, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user mikabl-arm" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
|
@mikabl-arm This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 165 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @adinn) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@mikabl-arm The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
|
/covered |
|
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
|
Just a trivial note: this change also improves the calculation of String.hashCode(). For instance, on V1 SVE can give notable additional speedup only for very long strings (>1k). |
| const size_t loop_factor = eltype == T_BOOLEAN || eltype == T_BYTE ? 8 | ||
| : eltype == T_CHAR || eltype == T_SHORT || eltype == T_INT ? 4 | ||
| : 0; | ||
| guarantee(loop_factor, "unsopported eltype"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: unsupported
| * | ||
| * Pseudocode: | ||
| * | ||
| * cnt -= unroll_facotor + 1 - loop_factor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: factor
Webrevs
|
| default: ShouldNotReachHere(); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// comments here.
|
Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail. |
@theRealAph , thank you for a suggestion. That's because current result (hash sum) has to multiplied by 31^4 between iterations, where 4 is the numbers of elements handled per iteration. It's possible to multiply all lanes of addv(vmultiplication, Assembler::T4S, vmultiplication);
umov(addend, vmultiplication, Assembler::S, 0); // Sign-extension isn't necessary
maddw(result, result, pow4, addend);I can re-check and post the performance numbers here per a request. |
Please do. Please also post the code. |
@theRealAph , you may find the performance numbers and the code in mikabl-arm@f844b11 |
OK, thanks. I think I see the problem. Unfortunately I've come to the end of my working day, but I'll try to get back to you as soon as possible next week. |
|
In addition, doing only one vector per iteration is very wasteful. A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency. By only issuing a single multiply-accumulate per iteration you're leaving a lot of performance on the table. I'd try to make the bulk width 16, and measure from there. |
|
You only need one load, add, and multiply per iteration. This is an example of how to do it. The full thing is at https://gist.github.com/theRealAph/cbc85299d6cd24101d46a06c12a97ce6. |
@theRealAph , looks reasonable, thank you for providing the listing! I'll revert back on this once I have updated performance numbers. |
@theRealAph , hmph, could you elaborate on what spec you refer to here? |
That's not so much a spec, more Dougall's measured Apple M1 performance: https://dougallj.github.io/applecpu/measurements/firestorm/UMLAL_v_4S.html. |
@theRealAph , I've tried to follow the suggested approach, please find the patch and result in mikabl-arm@e352f30 . So far I wasn't able to see any performance benefits compared to an implementations from this MR. |
Yeah, true. I can see why that's happening from prof perfnorm: This is 1.4 insns/clk on a machine that can run 8 insns/clk. Because we're doing one load, then the MAC, then another load after the MAC, then a MAC that depends on the load: we stall the whole core waiting for the next load. Everything is serialized. Neoverse looks the same as Apple M1 here. I guess the real question here is what we want. x86's engineers get this: (And that's on my desktop box from 2018, an inferior piece of hardware.) This is how they do it: So, do we want to try to beat them on Arm, or not? They surely want to beat Arm. |
|
Hi, is this one stuck? What you have today is definitely an improvement, even though it's not as good as what we have for x86. I guess we could commit this and leave widening the arithmetic for a later enhancement if you have no time to work on it. |
…lements modulo vf
adinn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the final cleanups. All looking very nice!
Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>
| } else { | ||
| __ uxtl(vhalf0, Assembler::T4S, vdata0, Assembler::T4H); | ||
| } | ||
| __ addv(vmul0, Assembler::T4S, vmul0, vhalf0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was advised to use a single SADDW/UADDW instruction instead of the current pair of SXTL/UXTL followed by ADD. It seems this was likely overlooked because the Assembler class is missing the corresponding instructions. I am adding these instructions and updating the implementation accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (is_signed_subword_type(eltype)) { | ||
| __ saddwv2(vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T8H); | ||
| } else { | ||
| __ uaddwv2(vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T8H); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe define addwv2 and addwv in MacroAssembler.
| if (is_signed_subword_type(eltype)) { | |
| __ saddwv2(vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T8H); | |
| } else { | |
| __ uaddwv2(vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T8H); | |
| } | |
| __ addwv2(is_signed_subword_type(eltype), vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T8H); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that an interface should be explicit and map 1:1 to real instructions when it comes to assembly whereas possible.
Anyway, regardless of my preferences, as far as I can see, currently Assembler provides all other signed/unsigned versions of arithmetic instructions separately. Adding a single method like this would make the whole API inconsistent. Therefore, I suggest leaving it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem at all with what class Assembler provides. However, when the result looks like this, even a "normal" assembler programmer would suggest macros rather than copy-and-paste:
assert(is_subword_type(eltype), "subword type expected");
if (is_signed_subword_type(eltype)) {
__ saddwv(vmul3, vmul3, Assembler::T4S, vdata3, Assembler::T4H);
__ saddwv(vmul2, vmul2, Assembler::T4S, vdata2, Assembler::T4H);
__ saddwv(vmul1, vmul1, Assembler::T4S, vdata1, Assembler::T4H);
__ saddwv(vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T4H);
} else {
__ uaddwv(vmul3, vmul3, Assembler::T4S, vdata3, Assembler::T4H);
__ uaddwv(vmul2, vmul2, Assembler::T4S, vdata2, Assembler::T4H);
__ uaddwv(vmul1, vmul1, Assembler::T4S, vdata1, Assembler::T4H);
__ uaddwv(vmul0, vmul0, Assembler::T4S, vdata0, Assembler::T4H);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to integrate this as is? The code was approved twice before, even though it had the same constructs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. It's a fairly minor style thing, something to remember for the future.
Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>
|
/integrate |
|
@mikabl-arm |
|
/sponsor |
|
Going to push as commit 475b894.
Your commit was automatically rebased without conflicts. |
|
@theRealAph @mikabl-arm Pushed as commit 475b894. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |


Hello,
Please review the following PR for JDK-8322770 Implement C2 VectorizedHashCode on AArch64. It follows previous work done in #16629 and #10847 for RISC-V and x86 respectively.
The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
Performance
Neoverse N1
Neoverse N2, Neoverse V1
Performance metrics have been collected for these cores as well. They are similar to the results above and can be posted upon request.
Test
Full jtreg passed on AArch64 and x86.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18487/head:pull/18487$ git checkout pull/18487Update a local copy of the PR:
$ git checkout pull/18487$ git pull https://git.openjdk.org/jdk.git pull/18487/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18487View PR using the GUI difftool:
$ git pr show -t 18487Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18487.diff
Webrev
Link to Webrev Comment