-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8261553: Efficient mask generation using BMI2 BZHI instruction #2522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
|
@jatin-bhateja The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
|
/label hotspot-compiler-dev |
|
@jatin-bhateja |
Webrevs
|
cl4es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Rather than removing the old code, I believe the code calling bzhiq needs to be in a branch checking
VM_Version::supports_bmi2. Otherwise you'll hit asserts on older hardware without this extension - Some demonstration of the performance benefit would be nice - either a new microbenchmark or a statistically significant result running some existing ones, e.g.
make test TEST=micro:ArrayCopy
|
Hi Claes, Here is the JMH performance data over CLX for arraycopy benchmarks: Regards, |
Hi Claes, added missing safely check for BMI2, its in general rare that a target supporting AVX-512 does not support BMI2
|
Thanks! Eyeballing the results it looks like a mixed bag. There even seems to be a few regressions such as this: |
Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies. |
Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity. |
JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot. But, there is a significant reduction in instruction count and cycles are almost comparable. We are saving one shift per mask computation. |
|
Further analysis of perf degradation revealed that with new optimized instruction pattern, code alignment got disturbed. This led to increase in LSD misses, also it reduced the UOPs cashing in DSB. While computing the mask for partial in-lining of small copy calls ( currently enabled for sub-word types with copy length less than 32/64 bytes), new optimized sequence should always offer lower instruction count and latency path. Following are the links to updated JMH perf data: In general gains are not significant in case of copy stubs, but new sequence offers a optimal latency path for mask computation sequence. |
|
Thanks for getting to the bottom of that regression. |
neliasso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
@jatin-bhateja This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 126 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thanks, since there is not a significant impact on performance, but having an optimum instruction sequence will still reduce the complexity. Should it be ok to check this in ? We have one reviewer consent. |
|
/integrate |
|
@jatin-bhateja Since your change was applied there have been 129 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit cb84539. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
BMI2 BHZI instruction can be used to optimize the instruction sequence
used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/2522/head:pull/2522$ git checkout pull/2522