New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8277617: Adjust AVX3Threshold for copy/fill stubs #6512
Conversation
|
Webrevs
|
This isn't my area but I'm a bit perplexed by the changes. AFAICS this patch does 2 things:
but I am at a loss to understand what Thanks, |
Plus some performance numbers would be useful. Thanks |
@dholmes-ora We see about 25% gain on a micro on our latest platform. There is no cpuid bit for this, so the closest was to check for the new serialize ISA supported on this platform. |
But what exactly is it that you are checking for? What is the connection between the ISA version and the decision to effectively zero out AVX3Threshold? |
@dholmes-ora The Intel platforms that supports this ISA has improved implementation of 64-byte load/stores. I could not find any other better way to check in the absence of cupid bit. |
@sviswa7 that further restriction and an explanatory comment would be appreciated. Thanks. |
@dholmes-ora I have implemented your review comments. |
It would be better to add a jmh test for this opt. |
Sorry @sviswa7 but could you explain in the comment why/how |
static int avx3_threshold() { return ((is_intel_family_core() && | ||
supports_serialize()) ? 0: AVX3Threshold); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sviswa7 , Should we not return a zero threshold only if user does not explicitly set AVX3Threshold i.e. in default case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja On these platforms it is beneficial to set the threshold to zero for copy and clear operations and hence the override. I have described that in the comment in detail as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vitalyd, thanks for making a comment in an OpenJDK project!
All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user vitalyd for the summary.
If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.
- I agree to the OpenJDK Terms of Use for all comments I make in a project in the OpenJDK GitHub organization.
Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sviswa7 I tend to agree with @jatin-bhateja . AVX3Threshold is a diagnostic flag so if someone has deliberately modified it so they can measure something, your change will make that impossible on newer systems. You may want to define a static field to store the actual value for avx3_threshold()
to return and initialize it during VM initialization. Or lazy initialize it on first use.
@DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. |
@dholmes-ora @jatin-bhateja I have added a check for FLAG_IS_DEFAULT before overriding the threshold. Let me know if this looks ok to you. |
Thanks @sviswa7 , changes looks good to me.
Best Regards
General change looks okay but I have a query below about startup overhead.
Also what testing has been done for this aside from the benchmarking? AFAICS there is only a single test that currently sets AVX3Threshold to zero so we have very little test coverage for that. With this change it will be zero all the time on some systems and so will now be exercising code paths that do not normally get executed.
Thanks,
David
int VM_Version::avx3_threshold() { | ||
if (FLAG_IS_DEFAULT(AVX3Threshold)) { | ||
return ((is_intel_family_core() && | ||
supports_serialize()) ? 0: AVX3Threshold); | ||
} else { | ||
return AVX3Threshold; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat concerned about the overhead of evaluating this each time it is used. I realize these will only be startup costs while generating the stubs, not part of the stubs themselves, but it still may be a startup impact. Can you run a startup benchmark to see if there is any problem?
I was also thinking the more direct formulation would just be:
return (is_intel_family_core() && supports_serialize() && FLAG_IS_DEFAULT(AVX3Threshold)) ? 0 : AVX3Threshold;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sviswa7 agree with @dholmes-ora , instead of calling multiple times in a stub can we not call it only once per stub? Since stubs are assembled once and not relocated hence it should be okay to call this method only once for stubs which are going to benefit form this change.
/label hotspot-compiler |
@neliasso |
As I understand it - old AVX512 platforms will continue to work as before. The new case is that new platforms (that have avx_threshold set to 0) will use 64 byte instructions. But it would be nice with some benchmarks that verify that there are no regression on old avx512 hardware. |
I am happy with the change but would like to see some benchmarks that verify that there are no regressions - [before/after]*[avx2/old avx512/new avx512]. You have already posted some of them - please complete with the missing ones.
According to @sviswa7 's comments (no cupid bit for the latest ISA), @sviswa7 , can you further explain what's the difference of the 64-byte instructions between Intel's old and latest AVX512 platforms? |
I do not see such comments. From my previous questions on this it was indicated that any CPU that supports |
If so, CPUs that don't support |
Yes, the patch doesn't change behavior on AVX2 and older AVX512 systems. The additional performance numbers with the patch requested by Nils are as below: Old AVX512
After:
AVX2
After:
|
Thanks for your clarification. I will test the 64-byte instructions on older AVX512 systems today and feedback here. |
Mailing list message from David Holmes on hotspot-dev: On 1/12/2021 5:11 pm, Jie Fu wrote:
The old platforms, for which serialize() is not true, will just use David |
Here is the performance data on our older AVX512 platform which doesn't support Even without So it seems unfair only enable 64-byte instructions for the latest Intel AVX512 platforms. Still, I would like to know why we don't use 64-byte instructions on platforms without Results with 32-byte instructions.
Results with 64-byte instructions.
|
Mailing list message from David Holmes on hotspot-dev: On 1/12/2021 11:54 pm, Jie Fu wrote:
Yes, which is exactly why we have been saying this should not affect David |
Mailing list message from David Holmes on hotspot-dev: On 2/12/2021 12:32 pm, Jie Fu wrote:
Because, as previously stated, there is no actual way to identify those David |
OK, make sense! |
@sviswa7 This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 1 new commit pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the
|
@dholmes-ora @DamonFool @jatin-bhateja @neliasso Thanks a lot for the review. If no further objections, I plan to integrate this PR tomorrow (Friday 12/3). |
Please check the GHA test results before integrating - there is a failing compiler test. |
@dholmes-ora I looked at the results. The failure in compiler/c2/irTests/TestUnsignedComparison.java for x86_32 is unrelated to this patch ( https://bugs.openjdk.java.net/browse/JDK-8277324) and was fixed recently. A merge with master should fix that. |
/integrate |
Going to push as commit 24e16ac.
Your commit was automatically rebased without conflicts. |
Currently 32-byte instructions are used for small array copy and clear.
This can be optimized by using 64-byte instructions.
Please review.
Best Regards,
Sandhya
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6512/head:pull/6512
$ git checkout pull/6512
Update a local copy of the PR:
$ git checkout pull/6512
$ git pull https://git.openjdk.java.net/jdk pull/6512/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 6512
View PR using the GUI difftool:
$ git pr show -t 6512
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6512.diff