New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8266332: Adler32 intrinsic for x86 64-bit platforms #3806
Conversation
|
@xbzhang99 The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
/check |
@xbzhang99 Unknown command |
Currently, only 64-bit Linux is supported |
Webrevs
|
@@ -3244,6 +3244,18 @@ void MacroAssembler::vpmullw(XMMRegister dst, XMMRegister nds, Address src, int | |||
Assembler::vpmullw(dst, nds, src, vector_len); | |||
} | |||
|
|||
void MacroAssembler::vpmulld(XMMRegister dst, XMMRegister nds, AddressLiteral src, int vector_len) { | |||
// Used in sign-bit flipping with aligned address. | |||
bool aligned_adr = (((intptr_t)src.target() & 15) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an AVX instruction. So alignment is not required. The assert need to only check for UseAVX>0.
__ align(CodeEntryAlignment); | ||
StubCodeMark mark(this, "StubRoutines", "updateBytesAdler32"); | ||
|
||
address start = __ pc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm part can go into macroAssembler_x86_adler.cpp with Intel copyright (see macroAssembler_x86_sha.cpp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macroAssembler_x86_adler.cpp added
__ cmpl(s, size); | ||
__ cmovl(Assembler::above, s, size); // s = min(size, LIMIT) | ||
__ lea(end, Address(s, data, Address::times_1, -CHUNKSIZE_M1)); | ||
__ cmpq(data, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be cmpptr here.
|
||
// reduce | ||
__ vpslld(yb, yb, 3, Assembler::AVX_256bit); //b is scaled by 8 | ||
__ vpmulld(ysa, ya, ExternalAddress((address) StubRoutines::x86::_adler32_ascale_table), Assembler::AVX_256bit); //need scratch register?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the instructions with ExternalAddress can modify rscratch1 which is r10. It is good to pass an explicit scratch register to these as last argument.
….cpp; added a scratch reg to vpmulld, and some other issues
@@ -3244,6 +3244,17 @@ void MacroAssembler::vpmullw(XMMRegister dst, XMMRegister nds, Address src, int | |||
Assembler::vpmullw(dst, nds, src, vector_len); | |||
} | |||
|
|||
void MacroAssembler::vpmulld(XMMRegister dst, XMMRegister nds, AddressLiteral src, int vector_len, Register scratch_reg) { | |||
// Used in sign-bit flipping with aligned address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could remove the spurious comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
BLOCK_COMMENT("Entry:"); | ||
__ enter(); // required for proper stackwalking of RuntimeStub frame | ||
|
||
__ vmovdqu(yshuf0, ExternalAddress((address) StubRoutines::x86::_adler32_shuf0_table)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For vmovdqu also it is good to be explicit with scratch register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added scratch register to vmovdqu
#include "runtime/stubRoutines.hpp" | ||
#include "macroAssembler_x86.hpp" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updateBytesAdler32 should be under #ifdef _LP64, #endif.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef _LP64 ... #endif added
@@ -898,6 +898,16 @@ void VM_Version::get_processor_features() { | |||
FLAG_SET_DEFAULT(UseCRC32Intrinsics, false); | |||
} | |||
|
|||
if (supports_avx2() && UseAdler32Intrinsics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be under #ifdef _LP64.
For 32-bit, UseAdler32Intrinsics should be set to false.
@@ -0,0 +1,209 @@ | |||
/* | |||
* Copyright (c) 2016, Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright year should be 2021.
/contributor add @xbzhang99 |
@xbzhang99 Could not parse
|
@xbzhang99 |
/contributor remove Lauren Greg B Tucker greg.b.tucker@intel.com |
@xbzhang99 Could not parse
|
@xbzhang99 Could not parse
|
/contributor add @gbtucker |
@xbzhang99 Could not parse
|
/contributor add xbzhang99 |
@xbzhang99 Could not parse
|
/contributor remove Lauren Greg B Tucker greg.b.tucker@intel.com |
@xbzhang99 |
/contributor add Xubo Zhang xubo.zhang@intel.com |
@xbzhang99 |
@vnkozlov I implemented your review comments. Could you please take a look. |
@theRealAph micro is already there for long time: https://cr.openjdk.java.net/~pli/rfr/8216259/TestAdler32.java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 comments.
UseAdler32Intrinsics = true; | ||
} | ||
} else if (UseAdler32Intrinsics) { | ||
if (!FLAG_IS_DEFAULT(UseAdler32Intrinsics)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add {}
.
void vpmulld(XMMRegister dst, XMMRegister nds, Address src, int vector_len) { | ||
Assembler::vpmulld(dst, nds, src, vector_len); | ||
}; | ||
void vpmulld(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { | ||
Assembler::vpmulld(dst, nds, src, vector_len); | ||
} | ||
void vpmulld(XMMRegister dst, XMMRegister nds, AddressLiteral src, int vector_len, Register scratch_reg = rscratch1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like my comment was lost.
I see only last version of method is used in stub. Why you need additional 2 wrapper methods?
Also the code always pass scratch_reg
- you don't need to set default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first two were introduced by other patches
will remove the scratch_reg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I added first two.
The vpmulld is overloaded in base Assembler class. If I override one method in MacroAssembler class, the C++ compiler doesn’t seem to find the other overloaded functions, they somehow become hidden.
So, I need to override those as well in macroAssembler, otherwise I get the following error:
./src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp: In member function 'void C2_MacroAssembler::reduce_operation_256(BasicType, int, XMMRegister, XMMRegister, XMMRegister)':
./src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp:1573:64: error: no matching function for call to 'C2_MacroAssembler::vpmulld(XMMRegisterImpl*&, XMMRegisterImpl*&, XMMRegisterImpl*&, int&)'
I’ve copied @xbzhang99 's modified test case into http://cr.openjdk.java.net/~pli/rfr/8216259/TestAdler32.java The original one w/o copyright header is backed up at http://cr.openjdk.java.net/~pli/rfr/8216259/TestAdler32.java.old Please let me know if I should do anything else. |
@@ -7854,6 +7854,18 @@ void Assembler::vbroadcastsd(XMMRegister dst, Address src, int vector_len) { | |||
emit_operand(dst, src); | |||
} | |||
|
|||
void Assembler::vbroadcastf128(XMMRegister dst, Address src, int vector_len) { | |||
assert(VM_Version::supports_avx(), ""); | |||
assert(vector_len == AVX_256bit, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like "vector_len" can only be AVX_256bit. Do we really need a parameter then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your are right, for now it can only be AVX_256bit. But I think in the future other lengths will be used too. So we should have a more generic signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
@xbzhang99 This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 273 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@sviswa7, @jatin-bhateja, @vnkozlov, @neliasso) but any other Committer may sponsor as well.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
/integrate |
@xbzhang99 |
/test tier1 |
@vnkozlov Looks like automated tests are not run. Could you please help? |
I started our internal testing. |
tier1-3 testing is clean. It ran compiler/intrinsics/zip/TestAdler32.java test. |
/sponsor |
@vnkozlov @xbzhang99 Since your change was applied there have been 280 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 8e3549f. |
/backport jdk11u-dev |
@xbzhang99 Unknown command |
Implement Adler32 intrinsic for x86 64-bit platform using vector instructions.
The benchmark test/micro/org/openjdk/bench/java/util/TestAdler32.java is contributed by Pengfei Li (pli, Pengfei.Li@arm.com).
For this benchmark, the optimization shows ~5x improvement.
Base:
Benchmark (count) Mode Cnt Score Error Units
TestAdler32Perf.testAdler32Update 64 avgt 25 0.084 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 128 avgt 25 0.104 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 256 avgt 25 0.146 ± 0.002 us/op
TestAdler32Perf.testAdler32Update 512 avgt 25 0.226 ± 0.002 us/op
TestAdler32Perf.testAdler32Update 1024 avgt 25 0.390 ± 0.005 us/op
TestAdler32Perf.testAdler32Update 2048 avgt 25 0.714 ± 0.007 us/op
TestAdler32Perf.testAdler32Update 4096 avgt 25 1.359 ± 0.014 us/op
TestAdler32Perf.testAdler32Update 8192 avgt 25 2.751 ± 0.023 us/op
TestAdler32Perf.testAdler32Update 16384 avgt 25 5.494 ± 0.077 us/op
TestAdler32Perf.testAdler32Update 32768 avgt 25 11.058 ± 0.160 us/op
TestAdler32Perf.testAdler32Update 65536 avgt 25 22.198 ± 0.319 us/op
With patch:
Benchmark (count) Mode Cnt Score Error Units
TestAdler32Perf.testAdler32Update 64 avgt 25 0.020 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 128 avgt 25 0.025 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 256 avgt 25 0.031 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 512 avgt 25 0.048 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 1024 avgt 25 0.078 ± 0.001 us/op
TestAdler32Perf.testAdler32Update 2048 avgt 25 0.139 ± 0.002 us/op
TestAdler32Perf.testAdler32Update 4096 avgt 25 0.262 ± 0.004 us/op
TestAdler32Perf.testAdler32Update 8192 avgt 25 0.524 ± 0.010 us/op
TestAdler32Perf.testAdler32Update 16384 avgt 25 1.017 ± 0.022 us/op
TestAdler32Perf.testAdler32Update 32768 avgt 25 2.058 ± 0.052 us/op
TestAdler32Perf.testAdler32Update 65536 avgt 25 3.994 ± 0.013 us/op
Progress
Issue
Reviewers
Contributors
<xubo.zhang@intel.com>
<greg.b.tucker@intel.com>
<pli@openjdk.org>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3806/head:pull/3806
$ git checkout pull/3806
Update a local copy of the PR:
$ git checkout pull/3806
$ git pull https://git.openjdk.java.net/jdk pull/3806/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 3806
View PR using the GUI difftool:
$ git pr show -t 3806
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3806.diff