Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8252847: Optimize primitive arrayCopy stubs using AVX-512 masked instructions #61

Closed
wants to merge 6 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Sep 7, 2020

Summary:

  1. New AVX3 optimized stubs for both conjoint and disjoint arraycopy.
  2. Special instruction sequence blocks for copy sizes b/w 32-192 bytes.
  3. Block copy operation above 192 bytes is performed using destination address aligned PRE-MAIN-POST loop. Main loop copies 192 byte in one iteration and tail part fall over special instruction sequence blocks.
  4. Both small copy block and aligned loop use 32 byte vector register to prevent and frequency penalty for copy sizes less than AVX3Threshold.
  5. For block size above AVX3Theshold both special blocks and loop operate using 64 byte register.
  6. In case user sets the maximum vector size to 32 bytes, forward copy (disjoint) operations are done using efficient REP MOVS for copy sizes above 4096 bytes.

JMH Results:
System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
Baseline : http://cr.openjdk.java.net/~jbhateja/8252847/JMH_results/ArrayCopy_AVX3_Stubs_Baseline.txt
WithOpt : http://cr.openjdk.java.net/~jbhateja/8252847/JMH_results/ArrayCopy_AVX3_Stubs_WithOpts.txt


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8252847: Optimize primitive arrayCopy stubs using AVX-512 masked instructions

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/61/head:pull/61
$ git checkout pull/61

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 7, 2020

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 7, 2020
@openjdk
Copy link

openjdk bot commented Sep 7, 2020

@jatin-bhateja The following label will be automatically applied to this pull request: hotspot.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Sep 7, 2020
@mlbridge
Copy link

mlbridge bot commented Sep 7, 2020

Webrevs

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 10, 2020

/label add hotspot-compiler-dev

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Sep 10, 2020
@openjdk
Copy link

openjdk bot commented Sep 10, 2020

@jatin-bhateja
The hotspot-compiler label was successfully added.

BasicType type, int offset, bool use64byteVector) {
assert(MaxVectorSize >= 32, "vector length < 32");
use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0;
if (use64byteVector == false) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to "!use64byteVector"

void MacroAssembler::copy64_avx(Register dst, Register src, XMMRegister xmm, int offset, bool use64byteVector) {
assert(MaxVectorSize == 64 || MaxVectorSize == 32, "vector length mismatch");
use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0;
if (use64byteVector == false) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to "!use64byteVector"

void MacroAssembler::copy64_conjoint_avx(Register dst, Register src, XMMRegister xmm, int offset, bool use64byteVector) {
assert(MaxVectorSize == 64 || MaxVectorSize == 32, "vector length mismatch");
use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0;
if (use64byteVector == false) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to "!use64byteVector"

@@ -1162,7 +1164,7 @@ void VM_Version::get_processor_features() {
#endif // COMPILER2 && ASSERT

if (!FLAG_IS_DEFAULT(AVX3Threshold)) {
if (!is_power_of_2(AVX3Threshold)) {
if (AVX3Threshold !=0 && !is_power_of_2(AVX3Threshold)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before '0'

void MacroAssembler::copy64_masked_avx(Register dst, Register src, XMMRegister xmm,
KRegister mask, Register length, Register temp,
BasicType type, int offset, bool use64byteVector) {
assert(MaxVectorSize >= 32, "vector length < 32");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does "MaxVectorSize >= 32" imply that "vector length < 32"?

This assert appears in multiple locations.

KRegister mask, Register length, Register temp,
BasicType type, int offset, bool use64byteVector) {
assert(MaxVectorSize >= 32, "vector length < 32");
use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you expect AVX3Threshold to be 0?

Copy link
Member Author

@jatin-bhateja jatin-bhateja Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now when user explicitly pass -XX:AVX3Threshold=0 , default value of AVX3Threshold is 4096.

Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that you put special meaning on AVX3Threshold=0 and then have to add additional checks for it in places where you check its power of 2. And you don't check such setting in new tests.

Actually checking for 0 and power of 2 should be done by flag's constraint. See CodeEntryAlignmentConstraintFunc as example.

There is also this strange relation with MaxVectorSize. Also we should consider power level switch for 64 bytes AVX3 vectors. Does it make sense to use it if array length is small (< 4096 default)?

Copy link

@neliasso neliasso left a comment

I like that you have extracted the avx512 stub code from the rest - that makes it a lot more readable! Overall the new code feels easy to understand and read.

I found one more minor issue (appears in four places).

My only concern is that it's getting hard to follow under what circumstances avx3 instructions are used:
Could it be the case that different thresholds are needed for when you are using avx3 instructions with 32 or 64 byte vectors? Are we sure all variants are tested?

Also - have you thought about supporting oop-copies? You only have to call the BarrierSetAssembler::arraycopy_prologue/epilogue like in the old versions. It's not a requirement for me to approve this - but an encouragement for a future patch.

@@ -1884,6 +2387,10 @@ class StubGenerator: public StubCodeGenerator {
address generate_conjoint_int_oop_copy(bool aligned, bool is_oop, address nooverlap_target,
address *entry, const char *name,
bool dest_uninitialized = false) {
if (VM_Version::supports_avx512vlbw() && false == is_oop && MaxVectorSize >= 32) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"false == is_oop" => !is_oop

@@ -1991,6 +2498,9 @@ class StubGenerator: public StubCodeGenerator {
//
address generate_disjoint_long_oop_copy(bool aligned, bool is_oop, address *entry,
const char *name, bool dest_uninitialized = false) {
if (VM_Version::supports_avx512vlbw() && false == is_oop && MaxVectorSize >= 32) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false == is_oop => !is_oop

@@ -2095,6 +2605,10 @@ class StubGenerator: public StubCodeGenerator {
address generate_conjoint_long_oop_copy(bool aligned, bool is_oop,
address nooverlap_target, address *entry,
const char *name, bool dest_uninitialized = false) {
if (VM_Version::supports_avx512vlbw() && false == is_oop && MaxVectorSize >= 32) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false == is_oop => !is_oop

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp Outdated Show resolved Hide resolved
@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 18, 2020

My only concern is that it's getting hard to follow under what circumstances avx3 instructions are used:
Could it be the case that different thresholds are needed for when you are using avx3 instructions with 32 or 64 byte vectors? Are we sure all variants are tested?

Following 2 runtime flags influence the implementation :-

  • MaxVectorSize: Determined during VM initialization based on the CPUID of the target.
  • AVX3Theshold: Set to a default value of 4096 bytes based on prior performance analysis.

Following general rules were followed during implementation:

  1. If target support AVX3 features (BW+VL+F) then copy will use 32 byte vectors (YMMs) for both special cases and aligned copy loop. This is default configuration.
  2. If copy length is above AVX3Threshold, then we can safely use 64 byte vectors (ZMMs) for main copy loop (and tail) since bulk of the cycles will be consumed in it.
  3. Leaf level Macro Assembly routines can dynamically choose b/w YMM or ZMM register based on the AVX3Threshold value.
  4. If user forces MaxVectorSize=32 then above 4096 bytes its seen that REP MOVs shows a better performance for disjoint copies. For conjoint/backward copy vector based copy performs better.

Thus, for 32 byte vector we do not need any threshold since they execute at max frequency level.

tier1, tier2 and tier3 did not show any new issues with the changes.

Also - have you thought about supporting oop-copies? You only have to call the

We may not see significant performance improvement considering prologue and epilogue barriers does considerable processing over object arrays.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 18, 2020

BarrierSetAssembler::arraycopy_prologue/epilogue like in the old versions. It's not a requirement for me to approve this - but an encouragement for a future patch.

Yes that's a good pointer, I can explore extending existing GC barriers for array copy as a separate next step.

Copy link

@neliasso neliasso left a comment

Thanks for the clarification och update!

Reviewed.

@openjdk
Copy link

openjdk bot commented Sep 18, 2020

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8252847: Optimize primitive arrayCopy stubs using AVX-512 masked instructions

Reviewed-by: neliasso, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 415 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 18, 2020
@vnkozlov
Copy link
Contributor

vnkozlov commented Sep 22, 2020

@jatin-bhateja Can you put summary of performance improvement into JBS?

@openjdk openjdk bot added core-libs core-libs-dev@openjdk.org and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 22, 2020
@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 22, 2020
@jatin-bhateja jatin-bhateja changed the title 8252847: New AVX512 optimized stubs for both conjoint and disjoint arraycopy 8252847: Optimize primitive arrayCopy stubs using AVX-512 masked instructions Sep 22, 2020
@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 22, 2020

@jatin-bhateja Can you put summary of performance improvement into JBS?

Yes, I have added the summary to JBS

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 25, 2020

@jatin-bhateja Can you put summary of performance improvement into JBS?

Hi @vnkozlov , @neliasso
Kindly let me know your feedback, If there are no more comments is it ok to integrate this patch.

Copy link

@neliasso neliasso left a comment

Looks good to me!

Copy link
Contributor

@vnkozlov vnkozlov left a comment

The main concern which is not clear in these changes is ZMM usage which will lower frequency and case performance regression for small arrays.
That is why AVX3Threshold is set to 4096 bytes by default.
Allowing and checking for 0 AVX3Threshold value contradicts that. Would be nice to have clear comment/explanation about that.

I also propose to use Flag constraint() functionality for checking AVX3Threshold value instead of runtime checks everywhere. Separate RFE, please.

@@ -2589,6 +2589,38 @@ void Assembler::evmovdqub(XMMRegister dst, KRegister mask, Address src, int vect
emit_operand(dst, src);
}

void Assembler::evmovdqu(XMMRegister dst, KRegister mask, Address src, int vector_len, int type) {
assert(VM_Version::supports_avx512vlbw(), "");
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add assert to these 2 new instruction to check 'type' value to make sure only expected types are passed.

assert(VM_Version::supports_avx512vlbw(), "");
InstructionMark im(this);
bool wide = type == T_SHORT || type == T_LONG || type == T_CHAR;
bool bwinstr = type == T_BYTE || type == T_SHORT || type == T_CHAR;
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'bwinstr' is used only once. You may as well directly set 'prefix' here. (Same in second instruction).

void Assembler::evmovdqu(XMMRegister dst, KRegister mask, Address src, int vector_len, int type) {
assert(VM_Version::supports_avx512vlbw(), "");
InstructionMark im(this);
bool wide = type == T_SHORT || type == T_LONG || type == T_CHAR;
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks strange but it is correct (I looked on existing evmovdqu* instructions). May be reorder - T_LONG last.

Do you consider replacing existing evmovdqu* instructions with these two?

KRegister mask, Register length, Register temp,
BasicType type, int offset, bool use64byteVector) {
assert(MaxVectorSize >= 32, "vector length < 32");
use64byteVector |= MaxVectorSize > 32 && AVX3Threshold == 0;
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that you put special meaning on AVX3Threshold=0 and then have to add additional checks for it in places where you check its power of 2. And you don't check such setting in new tests.

Actually checking for 0 and power of 2 should be done by flag's constraint. See CodeEntryAlignmentConstraintFunc as example.

There is also this strange relation with MaxVectorSize. Also we should consider power level switch for 64 bytes AVX3 vectors. Does it make sense to use it if array length is small (< 4096 default)?

@@ -1323,6 +1261,572 @@ class StubGenerator: public StubCodeGenerator {
__ jcc(Assembler::greater, L_copy_8_bytes); // Copy trailing qwords
}

#ifndef PRODUCT
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macroAssembler_x86.hpp become big. May be we should start thing about splitting arraycopy stubs into separate file.

Copy link

@neliasso neliasso Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But lets do that in a another change. It is good that the AVX3 case is separated out in this change - makes it easy to follow.

Copy link
Contributor

@vnkozlov vnkozlov Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@@ -33,7 +33,7 @@ static bool returns_to_call_stub(address return_pc) { return return_pc == _call_

enum platform_dependent_constants {
code_size1 = 20000 LP64_ONLY(+10000), // simply increase if too small (assembler will crash if too small)
code_size2 = 35300 LP64_ONLY(+11400) // simply increase if too small (assembler will crash if too small)
code_size2 = 35300 LP64_ONLY(+21400) // simply increase if too small (assembler will crash if too small)
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is big increase in size!

@@ -1162,7 +1164,7 @@ void VM_Version::get_processor_features() {
#endif // COMPILER2 && ASSERT

if (!FLAG_IS_DEFAULT(AVX3Threshold)) {
if (!is_power_of_2(AVX3Threshold)) {
if (AVX3Threshold != 0 && !is_power_of_2(AVX3Threshold)) {
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider flag's constraint() instead of runtime these checks. Separate RFE, please.

@@ -84,7 +84,7 @@
public AMD64ArrayCompareToOp(LIRGeneratorTool tool, int useAVX3Threshold, JavaKind kind1, JavaKind kind2, Value result, Value array1, Value array2, Value length1, Value length2) {
super(TYPE);

assert CodeUtil.isPowerOf2(useAVX3Threshold) : "AVX3Threshold must be power of 2";
assert useAVX3Threshold == 0 || CodeUtil.isPowerOf2(useAVX3Threshold) : "AVX3Threshold must be power of 2";
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to upstream Graal changes.

* @summary Optimize arrayCopy using AVX-512 masked instructions.
*
* @run main/othervm/timeout=600 -XX:-TieredCompilation -Xbatch -XX:+IgnoreUnrecognizedVMOptions
* -XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:ArrayCopyPartialInlineSize=0 -XX:MaxVectorSize=32 -XX:+UnlockDiagnosticVMOptions
Copy link
Contributor

@vnkozlov vnkozlov Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayCopyPartialInlineSize flag is not defiled in these changes.
It seems they need to be included in 8252848 changes.

Copy link
Member Author

@jatin-bhateja jatin-bhateja Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vnkozlov, I have updated the pull request to cover your comments. Kindly review.
New RFE JDK-8253721 has been created for AVX3Threshold flag related changes (PR-394).

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Yes, this looks better. Reviewed. Before pushing let me test it. I will let you know results.

@@ -1323,6 +1261,572 @@ class StubGenerator: public StubCodeGenerator {
__ jcc(Assembler::greater, L_copy_8_bytes); // Copy trailing qwords
}

#ifndef PRODUCT
Copy link
Contributor

@vnkozlov vnkozlov Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 9, 2020

hs-tier1-3 testing passed on x86 (all OSs).

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Oct 10, 2020

/integrate

@openjdk openjdk bot closed this Oct 10, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 10, 2020
@openjdk
Copy link

openjdk bot commented Oct 10, 2020

@jatin-bhateja Since your change was applied there have been 420 commits pushed to the master branch:

  • ec41046: 8254348: Build fails when cds is disabled after JDK-8247536
  • e4469d2: 8247536: Support for pre-generated java.lang.invoke classes in CDS static archive
  • 7ec9c8e: 8233214: Remove runtime code not needed with CMS removed
  • 536b35b: 8254319: Shenandoah: Interpreter native-LRB needs to activate during HAS_FORWARDED
  • be26972: 8253379: [windows] Several jpackage tests failed with error code 1638
  • 52e45a3: 8229186: Improve error messages for TestStringIntrinsics failures
  • 6d2c1a6: 8254292: Update JMH devkit to 1.26
  • 2bbf8a2: 8245543: Cgroups: Incorrect detection logic on some systems (still reproducible)
  • aaa0a2a: 8254297: Zero and Minimal VMs are broken with undeclared identifier 'DerivedPointerTable' after JDK-8253180
  • 7e80c98: 8254261: fix javadocs in jdk.test.lib.Utils
  • ... and 410 more: https://git.openjdk.java.net/jdk/compare/e0c8d4420c8e1a84581927cf77314498b8e5aa52...master

Your commit was automatically rebased without conflicts.

Pushed as commit 4b5ac3a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
3 participants