-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy #16575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rate_disjoint_copy_avx3_masked - add src address prefetches - switch to non-temporal writes - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore
Hi @steveatgh, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user steveatgh" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
@steveatgh The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
/covered |
I'm part of the Intel Java team |
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
/contributor add @mcimadamore |
@steveatgh |
- fix xor test foreign impl constructor signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted some quick testing and I'm seeing the following failure with multiple tests:
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201), pid=24136, tid=24139
# assert(MaxVectorSize == 64) failed: vector length != 64
#
# JRE version: (22.0) (fastdebug build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, Register, XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366
Stack: [0x00007f0b5e919000,0x00007f0b5ea1a000], sp=0x00007f0b5ea17150, free space=1016k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x16c00e6] StubGenerator::copy64_masked_avx(Register, Register, XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366 (stubGenerator_x86_64_arraycopy.cpp:1201)
V [libjvm.so+0x16c0ecd] StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d (stubGenerator_x86_64_arraycopy.cpp:1055)
V [libjvm.so+0x16c16c1] StubGenerator::arraycopy_avx3_large(Register, Register, Register, Register, Register, Register, Register, XMMRegister, XMMRegister, XMMRegister, XMMRegister, int)+0x3f1 (stubGenerator_x86_64_arraycopy.cpp:790)
V [libjvm.so+0x16c22f0] StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char const*, int, bool, bool, bool)+0xa90 (stubGenerator_x86_64_arraycopy.cpp:728)
V [libjvm.so+0x16c4b85] StubGenerator::generate_disjoint_byte_copy(bool, unsigned char**, char const*)+0x965 (stubGenerator_x86_64_arraycopy.cpp:1277)
V [libjvm.so+0x16cb309] StubGenerator::generate_arraycopy_stubs()+0x29 (stubGenerator_x86_64_arraycopy.cpp:88)
V [libjvm.so+0x16a1089] StubGenerator::generate_final_stubs()+0xb9 (stubGenerator_x86_64.cpp:4051)
V [libjvm.so+0x16a22a5] StubGenerator_generate(CodeBuffer*, StubCodeGenerator::StubsKind)+0x105 (stubGenerator_x86_64.cpp:4296)
V [libjvm.so+0x16f349e] initialize_stubs(StubCodeGenerator::StubsKind, int, int, char const*, char const*, char const*)+0x13e (stubRoutines.cpp:241)
V [libjvm.so+0x16f500d] final_stubs_init()+0x3d (stubRoutines.cpp:288)
V [libjvm.so+0xe30c59] init_globals2()+0x69 (init.cpp:180)
V [libjvm.so+0x17b9151] Threads::create_vm(JavaVMInitArgs*, bool*)+0x311 (threads.cpp:569)
V [libjvm.so+0xf937e4] JNI_CreateJavaVM+0x54 (jni.cpp:3576)
C [libjli.so+0x419f] JavaMain+0x8f (java.c:1522)
C [libjli.so+0x7c39] ThreadJavaMain+0x9 (java_md.c:650)
For example, with compiler/arraycopy/TestArrayCopyConjoint.java
and -XX:-UseTLAB
.
Thank you @TobiHartmann for the feedback. I'm working on the issue. Can you tell me what kind of machine you tested with? |
__ movq(temp2, temp1); | ||
__ shlq(temp2, shift); | ||
__ cmpq(temp2, large_threshold); | ||
__ jcc(Assembler::greaterEqual, L_copy_large); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @steveatgh , Can you please share the performance number of other Array copy JMH micros in following directoy https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will still request you to run BM in above path, we may see performance dips for sizes after special cases due to additional comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the results on my Ubuntu laptop running at 3 GHz
// Baseline
Benchmark (length) (size) Mode Cnt Score Error Units
ArrayCopyObject.conjoint_micro N/A 31 thrpt 15 77157.933 ? 1977.467 ops/ms
ArrayCopyObject.conjoint_micro N/A 63 thrpt 15 58329.157 ? 1667.574 ops/ms
ArrayCopyObject.conjoint_micro N/A 127 thrpt 15 49322.065 ? 2332.342 ops/ms
ArrayCopyObject.conjoint_micro N/A 2047 thrpt 15 13895.531 ? 239.300 ops/ms
ArrayCopyObject.conjoint_micro N/A 4095 thrpt 15 7926.854 ? 201.238 ops/ms
ArrayCopyObject.conjoint_micro N/A 8191 thrpt 15 4289.582 ? 31.734 ops/ms
ArrayCopyObject.disjoint_micro N/A 31 thrpt 15 74711.699 ? 2463.378 ops/ms
ArrayCopyObject.disjoint_micro N/A 63 thrpt 15 65229.586 ? 1329.809 ops/ms
ArrayCopyObject.disjoint_micro N/A 127 thrpt 15 54330.794 ? 2372.868 ops/ms
ArrayCopyObject.disjoint_micro N/A 2047 thrpt 15 9338.340 ? 132.987 ops/ms
ArrayCopyObject.disjoint_micro N/A 4095 thrpt 15 5035.553 ? 109.679 ops/ms
ArrayCopyObject.disjoint_micro N/A 8191 thrpt 15 1192.069 ? 10.765 ops/ms
ArrayCopy.arrayCopy N/A N/A avgt 15 1.356 ? 0.029 ns/op
ArrayCopy.arrayCopyChar N/A N/A avgt 15 4.368 ? 0.038 ns/op
ArrayCopy.arrayCopyCharNonConst N/A N/A avgt 15 4.749 ? 0.113 ns/op
ArrayCopy.arrayCopyLocalArray N/A N/A avgt 15 0.503 ? 0.001 ns/op
ArrayCopy.arrayCopyNonConst N/A N/A avgt 15 1.955 ? 0.108 ns/op
ArrayCopy.arrayCopyObject N/A N/A avgt 15 22.403 ? 0.563 ns/op
ArrayCopy.arrayCopyObjectNonConst N/A N/A avgt 15 25.188 ? 0.484 ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward N/A N/A avgt 15 17.785 ? 0.781 ns/op
ArrayCopy.arrayCopyObjectSameArraysForward N/A N/A avgt 15 17.347 ? 0.126 ns/op
ArrayCopy.copyLoop N/A N/A avgt 15 5.189 ? 0.100 ns/op
ArrayCopy.copyLoopLocalArray N/A N/A avgt 15 3.685 ? 0.085 ns/op
ArrayCopy.copyLoopNonConst N/A N/A avgt 15 5.436 ? 0.040 ns/op
ArrayCopyAligned.testByte 1 N/A avgt 15 2.366 ? 0.028 ns/op
ArrayCopyAligned.testByte 3 N/A avgt 15 2.381 ? 0.063 ns/op
ArrayCopyAligned.testByte 5 N/A avgt 15 2.362 ? 0.035 ns/op
ArrayCopyAligned.testByte 10 N/A avgt 15 2.364 ? 0.048 ns/op
ArrayCopyAligned.testByte 20 N/A avgt 15 2.353 ? 0.026 ns/op
ArrayCopyAligned.testByte 70 N/A avgt 15 5.214 ? 0.082 ns/op
ArrayCopyAligned.testByte 150 N/A avgt 15 6.081 ? 0.140 ns/op
ArrayCopyAligned.testByte 300 N/A avgt 15 9.399 ? 0.262 ns/op
ArrayCopyAligned.testByte 600 N/A avgt 15 12.710 ? 0.149 ns/op
ArrayCopyAligned.testByte 1200 N/A avgt 15 21.873 ? 0.237 ns/op
ArrayCopyAligned.testChar 1 N/A avgt 15 2.349 ? 0.014 ns/op
ArrayCopyAligned.testChar 3 N/A avgt 15 2.360 ? 0.041 ns/op
ArrayCopyAligned.testChar 5 N/A avgt 15 2.359 ? 0.021 ns/op
ArrayCopyAligned.testChar 10 N/A avgt 15 2.369 ? 0.042 ns/op
ArrayCopyAligned.testChar 20 N/A avgt 15 5.101 ? 0.080 ns/op
ArrayCopyAligned.testChar 70 N/A avgt 15 5.961 ? 0.096 ns/op
ArrayCopyAligned.testChar 150 N/A avgt 15 9.321 ? 0.221 ns/op
ArrayCopyAligned.testChar 300 N/A avgt 15 13.473 ? 0.282 ns/op
ArrayCopyAligned.testChar 600 N/A avgt 15 20.941 ? 0.211 ns/op
ArrayCopyAligned.testChar 1200 N/A avgt 15 33.840 ? 0.490 ns/op
ArrayCopyAligned.testInt 1 N/A avgt 15 4.391 ? 0.042 ns/op
ArrayCopyAligned.testInt 3 N/A avgt 15 4.417 ? 0.063 ns/op
ArrayCopyAligned.testInt 5 N/A avgt 15 4.425 ? 0.047 ns/op
ArrayCopyAligned.testInt 10 N/A avgt 15 5.058 ? 0.084 ns/op
ArrayCopyAligned.testInt 20 N/A avgt 15 5.083 ? 0.062 ns/op
ArrayCopyAligned.testInt 70 N/A avgt 15 8.773 ? 0.200 ns/op
ArrayCopyAligned.testInt 150 N/A avgt 15 12.221 ? 0.212 ns/op
ArrayCopyAligned.testInt 300 N/A avgt 15 21.785 ? 0.160 ns/op
ArrayCopyAligned.testInt 600 N/A avgt 15 37.937 ? 0.178 ns/op
ArrayCopyAligned.testInt 1200 N/A avgt 15 54.911 ? 0.943 ns/op
ArrayCopyAligned.testLong 1 N/A avgt 15 4.420 ? 0.075 ns/op
ArrayCopyAligned.testLong 3 N/A avgt 15 4.362 ? 0.010 ns/op
ArrayCopyAligned.testLong 5 N/A avgt 15 5.030 ? 0.018 ns/op
ArrayCopyAligned.testLong 10 N/A avgt 15 5.112 ? 0.074 ns/op
ArrayCopyAligned.testLong 20 N/A avgt 15 5.847 ? 0.151 ns/op
ArrayCopyAligned.testLong 70 N/A avgt 15 11.349 ? 0.411 ns/op
ArrayCopyAligned.testLong 150 N/A avgt 15 17.721 ? 0.360 ns/op
ArrayCopyAligned.testLong 300 N/A avgt 15 27.205 ? 0.427 ns/op
ArrayCopyAligned.testLong 600 N/A avgt 15 44.129 ? 0.555 ns/op
ArrayCopyAligned.testLong 1200 N/A avgt 15 75.388 ? 0.774 ns/op
ArrayCopyUnalignedBoth.testByte 1 N/A avgt 15 2.355 ? 0.026 ns/op
ArrayCopyUnalignedBoth.testByte 3 N/A avgt 15 2.361 ? 0.046 ns/op
ArrayCopyUnalignedBoth.testByte 5 N/A avgt 15 2.357 ? 0.032 ns/op
ArrayCopyUnalignedBoth.testByte 10 N/A avgt 15 2.385 ? 0.047 ns/op
ArrayCopyUnalignedBoth.testByte 20 N/A avgt 15 2.355 ? 0.028 ns/op
ArrayCopyUnalignedBoth.testByte 70 N/A avgt 15 5.218 ? 0.095 ns/op
ArrayCopyUnalignedBoth.testByte 150 N/A avgt 15 6.038 ? 0.112 ns/op
ArrayCopyUnalignedBoth.testByte 300 N/A avgt 15 9.848 ? 0.218 ns/op
ArrayCopyUnalignedBoth.testByte 600 N/A avgt 15 13.090 ? 0.170 ns/op
ArrayCopyUnalignedBoth.testByte 1200 N/A avgt 15 20.538 ? 0.270 ns/op
ArrayCopyUnalignedBoth.testChar 1 N/A avgt 15 2.374 ? 0.043 ns/op
ArrayCopyUnalignedBoth.testChar 3 N/A avgt 15 2.351 ? 0.011 ns/op
ArrayCopyUnalignedBoth.testChar 5 N/A avgt 15 2.352 ? 0.017 ns/op
ArrayCopyUnalignedBoth.testChar 10 N/A avgt 15 2.349 ? 0.008 ns/op
ArrayCopyUnalignedBoth.testChar 20 N/A avgt 15 5.070 ? 0.041 ns/op
ArrayCopyUnalignedBoth.testChar 70 N/A avgt 15 6.052 ? 0.197 ns/op
ArrayCopyUnalignedBoth.testChar 150 N/A avgt 15 9.861 ? 0.226 ns/op
ArrayCopyUnalignedBoth.testChar 300 N/A avgt 15 13.635 ? 0.136 ns/op
ArrayCopyUnalignedBoth.testChar 600 N/A avgt 15 20.967 ? 0.164 ns/op
ArrayCopyUnalignedBoth.testChar 1200 N/A avgt 15 36.465 ? 0.140 ns/op
ArrayCopyUnalignedBoth.testInt 1 N/A avgt 15 4.440 ? 0.064 ns/op
ArrayCopyUnalignedBoth.testInt 3 N/A avgt 15 4.446 ? 0.089 ns/op
ArrayCopyUnalignedBoth.testInt 5 N/A avgt 15 4.417 ? 0.058 ns/op
ArrayCopyUnalignedBoth.testInt 10 N/A avgt 15 5.044 ? 0.054 ns/op
ArrayCopyUnalignedBoth.testInt 20 N/A avgt 15 5.127 ? 0.100 ns/op
ArrayCopyUnalignedBoth.testInt 70 N/A avgt 15 8.399 ? 0.077 ns/op
ArrayCopyUnalignedBoth.testInt 150 N/A avgt 15 12.252 ? 0.203 ns/op
ArrayCopyUnalignedBoth.testInt 300 N/A avgt 15 23.253 ? 0.252 ns/op
ArrayCopyUnalignedBoth.testInt 600 N/A avgt 15 37.990 ? 0.456 ns/op
ArrayCopyUnalignedBoth.testInt 1200 N/A avgt 15 57.030 ? 0.146 ns/op
ArrayCopyUnalignedBoth.testLong 1 N/A avgt 15 4.360 ? 0.014 ns/op
ArrayCopyUnalignedBoth.testLong 3 N/A avgt 15 4.391 ? 0.080 ns/op
ArrayCopyUnalignedBoth.testLong 5 N/A avgt 15 5.060 ? 0.071 ns/op
ArrayCopyUnalignedBoth.testLong 10 N/A avgt 15 5.117 ? 0.109 ns/op
ArrayCopyUnalignedBoth.testLong 20 N/A avgt 15 5.841 ? 0.115 ns/op
ArrayCopyUnalignedBoth.testLong 70 N/A avgt 15 11.700 ? 0.655 ns/op
ArrayCopyUnalignedBoth.testLong 150 N/A avgt 15 22.002 ? 0.408 ns/op
ArrayCopyUnalignedBoth.testLong 300 N/A avgt 15 36.020 ? 0.356 ns/op
ArrayCopyUnalignedBoth.testLong 600 N/A avgt 15 45.212 ? 0.194 ns/op
ArrayCopyUnalignedBoth.testLong 1200 N/A avgt 15 75.720 ? 0.607 ns/op
ArrayCopyUnalignedDst.testByte 1 N/A avgt 15 2.361 ? 0.037 ns/op
ArrayCopyUnalignedDst.testByte 10 N/A avgt 15 2.353 ? 0.025 ns/op
ArrayCopyUnalignedDst.testByte 150 N/A avgt 15 6.145 ? 0.170 ns/op
ArrayCopyUnalignedDst.testByte 1200 N/A avgt 15 19.825 ? 0.231 ns/op
ArrayCopyUnalignedDst.testChar 1 N/A avgt 15 2.366 ? 0.053 ns/op
ArrayCopyUnalignedDst.testChar 10 N/A avgt 15 2.375 ? 0.058 ns/op
ArrayCopyUnalignedDst.testChar 150 N/A avgt 15 9.274 ? 0.237 ns/op
ArrayCopyUnalignedDst.testChar 1200 N/A avgt 15 36.327 ? 0.086 ns/op
ArrayCopyUnalignedDst.testInt 1 N/A avgt 15 4.400 ? 0.023 ns/op
ArrayCopyUnalignedDst.testInt 10 N/A avgt 15 5.071 ? 0.073 ns/op
ArrayCopyUnalignedDst.testInt 150 N/A avgt 15 13.229 ? 0.172 ns/op
ArrayCopyUnalignedDst.testInt 1200 N/A avgt 15 56.467 ? 0.384 ns/op
ArrayCopyUnalignedDst.testLong 1 N/A avgt 15 4.421 ? 0.107 ns/op
ArrayCopyUnalignedDst.testLong 10 N/A avgt 15 5.074 ? 0.063 ns/op
ArrayCopyUnalignedDst.testLong 150 N/A avgt 15 20.605 ? 0.602 ns/op
ArrayCopyUnalignedDst.testLong 1200 N/A avgt 15 74.206 ? 0.294 ns/op
ArrayCopyUnalignedSrc.testByte 1 N/A avgt 15 2.352 ? 0.024 ns/op
ArrayCopyUnalignedSrc.testByte 10 N/A avgt 15 2.352 ? 0.028 ns/op
ArrayCopyUnalignedSrc.testByte 150 N/A avgt 15 6.156 ? 0.118 ns/op
ArrayCopyUnalignedSrc.testByte 1200 N/A avgt 15 16.755 ? 0.046 ns/op
ArrayCopyUnalignedSrc.testChar 1 N/A avgt 15 2.363 ? 0.031 ns/op
ArrayCopyUnalignedSrc.testChar 10 N/A avgt 15 2.367 ? 0.045 ns/op
ArrayCopyUnalignedSrc.testChar 150 N/A avgt 15 9.318 ? 0.157 ns/op
ArrayCopyUnalignedSrc.testChar 1200 N/A avgt 15 31.355 ? 0.276 ns/op
ArrayCopyUnalignedSrc.testInt 1 N/A avgt 15 4.428 ? 0.063 ns/op
ArrayCopyUnalignedSrc.testInt 10 N/A avgt 15 5.072 ? 0.089 ns/op
ArrayCopyUnalignedSrc.testInt 150 N/A avgt 15 12.163 ? 0.116 ns/op
ArrayCopyUnalignedSrc.testInt 1200 N/A avgt 15 54.206 ? 0.374 ns/op
ArrayCopyUnalignedSrc.testLong 1 N/A avgt 15 4.401 ? 0.052 ns/op
ArrayCopyUnalignedSrc.testLong 10 N/A avgt 15 5.058 ? 0.034 ns/op
ArrayCopyUnalignedSrc.testLong 150 N/A avgt 15 20.391 ? 0.417 ns/op
ArrayCopyUnalignedSrc.testLong 1200 N/A avgt 15 74.467 ? 0.809 ns/op
// PR with assert fix
Benchmark (length) (size) Mode Cnt Score Error Units
ArrayCopyObject.conjoint_micro N/A 31 thrpt 15 79910.859 ? 869.372 ops/ms
ArrayCopyObject.conjoint_micro N/A 63 thrpt 15 62631.951 ? 1440.065 ops/ms
ArrayCopyObject.conjoint_micro N/A 127 thrpt 15 51043.300 ? 761.226 ops/ms
ArrayCopyObject.conjoint_micro N/A 2047 thrpt 15 14141.790 ? 164.714 ops/ms
ArrayCopyObject.conjoint_micro N/A 4095 thrpt 15 8024.056 ? 53.310 ops/ms
ArrayCopyObject.conjoint_micro N/A 8191 thrpt 15 4318.074 ? 6.441 ops/ms
ArrayCopyObject.disjoint_micro N/A 31 thrpt 15 78245.690 ? 1697.277 ops/ms
ArrayCopyObject.disjoint_micro N/A 63 thrpt 15 61873.747 ? 806.972 ops/ms
ArrayCopyObject.disjoint_micro N/A 127 thrpt 15 55457.908 ? 2091.739 ops/ms
ArrayCopyObject.disjoint_micro N/A 2047 thrpt 15 9407.159 ? 102.308 ops/ms
ArrayCopyObject.disjoint_micro N/A 4095 thrpt 15 5107.999 ? 49.856 ops/ms
ArrayCopyObject.disjoint_micro N/A 8191 thrpt 15 1195.313 ? 7.580 ops/ms
ArrayCopy.arrayCopy N/A N/A avgt 15 1.354 ? 0.026 ns/op
ArrayCopy.arrayCopyChar N/A N/A avgt 15 4.388 ? 0.101 ns/op
ArrayCopy.arrayCopyCharNonConst N/A N/A avgt 15 4.715 ? 0.077 ns/op
ArrayCopy.arrayCopyLocalArray N/A N/A avgt 15 0.505 ? 0.007 ns/op
ArrayCopy.arrayCopyNonConst N/A N/A avgt 15 1.900 ? 0.042 ns/op
ArrayCopy.arrayCopyObject N/A N/A avgt 15 23.395 ? 0.252 ns/op
ArrayCopy.arrayCopyObjectNonConst N/A N/A avgt 15 25.409 ? 0.355 ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward N/A N/A avgt 15 17.352 ? 0.297 ns/op
ArrayCopy.arrayCopyObjectSameArraysForward N/A N/A avgt 15 17.804 ? 0.198 ns/op
ArrayCopy.copyLoop N/A N/A avgt 15 5.114 ? 0.117 ns/op
ArrayCopy.copyLoopLocalArray N/A N/A avgt 15 3.728 ? 0.086 ns/op
ArrayCopy.copyLoopNonConst N/A N/A avgt 15 5.413 ? 0.022 ns/op
ArrayCopyAligned.testByte 1 N/A avgt 15 2.367 ? 0.041 ns/op
ArrayCopyAligned.testByte 3 N/A avgt 15 2.368 ? 0.048 ns/op
ArrayCopyAligned.testByte 5 N/A avgt 15 2.360 ? 0.050 ns/op
ArrayCopyAligned.testByte 10 N/A avgt 15 2.362 ? 0.030 ns/op
ArrayCopyAligned.testByte 20 N/A avgt 15 2.363 ? 0.038 ns/op
ArrayCopyAligned.testByte 70 N/A avgt 15 5.185 ? 0.092 ns/op
ArrayCopyAligned.testByte 150 N/A avgt 15 5.905 ? 0.073 ns/op
ArrayCopyAligned.testByte 300 N/A avgt 15 9.720 ? 0.215 ns/op
ArrayCopyAligned.testByte 600 N/A avgt 15 13.076 ? 0.142 ns/op
ArrayCopyAligned.testByte 1200 N/A avgt 15 22.189 ? 0.143 ns/op
ArrayCopyAligned.testChar 1 N/A avgt 15 2.351 ? 0.008 ns/op
ArrayCopyAligned.testChar 3 N/A avgt 15 2.370 ? 0.046 ns/op
ArrayCopyAligned.testChar 5 N/A avgt 15 2.355 ? 0.037 ns/op
ArrayCopyAligned.testChar 10 N/A avgt 15 2.351 ? 0.020 ns/op
ArrayCopyAligned.testChar 20 N/A avgt 15 5.077 ? 0.059 ns/op
ArrayCopyAligned.testChar 70 N/A avgt 15 5.932 ? 0.101 ns/op
ArrayCopyAligned.testChar 150 N/A avgt 15 9.815 ? 0.159 ns/op
ArrayCopyAligned.testChar 300 N/A avgt 15 13.759 ? 0.197 ns/op
ArrayCopyAligned.testChar 600 N/A avgt 15 20.505 ? 0.161 ns/op
ArrayCopyAligned.testChar 1200 N/A avgt 15 33.720 ? 0.493 ns/op
ArrayCopyAligned.testInt 1 N/A avgt 15 4.417 ? 0.096 ns/op
ArrayCopyAligned.testInt 3 N/A avgt 15 4.363 ? 0.029 ns/op
ArrayCopyAligned.testInt 5 N/A avgt 15 4.365 ? 0.022 ns/op
ArrayCopyAligned.testInt 10 N/A avgt 15 5.122 ? 0.170 ns/op
ArrayCopyAligned.testInt 20 N/A avgt 15 5.074 ? 0.076 ns/op
ArrayCopyAligned.testInt 70 N/A avgt 15 9.048 ? 0.201 ns/op
ArrayCopyAligned.testInt 150 N/A avgt 15 12.559 ? 0.159 ns/op
ArrayCopyAligned.testInt 300 N/A avgt 15 21.518 ? 0.276 ns/op
ArrayCopyAligned.testInt 600 N/A avgt 15 38.209 ? 0.349 ns/op
ArrayCopyAligned.testInt 1200 N/A avgt 15 54.638 ? 0.706 ns/op
ArrayCopyAligned.testLong 1 N/A avgt 15 4.407 ? 0.041 ns/op
ArrayCopyAligned.testLong 3 N/A avgt 15 4.415 ? 0.077 ns/op
ArrayCopyAligned.testLong 5 N/A avgt 15 5.087 ? 0.092 ns/op
ArrayCopyAligned.testLong 10 N/A avgt 15 5.072 ? 0.078 ns/op
ArrayCopyAligned.testLong 20 N/A avgt 15 5.802 ? 0.023 ns/op
ArrayCopyAligned.testLong 70 N/A avgt 15 11.284 ? 0.171 ns/op
ArrayCopyAligned.testLong 150 N/A avgt 15 17.501 ? 0.185 ns/op
ArrayCopyAligned.testLong 300 N/A avgt 15 27.477 ? 0.391 ns/op
ArrayCopyAligned.testLong 600 N/A avgt 15 44.711 ? 0.209 ns/op
ArrayCopyAligned.testLong 1200 N/A avgt 15 77.157 ? 1.437 ns/op
ArrayCopyUnalignedBoth.testByte 1 N/A avgt 15 2.360 ? 0.040 ns/op
ArrayCopyUnalignedBoth.testByte 3 N/A avgt 15 2.351 ? 0.028 ns/op
ArrayCopyUnalignedBoth.testByte 5 N/A avgt 15 2.352 ? 0.017 ns/op
ArrayCopyUnalignedBoth.testByte 10 N/A avgt 15 2.347 ? 0.011 ns/op
ArrayCopyUnalignedBoth.testByte 20 N/A avgt 15 2.363 ? 0.039 ns/op
ArrayCopyUnalignedBoth.testByte 70 N/A avgt 15 5.182 ? 0.083 ns/op
ArrayCopyUnalignedBoth.testByte 150 N/A avgt 15 5.920 ? 0.157 ns/op
ArrayCopyUnalignedBoth.testByte 300 N/A avgt 15 10.374 ? 0.314 ns/op
ArrayCopyUnalignedBoth.testByte 600 N/A avgt 15 13.511 ? 0.182 ns/op
ArrayCopyUnalignedBoth.testByte 1200 N/A avgt 15 21.302 ? 0.194 ns/op
ArrayCopyUnalignedBoth.testChar 1 N/A avgt 15 2.359 ? 0.035 ns/op
ArrayCopyUnalignedBoth.testChar 3 N/A avgt 15 2.342 ? 0.002 ns/op
ArrayCopyUnalignedBoth.testChar 5 N/A avgt 15 2.348 ? 0.019 ns/op
ArrayCopyUnalignedBoth.testChar 10 N/A avgt 15 2.362 ? 0.059 ns/op
ArrayCopyUnalignedBoth.testChar 20 N/A avgt 15 5.079 ? 0.046 ns/op
ArrayCopyUnalignedBoth.testChar 70 N/A avgt 15 5.974 ? 0.165 ns/op
ArrayCopyUnalignedBoth.testChar 150 N/A avgt 15 10.201 ? 0.260 ns/op
ArrayCopyUnalignedBoth.testChar 300 N/A avgt 15 13.862 ? 0.064 ns/op
ArrayCopyUnalignedBoth.testChar 600 N/A avgt 15 20.752 ? 0.240 ns/op
ArrayCopyUnalignedBoth.testChar 1200 N/A avgt 15 36.883 ? 0.390 ns/op
ArrayCopyUnalignedBoth.testInt 1 N/A avgt 15 4.372 ? 0.054 ns/op
ArrayCopyUnalignedBoth.testInt 3 N/A avgt 15 4.376 ? 0.051 ns/op
ArrayCopyUnalignedBoth.testInt 5 N/A avgt 15 4.385 ? 0.081 ns/op
ArrayCopyUnalignedBoth.testInt 10 N/A avgt 15 5.059 ? 0.082 ns/op
ArrayCopyUnalignedBoth.testInt 20 N/A avgt 15 5.099 ? 0.154 ns/op
ArrayCopyUnalignedBoth.testInt 70 N/A avgt 15 8.983 ? 0.079 ns/op
ArrayCopyUnalignedBoth.testInt 150 N/A avgt 15 12.481 ? 0.169 ns/op
ArrayCopyUnalignedBoth.testInt 300 N/A avgt 15 23.265 ? 0.319 ns/op
ArrayCopyUnalignedBoth.testInt 600 N/A avgt 15 38.328 ? 0.259 ns/op
ArrayCopyUnalignedBoth.testInt 1200 N/A avgt 15 57.320 ? 0.476 ns/op
ArrayCopyUnalignedBoth.testLong 1 N/A avgt 15 4.413 ? 0.055 ns/op
ArrayCopyUnalignedBoth.testLong 3 N/A avgt 15 4.409 ? 0.024 ns/op
ArrayCopyUnalignedBoth.testLong 5 N/A avgt 15 5.086 ? 0.134 ns/op
ArrayCopyUnalignedBoth.testLong 10 N/A avgt 15 5.069 ? 0.022 ns/op
ArrayCopyUnalignedBoth.testLong 20 N/A avgt 15 5.788 ? 0.087 ns/op
ArrayCopyUnalignedBoth.testLong 70 N/A avgt 15 11.149 ? 0.182 ns/op
ArrayCopyUnalignedBoth.testLong 150 N/A avgt 15 22.461 ? 0.284 ns/op
ArrayCopyUnalignedBoth.testLong 300 N/A avgt 15 36.353 ? 0.272 ns/op
ArrayCopyUnalignedBoth.testLong 600 N/A avgt 15 47.568 ? 2.050 ns/op
ArrayCopyUnalignedBoth.testLong 1200 N/A avgt 15 80.643 ? 3.747 ns/op
ArrayCopyUnalignedDst.testByte 1 N/A avgt 15 2.344 ? 0.004 ns/op
ArrayCopyUnalignedDst.testByte 10 N/A avgt 15 2.362 ? 0.043 ns/op
ArrayCopyUnalignedDst.testByte 150 N/A avgt 15 5.922 ? 0.066 ns/op
ArrayCopyUnalignedDst.testByte 1200 N/A avgt 15 19.768 ? 0.177 ns/op
ArrayCopyUnalignedDst.testChar 1 N/A avgt 15 2.358 ? 0.032 ns/op
ArrayCopyUnalignedDst.testChar 10 N/A avgt 15 2.379 ? 0.056 ns/op
ArrayCopyUnalignedDst.testChar 150 N/A avgt 15 9.497 ? 0.181 ns/op
ArrayCopyUnalignedDst.testChar 1200 N/A avgt 15 36.580 ? 0.067 ns/op
ArrayCopyUnalignedDst.testInt 1 N/A avgt 15 4.412 ? 0.106 ns/op
ArrayCopyUnalignedDst.testInt 10 N/A avgt 15 5.082 ? 0.130 ns/op
ArrayCopyUnalignedDst.testInt 150 N/A avgt 15 13.638 ? 0.262 ns/op
ArrayCopyUnalignedDst.testInt 1200 N/A avgt 15 56.724 ? 0.247 ns/op
ArrayCopyUnalignedDst.testLong 1 N/A avgt 15 4.435 ? 0.113 ns/op
ArrayCopyUnalignedDst.testLong 10 N/A avgt 15 5.102 ? 0.095 ns/op
ArrayCopyUnalignedDst.testLong 150 N/A avgt 15 20.762 ? 0.388 ns/op
ArrayCopyUnalignedDst.testLong 1200 N/A avgt 15 77.408 ? 2.771 ns/op
ArrayCopyUnalignedSrc.testByte 1 N/A avgt 15 2.346 ? 0.009 ns/op
ArrayCopyUnalignedSrc.testByte 10 N/A avgt 15 2.367 ? 0.053 ns/op
ArrayCopyUnalignedSrc.testByte 150 N/A avgt 15 5.953 ? 0.120 ns/op
ArrayCopyUnalignedSrc.testByte 1200 N/A avgt 15 16.899 ? 0.277 ns/op
ArrayCopyUnalignedSrc.testChar 1 N/A avgt 15 2.375 ? 0.054 ns/op
ArrayCopyUnalignedSrc.testChar 10 N/A avgt 15 2.348 ? 0.005 ns/op
ArrayCopyUnalignedSrc.testChar 150 N/A avgt 15 9.559 ? 0.217 ns/op
ArrayCopyUnalignedSrc.testChar 1200 N/A avgt 15 31.406 ? 0.389 ns/op
ArrayCopyUnalignedSrc.testInt 1 N/A avgt 15 4.372 ? 0.023 ns/op
ArrayCopyUnalignedSrc.testInt 10 N/A avgt 15 5.071 ? 0.110 ns/op
ArrayCopyUnalignedSrc.testInt 150 N/A avgt 15 12.627 ? 0.379 ns/op
ArrayCopyUnalignedSrc.testInt 1200 N/A avgt 15 54.595 ? 0.281 ns/op
ArrayCopyUnalignedSrc.testLong 1 N/A avgt 15 4.415 ? 0.043 ns/op
ArrayCopyUnalignedSrc.testLong 10 N/A avgt 15 5.058 ? 0.065 ns/op
ArrayCopyUnalignedSrc.testLong 150 N/A avgt 15 20.759 ? 0.256 ns/op
ArrayCopyUnalignedSrc.testLong 1200 N/A avgt 15 78.106 ? 2.320 ns/op
Most of the diferences are in the error range, a few are a little bigger.
__ movq(temp2, temp1); | ||
__ shlq(temp2, shift); | ||
__ cmpq(temp2, large_threshold); | ||
__ jcc(Assembler::greaterEqual, L_copy_large); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect additional checks for 2.5MB array size may hit the performance of other general sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing several runs of the XorTest.copy SMALL (100K) benchmark, baseline against PR, I see an average slowdown of 1.7% (7.566 ms / op vs 7.696 ms/op)
__ evmovntdquq(Address(dst, index, scale, offset), xmm1, Assembler::AVX_512bit); | ||
__ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, Assembler::AVX_512bit); | ||
__ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, Assembler::AVX_512bit); | ||
__ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, Assembler::AVX_512bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are non-temporal memory moves, to force eviction from write combining buffers we may need to emit additional fences, else a subsequent read from destination memory may see incorrect values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja There is a sfence at line 781.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see any concerns while handling multithreaded case where writer is busy copying 256 bytes block in loop and reader try to access a location still not flushed out of write combining buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results a concurrent reader sees could be different if the copy is using nt writes, but if the read of the destination is not synced with the copy operation, I think the reader would not see consistent state in either case. Is it worse with nt writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, agree behavior is similar to non-NT case, in fact using NT for huge copy operations will prevent polluting caches due to destination cache line fills.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But won't it also cause performance regressions in the common case where the caller needs to use the destination array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One component of the included XorTest.xor benchmark is to read the bytes from two copied arrays. See line 155 in libjnitest.c
The nt stores are only used in the FOREGN LARGE case and it shows a net speedup ~123 ms -> 104 ms.
@steveatgh this pull request can not be integrated into git checkout memcpy
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Thanks, I re-submitted testing. |
<<<<<<< HEAD | ||
void copy(int count, byte[] src, int sOff, byte[] dst, int dOff, int len); | ||
======= | ||
>>>>>>> 9727f4bdddc071e6f59806087339f345405ab004 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have multiple merge conflicts in the micro benchmark files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure how I missed the conflicts. They should be resolved now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @steveatgh , X86 code changes looks good to me.
|
@steveatgh This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 54 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@TobiHartmann, @jatin-bhateja, @sviswa7) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
- remove ::copy test from XorTest
Previous commit (fcbbc0d) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
The micros:java.lang.foreign.xor.XorTest::xor benchmark results shown in the introductory comment above used XorTest code from PR commit 7cc272e which was based on Maurizio Cimadamore's commit a788f06. The XorTest has since been updated and the XorTest::copy is no longer needed and has been removed from this pull request. Performance can be evaluated using both the new XorTest and a new org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark added to this PR. Results from these two benchmarks are show below: In the ArrayCopyAlignedLarge.testByte benchmark below, the PR code is active in sizes 5MB and 10MB.
In the XorTest::xor benchmark below, the PR code is active in 3 of the LARGE case runs: FOREIGN_NO_INIT, FOREIGN_INIT, and UNSAFE.
|
void StubGenerator::copy256_avx3(Register dst, Register src, Register index, XMMRegister xmm1, | ||
XMMRegister xmm2, XMMRegister xmm3, XMMRegister xmm4, | ||
bool conjoint, int shift, int offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conjoint parameter is not used so could be removed from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
__ jcc(Assembler::less, L_tail_large); | ||
|
||
__ BIND(L_main_pre_loop_large); | ||
__ subq(temp1, loop_size[shift]); // whay is this here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spurious comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
Label L_main_pre_loop_large; | ||
Label L_pre_main_post_large; | ||
|
||
if (MaxVectorSize == 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
/* T_LONG */ { 8, 16 , 24 , 32} | ||
}; | ||
|
||
if (MaxVectorSize == 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
__ shrq(temp2, shift); | ||
} | ||
__ movq(temp3, temp2); | ||
copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last argument should be "true" or "1" instead of "0" or "false". This is as temp3 (length) could be less than 32 as well. This case is only handled when use64byteVector argument is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift, | ||
temp4, temp3, L_entry_large, L_exit_large); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need to be copied so we don't need to go back to L_entry_large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
- use asserts instead of conditionals in two logically unreachable blocks - remove unused function parmeters - use 64-byte vector path in pre-loop masked write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for taking care of all the review comments. The PR looks good to me now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness and performance testing passed.
/integrate |
@steveatgh |
/sponsor |
Going to push as commit 82967f4.
Your commit was automatically rebased without conflicts. |
@sviswa7 @steveatgh Pushed as commit 82967f4. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Update: the XorTest::xor results shown in this message used test code from PR commit 7cc272e which was based on Maurizio Cimadamore's commit a788f06. The XorTest has since been updated and XorTest::copy is no longer needed and has been removed from this pull request. See comment here for updated performance data.
Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512.
The 'copy' benchmark was added to measure the memory copy components of the 'xor' benchmark, separate from the memory allocation and xor data update components.
Profile data for the baseline REGION LARGE case, shows two hotspots covering about 90% of cycles:
The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% :
This PR optimizes the jlong_disjoint_arraycopy_avx3 code. The The Copy::fill_to memory_atomic hotspot (which I believe is associated with the benchmark's per-op off-heap buffer allocation) is not optimized here. The av3 array copy code is optimized by increasing the loop granularity from 192 to 256 bytes, adding source address prefetches, and using non-temporal writes with a store fence. The optimized code in only used with copies of greater that a set threshold number of bytes, currently 2.5MB. This is the size at which the optimized code was observed to be faster than the original code. The profile data with optimization is:
The optimization brings the cycles for the mem copy work roughly to parity with the REGION LARGE case. Benchmark data for the optimized case:
I am very much looking forward to contributing to OpenJDK! Please review this PR and let me know how it can be improved.
Progress
Issue
Reviewers
Contributors
<mcimadamore@openjdk.org>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575
$ git checkout pull/16575
Update a local copy of the PR:
$ git checkout pull/16575
$ git pull https://git.openjdk.org/jdk.git pull/16575/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16575
View PR using the GUI difftool:
$ git pr show -t 16575
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16575.diff
Webrev
Link to Webrev Comment