Skip to content

8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy #16575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

steveatgh
Copy link
Contributor

@steveatgh steveatgh commented Nov 8, 2023

Update: the XorTest::xor results shown in this message used test code from PR commit 7cc272e which was based on Maurizio Cimadamore's commit a788f06. The XorTest has since been updated and XorTest::copy is no longer needed and has been removed from this pull request. See comment here for updated performance data.

Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512.

Baseline data
Benchmark     (arrayKind)  (sizeKind)  Mode  Cnt           Score          Error  Units
--------------------------------------------------------------------------------------
XorTest.copy     ELEMENTS       SMALL  avgt   30   584737355.767 ± 60414308.540  ns/op
XorTest.copy     ELEMENTS      MEDIUM  avgt   30   272248995.683 ±  2924954.498  ns/op
XorTest.copy     ELEMENTS       LARGE  avgt   30  1019200210.900 ± 28334453.652  ns/op
XorTest.copy       REGION       SMALL  avgt   30     7399944.164 ±   216821.819  ns/op
XorTest.copy       REGION      MEDIUM  avgt   30    20591454.558 ±   147398.572  ns/op
XorTest.copy       REGION       LARGE  avgt   30    21649266.051 ±   179263.875  ns/op
XorTest.copy     CRITICAL       SMALL  avgt   30       51079.357 ±      542.482  ns/op
XorTest.copy     CRITICAL      MEDIUM  avgt   30        2496.961 ±       11.375  ns/op
XorTest.copy     CRITICAL       LARGE  avgt   30         515.454 ±        5.831  ns/op
XorTest.copy      FOREIGN       SMALL  avgt   30     7558432.075 ±    79489.276  ns/op
XorTest.copy      FOREIGN      MEDIUM  avgt   30    19730666.341 ±   500505.099  ns/op
XorTest.copy      FOREIGN       LARGE  avgt   30    34616758.085 ±   340300.726  ns/op
XorTest.xor      ELEMENTS       SMALL  avgt   30   219832692.489 ±  2329417.319  ns/op
XorTest.xor      ELEMENTS      MEDIUM  avgt   30   505138197.167 ±  3818334.424  ns/op
XorTest.xor      ELEMENTS       LARGE  avgt   30  1189608474.667 ±  5877981.900  ns/op
XorTest.xor        REGION       SMALL  avgt   30    64093872.804 ±   599704.491  ns/op
XorTest.xor        REGION      MEDIUM  avgt   30    81544576.454 ±  1406342.118  ns/op
XorTest.xor        REGION       LARGE  avgt   30    90091424.883 ±   775577.613  ns/op
XorTest.xor      CRITICAL       SMALL  avgt   30    57231375.744 ±   438223.342  ns/op
XorTest.xor      CRITICAL      MEDIUM  avgt   30    58583884.930 ±   375355.215  ns/op
XorTest.xor      CRITICAL       LARGE  avgt   30    60644832.949 ±   588120.738  ns/op
XorTest.xor       FOREIGN       SMALL  avgt   30    73868679.405 ±   819965.524  ns/op
XorTest.xor       FOREIGN      MEDIUM  avgt   30    88156275.944 ±  1051257.152  ns/op
XorTest.xor       FOREIGN       LARGE  avgt   30   123115513.182 ±  1287935.621  ns/op

The 'copy' benchmark was added to measure the memory copy components of the 'xor' benchmark, separate from the memory allocation and xor data update components.

Profile data for the baseline REGION LARGE case, shows two hotspots covering about 90% of cycles:

Baseline REGION LARGE (r231)
Function                        CPU Time    Clockticks      Instructions Retired    CPI Rate
--------------------------------------------------------------------------------------------
xor_op                          63.7%       18,189,000,000  52,464,000,000          0.347   
__memcpy_evex_unaligned_erms    28.5%        7,608,000,000   3,459,000,000          2.199  

The baseline FOREIGN LARGE case shows 3 hotspots covering about 90% :

Baseline FOREIGN LARGE (r226)
Function                        CPU Time    Clockticks      Instructions Retired    CPI Rate
--------------------------------------------------------------------------------------------
xor_op                          46.4%       18,345,000,000  52,476,000,000          0.350   
jlong_disjoint_arraycopy_avx3   29.3%       11,124,000,000   1,404,000,000          7.923   
Copy::fill_to_memory_atomic     15.3%        5,016,000,000   8,010,000,000          0.626   

This PR optimizes the jlong_disjoint_arraycopy_avx3 code. The The Copy::fill_to memory_atomic hotspot (which I believe is associated with the benchmark's per-op off-heap buffer allocation) is not optimized here. The av3 array copy code is optimized by increasing the loop granularity from 192 to 256 bytes, adding source address prefetches, and using non-temporal writes with a store fence. The optimized code in only used with copies of greater that a set threshold number of bytes, currently 2.5MB. This is the size at which the optimized code was observed to be faster than the original code. The profile data with optimization is:

Optimized FOREIGN LARGE (r277)
Function                        CPU Time    Clockticks      Instructions Retired    CPI Rate
--------------------------------------------------------------------------------------------
xor_op                          51.2%       18,153,000,000  52,404,000,000          0.346   
jlong_disjoint_arraycopy_avx3   22.4%        7,581,000,000   2,364,000,000          3.207   
Copy::fill_to_memory_atomic     16.3%        5,316,000,000   7,917,000,000          0.671   

The optimization brings the cycles for the mem copy work roughly to parity with the REGION LARGE case. Benchmark data for the optimized case:

Optimized data
Benchmark     (arrayKind)  (sizeKind)  Mode  Cnt           Score         Error  Units
XorTest.copy     ELEMENTS       SMALL  avgt   30   551072938.467 ± 4287149.108  ns/op
XorTest.copy     ELEMENTS      MEDIUM  avgt   30   272304419.633 ± 2993793.130  ns/op
XorTest.copy     ELEMENTS       LARGE  avgt   30  1013925081.233 ± 8590245.238  ns/op
XorTest.copy       REGION       SMALL  avgt   30     7472329.003 ±   77394.114  ns/op
XorTest.copy       REGION      MEDIUM  avgt   30    19882540.205 ±  349544.602  ns/op
XorTest.copy       REGION       LARGE  avgt   30    21185593.636 ±  404369.655  ns/op
XorTest.copy     CRITICAL       SMALL  avgt   30       52358.715 ±    1382.355  ns/op
XorTest.copy     CRITICAL      MEDIUM  avgt   30        2525.108 ±      22.396  ns/op
XorTest.copy     CRITICAL       LARGE  avgt   30         528.865 ±      11.747  ns/op
XorTest.copy      FOREIGN       SMALL  avgt   30     7748587.890 ±   67352.844  ns/op
XorTest.copy      FOREIGN      MEDIUM  avgt   30    19401977.378 ±  256247.071  ns/op
XorTest.copy      FOREIGN       LARGE  avgt   30    21519594.325 ±  124712.980  ns/op
XorTest.xor      ELEMENTS       SMALL  avgt   30   221049328.389 ± 2629557.148  ns/op
XorTest.xor      ELEMENTS      MEDIUM  avgt   30   503362446.150 ± 3759664.343  ns/op
XorTest.xor      ELEMENTS       LARGE  avgt   30  1186563496.067 ± 5135607.671  ns/op
XorTest.xor        REGION       SMALL  avgt   30    88402928.083 ±  790941.309  ns/op
XorTest.xor        REGION      MEDIUM  avgt   30    80041519.052 ±  597221.491  ns/op
XorTest.xor        REGION       LARGE  avgt   30    87706448.917 ±  751350.609  ns/op
XorTest.xor      CRITICAL       SMALL  avgt   30    56869387.315 ±  408618.338  ns/op
XorTest.xor      CRITICAL      MEDIUM  avgt   30    59041245.745 ±  820141.039  ns/op
XorTest.xor      CRITICAL       LARGE  avgt   30    60433672.443 ±  500954.831  ns/op
XorTest.xor       FOREIGN       SMALL  avgt   30    72838421.976 ±  410147.170  ns/op
XorTest.xor       FOREIGN      MEDIUM  avgt   30    87970109.478 ± 1058857.783  ns/op
XorTest.xor       FOREIGN       LARGE  avgt   30   103970690.407 ± 1033001.637  ns/op

I am very much looking forward to contributing to OpenJDK! Please review this PR and let me know how it can be improved.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy (Enhancement - P3)

Reviewers

Contributors

  • Maurizio Cimadamore <mcimadamore@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16575/head:pull/16575
$ git checkout pull/16575

Update a local copy of the PR:
$ git checkout pull/16575
$ git pull https://git.openjdk.org/jdk.git pull/16575/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16575

View PR using the GUI difftool:
$ git pr show -t 16575

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16575.diff

Webrev

Link to Webrev Comment

…rate_disjoint_copy_avx3_masked

  - add src address prefetches
  - switch to non-temporal writes
  - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore
@bridgekeeper bridgekeeper bot added the oca Needs verification of OCA signatory status label Nov 8, 2023
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 8, 2023

Hi @steveatgh, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user steveatgh" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

openjdk bot commented Nov 8, 2023

@steveatgh The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Nov 8, 2023
@steveatgh steveatgh changed the title Re: JDK-8310159 - optimize StubGenerator::generate_disjoint_copy_avx3_masked 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy Nov 8, 2023
@openjdk openjdk bot changed the title 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy Nov 8, 2023
@steveatgh
Copy link
Contributor Author

/covered

@steveatgh
Copy link
Contributor Author

I'm part of the Intel Java team

@bridgekeeper bridgekeeper bot added the oca-verify Needs verification of OCA signatory status label Nov 8, 2023
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 8, 2023

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

@steveatgh
Copy link
Contributor Author

/contributor add @mcimadamore

@openjdk
Copy link

openjdk bot commented Nov 8, 2023

@steveatgh
Contributor Maurizio Cimadamore <mcimadamore@openjdk.org> successfully added.

@bridgekeeper bridgekeeper bot removed oca Needs verification of OCA signatory status oca-verify Needs verification of OCA signatory status labels Nov 9, 2023
- fix xor test foreign impl constructor signature
@steveatgh steveatgh marked this pull request as ready for review November 10, 2023 00:35
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 10, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 10, 2023

Webrevs

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted some quick testing and I'm seeing the following failure with multiple tests:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/workspace/open/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp:1201), pid=24136, tid=24139
#  assert(MaxVectorSize == 64) failed: vector length != 64
#
# JRE version:  (22.0) (fastdebug build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-11-13-0750559.tobias.hartmann.jdk2, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, Register, XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366

Stack: [0x00007f0b5e919000,0x00007f0b5ea1a000],  sp=0x00007f0b5ea17150,  free space=1016k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x16c00e6]  StubGenerator::copy64_masked_avx(Register, Register, XMMRegister, KRegister, Register, Register, Register, int, int, bool)+0x366  (stubGenerator_x86_64_arraycopy.cpp:1201)
V  [libjvm.so+0x16c0ecd]  StubGenerator::arraycopy_avx3_special_cases_256(XMMRegister, KRegister, Register, Register, Register, int, Register, Register, Label&, Label&)+0x19d  (stubGenerator_x86_64_arraycopy.cpp:1055)
V  [libjvm.so+0x16c16c1]  StubGenerator::arraycopy_avx3_large(Register, Register, Register, Register, Register, Register, Register, XMMRegister, XMMRegister, XMMRegister, XMMRegister, int)+0x3f1  (stubGenerator_x86_64_arraycopy.cpp:790)
V  [libjvm.so+0x16c22f0]  StubGenerator::generate_disjoint_copy_avx3_masked(unsigned char**, char const*, int, bool, bool, bool)+0xa90  (stubGenerator_x86_64_arraycopy.cpp:728)
V  [libjvm.so+0x16c4b85]  StubGenerator::generate_disjoint_byte_copy(bool, unsigned char**, char const*)+0x965  (stubGenerator_x86_64_arraycopy.cpp:1277)
V  [libjvm.so+0x16cb309]  StubGenerator::generate_arraycopy_stubs()+0x29  (stubGenerator_x86_64_arraycopy.cpp:88)
V  [libjvm.so+0x16a1089]  StubGenerator::generate_final_stubs()+0xb9  (stubGenerator_x86_64.cpp:4051)
V  [libjvm.so+0x16a22a5]  StubGenerator_generate(CodeBuffer*, StubCodeGenerator::StubsKind)+0x105  (stubGenerator_x86_64.cpp:4296)
V  [libjvm.so+0x16f349e]  initialize_stubs(StubCodeGenerator::StubsKind, int, int, char const*, char const*, char const*)+0x13e  (stubRoutines.cpp:241)
V  [libjvm.so+0x16f500d]  final_stubs_init()+0x3d  (stubRoutines.cpp:288)
V  [libjvm.so+0xe30c59]  init_globals2()+0x69  (init.cpp:180)
V  [libjvm.so+0x17b9151]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x311  (threads.cpp:569)
V  [libjvm.so+0xf937e4]  JNI_CreateJavaVM+0x54  (jni.cpp:3576)
C  [libjli.so+0x419f]  JavaMain+0x8f  (java.c:1522)
C  [libjli.so+0x7c39]  ThreadJavaMain+0x9  (java_md.c:650)

For example, with compiler/arraycopy/TestArrayCopyConjoint.java and -XX:-UseTLAB.

@steveatgh
Copy link
Contributor Author

steveatgh commented Nov 14, 2023

Thank you @TobiHartmann for the feedback. I'm working on the issue.

Can you tell me what kind of machine you tested with?

Comment on lines 582 to 585
__ movq(temp2, temp1);
__ shlq(temp2, shift);
__ cmpq(temp2, large_threshold);
__ jcc(Assembler::greaterEqual, L_copy_large);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @steveatgh , Can you please share the performance number of other Array copy JMH micros in following directoy https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will still request you to run BM in above path, we may see performance dips for sizes after special cases due to additional comparisons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the results on my Ubuntu laptop running at 3 GHz

// Baseline
Benchmark                                    (length)  (size)   Mode  Cnt      Score      Error   Units
ArrayCopyObject.conjoint_micro                    N/A      31  thrpt   15  77157.933 ? 1977.467  ops/ms
ArrayCopyObject.conjoint_micro                    N/A      63  thrpt   15  58329.157 ? 1667.574  ops/ms
ArrayCopyObject.conjoint_micro                    N/A     127  thrpt   15  49322.065 ? 2332.342  ops/ms
ArrayCopyObject.conjoint_micro                    N/A    2047  thrpt   15  13895.531 ?  239.300  ops/ms
ArrayCopyObject.conjoint_micro                    N/A    4095  thrpt   15   7926.854 ?  201.238  ops/ms
ArrayCopyObject.conjoint_micro                    N/A    8191  thrpt   15   4289.582 ?   31.734  ops/ms
ArrayCopyObject.disjoint_micro                    N/A      31  thrpt   15  74711.699 ? 2463.378  ops/ms
ArrayCopyObject.disjoint_micro                    N/A      63  thrpt   15  65229.586 ? 1329.809  ops/ms
ArrayCopyObject.disjoint_micro                    N/A     127  thrpt   15  54330.794 ? 2372.868  ops/ms
ArrayCopyObject.disjoint_micro                    N/A    2047  thrpt   15   9338.340 ?  132.987  ops/ms
ArrayCopyObject.disjoint_micro                    N/A    4095  thrpt   15   5035.553 ?  109.679  ops/ms
ArrayCopyObject.disjoint_micro                    N/A    8191  thrpt   15   1192.069 ?   10.765  ops/ms
ArrayCopy.arrayCopy                               N/A     N/A   avgt   15      1.356 ?    0.029   ns/op
ArrayCopy.arrayCopyChar                           N/A     N/A   avgt   15      4.368 ?    0.038   ns/op
ArrayCopy.arrayCopyCharNonConst                   N/A     N/A   avgt   15      4.749 ?    0.113   ns/op
ArrayCopy.arrayCopyLocalArray                     N/A     N/A   avgt   15      0.503 ?    0.001   ns/op
ArrayCopy.arrayCopyNonConst                       N/A     N/A   avgt   15      1.955 ?    0.108   ns/op
ArrayCopy.arrayCopyObject                         N/A     N/A   avgt   15     22.403 ?    0.563   ns/op
ArrayCopy.arrayCopyObjectNonConst                 N/A     N/A   avgt   15     25.188 ?    0.484   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward       N/A     N/A   avgt   15     17.785 ?    0.781   ns/op
ArrayCopy.arrayCopyObjectSameArraysForward        N/A     N/A   avgt   15     17.347 ?    0.126   ns/op
ArrayCopy.copyLoop                                N/A     N/A   avgt   15      5.189 ?    0.100   ns/op
ArrayCopy.copyLoopLocalArray                      N/A     N/A   avgt   15      3.685 ?    0.085   ns/op
ArrayCopy.copyLoopNonConst                        N/A     N/A   avgt   15      5.436 ?    0.040   ns/op
ArrayCopyAligned.testByte                           1     N/A   avgt   15      2.366 ?    0.028   ns/op
ArrayCopyAligned.testByte                           3     N/A   avgt   15      2.381 ?    0.063   ns/op
ArrayCopyAligned.testByte                           5     N/A   avgt   15      2.362 ?    0.035   ns/op
ArrayCopyAligned.testByte                          10     N/A   avgt   15      2.364 ?    0.048   ns/op
ArrayCopyAligned.testByte                          20     N/A   avgt   15      2.353 ?    0.026   ns/op
ArrayCopyAligned.testByte                          70     N/A   avgt   15      5.214 ?    0.082   ns/op
ArrayCopyAligned.testByte                         150     N/A   avgt   15      6.081 ?    0.140   ns/op
ArrayCopyAligned.testByte                         300     N/A   avgt   15      9.399 ?    0.262   ns/op
ArrayCopyAligned.testByte                         600     N/A   avgt   15     12.710 ?    0.149   ns/op
ArrayCopyAligned.testByte                        1200     N/A   avgt   15     21.873 ?    0.237   ns/op
ArrayCopyAligned.testChar                           1     N/A   avgt   15      2.349 ?    0.014   ns/op
ArrayCopyAligned.testChar                           3     N/A   avgt   15      2.360 ?    0.041   ns/op
ArrayCopyAligned.testChar                           5     N/A   avgt   15      2.359 ?    0.021   ns/op
ArrayCopyAligned.testChar                          10     N/A   avgt   15      2.369 ?    0.042   ns/op
ArrayCopyAligned.testChar                          20     N/A   avgt   15      5.101 ?    0.080   ns/op
ArrayCopyAligned.testChar                          70     N/A   avgt   15      5.961 ?    0.096   ns/op
ArrayCopyAligned.testChar                         150     N/A   avgt   15      9.321 ?    0.221   ns/op
ArrayCopyAligned.testChar                         300     N/A   avgt   15     13.473 ?    0.282   ns/op
ArrayCopyAligned.testChar                         600     N/A   avgt   15     20.941 ?    0.211   ns/op
ArrayCopyAligned.testChar                        1200     N/A   avgt   15     33.840 ?    0.490   ns/op
ArrayCopyAligned.testInt                            1     N/A   avgt   15      4.391 ?    0.042   ns/op
ArrayCopyAligned.testInt                            3     N/A   avgt   15      4.417 ?    0.063   ns/op
ArrayCopyAligned.testInt                            5     N/A   avgt   15      4.425 ?    0.047   ns/op
ArrayCopyAligned.testInt                           10     N/A   avgt   15      5.058 ?    0.084   ns/op
ArrayCopyAligned.testInt                           20     N/A   avgt   15      5.083 ?    0.062   ns/op
ArrayCopyAligned.testInt                           70     N/A   avgt   15      8.773 ?    0.200   ns/op
ArrayCopyAligned.testInt                          150     N/A   avgt   15     12.221 ?    0.212   ns/op
ArrayCopyAligned.testInt                          300     N/A   avgt   15     21.785 ?    0.160   ns/op
ArrayCopyAligned.testInt                          600     N/A   avgt   15     37.937 ?    0.178   ns/op
ArrayCopyAligned.testInt                         1200     N/A   avgt   15     54.911 ?    0.943   ns/op
ArrayCopyAligned.testLong                           1     N/A   avgt   15      4.420 ?    0.075   ns/op
ArrayCopyAligned.testLong                           3     N/A   avgt   15      4.362 ?    0.010   ns/op
ArrayCopyAligned.testLong                           5     N/A   avgt   15      5.030 ?    0.018   ns/op
ArrayCopyAligned.testLong                          10     N/A   avgt   15      5.112 ?    0.074   ns/op
ArrayCopyAligned.testLong                          20     N/A   avgt   15      5.847 ?    0.151   ns/op
ArrayCopyAligned.testLong                          70     N/A   avgt   15     11.349 ?    0.411   ns/op
ArrayCopyAligned.testLong                         150     N/A   avgt   15     17.721 ?    0.360   ns/op
ArrayCopyAligned.testLong                         300     N/A   avgt   15     27.205 ?    0.427   ns/op
ArrayCopyAligned.testLong                         600     N/A   avgt   15     44.129 ?    0.555   ns/op
ArrayCopyAligned.testLong                        1200     N/A   avgt   15     75.388 ?    0.774   ns/op
ArrayCopyUnalignedBoth.testByte                     1     N/A   avgt   15      2.355 ?    0.026   ns/op
ArrayCopyUnalignedBoth.testByte                     3     N/A   avgt   15      2.361 ?    0.046   ns/op
ArrayCopyUnalignedBoth.testByte                     5     N/A   avgt   15      2.357 ?    0.032   ns/op
ArrayCopyUnalignedBoth.testByte                    10     N/A   avgt   15      2.385 ?    0.047   ns/op
ArrayCopyUnalignedBoth.testByte                    20     N/A   avgt   15      2.355 ?    0.028   ns/op
ArrayCopyUnalignedBoth.testByte                    70     N/A   avgt   15      5.218 ?    0.095   ns/op
ArrayCopyUnalignedBoth.testByte                   150     N/A   avgt   15      6.038 ?    0.112   ns/op
ArrayCopyUnalignedBoth.testByte                   300     N/A   avgt   15      9.848 ?    0.218   ns/op
ArrayCopyUnalignedBoth.testByte                   600     N/A   avgt   15     13.090 ?    0.170   ns/op
ArrayCopyUnalignedBoth.testByte                  1200     N/A   avgt   15     20.538 ?    0.270   ns/op
ArrayCopyUnalignedBoth.testChar                     1     N/A   avgt   15      2.374 ?    0.043   ns/op
ArrayCopyUnalignedBoth.testChar                     3     N/A   avgt   15      2.351 ?    0.011   ns/op
ArrayCopyUnalignedBoth.testChar                     5     N/A   avgt   15      2.352 ?    0.017   ns/op
ArrayCopyUnalignedBoth.testChar                    10     N/A   avgt   15      2.349 ?    0.008   ns/op
ArrayCopyUnalignedBoth.testChar                    20     N/A   avgt   15      5.070 ?    0.041   ns/op
ArrayCopyUnalignedBoth.testChar                    70     N/A   avgt   15      6.052 ?    0.197   ns/op
ArrayCopyUnalignedBoth.testChar                   150     N/A   avgt   15      9.861 ?    0.226   ns/op
ArrayCopyUnalignedBoth.testChar                   300     N/A   avgt   15     13.635 ?    0.136   ns/op
ArrayCopyUnalignedBoth.testChar                   600     N/A   avgt   15     20.967 ?    0.164   ns/op
ArrayCopyUnalignedBoth.testChar                  1200     N/A   avgt   15     36.465 ?    0.140   ns/op
ArrayCopyUnalignedBoth.testInt                      1     N/A   avgt   15      4.440 ?    0.064   ns/op
ArrayCopyUnalignedBoth.testInt                      3     N/A   avgt   15      4.446 ?    0.089   ns/op
ArrayCopyUnalignedBoth.testInt                      5     N/A   avgt   15      4.417 ?    0.058   ns/op
ArrayCopyUnalignedBoth.testInt                     10     N/A   avgt   15      5.044 ?    0.054   ns/op
ArrayCopyUnalignedBoth.testInt                     20     N/A   avgt   15      5.127 ?    0.100   ns/op
ArrayCopyUnalignedBoth.testInt                     70     N/A   avgt   15      8.399 ?    0.077   ns/op
ArrayCopyUnalignedBoth.testInt                    150     N/A   avgt   15     12.252 ?    0.203   ns/op
ArrayCopyUnalignedBoth.testInt                    300     N/A   avgt   15     23.253 ?    0.252   ns/op
ArrayCopyUnalignedBoth.testInt                    600     N/A   avgt   15     37.990 ?    0.456   ns/op
ArrayCopyUnalignedBoth.testInt                   1200     N/A   avgt   15     57.030 ?    0.146   ns/op
ArrayCopyUnalignedBoth.testLong                     1     N/A   avgt   15      4.360 ?    0.014   ns/op
ArrayCopyUnalignedBoth.testLong                     3     N/A   avgt   15      4.391 ?    0.080   ns/op
ArrayCopyUnalignedBoth.testLong                     5     N/A   avgt   15      5.060 ?    0.071   ns/op
ArrayCopyUnalignedBoth.testLong                    10     N/A   avgt   15      5.117 ?    0.109   ns/op
ArrayCopyUnalignedBoth.testLong                    20     N/A   avgt   15      5.841 ?    0.115   ns/op
ArrayCopyUnalignedBoth.testLong                    70     N/A   avgt   15     11.700 ?    0.655   ns/op
ArrayCopyUnalignedBoth.testLong                   150     N/A   avgt   15     22.002 ?    0.408   ns/op
ArrayCopyUnalignedBoth.testLong                   300     N/A   avgt   15     36.020 ?    0.356   ns/op
ArrayCopyUnalignedBoth.testLong                   600     N/A   avgt   15     45.212 ?    0.194   ns/op
ArrayCopyUnalignedBoth.testLong                  1200     N/A   avgt   15     75.720 ?    0.607   ns/op
ArrayCopyUnalignedDst.testByte                      1     N/A   avgt   15      2.361 ?    0.037   ns/op
ArrayCopyUnalignedDst.testByte                     10     N/A   avgt   15      2.353 ?    0.025   ns/op
ArrayCopyUnalignedDst.testByte                    150     N/A   avgt   15      6.145 ?    0.170   ns/op
ArrayCopyUnalignedDst.testByte                   1200     N/A   avgt   15     19.825 ?    0.231   ns/op
ArrayCopyUnalignedDst.testChar                      1     N/A   avgt   15      2.366 ?    0.053   ns/op
ArrayCopyUnalignedDst.testChar                     10     N/A   avgt   15      2.375 ?    0.058   ns/op
ArrayCopyUnalignedDst.testChar                    150     N/A   avgt   15      9.274 ?    0.237   ns/op
ArrayCopyUnalignedDst.testChar                   1200     N/A   avgt   15     36.327 ?    0.086   ns/op
ArrayCopyUnalignedDst.testInt                       1     N/A   avgt   15      4.400 ?    0.023   ns/op
ArrayCopyUnalignedDst.testInt                      10     N/A   avgt   15      5.071 ?    0.073   ns/op
ArrayCopyUnalignedDst.testInt                     150     N/A   avgt   15     13.229 ?    0.172   ns/op
ArrayCopyUnalignedDst.testInt                    1200     N/A   avgt   15     56.467 ?    0.384   ns/op
ArrayCopyUnalignedDst.testLong                      1     N/A   avgt   15      4.421 ?    0.107   ns/op
ArrayCopyUnalignedDst.testLong                     10     N/A   avgt   15      5.074 ?    0.063   ns/op
ArrayCopyUnalignedDst.testLong                    150     N/A   avgt   15     20.605 ?    0.602   ns/op
ArrayCopyUnalignedDst.testLong                   1200     N/A   avgt   15     74.206 ?    0.294   ns/op
ArrayCopyUnalignedSrc.testByte                      1     N/A   avgt   15      2.352 ?    0.024   ns/op
ArrayCopyUnalignedSrc.testByte                     10     N/A   avgt   15      2.352 ?    0.028   ns/op
ArrayCopyUnalignedSrc.testByte                    150     N/A   avgt   15      6.156 ?    0.118   ns/op
ArrayCopyUnalignedSrc.testByte                   1200     N/A   avgt   15     16.755 ?    0.046   ns/op
ArrayCopyUnalignedSrc.testChar                      1     N/A   avgt   15      2.363 ?    0.031   ns/op
ArrayCopyUnalignedSrc.testChar                     10     N/A   avgt   15      2.367 ?    0.045   ns/op
ArrayCopyUnalignedSrc.testChar                    150     N/A   avgt   15      9.318 ?    0.157   ns/op
ArrayCopyUnalignedSrc.testChar                   1200     N/A   avgt   15     31.355 ?    0.276   ns/op
ArrayCopyUnalignedSrc.testInt                       1     N/A   avgt   15      4.428 ?    0.063   ns/op
ArrayCopyUnalignedSrc.testInt                      10     N/A   avgt   15      5.072 ?    0.089   ns/op
ArrayCopyUnalignedSrc.testInt                     150     N/A   avgt   15     12.163 ?    0.116   ns/op
ArrayCopyUnalignedSrc.testInt                    1200     N/A   avgt   15     54.206 ?    0.374   ns/op
ArrayCopyUnalignedSrc.testLong                      1     N/A   avgt   15      4.401 ?    0.052   ns/op
ArrayCopyUnalignedSrc.testLong                     10     N/A   avgt   15      5.058 ?    0.034   ns/op
ArrayCopyUnalignedSrc.testLong                    150     N/A   avgt   15     20.391 ?    0.417   ns/op
ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt   15     74.467 ?    0.809   ns/op

// PR with assert fix
Benchmark                                    (length)  (size)   Mode  Cnt      Score      Error   Units
ArrayCopyObject.conjoint_micro                    N/A      31  thrpt   15  79910.859 ?  869.372  ops/ms
ArrayCopyObject.conjoint_micro                    N/A      63  thrpt   15  62631.951 ? 1440.065  ops/ms
ArrayCopyObject.conjoint_micro                    N/A     127  thrpt   15  51043.300 ?  761.226  ops/ms
ArrayCopyObject.conjoint_micro                    N/A    2047  thrpt   15  14141.790 ?  164.714  ops/ms
ArrayCopyObject.conjoint_micro                    N/A    4095  thrpt   15   8024.056 ?   53.310  ops/ms
ArrayCopyObject.conjoint_micro                    N/A    8191  thrpt   15   4318.074 ?    6.441  ops/ms
ArrayCopyObject.disjoint_micro                    N/A      31  thrpt   15  78245.690 ? 1697.277  ops/ms
ArrayCopyObject.disjoint_micro                    N/A      63  thrpt   15  61873.747 ?  806.972  ops/ms
ArrayCopyObject.disjoint_micro                    N/A     127  thrpt   15  55457.908 ? 2091.739  ops/ms
ArrayCopyObject.disjoint_micro                    N/A    2047  thrpt   15   9407.159 ?  102.308  ops/ms
ArrayCopyObject.disjoint_micro                    N/A    4095  thrpt   15   5107.999 ?   49.856  ops/ms
ArrayCopyObject.disjoint_micro                    N/A    8191  thrpt   15   1195.313 ?    7.580  ops/ms
ArrayCopy.arrayCopy                               N/A     N/A   avgt   15      1.354 ?    0.026   ns/op
ArrayCopy.arrayCopyChar                           N/A     N/A   avgt   15      4.388 ?    0.101   ns/op
ArrayCopy.arrayCopyCharNonConst                   N/A     N/A   avgt   15      4.715 ?    0.077   ns/op
ArrayCopy.arrayCopyLocalArray                     N/A     N/A   avgt   15      0.505 ?    0.007   ns/op
ArrayCopy.arrayCopyNonConst                       N/A     N/A   avgt   15      1.900 ?    0.042   ns/op
ArrayCopy.arrayCopyObject                         N/A     N/A   avgt   15     23.395 ?    0.252   ns/op
ArrayCopy.arrayCopyObjectNonConst                 N/A     N/A   avgt   15     25.409 ?    0.355   ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward       N/A     N/A   avgt   15     17.352 ?    0.297   ns/op
ArrayCopy.arrayCopyObjectSameArraysForward        N/A     N/A   avgt   15     17.804 ?    0.198   ns/op
ArrayCopy.copyLoop                                N/A     N/A   avgt   15      5.114 ?    0.117   ns/op
ArrayCopy.copyLoopLocalArray                      N/A     N/A   avgt   15      3.728 ?    0.086   ns/op
ArrayCopy.copyLoopNonConst                        N/A     N/A   avgt   15      5.413 ?    0.022   ns/op
ArrayCopyAligned.testByte                           1     N/A   avgt   15      2.367 ?    0.041   ns/op
ArrayCopyAligned.testByte                           3     N/A   avgt   15      2.368 ?    0.048   ns/op
ArrayCopyAligned.testByte                           5     N/A   avgt   15      2.360 ?    0.050   ns/op
ArrayCopyAligned.testByte                          10     N/A   avgt   15      2.362 ?    0.030   ns/op
ArrayCopyAligned.testByte                          20     N/A   avgt   15      2.363 ?    0.038   ns/op
ArrayCopyAligned.testByte                          70     N/A   avgt   15      5.185 ?    0.092   ns/op
ArrayCopyAligned.testByte                         150     N/A   avgt   15      5.905 ?    0.073   ns/op
ArrayCopyAligned.testByte                         300     N/A   avgt   15      9.720 ?    0.215   ns/op
ArrayCopyAligned.testByte                         600     N/A   avgt   15     13.076 ?    0.142   ns/op
ArrayCopyAligned.testByte                        1200     N/A   avgt   15     22.189 ?    0.143   ns/op
ArrayCopyAligned.testChar                           1     N/A   avgt   15      2.351 ?    0.008   ns/op
ArrayCopyAligned.testChar                           3     N/A   avgt   15      2.370 ?    0.046   ns/op
ArrayCopyAligned.testChar                           5     N/A   avgt   15      2.355 ?    0.037   ns/op
ArrayCopyAligned.testChar                          10     N/A   avgt   15      2.351 ?    0.020   ns/op
ArrayCopyAligned.testChar                          20     N/A   avgt   15      5.077 ?    0.059   ns/op
ArrayCopyAligned.testChar                          70     N/A   avgt   15      5.932 ?    0.101   ns/op
ArrayCopyAligned.testChar                         150     N/A   avgt   15      9.815 ?    0.159   ns/op
ArrayCopyAligned.testChar                         300     N/A   avgt   15     13.759 ?    0.197   ns/op
ArrayCopyAligned.testChar                         600     N/A   avgt   15     20.505 ?    0.161   ns/op
ArrayCopyAligned.testChar                        1200     N/A   avgt   15     33.720 ?    0.493   ns/op
ArrayCopyAligned.testInt                            1     N/A   avgt   15      4.417 ?    0.096   ns/op
ArrayCopyAligned.testInt                            3     N/A   avgt   15      4.363 ?    0.029   ns/op
ArrayCopyAligned.testInt                            5     N/A   avgt   15      4.365 ?    0.022   ns/op
ArrayCopyAligned.testInt                           10     N/A   avgt   15      5.122 ?    0.170   ns/op
ArrayCopyAligned.testInt                           20     N/A   avgt   15      5.074 ?    0.076   ns/op
ArrayCopyAligned.testInt                           70     N/A   avgt   15      9.048 ?    0.201   ns/op
ArrayCopyAligned.testInt                          150     N/A   avgt   15     12.559 ?    0.159   ns/op
ArrayCopyAligned.testInt                          300     N/A   avgt   15     21.518 ?    0.276   ns/op
ArrayCopyAligned.testInt                          600     N/A   avgt   15     38.209 ?    0.349   ns/op
ArrayCopyAligned.testInt                         1200     N/A   avgt   15     54.638 ?    0.706   ns/op
ArrayCopyAligned.testLong                           1     N/A   avgt   15      4.407 ?    0.041   ns/op
ArrayCopyAligned.testLong                           3     N/A   avgt   15      4.415 ?    0.077   ns/op
ArrayCopyAligned.testLong                           5     N/A   avgt   15      5.087 ?    0.092   ns/op
ArrayCopyAligned.testLong                          10     N/A   avgt   15      5.072 ?    0.078   ns/op
ArrayCopyAligned.testLong                          20     N/A   avgt   15      5.802 ?    0.023   ns/op
ArrayCopyAligned.testLong                          70     N/A   avgt   15     11.284 ?    0.171   ns/op
ArrayCopyAligned.testLong                         150     N/A   avgt   15     17.501 ?    0.185   ns/op
ArrayCopyAligned.testLong                         300     N/A   avgt   15     27.477 ?    0.391   ns/op
ArrayCopyAligned.testLong                         600     N/A   avgt   15     44.711 ?    0.209   ns/op
ArrayCopyAligned.testLong                        1200     N/A   avgt   15     77.157 ?    1.437   ns/op
ArrayCopyUnalignedBoth.testByte                     1     N/A   avgt   15      2.360 ?    0.040   ns/op
ArrayCopyUnalignedBoth.testByte                     3     N/A   avgt   15      2.351 ?    0.028   ns/op
ArrayCopyUnalignedBoth.testByte                     5     N/A   avgt   15      2.352 ?    0.017   ns/op
ArrayCopyUnalignedBoth.testByte                    10     N/A   avgt   15      2.347 ?    0.011   ns/op
ArrayCopyUnalignedBoth.testByte                    20     N/A   avgt   15      2.363 ?    0.039   ns/op
ArrayCopyUnalignedBoth.testByte                    70     N/A   avgt   15      5.182 ?    0.083   ns/op
ArrayCopyUnalignedBoth.testByte                   150     N/A   avgt   15      5.920 ?    0.157   ns/op
ArrayCopyUnalignedBoth.testByte                   300     N/A   avgt   15     10.374 ?    0.314   ns/op
ArrayCopyUnalignedBoth.testByte                   600     N/A   avgt   15     13.511 ?    0.182   ns/op
ArrayCopyUnalignedBoth.testByte                  1200     N/A   avgt   15     21.302 ?    0.194   ns/op
ArrayCopyUnalignedBoth.testChar                     1     N/A   avgt   15      2.359 ?    0.035   ns/op
ArrayCopyUnalignedBoth.testChar                     3     N/A   avgt   15      2.342 ?    0.002   ns/op
ArrayCopyUnalignedBoth.testChar                     5     N/A   avgt   15      2.348 ?    0.019   ns/op
ArrayCopyUnalignedBoth.testChar                    10     N/A   avgt   15      2.362 ?    0.059   ns/op
ArrayCopyUnalignedBoth.testChar                    20     N/A   avgt   15      5.079 ?    0.046   ns/op
ArrayCopyUnalignedBoth.testChar                    70     N/A   avgt   15      5.974 ?    0.165   ns/op
ArrayCopyUnalignedBoth.testChar                   150     N/A   avgt   15     10.201 ?    0.260   ns/op
ArrayCopyUnalignedBoth.testChar                   300     N/A   avgt   15     13.862 ?    0.064   ns/op
ArrayCopyUnalignedBoth.testChar                   600     N/A   avgt   15     20.752 ?    0.240   ns/op
ArrayCopyUnalignedBoth.testChar                  1200     N/A   avgt   15     36.883 ?    0.390   ns/op
ArrayCopyUnalignedBoth.testInt                      1     N/A   avgt   15      4.372 ?    0.054   ns/op
ArrayCopyUnalignedBoth.testInt                      3     N/A   avgt   15      4.376 ?    0.051   ns/op
ArrayCopyUnalignedBoth.testInt                      5     N/A   avgt   15      4.385 ?    0.081   ns/op
ArrayCopyUnalignedBoth.testInt                     10     N/A   avgt   15      5.059 ?    0.082   ns/op
ArrayCopyUnalignedBoth.testInt                     20     N/A   avgt   15      5.099 ?    0.154   ns/op
ArrayCopyUnalignedBoth.testInt                     70     N/A   avgt   15      8.983 ?    0.079   ns/op
ArrayCopyUnalignedBoth.testInt                    150     N/A   avgt   15     12.481 ?    0.169   ns/op
ArrayCopyUnalignedBoth.testInt                    300     N/A   avgt   15     23.265 ?    0.319   ns/op
ArrayCopyUnalignedBoth.testInt                    600     N/A   avgt   15     38.328 ?    0.259   ns/op
ArrayCopyUnalignedBoth.testInt                   1200     N/A   avgt   15     57.320 ?    0.476   ns/op
ArrayCopyUnalignedBoth.testLong                     1     N/A   avgt   15      4.413 ?    0.055   ns/op
ArrayCopyUnalignedBoth.testLong                     3     N/A   avgt   15      4.409 ?    0.024   ns/op
ArrayCopyUnalignedBoth.testLong                     5     N/A   avgt   15      5.086 ?    0.134   ns/op
ArrayCopyUnalignedBoth.testLong                    10     N/A   avgt   15      5.069 ?    0.022   ns/op
ArrayCopyUnalignedBoth.testLong                    20     N/A   avgt   15      5.788 ?    0.087   ns/op
ArrayCopyUnalignedBoth.testLong                    70     N/A   avgt   15     11.149 ?    0.182   ns/op
ArrayCopyUnalignedBoth.testLong                   150     N/A   avgt   15     22.461 ?    0.284   ns/op
ArrayCopyUnalignedBoth.testLong                   300     N/A   avgt   15     36.353 ?    0.272   ns/op
ArrayCopyUnalignedBoth.testLong                   600     N/A   avgt   15     47.568 ?    2.050   ns/op
ArrayCopyUnalignedBoth.testLong                  1200     N/A   avgt   15     80.643 ?    3.747   ns/op
ArrayCopyUnalignedDst.testByte                      1     N/A   avgt   15      2.344 ?    0.004   ns/op
ArrayCopyUnalignedDst.testByte                     10     N/A   avgt   15      2.362 ?    0.043   ns/op
ArrayCopyUnalignedDst.testByte                    150     N/A   avgt   15      5.922 ?    0.066   ns/op
ArrayCopyUnalignedDst.testByte                   1200     N/A   avgt   15     19.768 ?    0.177   ns/op
ArrayCopyUnalignedDst.testChar                      1     N/A   avgt   15      2.358 ?    0.032   ns/op
ArrayCopyUnalignedDst.testChar                     10     N/A   avgt   15      2.379 ?    0.056   ns/op
ArrayCopyUnalignedDst.testChar                    150     N/A   avgt   15      9.497 ?    0.181   ns/op
ArrayCopyUnalignedDst.testChar                   1200     N/A   avgt   15     36.580 ?    0.067   ns/op
ArrayCopyUnalignedDst.testInt                       1     N/A   avgt   15      4.412 ?    0.106   ns/op
ArrayCopyUnalignedDst.testInt                      10     N/A   avgt   15      5.082 ?    0.130   ns/op
ArrayCopyUnalignedDst.testInt                     150     N/A   avgt   15     13.638 ?    0.262   ns/op
ArrayCopyUnalignedDst.testInt                    1200     N/A   avgt   15     56.724 ?    0.247   ns/op
ArrayCopyUnalignedDst.testLong                      1     N/A   avgt   15      4.435 ?    0.113   ns/op
ArrayCopyUnalignedDst.testLong                     10     N/A   avgt   15      5.102 ?    0.095   ns/op
ArrayCopyUnalignedDst.testLong                    150     N/A   avgt   15     20.762 ?    0.388   ns/op
ArrayCopyUnalignedDst.testLong                   1200     N/A   avgt   15     77.408 ?    2.771   ns/op
ArrayCopyUnalignedSrc.testByte                      1     N/A   avgt   15      2.346 ?    0.009   ns/op
ArrayCopyUnalignedSrc.testByte                     10     N/A   avgt   15      2.367 ?    0.053   ns/op
ArrayCopyUnalignedSrc.testByte                    150     N/A   avgt   15      5.953 ?    0.120   ns/op
ArrayCopyUnalignedSrc.testByte                   1200     N/A   avgt   15     16.899 ?    0.277   ns/op
ArrayCopyUnalignedSrc.testChar                      1     N/A   avgt   15      2.375 ?    0.054   ns/op
ArrayCopyUnalignedSrc.testChar                     10     N/A   avgt   15      2.348 ?    0.005   ns/op
ArrayCopyUnalignedSrc.testChar                    150     N/A   avgt   15      9.559 ?    0.217   ns/op
ArrayCopyUnalignedSrc.testChar                   1200     N/A   avgt   15     31.406 ?    0.389   ns/op
ArrayCopyUnalignedSrc.testInt                       1     N/A   avgt   15      4.372 ?    0.023   ns/op
ArrayCopyUnalignedSrc.testInt                      10     N/A   avgt   15      5.071 ?    0.110   ns/op
ArrayCopyUnalignedSrc.testInt                     150     N/A   avgt   15     12.627 ?    0.379   ns/op
ArrayCopyUnalignedSrc.testInt                    1200     N/A   avgt   15     54.595 ?    0.281   ns/op
ArrayCopyUnalignedSrc.testLong                      1     N/A   avgt   15      4.415 ?    0.043   ns/op
ArrayCopyUnalignedSrc.testLong                     10     N/A   avgt   15      5.058 ?    0.065   ns/op
ArrayCopyUnalignedSrc.testLong                    150     N/A   avgt   15     20.759 ?    0.256   ns/op
ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt   15     78.106 ?    2.320   ns/op

Most of the diferences are in the error range, a few are a little bigger.

Comment on lines 582 to 585
__ movq(temp2, temp1);
__ shlq(temp2, shift);
__ cmpq(temp2, large_threshold);
__ jcc(Assembler::greaterEqual, L_copy_large);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect additional checks for 2.5MB array size may hit the performance of other general sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing several runs of the XorTest.copy SMALL (100K) benchmark, baseline against PR, I see an average slowdown of 1.7% (7.566 ms / op vs 7.696 ms/op)

Comment on lines +1183 to +1186
__ evmovntdquq(Address(dst, index, scale, offset), xmm1, Assembler::AVX_512bit);
__ evmovntdquq(Address(dst, index, scale, offset + 0x40), xmm2, Assembler::AVX_512bit);
__ evmovntdquq(Address(dst, index, scale, offset + 0x80), xmm3, Assembler::AVX_512bit);
__ evmovntdquq(Address(dst, index, scale, offset + 0xC0), xmm4, Assembler::AVX_512bit);
Copy link
Member

@jatin-bhateja jatin-bhateja Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are non-temporal memory moves, to force eviction from write combining buffers we may need to emit additional fences, else a subsequent read from destination memory may see incorrect values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jatin-bhateja There is a sfence at line 781.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, there is an store fence upon completion of the main loop for the large size code:

image

Copy link
Member

@jatin-bhateja jatin-bhateja Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any concerns while handling multithreaded case where writer is busy copying 256 bytes block in loop and reader try to access a location still not flushed out of write combining buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results a concurrent reader sees could be different if the copy is using nt writes, but if the read of the destination is not synced with the copy operation, I think the reader would not see consistent state in either case. Is it worse with nt writes?

Copy link
Member

@jatin-bhateja jatin-bhateja Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, agree behavior is similar to non-NT case, in fact using NT for huge copy operations will prevent polluting caches due to destination cache line fills.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't it also cause performance regressions in the common case where the caller needs to use the destination array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One component of the included XorTest.xor benchmark is to read the bytes from two copied arrays. See line 155 in libjnitest.c
The nt stores are only used in the FOREGN LARGE case and it shows a net speedup ~123 ms -> 104 ms.

@openjdk
Copy link

openjdk bot commented Nov 16, 2023

@steveatgh this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout memcpy
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Nov 16, 2023
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Nov 16, 2023
@TobiHartmann
Copy link
Member

Thanks, I re-submitted testing.

<<<<<<< HEAD
void copy(int count, byte[] src, int sOff, byte[] dst, int dOff, int len);
=======
>>>>>>> 9727f4bdddc071e6f59806087339f345405ab004
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have multiple merge conflicts in the micro benchmark files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure how I missed the conflicts. They should be resolved now. Thanks!

Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @steveatgh , X86 code changes looks good to me.

@openjdk
Copy link

openjdk bot commented Nov 20, 2023

⚠️ @steveatgh the full name on your profile does not match the author name in this pull requests' HEAD commit. If this pull request gets integrated then the author name from this pull requests' HEAD commit will be used for the resulting commit. If you wish to push a new commit with a different author name, then please run the following commands in a local repository of your personal fork:

$ git checkout memcpy
$ git commit --author='Preferred Full Name <you@example.com>' --allow-empty -m 'Update full name'
$ git push

@openjdk
Copy link

openjdk bot commented Nov 20, 2023

@steveatgh This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

Co-authored-by: Maurizio Cimadamore <mcimadamore@openjdk.org>
Reviewed-by: thartmann, jbhateja, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 54 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@TobiHartmann, @jatin-bhateja, @sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 20, 2023
- remove ::copy test from XorTest
Previous commit (fcbbc0d) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
@steveatgh
Copy link
Contributor Author

steveatgh commented Nov 21, 2023

The micros:java.lang.foreign.xor.XorTest::xor benchmark results shown in the introductory comment above used XorTest code from PR commit 7cc272e which was based on Maurizio Cimadamore's commit a788f06. The XorTest has since been updated and the XorTest::copy is no longer needed and has been removed from this pull request. Performance can be evaluated using both the new XorTest and a new org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark added to this PR. Results from these two benchmarks are show below:

In the ArrayCopyAlignedLarge.testByte benchmark below, the PR code is active in sizes 5MB and 10MB.

// Baseline 
Benchmark                        (length)  Mode  Cnt       Score       Error  Units
ArrayCopyAlignedLarge.testByte    100000  avgt   15    2434.515 ?    11.526  ns/op
ArrayCopyAlignedLarge.testByte   1000000  avgt   15   51211.235 ?   539.355  ns/op
ArrayCopyAlignedLarge.testByte   2000000  avgt   15  104837.012 ?  1338.823  ns/op
ArrayCopyAlignedLarge.testByte   5000000  avgt   15  293357.745 ?  3233.745  ns/op
ArrayCopyAlignedLarge.testByte  10000000  avgt   15  957068.292 ? 15509.983  ns/op

// PR
Benchmark                       (length)  Mode  Cnt       Score      Error  Units
ArrayCopyAlignedLarge.testByte    100000  avgt   15    2443.354 ?   17.996  ns/op
ArrayCopyAlignedLarge.testByte   1000000  avgt   15   50854.800 ? 1253.863  ns/op
ArrayCopyAlignedLarge.testByte   2000000  avgt   15  105105.124 ? 1286.606  ns/op
ArrayCopyAlignedLarge.testByte   5000000  avgt   15  207298.875 ? 1260.289  ns/op
ArrayCopyAlignedLarge.testByte  10000000  avgt   15  457461.404 ? 8628.867  ns/op

In the XorTest::xor benchmark below, the PR code is active in 3 of the LARGE case runs: FOREIGN_NO_INIT, FOREIGN_INIT, and UNSAFE.

// New xor test - Baseline
Benchmark         (arrayKind)  (sizeKind)  Mode  Cnt    Score    Error  Units
XorTest.xor      JNI_ELEMENTS       SMALL  avgt   30    0.220 ?  0.002  ms/op
XorTest.xor      JNI_ELEMENTS      MEDIUM  avgt   30    2.859 ?  0.034  ms/op
XorTest.xor      JNI_ELEMENTS       LARGE  avgt   30  117.436 ?  1.708  ms/op
XorTest.xor        JNI_REGION       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor        JNI_REGION      MEDIUM  avgt   30    1.623 ?  0.013  ms/op
XorTest.xor        JNI_REGION       LARGE  avgt   30    8.923 ?  0.095  ms/op
XorTest.xor      JNI_CRITICAL       SMALL  avgt   30    0.058 ?  0.001  ms/op
XorTest.xor      JNI_CRITICAL      MEDIUM  avgt   30    1.215 ?  0.012  ms/op
XorTest.xor      JNI_CRITICAL       LARGE  avgt   30    6.246 ?  0.048  ms/op
XorTest.xor   FOREIGN_NO_INIT       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor   FOREIGN_NO_INIT      MEDIUM  avgt   30    1.572 ?  0.018  ms/op
XorTest.xor   FOREIGN_NO_INIT       LARGE  avgt   30   10.204 ?  0.116  ms/op
XorTest.xor      FOREIGN_INIT       SMALL  avgt   30    0.071 ?  0.001  ms/op
XorTest.xor      FOREIGN_INIT      MEDIUM  avgt   30    1.697 ?  0.008  ms/op
XorTest.xor      FOREIGN_INIT       LARGE  avgt   30   12.056 ?  0.152  ms/op
XorTest.xor  FOREIGN_CRITICAL       SMALL  avgt   30    0.059 ?  0.001  ms/op
XorTest.xor  FOREIGN_CRITICAL      MEDIUM  avgt   30    1.215 ?  0.012  ms/op
XorTest.xor  FOREIGN_CRITICAL       LARGE  avgt   30    6.301 ?  0.110  ms/op
XorTest.xor            UNSAFE       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor            UNSAFE      MEDIUM  avgt   30    1.589 ?  0.029  ms/op
XorTest.xor            UNSAFE       LARGE  avgt   30   10.177 ?  0.108  ms/op

// New xor test - PR
Benchmark         (arrayKind)  (sizeKind)  Mode  Cnt    Score    Error  Units
XorTest.xor      JNI_ELEMENTS       SMALL  avgt   30    0.224 ?  0.003  ms/op
XorTest.xor      JNI_ELEMENTS      MEDIUM  avgt   30    2.873 ?  0.025  ms/op
XorTest.xor      JNI_ELEMENTS       LARGE  avgt   30  118.523 ?  0.951  ms/op
XorTest.xor        JNI_REGION       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor        JNI_REGION      MEDIUM  avgt   30    1.639 ?  0.019  ms/op
XorTest.xor        JNI_REGION       LARGE  avgt   30    8.890 ?  0.124  ms/op
XorTest.xor      JNI_CRITICAL       SMALL  avgt   30    0.059 ?  0.001  ms/op
XorTest.xor      JNI_CRITICAL      MEDIUM  avgt   30    1.213 ?  0.013  ms/op
XorTest.xor      JNI_CRITICAL       LARGE  avgt   30    6.241 ?  0.099  ms/op
XorTest.xor   FOREIGN_NO_INIT       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor   FOREIGN_NO_INIT      MEDIUM  avgt   30    1.580 ?  0.015  ms/op
XorTest.xor   FOREIGN_NO_INIT       LARGE  avgt   30    8.936 ?  0.059  ms/op
XorTest.xor      FOREIGN_INIT       SMALL  avgt   30    0.071 ?  0.001  ms/op
XorTest.xor      FOREIGN_INIT      MEDIUM  avgt   30    1.727 ?  0.028  ms/op
XorTest.xor      FOREIGN_INIT       LARGE  avgt   30   10.544 ?  0.114  ms/op
XorTest.xor  FOREIGN_CRITICAL       SMALL  avgt   30    0.059 ?  0.001  ms/op
XorTest.xor  FOREIGN_CRITICAL      MEDIUM  avgt   30    1.215 ?  0.014  ms/op
XorTest.xor  FOREIGN_CRITICAL       LARGE  avgt   30    6.230 ?  0.029  ms/op
XorTest.xor            UNSAFE       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor            UNSAFE      MEDIUM  avgt   30    1.578 ?  0.020  ms/op
XorTest.xor            UNSAFE       LARGE  avgt   30    8.910 ?  0.100  ms/op

Comment on lines 1173 to 1175
void StubGenerator::copy256_avx3(Register dst, Register src, Register index, XMMRegister xmm1,
XMMRegister xmm2, XMMRegister xmm3, XMMRegister xmm4,
bool conjoint, int shift, int offset) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conjoint parameter is not used so could be removed from this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

__ jcc(Assembler::less, L_tail_large);

__ BIND(L_main_pre_loop_large);
__ subq(temp1, loop_size[shift]); // whay is this here
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

Label L_main_pre_loop_large;
Label L_pre_main_post_large;

if (MaxVectorSize == 64) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

/* T_LONG */ { 8, 16 , 24 , 32}
};

if (MaxVectorSize == 64) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an assert here instead of if check as this method shouldn't be called if MaxVectorSize is < 64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

__ shrq(temp2, shift);
}
__ movq(temp3, temp2);
copy64_masked_avx(to, from, xmm1, k2, temp3, temp4, temp1, shift, 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last argument should be "true" or "1" instead of "0" or "false". This is as temp3 (length) could be less than 32 as well. This case is only handled when use64byteVector argument is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

Comment on lines 796 to 797
arraycopy_avx3_special_cases_256(xmm1, k2, from, to, temp1, shift,
temp4, temp3, L_entry_large, L_exit_large);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we come here to arraycopy_avx3_special_cases_256 only up to 256 bytes need to be copied so we don't need to go back to L_entry_large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

- use asserts instead of conditionals in two logically unreachable blocks
- remove unused function parmeters
- use 64-byte vector path in pre-loop masked write
Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking care of all the review comments. The PR looks good to me now.

@steveatgh steveatgh requested a review from theRealAph November 22, 2023 16:42
Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness and performance testing passed.

@steveatgh
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Nov 27, 2023
@openjdk
Copy link

openjdk bot commented Nov 27, 2023

@steveatgh
Your change (at version 02ad27f) is now ready to be sponsored by a Committer.

@sviswa7
Copy link

sviswa7 commented Nov 27, 2023

/sponsor

@openjdk
Copy link

openjdk bot commented Nov 27, 2023

Going to push as commit 82967f4.
Since your change was applied there have been 56 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 27, 2023
@openjdk openjdk bot closed this Nov 27, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Nov 27, 2023
@openjdk
Copy link

openjdk bot commented Nov 27, 2023

@sviswa7 @steveatgh Pushed as commit 82967f4.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants