Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8261553: Efficient mask generation using BMI2 BZHI instruction #2522

Closed
wants to merge 3 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Feb 11, 2021

BMI2 BHZI instruction can be used to optimize the instruction sequence
used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8261553: Efficient mask generation using BMI2 BZHI instruction

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/2522/head:pull/2522
$ git checkout pull/2522

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 11, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 11, 2021
@openjdk
Copy link

openjdk bot commented Feb 11, 2021

@jatin-bhateja The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Feb 11, 2021
@jatin-bhateja
Copy link
Member Author

/label hotspot-compiler-dev

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Feb 11, 2021
@openjdk
Copy link

openjdk bot commented Feb 11, 2021

@jatin-bhateja
The hotspot-compiler label was successfully added.

@jatin-bhateja jatin-bhateja changed the title 8261553: Efficient mask generation using BMI2 BZHI instruction. 8261553: Efficient mask generation using BMI2 BZHI instruction Feb 11, 2021
@mlbridge
Copy link

mlbridge bot commented Feb 11, 2021

Webrevs

Copy link
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Rather than removing the old code, I believe the code calling bzhiq needs to be in a branch checking VM_Version::supports_bmi2. Otherwise you'll hit asserts on older hardware without this extension
  • Some demonstration of the performance benefit would be nice - either a new microbenchmark or a statistically significant result running some existing ones, e.g. make test TEST=micro:ArrayCopy

@jatin-bhateja
Copy link
Member Author

Hi Claes,

Here is the JMH performance data over CLX for arraycopy benchmarks:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS.txt

Regards,
Jatin

@jatin-bhateja
Copy link
Member Author

  • Rather than removing the old code, I believe the code calling bzhiq needs to be in a branch checking VM_Version::supports_bmi2. Otherwise you'll hit asserts on older hardware without this extension

Hi Claes, added missing safely check for BMI2, its in general rare that a target supporting AVX-512 does not support BMI2

  • Some demonstration of the performance benefit would be nice - either a new microbenchmark or a statistically significant result running some existing ones, e.g. make test TEST=micro:ArrayCopy

@cl4es
Copy link
Member

cl4es commented Feb 11, 2021

Hi Claes,

Here is the JMH performance data over CLX for arraycopy benchmarks:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS.txt

Regards,
Jatin

Thanks! Eyeballing the results it looks like a mixed bag. There even seems to be a few regressions such as this:

o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     61.663           ns/op
--> 
o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     74.160           ns/op

@jatin-bhateja
Copy link
Member Author

Hi Claes,
Here is the JMH performance data over CLX for arraycopy benchmarks:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS.txt
Regards,
Jatin

Thanks! Eyeballing the results it looks like a mixed bag. There even seems to be a few regressions such as this:

o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     61.663           ns/op
--> 
o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     74.160           ns/op

Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.

@cl4es
Copy link
Member

cl4es commented Feb 11, 2021

Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.

Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.

@jatin-bhateja
Copy link
Member Author

Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.

Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.

BASELINE:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
  61.037 ns/op

Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
Perf stats:
--------------------------------------------------

         19,739.21 msec task-clock                #    0.389 CPUs utilized          
               646      context-switches          #    0.033 K/sec                  
                12      cpu-migrations            #    0.001 K/sec                  
               150      page-faults               #    0.008 K/sec                  
   74,59,83,59,139      cycles                    #    3.779 GHz                      (30.73%)
 1,78,78,79,19,117      instructions              #    2.40  insn per cycle           (38.48%)
   24,79,81,63,651      branches                  # 1256.289 M/sec                    (38.55%)
      32,24,89,924      branch-misses             #    1.30% of all branches          (38.62%)
   52,56,88,28,472      L1-dcache-loads           # 2663.167 M/sec                    (38.65%)
         39,00,969      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.57%)
          3,74,131      LLC-loads                 #    0.019 M/sec                    (30.77%)
            22,315      LLC-load-misses           #    5.96% of all LL-cache hits     (30.72%)
   <not supported>      L1-icache-loads                                             
         17,49,997      L1-icache-load-misses                                         (30.72%)
   52,91,41,70,636      dTLB-loads                # 2680.663 M/sec                    (30.69%)
             3,315      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.67%)
             4,674      iTLB-loads                #    0.237 K/sec                    (30.65%)
            33,746      iTLB-load-misses          #  721.99% of all iTLB cache hits   (30.63%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      50.723759146 seconds time elapsed

      51.447054000 seconds user
       0.189949000 seconds sys


WITH OPT:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
  74.356 ns/op

Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
Perf stats:
--------------------------------------------------

         19,741.09 msec task-clock                #    0.389 CPUs utilized          
               641      context-switches          #    0.032 K/sec                  
                17      cpu-migrations            #    0.001 K/sec                  
               164      page-faults               #    0.008 K/sec                  
   74,40,40,48,513      cycles                    #    3.769 GHz                      (30.81%)
 1,45,66,22,06,797      instructions              #    1.96  insn per cycle           (38.56%)
   20,31,28,43,577      branches                  # 1028.963 M/sec                    (38.65%)
         14,11,419      branch-misses             #    0.01% of all branches          (38.69%)
   43,07,86,33,662      L1-dcache-loads           # 2182.182 M/sec                    (38.72%)
         37,06,744      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.56%)
          1,34,292      LLC-loads                 #    0.007 M/sec                    (30.72%)
            30,627      LLC-load-misses           #   22.81% of all LL-cache hits     (30.68%)
   <not supported>      L1-icache-loads                                             
         14,49,145      L1-icache-load-misses                                         (30.65%)
   43,44,86,27,516      dTLB-loads                # 2200.924 M/sec                    (30.63%)
               218      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.63%)
             2,445      iTLB-loads                #    0.124 K/sec                    (30.63%)
            28,624      iTLB-load-misses          # 1170.72% of all iTLB cache hits   (30.63%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      50.716083931 seconds time elapsed

      51.467300000 seconds user
       0.200390000 seconds sys

JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.

But, there is a significant reduction in instruction count and cycles are almost comparable. We are saving one shift per mask computation.

          OLD Sequence:
              0x00007f7fc1030ead:   movabs $0x1,%rax
              0x00007f7fc1030eb7:   shlx   %r8,%rax,%rax
              0x00007f7fc1030ebc:   dec    %rax
              0x00007f7fc1030ebf:   kmovq  %rax,%k2
          NEW Sequence:
              0x00007f775d030d51:   movabs $0xffffffffffffffff,%rax
              0x00007f775d030d5b:   bzhi   %r8,%rax,%rax
              0x00007f775d030d60:   kmovq  %rax,%k2

@jatin-bhateja
Copy link
Member Author

Further analysis of perf degradation revealed that with new optimized instruction pattern, code alignment got disturbed. This led to increase in LSD misses, also it reduced the UOPs cashing in DSB.
Aligning copy loops at 32 byte boundary prevents any adverse impact on UOP caching.
NOPs used for padding add up to the instruction count and thus may over shadow the code size gains due to new mask generation sequence in copy stubs.

Baseline:
 ArrayCopyAligned.testLong   Length : 1200    61 ns/op (approx)
 1,93,44,43,11,622      cycles                                                      
 4,59,57,99,78,727      instructions              #    2.38  insn per cycle         
 1,83,68,75,68,255      idq.dsb_uops                                                
 2,08,32,43,71,906      lsd.uops                                                    
   37,12,54,60,211      idq.mite_uops                                               

With Opt:
 ArrayCopyAligned.testLong   Length : 1200   74 ns/op (approx)
 1,93,51,25,94,766      cycles                                                      
 3,75,11,57,91,917      instructions              #    1.94  insn per cycle         
   48,67,58,25,566      idq.dsb_uops                                                
      19,46,13,236      lsd.uops                                                    
 2,87,42,95,74,280      idq.mite_uops                                               

With Opt + main loop alignment(nop):         61 ns/op (approx)
 ArrayCopyAligned.testLong   Length : 1200 
 1,93,52,15,90,080      cycles                                                      
 4,60,89,14,06,528      instructions              #    2.38  insn per cycle         
 1,78,76,10,34,991      idq.dsb_uops                                                
 2,09,16,15,84,313      lsd.uops                                                    
   46,25,31,92,101      idq.mite_uops            

While computing the mask for partial in-lining of small copy calls ( currently enabled for sub-word types with copy length less than 32/64 bytes), new optimized sequence should always offer lower instruction count and latency path.

Baseline:
ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
 1,97,76,75,18,052      cycles                                                      
 8,96,00,37,11,803      instructions              #    4.53  insn per cycle         
    2,71,83,79,035      idq.dsb_uops                                                
 7,54,82,43,63,409      lsd.uops                                                    
    3,92,55,74,395      idq.mite_uops                                               

ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
 1,97,79,16,56,787      cycles                                                      
 8,96,13,15,69,780      instructions              #    4.53  insn per cycle         
    2,69,07,11,691      idq.dsb_uops                                                
 7,54,95,63,77,683      lsd.uops                                                    
    3,90,19,10,747      idq.mite_uops                                               

WithOpt:
ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
 1,97,66,64,62,541      cycles                                                      
 8,92,03,95,00,236      instructions              #    4.51  insn per cycle         
    2,72,38,56,205      idq.dsb_uops                                                
 7,50,87,50,60,591      lsd.uops                                                    
    3,89,15,02,954      idq.mite_uops                                               

ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
 1,97,54,21,61,110      cycles                                                      
 8,91,46,64,23,754      instructions              #    4.51  insn per cycle         
    2,78,12,19,544      idq.dsb_uops                                                
 7,50,35,88,95,843      lsd.uops                                                    
    3,90,41,97,276      idq.mite_uops                               

Following are the links to updated JMH perf data:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS_LOOP_ALIGN.txt

In general gains are not significant in case of copy stubs, but new sequence offers a optimal latency path for mask computation sequence.

@cl4es
Copy link
Member

cl4es commented Feb 16, 2021

Thanks for getting to the bottom of that regression.

Copy link

@neliasso neliasso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@openjdk
Copy link

openjdk bot commented Feb 16, 2021

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8261553: Efficient mask generation using BMI2 BZHI instruction

Reviewed-by: redestad, neliasso

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 126 new commits pushed to the master branch:

  • 2e18b52: 8261752: Multiple GC test are missing memory requirements
  • c7885eb: 8261758: [TESTBUG] gc/g1/TestGCLogMessages.java fails if ergonomics detect too small InitialHeapSize
  • 05d5955: 8261522: [PPC64] AES intrinsics write beyond the destination array
  • 03b586b: 8261750: Remove internal class sun.net.www.MimeLauncher
  • 8418285: 8261235: C1 compilation fails with assert(res->vreg_number() == index) failed: conversion check
  • a930870: 8261309: Remove remaining StoreLoad barrier with UseCondCardMark for Serial/Parallel GC
  • b955f85: 8261075: Create stubRoutines.inline.hpp with SafeFetch implementation
  • d195033: 8261842: Shenandoah: cleanup ShenandoahHeapRegionSet
  • fc1d032: 8261125: Move VM_Operation to vmOperation.hpp
  • d547e1a: 8261123: Augment discussion of equivalence classes in Object.equals and comparison methods
  • ... and 116 more: https://git.openjdk.java.net/jdk/compare/d0a8f2f737cdfc1ae742d47f2dd4f2bbc03f4398...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 16, 2021
@jatin-bhateja
Copy link
Member Author

Thanks for getting to the bottom of that regression.

Thanks, since there is not a significant impact on performance, but having an optimum instruction sequence will still reduce the complexity. Should it be ok to check this in ? We have one reviewer consent.

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk openjdk bot closed this Feb 17, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated labels Feb 17, 2021
@openjdk openjdk bot removed the rfr Pull request is ready for review label Feb 17, 2021
@openjdk
Copy link

openjdk bot commented Feb 17, 2021

@jatin-bhateja Since your change was applied there have been 129 commits pushed to the master branch:

  • a065879: 8261791: (sctp) handleSendFailed in SctpChannelImpl.c potential leaks
  • 9ba2b71: 8261657: [PPC64] Cleanup StoreCM nodes after CMS removal
  • f639df4: 8261401: Add sanity check for UseSHM large pages similar to the one used with hugetlb large pages
  • 2e18b52: 8261752: Multiple GC test are missing memory requirements
  • c7885eb: 8261758: [TESTBUG] gc/g1/TestGCLogMessages.java fails if ergonomics detect too small InitialHeapSize
  • 05d5955: 8261522: [PPC64] AES intrinsics write beyond the destination array
  • 03b586b: 8261750: Remove internal class sun.net.www.MimeLauncher
  • 8418285: 8261235: C1 compilation fails with assert(res->vreg_number() == index) failed: conversion check
  • a930870: 8261309: Remove remaining StoreLoad barrier with UseCondCardMark for Serial/Parallel GC
  • b955f85: 8261075: Create stubRoutines.inline.hpp with SafeFetch implementation
  • ... and 119 more: https://git.openjdk.java.net/jdk/compare/d0a8f2f737cdfc1ae742d47f2dd4f2bbc03f4398...master

Your commit was automatically rebased without conflicts.

Pushed as commit cb84539.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jatin-bhateja jatin-bhateja deleted the JDK-8261553 branch April 9, 2021 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
3 participants