Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8248188: Add IntrinsicCandidate and API for Base64 decoding #293

Closed

Conversation

@CoreyAshford
Copy link

@CoreyAshford CoreyAshford commented Sep 22, 2020

This patch set encompasses the following commits:

  • Adds a new HotSpot intrinsic candidate to the java.lang.Base64 class - decodeBlock(), and provides a flexible API for the intrinsic. The API is similar to the existing encodeBlock intrinsic.
  • Adds the code in HotSpot to check and martial the new intrinsic's arguments to the arch-specific intrinsic implementation
  • Adds a Power64LE-specific implementation of the decodeBlock intrinsic.
  • Adds a JMH microbenchmark for both Base64 encoding and encoding.
  • Enhances the JTReg hotspot intrinsic "TestBase64.java" regression test to more fully test both decoding and encoding.

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Testing

Linux x32 Linux x64 Windows x64 macOS x64
Build (1/1 failed) ✔️ (5/5 passed) ✔️ (2/2 passed) ✔️ (2/2 passed)
Test (tier1) ✔️ (9/9 passed) (1/9 failed) ✔️ (9/9 passed)

Failed test tasks

Issue

  • JDK-8248188: Add IntrinsicCandidate and API for Base64 decoding

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/293/head:pull/293
$ git checkout pull/293

@bridgekeeper bridgekeeper bot added the oca label Sep 22, 2020
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Sep 22, 2020

Hi @CoreyAshford, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user CoreyAshford" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Sep 22, 2020

@CoreyAshford The following labels will be automatically applied to this pull request: core-libs hotspot shenandoah.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label (add|remove) "label" command.

@CoreyAshford CoreyAshford force-pushed the base64_decode_intrinsic branch 5 times, most recently from ba997f4 to 8b6d979 Sep 23, 2020
@CoreyAshford CoreyAshford changed the title Base64 decode intrinsic 8248188: Add HotSpotIntrinsicCandidate and API for Base64 decoding Sep 23, 2020
@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Sep 23, 2020

This work is covered by an existing IBM <-> Oracle agreement
/covered

@CoreyAshford CoreyAshford marked this pull request as ready for review Sep 23, 2020
@bridgekeeper bridgekeeper bot removed the oca label Sep 24, 2020
@openjdk openjdk bot added the rfr label Sep 24, 2020
@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 24, 2020

Webrevs

@TheRealMDoerr
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr commented Sep 25, 2020

Hi Corey,

thanks for refactoring the Java code such that it matches the intrinsic implementation. That's a better design.

I'm now looking at the PPC64 platform code. The algorithm looks fine.

I can see you're using clrldi to clear the upper bits of the parameters. But seems like it clears one bit too few.
You can also use cmpwi for the boolean one.

I wonder about the loop unrolling. It doesn't look beneficial because the loop body is large.
Did you measure performance gain by this unrolling?
I think for agressive tuning we'd have to apply techniques like modulo scheduling, but that's much more work.
So please only use unrolling as far as a benefit is measurable.

But you may want to align the loop start to help instruction fetch.

We'll test it, but we don't have Power 10. You guys need to cover that.

Best regards,
Martin

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

AOT support needs an update:

Internal Error (jdk/src/hotspot/share/aot/aotCodeHeap.cpp:557), pid=345656, tid=364316

guarantee(adr != NULL) failed: AOT Symbol not found _aot_stub_routines_base64_decodeBlock

V [jvm.dll+0x1dbc6e] AOTCodeHeap::link_stub_routines_symbols+0xf7e (aotcodeheap.cpp:557)
V [jvm.dll+0x1d95e8] AOTCodeHeap::link_global_lib_symbols+0x2f8 (aotcodeheap.cpp:603)
V [jvm.dll+0x1dc616] AOTCodeHeap::load_klass_data+0x476 (aotcodeheap.cpp:840)
V [jvm.dll+0x1e1021] AOTLoader::load_for_klass+0x161 (aotloader.cpp:55)
V [jvm.dll+0x5fec96] InstanceKlass::initialize_impl+0x4e6 (instanceklass.cpp:1159)
V [jvm.dll+0x5fead6] InstanceKlass::initialize_impl+0x326 (instanceklass.cpp:1133)
V [jvm.dll+0xc5a633] Threads::initialize_java_lang_classes+0x93 (thread.cpp:3766)
V [jvm.dll+0xc57c32] Threads::create_vm+0xa12 (thread.cpp:4037)

Can be reproduced by running JTREG tests:
compiler/aot/calls/fromAot

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Sep 28, 2020

AOT support needs an update:

Internal Error (jdk/src/hotspot/share/aot/aotCodeHeap.cpp:557), pid=345656, tid=364316

guarantee(adr != NULL) failed: AOT Symbol not found _aot_stub_routines_base64_decodeBlock

V [jvm.dll+0x1dbc6e] AOTCodeHeap::link_stub_routines_symbols+0xf7e (aotcodeheap.cpp:557)
V [jvm.dll+0x1d95e8] AOTCodeHeap::link_global_lib_symbols+0x2f8 (aotcodeheap.cpp:603)
V [jvm.dll+0x1dc616] AOTCodeHeap::load_klass_data+0x476 (aotcodeheap.cpp:840)
V [jvm.dll+0x1e1021] AOTLoader::load_for_klass+0x161 (aotloader.cpp:55)
V [jvm.dll+0x5fec96] InstanceKlass::initialize_impl+0x4e6 (instanceklass.cpp:1159)
V [jvm.dll+0x5fead6] InstanceKlass::initialize_impl+0x326 (instanceklass.cpp:1133)
V [jvm.dll+0xc5a633] Threads::initialize_java_lang_classes+0x93 (thread.cpp:3766)
V [jvm.dll+0xc57c32] Threads::create_vm+0xa12 (thread.cpp:4037)

Can be reproduced by running JTREG tests:
compiler/aot/calls/fromAot

Thanks for catching that! Will fix on next round.

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Sep 28, 2020

Martin Doerr wrote:
...

I can see you're using clrldi to clear the upper bits of the parameters. But seems like it clears one bit too few.

You're right. I misread the instruction as clear from bit 0 to bit N, but it's actually create a mask with bits N to 63 with one's, zeroes elsewhere, then AND it with the src register.

Will fix.

You can also use cmpwi for the boolean one.

Ah, good! Thanks. Will change.

I wonder about the loop unrolling. It doesn't look beneficial because the loop body is large.
Did you measure performance gain by this unrolling?
I think for agressive tuning we'd have to apply techniques like modulo scheduling, but that's much more work.
So please only use unrolling as far as a benefit is measurable.

I did test on a prototype written in C using vector intrinsics, and 8 was the sweet spot, however the structure of that code was a bit different and I should have verified that the same amount of loop unrolling makes sense for the Java intrinsic. I will perform those experiments.

But you may want to align the loop start to help instruction fetch.

Interesting. I did add an align, but in my patch clean up I must have lost it again somehow. I will add it back again. Sorry for that mistake.

We'll test it, but we don't have Power 10. You guys need to cover that.

I did test on Power10, but I wasn't able to do performance testing because I ran on an instruction-level simulator. Real hardware will be available in the coming months.

Thanks for your careful look at the code, and the regression testing you've done.

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Sep 29, 2020

Can be reproduced by running JTREG tests:
compiler/aot/calls/fromAot

I have tried reproducing this, but haven't yet succeeded. Here's how I'm running it from the jdk/test directory:

jtreg -jdk:<path_to_jdk> ./hotspot/jtreg/compiler/aot/calls/fromAot

The response is this:

Test results: no tests selected
Report written to /home/cjashfor/git-trees/jdk/test/JTreport/html/report.html
Results written to /home/cjashfor/git-trees/jdk/test/JTwork

The report's Results sections shows "Total 0"

Any ideas? I'm new to running JTReg tests, so don't assume I know anything :)

@TheRealMDoerr
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr commented Sep 29, 2020

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Sep 29, 2020

Did you try on x86? AOT is not supported on PPC64.

I didn't. No wonder. Thank you!

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Sep 30, 2020

Did you try on x86? AOT is not supported on PPC64.

After looking at this a bit, I find that there seems to be an assumption in the code that if there is an intrinsic symbol defined in aotCodeHeap.cpp using the SET_AOT_GLOBAL_SYMBOL_VALUE macro, it is required that the intrinsic is implemented for every arch that implements AOT. In this case, there isn't an implementation for x86_64 (yet), so that's why the failure is occurring.

I was tempted to put in an arch-specific #if for ppc arch only, but I don't see any arch-specific code in this area, and it doesn't make sense either because AOT isn't supported on ppc at all. Another alternative is to remove the SET_AOT_GLOBAL_SYMBOL_VALUE for decodeBlock, since the implementation is not defined (yet) for any arch which supports AOT.

A third alternative would be to leave the macro call in, but comment it out, saying to uncomment it when it's supported on all AOT-capable arches.

Any thoughts?

@CoreyAshford CoreyAshford changed the title 8248188: Add HotSpotIntrinsicCandidate and API for Base64 decoding 8248188: Add IntrinsicCandidate and API for Base64 decoding Sep 30, 2020
@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Oct 21, 2020

Note, tier1 and tier2 passed clean. But I have to rebuild it with updated test and run tier3 again.
CheckGraalIntrinsics.java passed.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Tier3 testing clean with updated test.

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Oct 22, 2020

Tier3 testing clean with updated test.

Thank you for identifying the problem, the fix, then rebuilding and rerunning the tests!

Copy link

@pmur pmur left a comment

Hi @pmur, thanks for making a comment in an OpenJDK project!

All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user pmur for the summary.

If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.

Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use.

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/ppc/stubGenerator_ppc.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/ppc/stubGenerator_ppc.cpp Show resolved Hide resolved
src/hotspot/cpu/ppc/stubGenerator_ppc.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/ppc/stubGenerator_ppc.cpp Outdated Show resolved Hide resolved
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 2, 2020

Mailing list message from Corey Ashford on hotspot-dev:

On 10/26/20 12:47 PM, Paul Murphy wrote:

On Thu, 22 Oct 2020 22:06:11 GMT, CoreyAshford <github.com+51754783+CoreyAshford at openjdk.org> wrote:

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 3878:

3876: // | Element | | | | | | | | |
3877: // +===============+=============+======================+======================+=============+=============+======================+======================+=============+
3878: // | after vaddubm | 00||b0:0..5 | 00||b0:6..7||b1:0..3 | 00||b1:4..7||b2:0..1 | 00||b2:2..7 | 00||b3:0..5 | 00||b3:6..7||b4:0..3 | 00||b4:4..7||b5:0..1 | 00||b5:2..7 |

An extra line here showing how the 8 6-bit values above get mapping into 6 bytes greatly help my brain out. (likewise for the <P10 case below too)

Just to make sure I understand, you're not asking for a change here, is that right?

I think the first line should also express the initial layout of the 6 bit values similar to the linked algo. I think changing this comment add an extra line which describes the bits as they leave `vaddubm` would be helpful to understand the demangling here. (e.g the `00aaaaaa 00bbbbbb 00ccccc 00dddddd` comments in the linked paper)

Ok, got it. I will change it as you suggest to create a better mental
link between the terminology used in the paper and the bit numbering I
chose to use in the code comments.

Got it. I will try that out and see how it looks compared to the
byte-swapped version. Also I will add a comment about vpextd operating
on doublewords.

As a side note, on github, it's waiting for you to check a box: "I agree
to the OpenJDK Terms of use for all comments I make in a project in the
OpenJDK GitHub organization.". Until you tick that box, your comment
can't be seen there.

https://github.com//pull/293#discussion_r512214354

Thanks,

- Corey

Corey Ashford added 2 commits Nov 3, 2020
…orithm

 * Change the order of the bytes as listed in the tables, which makes the
   use of vpextd easier to understand.

 * Because the byte order of the constants used in the tables is reversed from
   the original documentation, change the constant declarations to match the order
   in the table, by using the ARRAY_TO_LXV_ORDER macro.  This makes the constant
   declarations more consistent as well.
…nstruction to improves performance by about 9%

This conditional branch around the xxsel seemed like a good idea at the
time, because I thought the branch would be less costly than the xxsel
instruction, but it turns out not to be the case; executing the xxsel every
time without a conditional branch increases performance by about 9%.
Removing that branch also removed the need for the declaration and usage of
an array of Label's for the branch destinations inside the unrolled code.
@openjdk openjdk bot removed ready rfr labels Nov 3, 2020
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Thanks for removing the branch from the loop. (Maybe this affects unrolling decision.) Looks good.

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Nov 4, 2020

Thanks for removing the branch from the loop. (Maybe this affects unrolling decision.) Looks good.

Yeah, it does, and oddly enough the best loop unroll value is now 1. I will re-run the benchmarks again to confirm, but that's what it's looking like now.

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

I already had the feeling that unrolling the large loop was not beneficial. Thanks and thumps up from my side!

@CoreyAshford
Copy link
Author

@CoreyAshford CoreyAshford commented Nov 11, 2020

/integrate

@openjdk openjdk bot added the sponsor label Nov 11, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Nov 11, 2020

@CoreyAshford
Your change (at version 9e303da) is now ready to be sponsored by a Committer.

@TheRealMDoerr
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr commented Nov 11, 2020

/sponsor

@openjdk
Copy link

@openjdk openjdk bot commented Nov 11, 2020

@TheRealMDoerr @CoreyAshford Since your change was applied there have been 357 commits pushed to the master branch:

  • 59965c1: 8256237: Zero: non-PCH build fails after JDK-8253064
  • bfa060f: 8256051: nmethod_entry_barrier stub miscalculates xmm spill size on x86_32
  • 96e0261: 8256106: Bypass intrinsic/barrier when calling Reference.get() from Finalizer
  • 3c3469b: 8256020: Shenandoah: Don't resurrect objects during evacuation on AS_NO_KEEPALIVE
  • 2e19026: 8253064: monitor list simplifications and getting rid of TSM
  • 421a7c3: 8256182: Update qemu-debootstrap cross-compilation recipe
  • 6247736: 8256018: Adler32/CRC32/CRC32C missing reachabilityFence
  • 436019b: 8256166: [C2] Registers get confused on Big Endian after 8221404
  • ed615e3: 4907798: MEMORY LEAK: javax.swing.plaf.basic.BasicPopupMenuUI$MenuKeyboardHelper
  • 362feaa: 8254661: arm32: additional cleanup after fixing SIGSEGV
  • ... and 347 more: https://git.openjdk.java.net/jdk/compare/3ccf4877a8efb62f8e08d59cebea8c89148d056a...master

Your commit was automatically rebased without conflicts.

Pushed as commit ccb48b7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk-notifier bot referenced this issue Nov 11, 2020
8248188: Add IntrinsicCandidate and API for Base64 decoding, add Power64LE intrinsic implementation.

This patch set encompasses the following commits:

Adds a new intrinsic candidate to the java.lang.Base64 class - decodeBlock(), and provides a flexible API for the intrinsic. The API is similar to the existing encodeBlock intrinsic.

Adds the code in HotSpot to check and martial the new intrinsic's arguments to the arch-specific intrinsic implementation.

Adds a Power64LE-specific implementation of the decodeBlock intrinsic.

Adds a JMH microbenchmark for both Base64 encoding and encoding.

Enhances the JTReg hotspot intrinsic "TestBase64.java" regression test to more fully test both decoding and encoding.

Reviewed-by: rriggs, mdoerr, kvn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment