8248188: Add IntrinsicCandidate and API for Base64 decoding #293
Hi @CoreyAshford, welcome to this OpenJDK project and thanks for contributing!
We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing
If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing
@CoreyAshford The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing lists. If you would like to change these labels, use the
thanks for refactoring the Java code such that it matches the intrinsic implementation. That's a better design.
I'm now looking at the PPC64 platform code. The algorithm looks fine.
I can see you're using clrldi to clear the upper bits of the parameters. But seems like it clears one bit too few.
I wonder about the loop unrolling. It doesn't look beneficial because the loop body is large.
But you may want to align the loop start to help instruction fetch.
We'll test it, but we don't have Power 10. You guys need to cover that.
Thanks for catching that! Will fix on next round.
Martin Doerr wrote:
You're right. I misread the instruction as clear from bit 0 to bit N, but it's actually create a mask with bits N to 63 with one's, zeroes elsewhere, then AND it with the src register.
Ah, good! Thanks. Will change.
I did test on a prototype written in C using vector intrinsics, and 8 was the sweet spot, however the structure of that code was a bit different and I should have verified that the same amount of loop unrolling makes sense for the Java intrinsic. I will perform those experiments.
Interesting. I did add an align, but in my patch clean up I must have lost it again somehow. I will add it back again. Sorry for that mistake.
I did test on Power10, but I wasn't able to do performance testing because I ran on an instruction-level simulator. Real hardware will be available in the coming months.
Thanks for your careful look at the code, and the regression testing you've done.
I have tried reproducing this, but haven't yet succeeded. Here's how I'm running it from the jdk/test directory:
The response is this:
The report's Results sections shows "Total 0"
Any ideas? I'm new to running JTReg tests, so don't assume I know anything :)
After looking at this a bit, I find that there seems to be an assumption in the code that if there is an intrinsic symbol defined in aotCodeHeap.cpp using the SET_AOT_GLOBAL_SYMBOL_VALUE macro, it is required that the intrinsic is implemented for every arch that implements AOT. In this case, there isn't an implementation for x86_64 (yet), so that's why the failure is occurring.
I was tempted to put in an arch-specific #if for ppc arch only, but I don't see any arch-specific code in this area, and it doesn't make sense either because AOT isn't supported on ppc at all. Another alternative is to remove the SET_AOT_GLOBAL_SYMBOL_VALUE for decodeBlock, since the implementation is not defined (yet) for any arch which supports AOT.
A third alternative would be to leave the macro call in, but comment it out, saying to uncomment it when it's supported on all AOT-capable arches.
On 10/26/20 12:47 PM, Paul Murphy wrote:
Ok, got it. I will change it as you suggest to create a better mental
Got it. I will try that out and see how it looks compared to the
As a side note, on github, it's waiting for you to check a box: "I agree
…orithm * Change the order of the bytes as listed in the tables, which makes the use of vpextd easier to understand. * Because the byte order of the constants used in the tables is reversed from the original documentation, change the constant declarations to match the order in the table, by using the ARRAY_TO_LXV_ORDER macro. This makes the constant declarations more consistent as well.
…nstruction to improves performance by about 9% This conditional branch around the xxsel seemed like a good idea at the time, because I thought the branch would be less costly than the xxsel instruction, but it turns out not to be the case; executing the xxsel every time without a conditional branch increases performance by about 9%. Removing that branch also removed the need for the declaration and usage of an array of Label's for the branch destinations inside the unrolled code.
Your commit was automatically rebased without conflicts.
Pushed as commit ccb48b7.
8248188: Add IntrinsicCandidate and API for Base64 decoding, add Power64LE intrinsic implementation. This patch set encompasses the following commits: Adds a new intrinsic candidate to the java.lang.Base64 class - decodeBlock(), and provides a flexible API for the intrinsic. The API is similar to the existing encodeBlock intrinsic. Adds the code in HotSpot to check and martial the new intrinsic's arguments to the arch-specific intrinsic implementation. Adds a Power64LE-specific implementation of the decodeBlock intrinsic. Adds a JMH microbenchmark for both Base64 encoding and encoding. Enhances the JTReg hotspot intrinsic "TestBase64.java" regression test to more fully test both decoding and encoding. Reviewed-by: rriggs, mdoerr, kvn