Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops #10847

Closed
wants to merge 76 commits into from

Conversation

cl4es
Copy link
Member

@cl4es cl4es commented Oct 25, 2022

Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops.

I've rewired the library changes to route via a single @IntrinsicCandidate method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for Arrays.hashCode cases.

Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for Arrays or zero for String). It seems somewhat premature to parameterize this up front.

The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads.

With the most recent fixes the x64 intrinsic results on my workstation look like this:

Benchmark                               (size)  Mode  Cnt     Score    Error  Units
StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.199 ±  0.017  ns/op
StringHashCode.Algorithm.defaultLatin1      10  avgt    5     6.933 ±  0.049  ns/op
StringHashCode.Algorithm.defaultLatin1     100  avgt    5    29.935 ±  0.221  ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1596.982 ±  7.020  ns/op

Baseline:

Benchmark                               (size)  Mode  Cnt     Score    Error  Units
StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.200 ±  0.013  ns/op
StringHashCode.Algorithm.defaultLatin1      10  avgt    5     9.424 ±  0.122  ns/op
StringHashCode.Algorithm.defaultLatin1     100  avgt    5    90.541 ±  0.512  ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  9425.321 ± 67.630  ns/op

I.e. no measurable overhead compared to baseline even for size == 1.

The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good.

Benchmark for Arrays.hashCode:

Benchmark              (size)  Mode  Cnt     Score    Error  Units
ArraysHashCode.bytes        1  avgt    5     1.884 ±  0.013  ns/op
ArraysHashCode.bytes       10  avgt    5     6.955 ±  0.040  ns/op
ArraysHashCode.bytes      100  avgt    5    87.218 ±  0.595  ns/op
ArraysHashCode.bytes    10000  avgt    5  9419.591 ± 38.308  ns/op
ArraysHashCode.chars        1  avgt    5     2.200 ±  0.010  ns/op
ArraysHashCode.chars       10  avgt    5     6.935 ±  0.034  ns/op
ArraysHashCode.chars      100  avgt    5    30.216 ±  0.134  ns/op
ArraysHashCode.chars    10000  avgt    5  1601.629 ±  6.418  ns/op
ArraysHashCode.ints         1  avgt    5     2.200 ±  0.007  ns/op
ArraysHashCode.ints        10  avgt    5     6.936 ±  0.034  ns/op
ArraysHashCode.ints       100  avgt    5    29.412 ±  0.268  ns/op
ArraysHashCode.ints     10000  avgt    5  1610.578 ±  7.785  ns/op
ArraysHashCode.shorts       1  avgt    5     1.885 ±  0.012  ns/op
ArraysHashCode.shorts      10  avgt    5     6.961 ±  0.034  ns/op
ArraysHashCode.shorts     100  avgt    5    87.095 ±  0.417  ns/op
ArraysHashCode.shorts   10000  avgt    5  9420.617 ± 50.089  ns/op

Baseline:

Benchmark              (size)  Mode  Cnt     Score    Error  Units
ArraysHashCode.bytes        1  avgt    5     3.213 ±  0.207  ns/op
ArraysHashCode.bytes       10  avgt    5     8.483 ±  0.040  ns/op
ArraysHashCode.bytes      100  avgt    5    90.315 ±  0.655  ns/op
ArraysHashCode.bytes    10000  avgt    5  9422.094 ± 62.402  ns/op
ArraysHashCode.chars        1  avgt    5     3.040 ±  0.066  ns/op
ArraysHashCode.chars       10  avgt    5     8.497 ±  0.074  ns/op
ArraysHashCode.chars      100  avgt    5    90.074 ±  0.387  ns/op
ArraysHashCode.chars    10000  avgt    5  9420.474 ± 41.619  ns/op
ArraysHashCode.ints         1  avgt    5     2.827 ±  0.019  ns/op
ArraysHashCode.ints        10  avgt    5     7.727 ±  0.043  ns/op
ArraysHashCode.ints       100  avgt    5    89.405 ±  0.593  ns/op
ArraysHashCode.ints     10000  avgt    5  9426.539 ± 51.308  ns/op
ArraysHashCode.shorts       1  avgt    5     3.071 ±  0.062  ns/op
ArraysHashCode.shorts      10  avgt    5     8.168 ±  0.049  ns/op
ArraysHashCode.shorts     100  avgt    5    90.399 ±  0.292  ns/op
ArraysHashCode.shorts   10000  avgt    5  9420.171 ± 44.474  ns/op

As we can see the Arrays intrinsics are faster for small inputs, and faster on large inputs for char and int (the ones currently vectorized). I aim to fix byte and short cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops

Reviewers

Contributors

  • Sandhya Viswanathan <sviswanathan@openjdk.org>
  • Ludovic Henry <luhenry@openjdk.org>
  • Claes Redestad <redestad@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847
$ git checkout pull/10847

Update a local copy of the PR:
$ git checkout pull/10847
$ git pull https://git.openjdk.org/jdk pull/10847/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10847

View PR using the GUI difftool:
$ git pr show -t 10847

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10847.diff

luhenry and others added 27 commits March 4, 2022 16:22
…loops

Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.

Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].

At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.

[1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
Next is to generalize it for Arrays.hashCode and StringUTF16.hashCode and make it cheap on shorter strings
@bridgekeeper
Copy link

bridgekeeper bot commented Oct 25, 2022

👋 Welcome back redestad! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 25, 2022

⚠️ @cl4es This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@openjdk
Copy link

openjdk bot commented Oct 25, 2022

@cl4es The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot
  • security
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@cl4es
Copy link
Member Author

cl4es commented Jan 9, 2023

Explicitly loading the address to a register seems to do the trick, avoiding the pitfalls of as_Address(AddressLiteral) - which apparently only works (portably) when we know for certain the address is in some allowed range. There's no measurable difference on microbenchmarks (there might be a couple of extra lea instructions on the vectorized paths, but that disappears in the noise). Thanks @fisk for the suggestion!

@sviswa7
Copy link

sviswa7 commented Jan 10, 2023

Thanks @cl4es for fixing this issue. Changes look good to me.

@cl4es
Copy link
Member Author

cl4es commented Jan 11, 2023

I'll do another round of internal testing (tier1-4). Unless I hear any objections I plan to integrate this once all testing looks satisfactory.

Copy link

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the patch goes in, I'd like to see a plan how the code will be refactored later.

At the very least, I expect is_string_hashcode-related logic to go away and the intrinsic logic to be guided solely by a basic type of elements. If not in the initial version, then shortly after as a follow-up enhancement. Another thing I want to see is VectorizedHashCode node to go away and replaced with a stub call.

@cl4es
Copy link
Member Author

cl4es commented Jan 11, 2023

I'm not convinced using basic types - with T_BOOLEAN for unsigned byte - improves this much. It'd of course be nice with something more canonical than a set of adhoc constants strewn in here to steer this, but maybe we should pass element size and signedness. I think this would be a reasonable cleanup.

I'd also be willing to spend time rewriting as a stub call if someone could give me some pointers on how to best do that. This might be straightforward and simplify the implementation, but making a stub call could have noticeable overheads for small strings. A this is the common case the stub call overhead could be prohibitively expensive. An alternative is to keep the new node but extract the vectorized path as a stub routine and call it from inside the inlined intrinsic - similarly to what's done in C2_MacroAssembler::string_compare on aarch64.

…m arrays_hashcode, add initial value and range to intrinsic
@cl4es
Copy link
Member Author

cl4es commented Jan 15, 2023

FWIW I prototyped a follow-up to use basic types and extracted the String special-casing from the code. To do so a few things unraveled, such as needing to pass the initial value, but arguably it all ended up a bit neater. I've put this experiment in another branch for now (https://github.com/openjdk/jdk/compare/pr/10847...cl4es:jdk:8282664-type-cleanup?expand=1) since I need to test it through thoroughly, but functionally and to ensure there's no obvious performance impact (did some quick sanity testing on micros that look perfectly neutral)

@iwanowww does this make you a bit happier? I think of it as an immediate follow-up - but if there's strong preference I can merge it into this PR.

@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 16, 2023
@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 16, 2023
@cl4es
Copy link
Member Author

cl4es commented Jan 16, 2023

I've opted to include the changes spurred by @iwanowww's comments since it led to a number of revisions to the intrinsified method API, and it would be strange to introduce an intrinsified method just to change the API drastically in a follow-up. Basically ArraysSupport.vectorizedHashCode has been changed to take an offset + length, an initial value and the logical basic type of the array elements. Which means any necessary scaling of index and length needs to be taken care of before calling the intrinsic. This makes the implementation more flexible at no measurable performance cost. Overall the refactoring might have reduced complexity a bit.

Reviewers might observe that nothing is currently passing anything but 0 and length to vectorizedHashCode outside of the simple sanity test I've added, but I've verified this feature can be used to some effect elsewhere in the JDK, e.g: https://github.com/openjdk/jdk/compare/pr/10847...cl4es:jdk:zipcoder-hashcode?expand=1 (which improves speed of opening ZipFile by a small percentage in microbenchmarks).

Copy link

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Claes. Looks good.

Please, file an RFE for the follow-up work.

src/hotspot/share/opto/machnode.cpp Outdated Show resolved Hide resolved
// Possible values for the type operand of the NEWARRAY instruction.
// See https://docs.oracle.com/javase/specs/jvms/se9/html/jvms-6.html#jvms-6.5.newarray.

public static final int T_BOOLEAN = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an idea for a follow-up enhancement, unless there are plans to implement runtime dispatching between different stubs, the basic type can be coded as a Class and on compiler side the corresponding basic type extracted with java_lang_Class::as_BasicType().

@cl4es
Copy link
Member Author

cl4es commented Jan 17, 2023

Thanks for your patience and reviews.

/integrate

@openjdk
Copy link

openjdk bot commented Jan 17, 2023

Going to push as commit e37078f.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 17, 2023
@openjdk openjdk bot closed this Jan 17, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 17, 2023
@openjdk
Copy link

openjdk bot commented Jan 17, 2023

@cl4es Pushed as commit e37078f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@luhenry
Copy link
Member

luhenry commented Jan 18, 2023

Thanks for pushing it all the way!

@cl4es
Copy link
Member Author

cl4es commented Jan 18, 2023

Filed https://bugs.openjdk.org/browse/JDK-8300448 to follow-up and rewrite part of or all of the inlined code as a stub call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated shenandoah shenandoah-dev@openjdk.org