Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops #7700

Closed

Conversation

luhenry
Copy link
Member

@luhenry luhenry commented Mar 4, 2022

Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.

Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].

Speedup are as follows on a Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz

Benchmark                                        (size)  Mode  Cnt     Score    Error  Units
StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 ±  0.210  ns/op
StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 ±  0.127  ns/op
StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 ±  0.099  ns/op
StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 ±  0.444  ns/op
StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 ±  0.846  ns/op
StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 ±  4.071  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 ±  0.092  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 ±  0.159  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 ±  0.019  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 ±  1.218  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 ±  1.225  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 ± 14.621  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 ±  0.097  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 ±  0.158  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 ±  0.065  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 ±  0.019  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 ±  0.251  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 ±  1.667  ns/op
StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 ±  0.191  ns/op
StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 ±  0.362  ns/op
StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 ±  0.112  ns/op
StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 ±  0.702  ns/op
StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 ±  3.290  ns/op
StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 ± 43.847  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 ±  0.038  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 ±  0.422  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 ±  0.178  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 ±  0.089  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 ±  1.834  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 ±  2.927  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 ±  0.032  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 ±  0.396  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 ±  0.189  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 ±  0.032  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 ±  0.340  ns/op
StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 ±  2.699  ns/op

At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.

[1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8282664

Issue

  • JDK-8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops ⚠️ Title mismatch between PR and JBS.

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/7700/head:pull/7700
$ git checkout pull/7700

Update a local copy of the PR:
$ git checkout pull/7700
$ git pull https://git.openjdk.org/jdk pull/7700/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7700

View PR using the GUI difftool:
$ git pr show -t 7700

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/7700.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 4, 2022

👋 Welcome back luhenry! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 4, 2022

@luhenry The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Mar 4, 2022
@luhenry luhenry changed the title [JDK-8282664] Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops Mar 4, 2022
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 4, 2022
…loops

Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.

Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].

At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.

[1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
@luhenry luhenry force-pushed the vectorized-stringlatin1-hashcode branch from b62e677 to f7dda1d Compare March 4, 2022 16:22
@mlbridge
Copy link

mlbridge bot commented Mar 4, 2022

Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This straight forward enough. But I wonder if this should be a hotspot intrinsic and be able to take advantage vector machine instructions.
I'd also expect there is an existing test that checks the value of String.hashCode.

@cl4es
Copy link
Member

cl4es commented Mar 4, 2022

My only comment is that I'd like to see some benchmark verifying also the UTF-16 code path. It should see a similar speed-up, but adding plenty of calls might mess up compilation and inlining.

@luhenry
Copy link
Member Author

luhenry commented Mar 4, 2022

@RogerRiggs I would be happy to do such work. As it would be a bigger change, are you suggesting that it could be done or that it should be done as an intrinsic?

@cl4es adding that right now.

@luhenry
Copy link
Member Author

luhenry commented Mar 4, 2022

@cl4es I've added the UTF-16 benchmarks. I'm running them on my machine and should have the results in ~5 hours. I'll update the PR description once I've these numbers.

@merykitty
Copy link
Member

May you print out the generated code? My wild guess is that the updated version is still scalar, the improvement comes from dependency breakdown. I suggest hoisting the accumulation out of the main loop to achieve maximal scalar throughput. A small experiment on my machine shows improvement over your approach.

int c0 = 0, c1 = 0, c2 = 0, c3 = 0;
int i = 0, len = value.length;
for (; i < (len & (~3)); i += 4) {
    c3 = c3 * 92351 + (value[i + 3] & 0xff);
    c2 = c2 * 92351 + (value[i + 2] & 0xff);
    c1 = c1 * 92351 + (value[i + 1] & 0xff);
    c0 = c0 * 92351 + (value[i] & 0xff);
}
int h = c3 * 29791 + c2 * 961 + c1 * 31 + c0;
for (; i < len; i++) {
    h = h * 31 + (value[i] & 0xff);
}
return h;

Thanks a lot.

@luhenry
Copy link
Member Author

luhenry commented Mar 5, 2022

@cl4es I've updated the description with the results. StringUTF16 isn't adversely impacted by additional method calls.

@merykitty The way to use an accumulator is to do it like https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes. I'll implement the hashCodeVectorAPINoDependencies algorithm as an intrinsic next and check the performance.

@PaulSandoz
Copy link
Member

PaulSandoz commented Mar 7, 2022

I think we should explore an intrinsic that also covers Arrays.hashCode too, as that will have the broader impact. It may be possible to express the intrinsic in say ArraysSupport as:

@IntrinsicCandidate
public long hashCode(Object o, int offset, int length, Class<?> type, boolean unsigned) {
  return 0;
}  

then used as:

    public static int hashCode(byte[] a) {
        if (a == null)
            return 0;

        long v = ArraysSupport.hashCode(a, BYTE_OFFSET, a.length, byte.class, false);
        result = v  & 0xFFFF_FFFF;
        int i = (int) (v >> 32) 
        for (; i < a.length; i++) {
            result = 31 * result + a[i];
        }  

        return result;
    }

Allowing for the intrinsic to return a "remainder" may simplify the implementations (avoid having to do the tail).

@richardstartin
Copy link
Contributor

Great to see this taken up. As it’s implemented here, it’s still scalar, but the unroll prevents a strength reduction of the multiplication in the loop from

result = 31 * result + element;

to:

result = (result << 5) - result + element

which creates a data dependency and slows the loop down.

This was first reported by Peter Levart here: http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-September/028898.html

@cl4es
Copy link
Member

cl4es commented Mar 8, 2022

An awkward effect of this implementation is that it perturbs results on small Strings a bit. Adding a branch in the trivial case, but also regressing on certain lengths (try size=7). The added complexity seem to be no issue for JITs in these microbenchmarks, but I worry that the increased code size might play tricks with inlining heuristics in real applications.

After chatting a bit with @richardstartin regarding the observation that preventing a strength reduction on the constant 31 value being part of the improvement I devised an experiment which simply makes the 31 non-constant as to disable the strength reduction:

        private static int base = 31;
        @Benchmark
        public int scalarLatin1_NoStrengthReduction() {
            int h = 0;
            int i = 0, len = latin1.length;
            for (; i < len; i++) {
                h = base * h + (latin1[i] & 0xff);
            }
            return h;
        }

Interestingly results of that get planted in the middle of the baseline on large inputs, while avoiding most of the irregularities on small inputs compared to manually unrolled versions:

Benchmark                                                  (size)  Mode  Cnt   Score   Error  Units
StringHashCode.Algorithm.scalarLatin1                           1  avgt    3   2.910 ? 0.068  ns/op
StringHashCode.Algorithm.scalarLatin1                           7  avgt    3   6.530 ? 0.065  ns/op
StringHashCode.Algorithm.scalarLatin1                           8  avgt    3   7.472 ? 0.034  ns/op
StringHashCode.Algorithm.scalarLatin1                          15  avgt    3  12.750 ? 0.028  ns/op
StringHashCode.Algorithm.scalarLatin1                         100  avgt    3  99.190 ? 0.618  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 1  avgt    3   3.119 ? 0.015  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 7  avgt    3  11.556 ? 4.690  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 8  avgt    3   7.740 ? 0.005  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                15  avgt    3  13.030 ? 0.124  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16               100  avgt    3  46.470 ? 0.496  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  1  avgt    3   3.123 ? 0.057  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  7  avgt    3  11.380 ? 0.085  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  8  avgt    3   5.849 ? 0.583  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                 15  avgt    3  12.312 ? 0.025  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                100  avgt    3  45.751 ? 0.146  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       1  avgt    3   3.173 ? 0.015  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       7  avgt    3   5.229 ? 0.455  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       8  avgt    3   5.679 ? 0.012  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction      15  avgt    3   8.731 ? 0.103  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction     100  avgt    3  70.954 ? 3.386  ns/op

I wonder if this might be a safer play while we investigate intrinsification and other possible enhancements?

@merykitty
Copy link
Member

Can we change the optimizer so that the strength reduction happens only after all transformations have settled? Carelessly changing a multiplication to a shift as today may hurt a lot of potential optimisations.
Thanks.

@PaulSandoz
Copy link
Member

@cl4es Yes, we would need to carefully measure the impact for small array sizes (similar to what we had to do when the array mismatch intrinsic was implemented and applied to array equals). My sense is to focus on the intrinsic and also look for potential opportunities like @merykitty points out, as that is where the larger impact is, although it is more work!

@cl4es
Copy link
Member

cl4es commented Mar 8, 2022

Can we change the optimizer so that the strength reduction happens only after all transformations have settled? Carelessly changing a multiplication to a shift as today may hurt a lot of potential optimisations. Thanks.

Yes, it's troubling that making a constant non-foldable can lead the JIT down a path that ultimately pessimizes the end result (as observed here). If we could train the JIT to avoid this pitfall and get to the improvement observed in my experiment here without any changes to String.java then that'd be great.

@cl4es
Copy link
Member

cl4es commented Mar 8, 2022

@cl4es Yes, we would need to carefully measure the impact for small array sizes (similar to what we had to do when the array mismatch intrinsic was implemented and applied to array equals). My sense is to focus on the intrinsic and also look for potential opportunities like @merykitty points out, as that is where the larger impact is, although it is more work!

Right, I'm not too thrilled about the prospect of moving ahead with the de-constantification as an alternative patch here. It's such a crutch, but it's also simple and has no obvious downsides as of right now. I think it was a useful experiment to see where some of the gain observed in the unroll might be coming from. The degradation on many smaller Strings in the unrolled versions is a concern that I think might be a blocker, though. Short Strings are excessively common as keys in hash maps et.c..

Feels like none of the alternatives seen here so far is really it.

@prdoyle
Copy link

prdoyle commented Mar 8, 2022

@richardstartin - does that strength reduction actually happen? The bit-shift transformation valid only if the original result is known to be non-negative.

@richardstartin
Copy link
Contributor

@richardstartin - does that strength reduction actually happen? The bit-shift transformation valid only if the original result is known to be non-negative.

Yes.

@State(Scope.Benchmark)
public class StringHashCode {

  @Param({"sdjhfklashdfklashdflkashdflkasdhf", "締国件街徹条覧野武鮮覧横営績難比兵州催色"})
  String string;

  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  @Benchmark
  public int stringHashCode() {
    return new String(string).hashCode();
  }
}
....[Hottest Region 1]..............................................................................
c2, level 4, StringHashCode::stringHashCode, version 507 (384 bytes) 

                0x00007f2df0142da4: shl    $0x3,%r10
                0x00007f2df0142da8: movabs $0x800000000,%r12
                0x00007f2df0142db2: add    %r12,%r10
                0x00007f2df0142db5: xor    %r12,%r12
                0x00007f2df0142db8: cmp    %r10,%rax
                0x00007f2df0142dbb: jne    0x00007f2de8696080  ;   {runtime_call ic_miss_stub}
                0x00007f2df0142dc1: data16 xchg %ax,%ax
                0x00007f2df0142dc4: nopl   0x0(%rax,%rax,1)
                0x00007f2df0142dcc: data16 data16 xchg %ax,%ax
              [Verified Entry Point]
  0.12%         0x00007f2df0142dd0: mov    %eax,-0x14000(%rsp)
  0.84%         0x00007f2df0142dd7: push   %rbp
  0.22%         0x00007f2df0142dd8: sub    $0x30,%rsp         ;*synchronization entry
                                                              ; - StringHashCode::stringHashCode@-1 (line 14)
                0x00007f2df0142ddc: mov    0xc(%rsi),%r8d     ;*getfield string {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - StringHashCode::stringHashCode@5 (line 14)
  0.73%         0x00007f2df0142de0: mov    0x10(%r12,%r8,8),%eax  ; implicit exception: dispatches to 0x00007f2df0142fc4
  0.10%         0x00007f2df0142de5: test   %eax,%eax
0x00007f2df0142de7: je     0x00007f2df0142df9  ;*synchronization entry
                                                    ; - StringHashCode::stringHashCode@-1 (line 14)
  0.16%  │      0x00007f2df0142de9: add    $0x30,%rsp
0x00007f2df0142ded: pop    %rbp
0x00007f2df0142dee: mov    0x108(%r15),%r10
  0.88%  │      0x00007f2df0142df5: test   %eax,(%r10)        ;   {poll_return}
  0.18%  │      0x00007f2df0142df8: retq   
0x00007f2df0142df9: mov    0xc(%r12,%r8,8),%ecx  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.String::&lt;init&gt;@6 (line 236)
                                                              ; - StringHashCode::stringHashCode@8 (line 14)
                0x00007f2df0142dfe: mov    0xc(%r12,%rcx,8),%r10d  ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.String::hashCode@13 (line 1503)
                                                              ; - StringHashCode::stringHashCode@11 (line 14)
                                                              ; implicit exception: dispatches to 0x00007f2df0142fd0
  0.83%         0x00007f2df0142e03: test   %r10d,%r10d
                0x00007f2df0142e06: jbe    0x00007f2df0142f86  ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.String::hashCode@14 (line 1503)
                                                              ; - StringHashCode::stringHashCode@11 (line 14)
  0.14%         0x00007f2df0142e0c: movsbl 0x14(%r12,%r8,8),%r8d  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.String::&lt;init&gt;@14 (line 237)
                                                              ; - StringHashCode::stringHashCode@8 (line 14)
  0.02%         0x00007f2df0142e12: test   %r8d,%r8d
                0x00007f2df0142e15: jne    0x00007f2df0142fac  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.String::isLatin1@10 (line 3266)
                                                              ; - java.lang.String::hashCode@19 (line 1504)
                                                              ; - StringHashCode::stringHashCode@11 (line 14)
                0x00007f2df0142e1b: mov    %r10d,%edi
  1.14%         0x00007f2df0142e1e: dec    %edi
  0.10%         0x00007f2df0142e20: cmp    %r10d,%edi
                0x00007f2df0142e23: jae    0x00007f2df0142f8d
                0x00007f2df0142e29: movzbl 0x10(%r12,%rcx,8),%r9d  ;*iand {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.StringLatin1::hashCode@31 (line 196)
                                                              ; - java.lang.String::hashCode@29 (line 1504)
                                                              ; - StringHashCode::stringHashCode@11 (line 14)
                0x00007f2df0142e2f: lea    (%r12,%rcx,8),%rbx  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.String::&lt;init&gt;@6 (line 236)
                                                              ; - StringHashCode::stringHashCode@8 (line 14)
  0.77%         0x00007f2df0142e33: mov    %r10d,%edx
  0.22%         0x00007f2df0142e36: add    $0xfffffff9,%edx
                0x00007f2df0142e39: mov    $0x80000000,%r11d
                0x00007f2df0142e3f: cmp    %edx,%edi
  0.84%         0x00007f2df0142e41: cmovl  %r11d,%edx
  0.10%         0x00007f2df0142e45: mov    $0x1,%ebp
                0x00007f2df0142e4a: cmp    $0x1,%edx
0x00007f2df0142e4d: jle    0x00007f2df0142f55
0x00007f2df0142e53: mov    %r9d,%r11d
  1.08%   │     0x00007f2df0142e56: shl    $0x5,%r11d
  0.08%   │     0x00007f2df0142e5a: sub    %r9d,%r11d         ;*putfield value {reexecute=0 rethrow=0 return_oop=0}
                                                   ; - java.lang.String::&lt;init&gt;@9 (line 236)
                                                   ; - StringHashCode::stringHashCode@8 (line 14)
  0.02%   │╭    0x00007f2df0142e5d: jmp    0x00007f2df0142e6d
          ││ ↗  0x00007f2df0142e5f: vmovd  %xmm0,%ecx
          ││ │  0x00007f2df0142e63: vmovd  %xmm2,%r10d
          ││ │  0x00007f2df0142e68: vmovd  %xmm1,%r8d
          │↘ │  0x00007f2df0142e6d: mov    %edx,%esi
  0.92%   │  │  0x00007f2df0142e6f: sub    %ebp,%esi
  0.16%   │  │  0x00007f2df0142e71: mov    $0x1f40,%r9d
  0.02%   │  │  0x00007f2df0142e77: cmp    %r9d,%esi
          │  │  0x00007f2df0142e7a: mov    $0x1f40,%edi
  0.94%   │  │  0x00007f2df0142e7f: cmovg  %edi,%esi
  0.12%   │  │  0x00007f2df0142e82: add    %ebp,%esi
          │  │  0x00007f2df0142e84: vmovd  %ecx,%xmm0
          │  │  0x00007f2df0142e88: vmovd  %r10d,%xmm2
  0.83%   │  │  0x00007f2df0142e8d: vmovd  %r8d,%xmm1
  0.10%   │  │  0x00007f2df0142e92: data16 nopw 0x0(%rax,%rax,1)
          │  │  0x00007f2df0142e9c: data16 data16 xchg %ax,%ax  ;*imul {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                ; - java.lang.StringLatin1::hashCode@25 (line 196)
          │  │                                                ; - java.lang.String::hashCode@29 (line 1504)
          │  │                                                ; - StringHashCode::stringHashCode@11 (line 14)
  0.16%   │ ↗│  0x00007f2df0142ea0: movslq %ebp,%r13          ;*baload {reexecute=0 rethrow=0 return_oop=0}
          │ ││                                                ; - java.lang.StringLatin1::hashCode@19 (line 195)
          │ ││                                                ; - java.lang.String::hashCode@29 (line 1504)
          │ ││                                                ; - StringHashCode::stringHashCode@11 (line 14)
  1.08%   │ ││  0x00007f2df0142ea3: movzbl 0x10(%rbx,%r13,1),%r9d
  2.08%   │ ││  0x00007f2df0142ea9: movzbl 0x17(%rbx,%r13,1),%r10d
  1.39%   │ ││  0x00007f2df0142eaf: movzbl 0x11(%rbx,%r13,1),%ecx
  0.20%   │ ││  0x00007f2df0142eb5: add    %r9d,%r11d
  1.04%   │ ││  0x00007f2df0142eb8: movzbl 0x15(%rbx,%r13,1),%r8d
  1.59%   │ ││  0x00007f2df0142ebe: mov    %r11d,%edi
  1.26%   │ ││  0x00007f2df0142ec1: shl    $0x5,%edi
  0.12%   │ ││  0x00007f2df0142ec4: sub    %r11d,%edi
  1.81%   │ ││  0x00007f2df0142ec7: add    %ecx,%edi
  2.77%   │ ││  0x00007f2df0142ec9: movzbl 0x14(%rbx,%r13,1),%r11d
  0.84%   │ ││  0x00007f2df0142ecf: mov    %edi,%ecx
  0.16%   │ ││  0x00007f2df0142ed1: shl    $0x5,%ecx
  1.77%   │ ││  0x00007f2df0142ed4: sub    %edi,%ecx
  2.28%   │ ││  0x00007f2df0142ed6: movzbl 0x13(%rbx,%r13,1),%r9d
  0.67%   │ ││  0x00007f2df0142edc: movzbl 0x12(%rbx,%r13,1),%edi
  0.02%   │ ││  0x00007f2df0142ee2: add    %edi,%ecx
  2.51%   │ ││  0x00007f2df0142ee4: movzbl 0x16(%rbx,%r13,1),%edi
  1.00%   │ ││  0x00007f2df0142eea: mov    %ecx,%r14d
  0.79%   │ ││  0x00007f2df0142eed: shl    $0x5,%r14d
  1.61%   │ ││  0x00007f2df0142ef1: sub    %ecx,%r14d
  6.01%   │ ││  0x00007f2df0142ef4: add    %r9d,%r14d
  1.73%   │ ││  0x00007f2df0142ef7: mov    %r14d,%r9d
  0.29%   │ ││  0x00007f2df0142efa: shl    $0x5,%r9d
  0.24%   │ ││  0x00007f2df0142efe: sub    %r14d,%r9d
  6.09%   │ ││  0x00007f2df0142f01: add    %r11d,%r9d
  2.28%   │ ││  0x00007f2df0142f04: mov    %r9d,%r11d
  0.29%   │ ││  0x00007f2df0142f07: shl    $0x5,%r11d
  0.28%   │ ││  0x00007f2df0142f0b: sub    %r9d,%r11d
  5.30%   │ ││  0x00007f2df0142f0e: add    %r8d,%r11d
  2.50%   │ ││  0x00007f2df0142f11: mov    %r11d,%ecx
  0.24%   │ ││  0x00007f2df0142f14: shl    $0x5,%ecx
  0.37%   │ ││  0x00007f2df0142f17: sub    %r11d,%ecx
  6.50%   │ ││  0x00007f2df0142f1a: add    %edi,%ecx
  2.71%   │ ││  0x00007f2df0142f1c: mov    %ecx,%r9d
  0.26%   │ ││  0x00007f2df0142f1f: shl    $0x5,%r9d
  0.18%   │ ││  0x00007f2df0142f23: sub    %ecx,%r9d
  5.93%   │ ││  0x00007f2df0142f26: add    %r10d,%r9d         ;*iadd {reexecute=0 rethrow=0 return_oop=0}
          │ ││                                                ; - java.lang.StringLatin1::hashCode@32 (line 196)
          │ ││                                                ; - java.lang.String::hashCode@29 (line 1504)
          │ ││                                                ; - StringHashCode::stringHashCode@11 (line 14)
  2.85%   │ ││  0x00007f2df0142f29: mov    %r9d,%r11d
  0.10%   │ ││  0x00007f2df0142f2c: shl    $0x5,%r11d
  0.20%   │ ││  0x00007f2df0142f30: sub    %r9d,%r11d         ;*imul {reexecute=0 rethrow=0 return_oop=0}
          │ ││                                                ; - java.lang.StringLatin1::hashCode@25 (line 196)
          │ ││                                                ; - java.lang.String::hashCode@29 (line 1504)
          │ ││                                                ; - StringHashCode::stringHashCode@11 (line 14)
  2.57%   │ ││  0x00007f2df0142f33: add    $0x8,%ebp          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          │ ││                                                ; - java.lang.StringLatin1::hashCode@34 (line 195)
          │ ││                                                ; - java.lang.String::hashCode@29 (line 1504)
          │ ││                                                ; - StringHashCode::stringHashCode@11 (line 14)
  1.36%   │ ││  0x00007f2df0142f36: cmp    %esi,%ebp
          │ ╰│  0x00007f2df0142f38: jl     0x00007f2df0142ea0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                ; - java.lang.StringLatin1::hashCode@13 (line 195)
          │  │                                                ; - java.lang.String::hashCode@29 (line 1504)
          │  │                                                ; - StringHashCode::stringHashCode@11 (line 14)
          │  │  0x00007f2df0142f3e: mov    0x108(%r15),%r10   ; ImmutableOopMap{rbx=Oop xmm0=NarrowOop }
          │  │                                                ;*goto {reexecute=1 rethrow=0 return_oop=0}
          │  │                                                ; - java.lang.StringLatin1::hashCode@37 (line 195)
          │  │                                                ; - java.lang.String::hashCode@29 (line 1504)
          │  │                                                ; - StringHashCode::stringHashCode@11 (line 14)
          │  │  0x00007f2df0142f45: test   %eax,(%r10)        ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                ; - java.lang.StringLatin1::hashCode@37 (line 195)
          │  │                                                ; - java.lang.String::hashCode@29 (line 1504)
          │  │                                                ; - StringHashCode::stringHashCode@11 (line 14)
          │  │                                                ;   {poll}
  1.00%   │  │  0x00007f2df0142f48: cmp    %edx,%ebp
          │  ╰  0x00007f2df0142f4a: jl     0x00007f2df0142e5f
  0.16%   │     0x00007f2df0142f50: vmovd  %xmm2,%r10d
0x00007f2df0142f55: cmp    %r10d,%ebp
                0x00007f2df0142f58: jge    0x00007f2df0142f7e
                0x00007f2df0142f5a: xchg   %ax,%ax            ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - java.lang.StringLatin1::hashCode@16 (line 195)
                                                              ; - java.lang.String::hashCode@29 (line 1504)
                                                              ; - StringHashCode::stringHashCode@11 (line 14)
                0x00007f2df0142f5c: movzbl 0x10(%rbx,%rbp,1),%r8d
                0x00007f2df0142f62: mov    %r9d,%eax
                0x00007f2df0142f65: shl    $0x5,%eax
                0x00007f2df0142f68: sub    %r9d,%eax

c2, level 4, StringHashCode::stringHashCode, version 505 (435 bytes) 

                      0x00007fd05f2c0ba4: shl    $0x3,%r10
                      0x00007fd05f2c0ba8: movabs $0x800000000,%r12
                      0x00007fd05f2c0bb2: add    %r12,%r10
                      0x00007fd05f2c0bb5: xor    %r12,%r12
                      0x00007fd05f2c0bb8: cmp    %r10,%rax
                      0x00007fd05f2c0bbb: jne    0x00007fd057814080  ;   {runtime_call ic_miss_stub}
                      0x00007fd05f2c0bc1: data16 xchg %ax,%ax
                      0x00007fd05f2c0bc4: nopl   0x0(%rax,%rax,1)
                      0x00007fd05f2c0bcc: data16 data16 xchg %ax,%ax
                    [Verified Entry Point]
  1.14%               0x00007fd05f2c0bd0: mov    %eax,-0x14000(%rsp)
  0.50%               0x00007fd05f2c0bd7: push   %rbp
  0.22%               0x00007fd05f2c0bd8: sub    $0x30,%rsp         ;*synchronization entry
                                                                    ; - StringHashCode::stringHashCode@-1 (line 14)
  1.58%               0x00007fd05f2c0bdc: mov    0xc(%rsi),%r11d    ;*getfield string {reexecute=0 rethrow=0 return_oop=0}
                                                                    ; - StringHashCode::stringHashCode@5 (line 14)
                      0x00007fd05f2c0be0: mov    0x10(%r12,%r11,8),%ecx  ;*synchronization entry
                                                                    ; - StringHashCode::stringHashCode@-1 (line 14)
                                                                    ; implicit exception: dispatches to 0x00007fd05f2c0efc
  0.34%               0x00007fd05f2c0be5: test   %ecx,%ecx
0x00007fd05f2c0be7: jne    0x00007fd05f2c0d84  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - java.lang.String::hashCode@6 (line 1503)
                                                          ; - StringHashCode::stringHashCode@11 (line 14)
  1.04%  │            0x00007fd05f2c0bed: mov    0xc(%r12,%r11,8),%edx  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - java.lang.String::&lt;init&gt;@6 (line 236)
                                                          ; - StringHashCode::stringHashCode@8 (line 14)
  0.50%  │            0x00007fd05f2c0bf2: mov    0xc(%r12,%rdx,8),%r14d  ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - java.lang.String::hashCode@13 (line 1503)
                                                          ; - StringHashCode::stringHashCode@11 (line 14)
                                                          ; implicit exception: dispatches to 0x00007fd05f2c0f08
0x00007fd05f2c0bf7: xor    %eax,%eax
  0.36%  │            0x00007fd05f2c0bf9: test   %r14d,%r14d
         │╭           0x00007fd05f2c0bfc: jbe    0x00007fd05f2c0d74  ;*ifle {reexecute=0 rethrow=0 return_oop=0}
         ││                                                         ; - java.lang.String::hashCode@14 (line 1503)
         ││                                                         ; - StringHashCode::stringHashCode@11 (line 14)
  1.08%  ││           0x00007fd05f2c0c02: movsbl 0x14(%r12,%r11,8),%ebp  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
         ││                                                         ; - java.lang.String::&lt;init&gt;@14 (line 237)
         ││                                                         ; - StringHashCode::stringHashCode@8 (line 14)
  0.50%  ││           0x00007fd05f2c0c08: lea    (%r12,%rdx,8),%rdi  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
         ││                                                         ; - java.lang.String::&lt;init&gt;@6 (line 236)
         ││                                                         ; - StringHashCode::stringHashCode@8 (line 14)
         ││           0x00007fd05f2c0c0c: mov    $0x1,%r10d
  0.18%  ││           0x00007fd05f2c0c12: mov    $0x1f40,%esi
  1.20%  ││           0x00007fd05f2c0c17: mov    $0x80000000,%r11d  ;*putfield value {reexecute=0 rethrow=0 return_oop=0}
         ││                                                         ; - java.lang.String::&lt;init&gt;@9 (line 236)
         ││                                                         ; - StringHashCode::stringHashCode@8 (line 14)
  0.50%  ││           0x00007fd05f2c0c1d: test   %ebp,%ebp
         ││╭          0x00007fd05f2c0c1f: je     0x00007fd05f2c0d88  ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
         │││                                                        ; - java.lang.String::hashCode@22 (line 1504)
         │││                                                        ; - StringHashCode::stringHashCode@11 (line 14)
         │││          0x00007fd05f2c0c25: sar    %r14d              ;*ishr {reexecute=0 rethrow=0 return_oop=0}
         │││                                                        ; - java.lang.StringUTF16::hashCode@5 (line 348)
         │││                                                        ; - java.lang.String::hashCode@39 (line 1505)
         │││                                                        ; - StringHashCode::stringHashCode@11 (line 14)
  0.20%  │││          0x00007fd05f2c0c28: test   %r14d,%r14d
         │││╭         0x00007fd05f2c0c2b: jle    0x00007fd05f2c0d74  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
         ││││                                                       ; - java.lang.StringUTF16::hashCode@11 (line 349)
         ││││                                                       ; - java.lang.String::hashCode@39 (line 1505)
         ││││                                                       ; - StringHashCode::stringHashCode@11 (line 14)
  1.14%  ││││         0x00007fd05f2c0c31: movzwl 0x10(%r12,%rdx,8),%r9d  ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
         ││││                                                       ; - java.lang.StringUTF16::hashCode@20 (line 350)
         ││││                                                       ; - java.lang.String::hashCode@39 (line 1505)
         ││││                                                       ; - StringHashCode::stringHashCode@11 (line 14)
  0.40%  ││││         0x00007fd05f2c0c37: mov    %r14d,%r13d
         ││││         0x00007fd05f2c0c3a: dec    %r13d
  0.16%  ││││         0x00007fd05f2c0c3d: mov    %r9d,%r8d
  0.86%  ││││         0x00007fd05f2c0c40: shl    $0x5,%r8d
  0.46%  ││││         0x00007fd05f2c0c44: mov    %r14d,%ebx
         ││││         0x00007fd05f2c0c47: add    $0xfffffff9,%ebx
  0.16%  ││││         0x00007fd05f2c0c4a: cmp    %ebx,%r13d
  0.98%  ││││         0x00007fd05f2c0c4d: cmovl  %r11d,%ebx
  0.46%  ││││         0x00007fd05f2c0c51: cmp    $0x1,%ebx
         ││││         0x00007fd05f2c0c54: jle    0x00007fd05f2c0edb
         ││││         0x00007fd05f2c0c5a: sub    %r9d,%r8d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
         ││││                                                       ; - java.lang.StringUTF16::hashCode@17 (line 350)
         ││││                                                       ; - java.lang.String::hashCode@39 (line 1505)
         ││││                                                       ; - StringHashCode::stringHashCode@11 (line 14)
  0.28%  ││││╭        0x00007fd05f2c0c5d: jmp    0x00007fd05f2c0c8e  ;*bipush {reexecute=0 rethrow=0 return_oop=0}
         │││││                                                      ; - java.lang.StringUTF16::hashCode@14 (line 350)
         │││││                                                      ; - java.lang.String::hashCode@39 (line 1505)
         │││││                                                      ; - StringHashCode::stringHashCode@11 (line 14)
  1.22%  │││││ ↗  ↗   0x00007fd05f2c0c5f: movzwl 0x10(%rdi,%r10,2),%r11d
  1.54%  │││││ │  │   0x00007fd05f2c0c65: sub    %r9d,%eax
  1.58%  │││││ │  │   0x00007fd05f2c0c68: add    %r11d,%eax         ;*iadd {reexecute=0 rethrow=0 return_oop=0}
         │││││ │  │                                                 ; - java.lang.StringUTF16::hashCode@23 (line 350)
         │││││ │  │                                                 ; - java.lang.String::hashCode@39 (line 1505)
         │││││ │  │                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  1.93%  │││││ │  │   0x00007fd05f2c0c6b: inc    %r10d              ;*iinc {reexecute=0 rethrow=0 return_oop=0}
         │││││ │  │                                                 ; - java.lang.StringUTF16::hashCode@25 (line 349)
         │││││ │  │                                                 ; - java.lang.String::hashCode@39 (line 1505)
         │││││ │  │                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  0.78%  │││││ │  │   0x00007fd05f2c0c6e: cmp    %r14d,%r10d
         │││││╭│  │   0x00007fd05f2c0c71: jge    0x00007fd05f2c0d74  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
         │││││││  │                                                 ; - java.lang.StringUTF16::hashCode@11 (line 349)
         │││││││  │                                                 ; - java.lang.String::hashCode@39 (line 1505)
         │││││││  │                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  0.80%  │││││││  │   0x00007fd05f2c0c77: mov    %eax,%r8d
  0.72%  │││││││  │   0x00007fd05f2c0c7a: shl    $0x5,%r8d
  1.56%  │││││││  │   0x00007fd05f2c0c7e: mov    %eax,%r9d
  0.68%  │││││││  │   0x00007fd05f2c0c81: mov    %r8d,%eax
  0.62%  ││││││╰  │   0x00007fd05f2c0c84: jmp    0x00007fd05f2c0c5f
         ││││││  ↗│   0x00007fd05f2c0c86: vmovd  %xmm1,%ecx
         ││││││  ││   0x00007fd05f2c0c8a: vmovd  %xmm2,%edx
  1.12%  ││││↘│  ││   0x00007fd05f2c0c8e: mov    %ebx,%r13d
  0.46%  ││││ │  ││   0x00007fd05f2c0c91: sub    %r10d,%r13d
         ││││ │  ││   0x00007fd05f2c0c94: cmp    %esi,%r13d
  0.18%  ││││ │  ││   0x00007fd05f2c0c97: cmovg  %esi,%r13d
  1.14%  ││││ │  ││   0x00007fd05f2c0c9b: add    %r10d,%r13d
  0.46%  ││││ │  ││   0x00007fd05f2c0c9e: vmovd  %ecx,%xmm1
         ││││ │  ││   0x00007fd05f2c0ca2: vmovd  %edx,%xmm2
  0.30%  ││││ │  ││   0x00007fd05f2c0ca6: data16 nopw 0x0(%rax,%rax,1)  ;*imul {reexecute=0 rethrow=0 return_oop=0}
         ││││ │  ││                                                 ; - java.lang.StringUTF16::hashCode@17 (line 350)
         ││││ │  ││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │  ││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  1.22%  ││││ │ ↗││   0x00007fd05f2c0cb0: movzwl 0x1e(%rdi,%r10,2),%eax
  1.91%  ││││ │ │││   0x00007fd05f2c0cb6: movzwl 0x1c(%rdi,%r10,2),%ecx
  0.42%  ││││ │ │││   0x00007fd05f2c0cbc: movzwl 0x10(%rdi,%r10,2),%r9d
  0.16%  ││││ │ │││   0x00007fd05f2c0cc2: movzwl 0x12(%rdi,%r10,2),%r11d
  1.16%  ││││ │ │││   0x00007fd05f2c0cc8: add    %r9d,%r8d
  1.72%  ││││ │ │││   0x00007fd05f2c0ccb: movzwl 0x14(%rdi,%r10,2),%r9d
  0.50%  ││││ │ │││   0x00007fd05f2c0cd1: mov    %r8d,%edx
  0.26%  ││││ │ │││   0x00007fd05f2c0cd4: shl    $0x5,%edx
  1.54%  ││││ │ │││   0x00007fd05f2c0cd7: sub    %r8d,%edx
  1.68%  ││││ │ │││   0x00007fd05f2c0cda: add    %r11d,%edx
  0.44%  ││││ │ │││   0x00007fd05f2c0cdd: movzwl 0x16(%rdi,%r10,2),%r8d
  0.26%  ││││ │ │││   0x00007fd05f2c0ce3: mov    %edx,%r11d
  1.10%  ││││ │ │││   0x00007fd05f2c0ce6: shl    $0x5,%r11d
  1.38%  ││││ │ │││   0x00007fd05f2c0cea: sub    %edx,%r11d
  0.46%  ││││ │ │││   0x00007fd05f2c0ced: add    %r9d,%r11d
  0.38%  ││││ │ │││   0x00007fd05f2c0cf0: movzwl 0x18(%rdi,%r10,2),%edx
  1.10%  ││││ │ │││   0x00007fd05f2c0cf6: mov    %r11d,%r9d
  1.44%  ││││ │ │││   0x00007fd05f2c0cf9: shl    $0x5,%r9d
  0.54%  ││││ │ │││   0x00007fd05f2c0cfd: sub    %r11d,%r9d
  0.38%  ││││ │ │││   0x00007fd05f2c0d00: add    %r8d,%r9d
  1.64%  ││││ │ │││   0x00007fd05f2c0d03: movzwl 0x1a(%rdi,%r10,2),%r8d
  1.40%  ││││ │ │││   0x00007fd05f2c0d09: mov    %r9d,%r11d
  0.44%  ││││ │ │││   0x00007fd05f2c0d0c: shl    $0x5,%r11d
  0.56%  ││││ │ │││   0x00007fd05f2c0d10: sub    %r9d,%r11d
  1.58%  ││││ │ │││   0x00007fd05f2c0d13: add    %edx,%r11d
  1.97%  ││││ │ │││   0x00007fd05f2c0d16: mov    %r11d,%edx
  0.22%  ││││ │ │││   0x00007fd05f2c0d19: shl    $0x5,%edx
  1.02%  ││││ │ │││   0x00007fd05f2c0d1c: sub    %r11d,%edx
  3.41%  ││││ │ │││   0x00007fd05f2c0d1f: add    %r8d,%edx
  2.03%  ││││ │ │││   0x00007fd05f2c0d22: mov    %edx,%r11d
  0.12%  ││││ │ │││   0x00007fd05f2c0d25: shl    $0x5,%r11d
  1.24%  ││││ │ │││   0x00007fd05f2c0d29: sub    %edx,%r11d
  2.97%  ││││ │ │││   0x00007fd05f2c0d2c: add    %ecx,%r11d
  1.83%  ││││ │ │││   0x00007fd05f2c0d2f: mov    %r11d,%r9d
  0.06%  ││││ │ │││   0x00007fd05f2c0d32: shl    $0x5,%r9d
  1.16%  ││││ │ │││   0x00007fd05f2c0d36: sub    %r11d,%r9d
  3.89%  ││││ │ │││   0x00007fd05f2c0d39: add    %eax,%r9d          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
         ││││ │ │││                                                 ; - java.lang.StringUTF16::hashCode@23 (line 350)
         ││││ │ │││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │ │││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  1.44%  ││││ │ │││   0x00007fd05f2c0d3c: mov    %r9d,%eax
         ││││ │ │││   0x00007fd05f2c0d3f: shl    $0x5,%eax
  1.16%  ││││ │ │││   0x00007fd05f2c0d42: mov    %eax,%r8d
  1.83%  ││││ │ │││   0x00007fd05f2c0d45: sub    %r9d,%r8d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
         ││││ │ │││                                                 ; - java.lang.StringUTF16::hashCode@17 (line 350)
         ││││ │ │││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │ │││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  1.76%  ││││ │ │││   0x00007fd05f2c0d48: add    $0x8,%r10d         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
         ││││ │ │││                                                 ; - java.lang.StringUTF16::hashCode@25 (line 349)
         ││││ │ │││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │ │││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
         ││││ │ │││   0x00007fd05f2c0d4c: cmp    %r13d,%r10d
         ││││ │ ╰││   0x00007fd05f2c0d4f: jl     0x00007fd05f2c0cb0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
         ││││ │  ││                                                 ; - java.lang.StringUTF16::hashCode@11 (line 349)
         ││││ │  ││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │  ││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
         ││││ │  ││   0x00007fd05f2c0d55: mov    0x108(%r15),%r11   ; ImmutableOopMap{rdi=Oop xmm2=NarrowOop }
         ││││ │  ││                                                 ;*goto {reexecute=1 rethrow=0 return_oop=0}
         ││││ │  ││                                                 ; - java.lang.StringUTF16::hashCode@28 (line 349)
         ││││ │  ││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │  ││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
  0.68%  ││││ │  ││   0x00007fd05f2c0d5c: test   %eax,(%r11)        ;*goto {reexecute=0 rethrow=0 return_oop=0}
         ││││ │  ││                                                 ; - java.lang.StringUTF16::hashCode@28 (line 349)
         ││││ │  ││                                                 ; - java.lang.String::hashCode@39 (line 1505)
         ││││ │  ││                                                 ; - StringHashCode::stringHashCode@11 (line 14)
         ││││ │  ││                                                 ;   {poll}
  0.84%  ││││ │  ││   0x00007fd05f2c0d5f: cmp    %ebx,%r10d
         ││││ │  ╰│   0x00007fd05f2c0d62: jl     0x00007fd05f2c0c86
         ││││ │   │   0x00007fd05f2c0d68: cmp    %r14d,%r10d
         ││││ │   ╰   0x00007fd05f2c0d6b: jl     0x00007fd05f2c0c5f
         ││││ │       0x00007fd05f2c0d71: mov    %r9d,%eax          ;*synchronization entry
         ││││ │                                                     ; - StringHashCode::stringHashCode@-1 (line 14)
  0.38%  │↘│↘ ↘    ↗  0x00007fd05f2c0d74: add    $0x30,%rsp
  0.88%  │ │       │  0x00007fd05f2c0d78: pop    %rbp
  0.76%  │ │       │  0x00007fd05f2c0d79: mov    0x108(%r15),%r10
         │ │       │  0x00007fd05f2c0d80: test   %eax,(%r10)        ;   {poll_return}
  0.28%  │ │       │  0x00007fd05f2c0d83: retq   
         ↘ │       │  0x00007fd05f2c0d84: mov    %ecx,%eax
           │       ╰  0x00007fd05f2c0d86: jmp    0x00007fd05f2c0d74
0x00007fd05f2c0d88: mov    %r14d,%ebx
                      0x00007fd05f2c0d8b: dec    %ebx
                      0x00007fd05f2c0d8d: cmp    %r14d,%ebx
                      0x00007fd05f2c0d90: jae    0x00007fd05f2c0ee3
                      0x00007fd05f2c0d96: movzbl 0x10(%r12,%rdx,8),%r9d  ;*iand {reexecute=0 rethrow=0 return_oop=0}
                                                                    ; - java.lang.StringLatin1::hashCode@31 (line 196)
                                                                    ; - java.lang.String::hashCode@29 (line 1504)
                                                                    ; - StringHashCode::stringHashCode@11 (line 14)

@jddarcy
Copy link
Member

jddarcy commented Mar 26, 2022

Independent of performance improvements, the proposed changes may be tolerable from a code maintenance point of view, but I think VM intrinsics would be a better fit here. If the Java-level changes are kept, I think a short comment to explain the intent of the loop would be appropriate; e.g. something like
// unroll (31 * h + (v & 0xff)) recurrence; constants are 31^k in in arithmetic

Copy link

@ExE-Boss ExE-Boss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (length & ~(8 - 1)) computation should probably be moved outside of the loop.

src/java.base/share/classes/java/lang/StringLatin1.java Outdated Show resolved Hide resolved
src/java.base/share/classes/java/lang/StringUTF16.java Outdated Show resolved Hide resolved
@luhenry
Copy link
Member Author

luhenry commented Mar 28, 2022

I'm currently working on the vectorized intrinsic. It's taking more time due to end of quarter activities but I'm getting around to it :)

Next is to generalize it for Arrays.hashCode and StringUTF16.hashCode and make it cheap on shorter strings
@luhenry
Copy link
Member Author

luhenry commented Apr 4, 2022

Some early results:

Benchmark                                       (size)  Mode  Cnt    Score   Error  Units
StringHashCode.Algorithm.defaultLatin1            1024  avgt    5   90.235 ± 0.513  ns/op
StringHashCode.Algorithm.scalarLatin1             1024  avgt    5  632.166 ± 2.197  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8    1024  avgt    5  323.742 ± 2.782  ns/op

The defaultLatin1 invokes StringLatin1.hashCode via a MethodHandle, so there is a ~5ns overhead (measured with size=0 compared to scalarLatin1).

The results are very encouraging as it is 7x faster for large strings.

Next steps are to:

  1. make the intrinsic "free" on short strings
  2. generalize to other types (char[], short[], int[], long[]) and apply to Arrays.hashCode as well
  3. support ARM64 and Intel without AVX2

@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 6, 2022
@openjdk openjdk bot added rfr Pull request is ready for review and removed merge-conflict Pull request has merge conflict with target branch labels Apr 6, 2022
@luhenry
Copy link
Member Author

luhenry commented Apr 6, 2022

EOD day update: I'm trying to generalise the approach to Arrays.hashCode for some of the types (int, short, char, byte, float). However, I'm running into the following assertion and I haven't figured it out just yet.

#  Internal Error (/home/ludovic/git/jdk/src/hotspot/share/opto/machnode.cpp:210), pid=1071828, tid=1071841
#  assert(opcnt < numopnds) failed: Accessing non-existent operand

Any pointers would be greatly appreciated, I'll keep digging in the meantime.

I've also explored using a jump table for the cnt < 8 case. However, I couldn't successfully express the addresses that would be automatically relocated. I might keep exploring that further once I've fixed the issue above. The code I have is the following.

// int i = 0;
movl(i, 0);

jmp(JMPTABLE);

address jmptabletarget = pc();
int32_t iterations[8];

for (int idx = 0; idx < 8-1; idx++) {
  if (idx != 0) {
    addl(i, 1);
  }
  iterations[idx] = (int32_t)(pc() - jmptabletarget);
  // h = h << 5 - 31;
  movl(tmp1, result);
  shll(result, 5);
  subl(result, tmp1);
  // h += ary1[i];
  arrays_hashcode_elload(tmp1, Address(ary1, i, Address::times(elsize)), eltype);
  addl(result, tmp1);
}
iterations[8-1] = (int32_t)(pc() - jmptabletarget);

jmp(END);

address jmptable = pc();
for (int idx = 8-1; idx >= 0; idx--) {
  emit_int32(iterations[idx]);
}

bind(JMPTABLE);

// goto jmptabletarget+jmptable[cnt1]
mov_literal64(tmp2, jmptabletarget, relocInfo::internal_word_type);
movzwq(tmp1, Address(as_Address(InternalAddress(jmptable)), cnt1, Address::times(sizeof(int32_t))));
addq(tmp1, tmp2);
jmp(tmp1);

Copy link
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't really see anything that I think is the direct cause of the error you're seeing, but there are a couple of places where Op_AryHashCode appears to be missing.

src/hotspot/share/adlc/formssel.cpp Show resolved Hide resolved
src/hotspot/share/opto/loopnode.cpp Show resolved Hide resolved
@luhenry luhenry changed the title 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops 8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops May 10, 2022
@luhenry
Copy link
Member Author

luhenry commented May 10, 2022

@cl4es that was indeed the issue leading to the crash. Thanks!

strcmp(_matrule->_rChild->_opType,"StrInflatedCopy" )==0 ||
strcmp(_matrule->_rChild->_opType,"StrCompressedCopy" )==0 ||
strcmp(_matrule->_rChild->_opType,"StrIndexOf")==0 ||
strcmp(_matrule->_rChild->_opType,"StrIndexOfChar")==0 ||
strcmp(_matrule->_rChild->_opType,"CountPositives")==0 ||
strcmp(_matrule->_rChild->_opType,"EncodeISOArray")==0)) {
// String.(compareTo/equals/indexOf) and Arrays.equals
// String.(compareTo/equals/indexOf/hashCode) and Arrays.equals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// String.(compareTo/equals/indexOf/hashCode) and Arrays.equals
// String.(compareTo/equals/indexOf/hashCode) and Arrays.(equals/hashCode)

@rose00
Copy link
Contributor

rose00 commented May 11, 2022

I will say, for the record, although It looks like Richard Startin scooped me by half a year (for which, kudos!), that an explicitly vectorized algorithm was independently derived as a seed challenge for the Panama Vector API. I coded the explicitly vectorized Horner's rule loops seen in http://cr.openjdk.java.net/~jrose/vectors/vloop0.cpp in mid-2015, when we first thought there was an opportunity to do something with vector units and Java. (Thank you Intel for believing in this crazy idea!)

I'm glad to see the current work moving forward. I agree that an intrinsic form, rather than a magically hand-crafted Java loop, is the right way to give the JVM its call to action.

I wish we could generalize this to other instances of vectorized polynomial evaluators, rather than simply the wretchedly hardwired radix-31 one that so much of Java relies on. Maybe we will eventually...

@PaulSandoz
Copy link
Member

PaulSandoz commented May 11, 2022

Looks like you are making great progress.

Have you thought about ways the intrinsic implementation might be simplified if some code is retained in Java and passed as constant arguments? e.g. table of constants, scalar loop, bounds checks etc, such that the intrinsic primarily focuses on the vectorized code. To some extent that's related to John's point on generalization, and through simplification there may be some generalization.

For example if there was a general intrinsic that returned a long value (e.g. first 32 bits are the offset in the array to continue processing, the second 32 bits are the current hashcode value) then we could call that from the Java implementations that then proceed with the scalar loop up to the array length. The Java implementation intrinsic would return (0L | 1L << 32).

Separately it would be nice to consider computing the hash code from the contents of a memory segment, similar to how we added mismatch support, but the trick of returning a value that merges the offset and hash code would not work, unless we return the remaining elements to process and that guaranteed to be less than a certain value (or alternatively a relative offset value given an upper bound argument, and performing the intrinsic call in a loop until no progress can be made, which works better for safepoints).

The long[] hashcode is annoying given (element ^ (element >>> 32)), but if we simplify the intrinsic maybe we can add back that complexity?

@luhenry
Copy link
Member Author

luhenry commented May 12, 2022

@PaulSandoz yes, keeping the "short" string part in pure Java and switching to an intrinsified/vectorized version for "long" strings is on the next avenue of exploration. I would also put the intrinsic as a runtime stub to avoid unnecessarily increase the size of the calling method unnecessarily. The overhead of the call would be amortised because it would only be called for longer strings anyway.

I haven't given much thoughts to how we could split up the different elements of the algorithm to generalise the approach just yet. I'll give it a try, see how far I can get with it, and keep you updated on my findings.

@PaulSandoz
Copy link
Member

@luhenry ok, we took a similar approach to the mismatch intrinsic, carefully analyzing the threshold by which the intrinsic would be called.

My suggestion would be to follow that approach further and head towards an internal intrinsic perhaps with this signature:

@IntrinsicCandidate
static long hashCode(Class<T> eType, Object base, long offset, int length /* in bytes * /) {
  return 0 | (1L << 32); // or perhaps just return 0
}  

Then on a further iteration try and pass the polynomial constant and table of powers (stable array) as arguments.

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jun 7, 2022
@bridgekeeper
Copy link

bridgekeeper bot commented Jun 9, 2022

@luhenry This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@luhenry
Copy link
Member Author

luhenry commented Jun 14, 2022

Still working on it, other work priorities have popped up. I'm taking the approach of outlining the longer string approach in a dedicated runtime stub. This makes the code easier and it doesn't have a performance impact given the stub is only called on longer strings (the cost of the call is thus amortised by the faster execution).

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 12, 2022

@luhenry This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 10, 2022

@luhenry This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Aug 10, 2022
@cl4es
Copy link
Member

cl4es commented Oct 6, 2022

@luhenry notified me that he won't be able to continue working on this for now. I've started looking at this and am scoping out what's needed to finishing the work. To be continued in a new PR (soon!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org merge-conflict Pull request has merge conflict with target branch rfr Pull request is ready for review
10 participants