Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8268229: Aarch64: Use Neon in intrinsics for String.equals #4423

Closed
wants to merge 3 commits into from

Conversation

Wanghuang-Huawei
Copy link

@Wanghuang-Huawei Wanghuang-Huawei commented Jun 9, 2021

Dear all,
Could you give me a favor to review this patch? It improves the performance of the intrinsic of String.equals on Neon backend of Aarch64.
We profile the performance by using this JMH case:

 package com.huawei.string;
 import java.util.*;
 import java.util.concurrent.TimeUnit;
 
 import org.openjdk.jmh.annotations.CompilerControl;
 import org.openjdk.jmh.annotations.Benchmark;
 import org.openjdk.jmh.annotations.Level;
 import org.openjdk.jmh.annotations.OutputTimeUnit;
 import org.openjdk.jmh.annotations.Param;
 import org.openjdk.jmh.annotations.Scope;
 import org.openjdk.jmh.annotations.Setup;
 import org.openjdk.jmh.annotations.State;
 import org.openjdk.jmh.annotations.Fork;
 import org.openjdk.jmh.infra.Blackhole;
 
 @State(Scope.Thread)
 @OutputTimeUnit(TimeUnit.MILLISECONDS)
 public class StringEqual {
     @Param({"8", "64", "4096"})
     int size;
 
     String str1;
     String str2;
 
     @Setup(Level.Trial)
     public void init() {
         str1 = newString(size, 'c', '1');
         str2 = newString(size, 'c', '2');
     }
 
     public String newString(int length, char charToFill, char lastChar) {
         if (length > 0) {
             char[] array = new char[length];
             Arrays.fill(array, charToFill);
             array[length - 1] = lastChar;
             return new String(array);
         }
         return "";
     }
 
     @Benchmark
     @CompilerControl(CompilerControl.Mode.DONT_INLINE)
     public boolean EqualString() {
         return str1.equals(str2);
     }
 }

The result is list as following:(Linux aarch64 with 128cores)

Benchmark (size) Mode Cnt Score Error Units
StringEqual.EqualString 8 thrpt 10 123971.994 ± 1462.131 ops/ms
StringEqual.EqualString 64 thrpt 10 56009.960 ± 999.734 ops/ms
StringEqual.EqualString 4096 thrpt 10 1943.852 ± 8.159 ops/ms
StringEqual.EqualStringWithNEON 8 thrpt 10 120319.271 ± 1392.185 ops/ms
StringEqual.EqualStringWithNEON 64 thrpt 10 72914.767 ± 1814.173 ops/ms
StringEqual.EqualStringWithNEON 4096 thrpt 10 2579.155 ± 15.589 ops/ms

Yours,
WANG Huang


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8268229: Aarch64: Use Neon in intrinsics for String.equals

Contributors

  • Wang Huang <whuang@openjdk.org>
  • Miao Zhuojun <mouzhuojun@huawei.com>
  • Ai Jiaming <aijiaming1@huawei.com>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4423/head:pull/4423
$ git checkout pull/4423

Update a local copy of the PR:
$ git checkout pull/4423
$ git pull https://git.openjdk.java.net/jdk pull/4423/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4423

View PR using the GUI difftool:
$ git pr show -t 4423

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4423.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 9, 2021

👋 Welcome back whuang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 9, 2021
@openjdk
Copy link

openjdk bot commented Jun 9, 2021

@Wanghuang-Huawei The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jun 9, 2021
@Wanghuang-Huawei
Copy link
Author

/contributor add Wang Huang whuang@openjdk.org
/contributor add Miao Zhuojun mouzhuojun@huawei.com
/contributor add Ai Jiaming aijiaming1@huawei.com

@openjdk
Copy link

openjdk bot commented Jun 9, 2021

@Wanghuang-Huawei
Contributor Wang Huang <whuang@openjdk.org> successfully added.

@openjdk
Copy link

openjdk bot commented Jun 9, 2021

@Wanghuang-Huawei
Contributor Miao Zhuojun <mouzhuojun@huawei.com> successfully added.

@openjdk
Copy link

openjdk bot commented Jun 9, 2021

@Wanghuang-Huawei
Contributor Ai Jiaming <aijiaming1@huawei.com> successfully added.

@mlbridge
Copy link

mlbridge bot commented Jun 9, 2021

Webrevs

cbnz(tmp1, DONE);
mov(tmp2, v0, T2D, 1);
cbnz(tmp2, DONE);
b(SAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be

mov(tmp1, v0, T2D, 0);
mov(tmp2, v0, T2D, 1);
orr(tmp1, tmp1, tmp2);
cbnz(tmp1, DONE);

... which would use up fewer branch prediction resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or maybe do the OR in the vector unit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be done with:

umaxv(v1, T4S, v0);
mov(tmp1, v1, T4S, 0);
cbnz(tmp1, DONE0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, great idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested @dgbo 's suggestion and found that the performance degradation happened by using umaxv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not surprised it's slower: even Firestorm has a 3-cycle latency for UMAX, and its output is used immediately.

Copy link
Contributor

@theRealAph theRealAph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a 30% gain for bulk comparisons. It's not a complete waste of time, but we should concentrate on shortish strings because that's the common case. Me must not do anything to compromise performance in this case

The JMH test must be part of your patch. It should be in test/micro/org/openjdk/bench/java/lang.

We also need to look at performance around lengths of 32 characters, which is very common. Let's see 8,16,32,64.

Did you try comparing long strings that differ in, say the 31st character?

Copy link
Member

@nick-arm nick-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change the size of thestring_equals intrinsic increases by ~60% from 120 bytes to 196 bytes and this gets expanded at every String.equals call site. It looks good on a micro-benchmark but I wonder if on a larger program this improvement is outweighed by the negative effects of methods taking up more space in the icache.

@@ -16673,7 +16673,7 @@ instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt,

format %{ "String Equals $str1,$str2,$cnt -> $result" %}
ins_encode %{
// Count is in 8-bit bytes; non-Compact chars are 16 bits.
// Count is in 8-bit bytes; non-Compact chars are 8 bits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit confusing: non-compact chars are still 16 bits, it's just at this point we know the string contains only 8-bit Latin characters. I think it's better to instead delete everything after the ";" (or leave it as it is).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed this comment. Thank you for your suggestion.

@theRealAph
Copy link
Contributor

theRealAph commented Jun 15, 2021

With this change the size of thestring_equals intrinsic increases by ~60% from 120 bytes to 196 bytes and this gets expanded at every String.equals call site. It looks good on a micro-benchmark but I wonder if on a larger program this improvement is outweighed by the negative effects of methods taking up more space in the icache.

That's an excellent point. There's no need at all for the Neon part to be expanded inline: it could be a subroutine. We'd have to use fixed Neon registers at the call site.

@theRealAph
Copy link
Contributor

With this change the size of thestring_equals intrinsic increases by ~60% from 120 bytes to 196 bytes and this gets expanded at every String.equals call site. It looks good on a micro-benchmark but I wonder if on a larger program this improvement is outweighed by the negative effects of methods taking up more space in the icache.

That's an excellent point. There's no need at all for the Neon part to be expanded inline: it could be a subroutine. We'd have to use fixed Neon registers at the call site.

Thinking some more,we could use this opportunity to move as much of the bulk comparison code as we can out of line, hopefully achieving a reduction in footprint as well as an improvement in performance.

@Wanghuang-Huawei
Copy link
Author

Wanghuang-Huawei commented Jun 24, 2021

Dear @theRealAph @dgbo @nick-arm @mdinacci
I have pushed my recent patch. In this commit,

  • I have tested some cases as @theRealAph suggested and found some points
    1. we changed the diff postions in the strings and get the data if we used neon in all cases
      image
      Due to this result, if the string is small, we used old implementaion.
    2. The result of 8:64 in this figure is something like bugs, and I fixed it by unrolling the loop
    bind(LOOP); {
    ldr(tmp1, Address(post(a1, wordSize)));
    ldr(tmp2, Address(post(a2, wordSize)));
    subs(cnt1, cnt1, wordSize);
    eor(tmp1, tmp1, tmp2);
    cbnz(tmp1, DONE);
    br(LT, SHORT);
    
    ldr(tmp1, Address(post(a1, wordSize)));
    ldr(tmp2, Address(post(a2, wordSize)));
    subs(cnt1, cnt1, wordSize);
    eor(tmp1, tmp1, tmp2);
    cbnz(tmp1, DONE);
    } br(GE, LOOP);
    1. UseSimpleStringEquals is added in this patch. If the option is true , we used old implentation.
  • The result of my JMH is listed here ,

Diff postion is in the LAST 2/3 of whole string

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equalsLenT 8 avgt 10 7.869 ± 0.063 ns/op
StringEquals.equalsLenT 16 avgt 10 8.651 ± 0.201 ns/op
StringEquals.equalsLenT 32 avgt 10 9.869 ± 0.049 ns/op
StringEquals.equalsLenT 64 avgt 10 11.379 ± 0.134 ns/op
StringEquals.equalsLenT 128 avgt 10 17.312 ± 0.274 ns/op
StringEquals.equalsLenT_simple 8 avgt 10 7.912 ± 0.439 ns/op
StringEquals.equalsLenT_simple 16 avgt 10 8.764 ± 0.061 ns/op
StringEquals.equalsLenT_simple 32 avgt 10 30.452 ± 0.065 ns/op
StringEquals.equalsLenT_simple 64 avgt 10 14.550 ± 0.199 ns/op
StringEquals.equalsLenT_simple 128 avgt 10 20.071 ± 2.465 ns/op

Diff postion is in the FIRST 1/3 of whole string

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equalsLenH 8 avgt 10 7.822 ± 0.148 ns/op
StringEquals.equalsLenH 16 avgt 10 7.631 ± 0.179 ns/op
StringEquals.equalsLenH 32 avgt 10 8.553 ± 0.064 ns/op
StringEquals.equalsLenH 64 avgt 10 11.944 ± 0.554 ns/op
StringEquals.equalsLenH 128 avgt 10 12.691 ± 0.091 ns/op
StringEquals.equalsLenH_simple 8 avgt 10 7.873 ± 0.141 ns/op
StringEquals.equalsLenH_simple 16 avgt 10 7.972 ± 0.556 ns/op
StringEquals.equalsLenH_simple 32 avgt 10 8.383 ± 0.100 ns/op
StringEquals.equalsLenH_simple 64 avgt 10 29.364 ± 0.344 ns/op
StringEquals.equalsLenH_simple 128 avgt 10 14.748 ± 0.354 ns/op

@mlbridge
Copy link

mlbridge bot commented Jun 29, 2021

Mailing list message from Andrew Haley on hotspot-dev:

I had to make some changes to the benchmark to get accurate timing, because
it is swamped by JMH overhead for very small strings.

It should be clear from my patch what I did. The most important part is
to run the test code in a loop, or you won't see small effects. We're
trying to measure something that only takes a few nanoseconds.

This is what I see, Apple M1, two equal strings:

Old:

StringEquals.equal 8 avgt 5 0.948 ? 0.001 us/op
StringEquals.equal 11 avgt 5 0.948 ? 0.004 us/op
StringEquals.equal 16 avgt 5 0.948 ? 0.001 us/op
StringEquals.equal 22 avgt 5 1.260 ? 0.002 us/op
StringEquals.equal 32 avgt 5 1.886 ? 0.001 us/op
StringEquals.equal 45 avgt 5 2.514 ? 0.001 us/op
StringEquals.equal 64 avgt 5 3.141 ? 0.003 us/op
StringEquals.equal 91 avgt 5 4.395 ? 0.002 us/op
StringEquals.equal 121 avgt 5 5.653 ? 0.014 us/op
StringEquals.equal 181 avgt 5 8.011 ? 0.010 us/op
StringEquals.equal 256 avgt 5 11.433 ? 0.014 us/op
StringEquals.equal 512 avgt 5 23.005 ? 0.124 us/op
StringEquals.equal 1024 avgt 5 49.185 ? 0.032 us/op

Your patch:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 1.574 ? 0.001 us/op
StringEquals.equal 11 avgt 5 1.734 ? 0.004 us/op
StringEquals.equal 16 avgt 5 1.888 ? 0.002 us/op
StringEquals.equal 22 avgt 5 1.892 ? 0.003 us/op
StringEquals.equal 32 avgt 5 2.517 ? 0.003 us/op
StringEquals.equal 45 avgt 5 2.988 ? 0.002 us/op
StringEquals.equal 64 avgt 5 2.517 ? 0.003 us/op
StringEquals.equal 91 avgt 5 8.659 ? 0.007 us/op
StringEquals.equal 121 avgt 5 5.649 ? 0.007 us/op
StringEquals.equal 181 avgt 5 6.050 ? 0.009 us/op
StringEquals.equal 256 avgt 5 7.088 ? 0.016 us/op
StringEquals.equal 512 avgt 5 14.163 ? 0.018 us/op
StringEquals.equal 1024 avgt 5 29.998 ? 0.052 us/op

As you can see, we're looking at regressions all the way up to size=45,
with something very odd happening at size=91. Finally the vectorized
code starts to pull ahead at size=181.

A few things:

You should never be executing the TAIL unless the string is really
short. Just do one pair of unaligned loads at the end to finish.

Please don't use aliases for rscratch1 and rscratch2. Calling them tmp1
and tmp2 doesn't help the reader.

So: please make sure the smaller strings are at least as good as
they are now. Remember strings are usually short, so we can tolerate
no regressions with the smaller sizes.

I don't think that Neon does any good here. This is what I get by rewriting
(just) the stub with scalar registers, in the attached patch:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 1.574 ? 0.004 us/op
StringEquals.equal 11 avgt 5 1.734 ? 0.003 us/op
StringEquals.equal 16 avgt 5 1.888 ? 0.002 us/op
StringEquals.equal 22 avgt 5 1.891 ? 0.003 us/op
StringEquals.equal 32 avgt 5 2.517 ? 0.001 us/op
StringEquals.equal 45 avgt 5 2.988 ? 0.002 us/op
StringEquals.equal 64 avgt 5 2.595 ? 0.004 us/op
StringEquals.equal 91 avgt 5 4.083 ? 0.006 us/op
StringEquals.equal 121 avgt 5 5.432 ? 0.006 us/op
StringEquals.equal 181 avgt 5 6.292 ? 0.009 us/op
StringEquals.equal 256 avgt 5 7.232 ? 0.008 us/op
StringEquals.equal 512 avgt 5 13.304 ? 0.012 us/op
StringEquals.equal 1024 avgt 5 25.537 ? 0.012 us/op

I use an editor with automatic indentation, as do many people, so
I inserted brackets in the right places in the assembly code.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8268229.patch
Type: text/x-patch
Size: 12464 bytes
Desc: not available
URL: <https://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20210629/61ccd20c/8268229-0001.patch>

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 29, 2021

Mailing list message from Andrew Haley on hotspot-dev:

I had to make some changes to the benchmark to get accurate timing, because
it is swamped by JMH overhead for very small strings.

It should be clear from my patch what I did. The most important part is
to run the test code in a loop, or you won't see small effects. We're
trying to measure something that only takes a few nanoseconds.

This is what I see, Apple M1, two equal strings:

Old:

StringEquals.equal 8 avgt 5 0.948 ? 0.001 us/op
StringEquals.equal 11 avgt 5 0.948 ? 0.004 us/op
StringEquals.equal 16 avgt 5 0.948 ? 0.001 us/op
StringEquals.equal 22 avgt 5 1.260 ? 0.002 us/op
StringEquals.equal 32 avgt 5 1.886 ? 0.001 us/op
StringEquals.equal 45 avgt 5 2.514 ? 0.001 us/op
StringEquals.equal 64 avgt 5 3.141 ? 0.003 us/op
StringEquals.equal 91 avgt 5 4.395 ? 0.002 us/op
StringEquals.equal 121 avgt 5 5.653 ? 0.014 us/op
StringEquals.equal 181 avgt 5 8.011 ? 0.010 us/op
StringEquals.equal 256 avgt 5 11.433 ? 0.014 us/op
StringEquals.equal 512 avgt 5 23.005 ? 0.124 us/op
StringEquals.equal 1024 avgt 5 49.185 ? 0.032 us/op

Your patch:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 1.574 ? 0.001 us/op
StringEquals.equal 11 avgt 5 1.734 ? 0.004 us/op
StringEquals.equal 16 avgt 5 1.888 ? 0.002 us/op
StringEquals.equal 22 avgt 5 1.892 ? 0.003 us/op
StringEquals.equal 32 avgt 5 2.517 ? 0.003 us/op
StringEquals.equal 45 avgt 5 2.988 ? 0.002 us/op
StringEquals.equal 64 avgt 5 2.517 ? 0.003 us/op
StringEquals.equal 91 avgt 5 8.659 ? 0.007 us/op
StringEquals.equal 121 avgt 5 5.649 ? 0.007 us/op
StringEquals.equal 181 avgt 5 6.050 ? 0.009 us/op
StringEquals.equal 256 avgt 5 7.088 ? 0.016 us/op
StringEquals.equal 512 avgt 5 14.163 ? 0.018 us/op
StringEquals.equal 1024 avgt 5 29.998 ? 0.052 us/op

As you can see, we're looking at regressions all the way up to size=45,
with something very odd happening at size=91. Finally the vectorized
code starts to pull ahead at size=181.

A few things:

You should never be executing the TAIL unless the string is really
short. Just do one pair of unaligned loads at the end to finish.

Please don't use aliases for rscratch1 and rscratch2. Calling them tmp1
and tmp2 doesn't help the reader.

So: please make sure the smaller strings are at least as good as
they are now. Remember strings are usually short, so we can tolerate
no regressions with the smaller sizes.

I don't think that Neon does any good here. This is what I get by rewriting
(just) the stub with scalar registers, in the attached patch:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 1.574 ? 0.004 us/op
StringEquals.equal 11 avgt 5 1.734 ? 0.003 us/op
StringEquals.equal 16 avgt 5 1.888 ? 0.002 us/op
StringEquals.equal 22 avgt 5 1.891 ? 0.003 us/op
StringEquals.equal 32 avgt 5 2.517 ? 0.001 us/op
StringEquals.equal 45 avgt 5 2.988 ? 0.002 us/op
StringEquals.equal 64 avgt 5 2.595 ? 0.004 us/op
StringEquals.equal 91 avgt 5 4.083 ? 0.006 us/op
StringEquals.equal 121 avgt 5 5.432 ? 0.006 us/op
StringEquals.equal 181 avgt 5 6.292 ? 0.009 us/op
StringEquals.equal 256 avgt 5 7.232 ? 0.008 us/op
StringEquals.equal 512 avgt 5 13.304 ? 0.012 us/op
StringEquals.equal 1024 avgt 5 25.537 ? 0.012 us/op

I use an editor with automatic indentation, as do many people, so
I inserted brackets in the right places in the assembly code.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8268229.patch
Type: text/x-patch
Size: 12464 bytes
Desc: not available
URL: <https://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20210629/61ccd20c/8268229-0001.patch>

@Wanghuang-Huawei
Copy link
Author

Mailing list message from Andrew Haley on hotspot-dev:

I had to make some changes to the benchmark to get accurate timing, because
it is swamped by JMH overhead for very small strings.

It should be clear from my patch what I did. The most important part is
to run the test code in a loop, or you won't see small effects. We're
trying to measure something that only takes a few nanoseconds.

This is what I see, Apple M1, two equal strings:

Old:

StringEquals.equal 8 avgt 5 0.948 ? 0.001 us/op
StringEquals.equal 11 avgt 5 0.948 ? 0.004 us/op
StringEquals.equal 16 avgt 5 0.948 ? 0.001 us/op
StringEquals.equal 22 avgt 5 1.260 ? 0.002 us/op
StringEquals.equal 32 avgt 5 1.886 ? 0.001 us/op
StringEquals.equal 45 avgt 5 2.514 ? 0.001 us/op
StringEquals.equal 64 avgt 5 3.141 ? 0.003 us/op
StringEquals.equal 91 avgt 5 4.395 ? 0.002 us/op
StringEquals.equal 121 avgt 5 5.653 ? 0.014 us/op
StringEquals.equal 181 avgt 5 8.011 ? 0.010 us/op
StringEquals.equal 256 avgt 5 11.433 ? 0.014 us/op
StringEquals.equal 512 avgt 5 23.005 ? 0.124 us/op
StringEquals.equal 1024 avgt 5 49.185 ? 0.032 us/op

Your patch:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 1.574 ? 0.001 us/op
StringEquals.equal 11 avgt 5 1.734 ? 0.004 us/op
StringEquals.equal 16 avgt 5 1.888 ? 0.002 us/op
StringEquals.equal 22 avgt 5 1.892 ? 0.003 us/op
StringEquals.equal 32 avgt 5 2.517 ? 0.003 us/op
StringEquals.equal 45 avgt 5 2.988 ? 0.002 us/op
StringEquals.equal 64 avgt 5 2.517 ? 0.003 us/op
StringEquals.equal 91 avgt 5 8.659 ? 0.007 us/op
StringEquals.equal 121 avgt 5 5.649 ? 0.007 us/op
StringEquals.equal 181 avgt 5 6.050 ? 0.009 us/op
StringEquals.equal 256 avgt 5 7.088 ? 0.016 us/op
StringEquals.equal 512 avgt 5 14.163 ? 0.018 us/op
StringEquals.equal 1024 avgt 5 29.998 ? 0.052 us/op

As you can see, we're looking at regressions all the way up to size=45,
with something very odd happening at size=91. Finally the vectorized
code starts to pull ahead at size=181.

A few things:

You should never be executing the TAIL unless the string is really
short. Just do one pair of unaligned loads at the end to finish.

Please don't use aliases for rscratch1 and rscratch2. Calling them tmp1
and tmp2 doesn't help the reader.

So: please make sure the smaller strings are at least as good as
they are now. Remember strings are usually short, so we can tolerate
no regressions with the smaller sizes.

I don't think that Neon does any good here. This is what I get by rewriting
(just) the stub with scalar registers, in the attached patch:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 1.574 ? 0.004 us/op
StringEquals.equal 11 avgt 5 1.734 ? 0.003 us/op
StringEquals.equal 16 avgt 5 1.888 ? 0.002 us/op
StringEquals.equal 22 avgt 5 1.891 ? 0.003 us/op
StringEquals.equal 32 avgt 5 2.517 ? 0.001 us/op
StringEquals.equal 45 avgt 5 2.988 ? 0.002 us/op
StringEquals.equal 64 avgt 5 2.595 ? 0.004 us/op
StringEquals.equal 91 avgt 5 4.083 ? 0.006 us/op
StringEquals.equal 121 avgt 5 5.432 ? 0.006 us/op
StringEquals.equal 181 avgt 5 6.292 ? 0.009 us/op
StringEquals.equal 256 avgt 5 7.232 ? 0.008 us/op
StringEquals.equal 512 avgt 5 13.304 ? 0.012 us/op
StringEquals.equal 1024 avgt 5 25.537 ? 0.012 us/op

I use an editor with automatic indentation, as do many people, so
I inserted brackets in the right places in the assembly code.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. https://www.redhat.com
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8268229.patch
Type: text/x-patch
Size: 12464 bytes
Desc: not available
URL: https://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20210629/61ccd20c/8268229-0001.patch

@theRealAph Thank you for your suggestion. It's my fault that the JMH I used is not accurate. I changed my codes and re-tested under your JMH:

Before opt:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 2.334 ? 0.012 us/op
StringEquals.equal 11 avgt 5 2.335 ? 0.012 us/op
StringEquals.equal 16 avgt 5 2.334 ? 0.011 us/op
StringEquals.equal 22 avgt 5 3.414 ? 0.422 us/op
StringEquals.equal 32 avgt 5 3.890 ? 0.004 us/op
StringEquals.equal 45 avgt 5 5.610 ? 0.023 us/op
StringEquals.equal 64 avgt 5 7.215 ? 0.009 us/op
StringEquals.equal 91 avgt 5 12.305 ? 1.716 us/op
StringEquals.equal 121 avgt 5 14.891 ? 0.085 us/op
StringEquals.equal 181 avgt 5 21.502 ? 0.050 us/op
StringEquals.equal 256 avgt 5 29.968 ? 0.155 us/op
StringEquals.equal 512 avgt 5 59.414 ? 2.341 us/op
StringEquals.equal 1024 avgt 5 118.365 ? 20.794 us/op

After opt:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 8 avgt 5 2.333 ? 0.003 us/op
StringEquals.equal 11 avgt 5 2.333 ? 0.001 us/op
StringEquals.equal 16 avgt 5 2.332 ? 0.002 us/op
StringEquals.equal 22 avgt 5 3.265 ? 0.404 us/op
StringEquals.equal 32 avgt 5 3.875 ? 0.002 us/op
StringEquals.equal 45 avgt 5 5.793 ? 0.331 us/op
StringEquals.equal 64 avgt 5 6.730 ? 0.054 us/op
StringEquals.equal 91 avgt 5 8.611 ? 0.075 us/op
StringEquals.equal 121 avgt 5 10.041 ? 0.042 us/op
StringEquals.equal 181 avgt 5 13.968 ? 0.653 us/op
StringEquals.equal 256 avgt 5 19.199 ? 1.227 us/op
StringEquals.equal 512 avgt 5 39.508 ? 1.784 us/op
StringEquals.equal 1024 avgt 5 77.883 ? 1.290 us/op

Comment on lines +4794 to +4816
ldr(rscratch1, Address(post(a1, wordSize)));
ldr(rscratch2, Address(post(a2, wordSize)));
eor(rscratch1, rscratch1, rscratch2);
cbnz(rscratch1, DONE);

bind(B24);
ldr(rscratch1, Address(post(a1, wordSize)));
ldr(rscratch2, Address(post(a2, wordSize)));
eor(rscratch2, rscratch1, rscratch2);
cbnz(rscratch2, DONE);

bind(B16);
ldr(rscratch1, Address(post(a1, wordSize)));
ldr(rscratch2, Address(post(a2, wordSize)));
eor(rscratch1, rscratch1, rscratch2);
cbnz(rscratch1, DONE);

ldr(rscratch1, Address(a1, cnt1));
ldr(rscratch2, Address(a2, cnt1));
eor(rscratch2, rscratch1, rscratch2);
cbnz(rscratch2, DONE);
b(SAME);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not going to do all this unrolling at the call site.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why we unrolling the loop ?
  • It is found that these codes degrades performance
   br(LE, B16);
   subs(cnt1, cnt1, wordSize);
   br(LE, B24);
   subs(cnt1, cnt1, wordSize);

We unrolls the loop to remove these comparsion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why we unroll the loop ?
  • It is found that these codes degrades performance
   br(LE, B16);
   subs(cnt1, cnt1, wordSize);
   br(LE, B24);
   subs(cnt1, cnt1, wordSize);

We unroll the loop to remove these comparsion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why we unroll the loop ?
  • It is found that these codes degrades performance
   br(LE, B16);
   subs(cnt1, cnt1, wordSize);
   br(LE, B24);
   subs(cnt1, cnt1, wordSize);

We unroll the loop to remove these comparsion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to balance performance against code size, given that this is expanded frequently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nick is right. This unrolling at the call site is not going to be accepted.

__ orr(rscratch1, rscratch1, rscratch2);
__ cbnz(rscratch1, NOT_EQUAL);
__ br(__ GE, LOOP);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, we gain nothing by using Neon here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better:

+	__ ldp(r5, r6, Address(__ post(a1, wordSize * 2)));
+	__ ldp(rscratch1, rscratch2, Address(__ post(a2, wordSize * 2)));
+	__ cmp(r5, rscratch1);
+	__ ccmp(r6, rscratch2, 0, Assembler::EQ);
+	__ br(__ NE, NOT_EQUAL);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed ld1 into ldp and get the result as following,

simple:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 45 avgt 5 6.105 ? 0.635 us/op
StringEquals.equal 64 avgt 5 7.226 ? 0.056 us/op
StringEquals.equal 91 avgt 5 12.010 ? 0.375 us/op
StringEquals.equal 121 avgt 5 14.772 ? 0.114 us/op
StringEquals.equal 181 avgt 5 21.468 ? 0.676 us/op
StringEquals.equal 256 avgt 5 28.942 ? 4.806 us/op
StringEquals.equal 512 avgt 5 58.479 ? 5.918 us/op
StringEquals.equal 1024 avgt 5 119.313 ? 16.661 us/op

ldp:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 45 avgt 5 6.449 ? 0.202 us/op
StringEquals.equal 64 avgt 5 7.367 ? 0.055 us/op
StringEquals.equal 91 avgt 5 9.984 ? 0.065 us/op
StringEquals.equal 121 avgt 5 12.540 ? 0.545 us/op
StringEquals.equal 181 avgt 5 15.614 ? 0.280 us/op
StringEquals.equal 256 avgt 5 19.346 ? 0.243 us/op
StringEquals.equal 512 avgt 5 35.718 ? 0.599 us/op
StringEquals.equal 1024 avgt 5 67.846 ? 0.439 us/op

neon:

Benchmark (size) Mode Cnt Score Error Units
StringEquals.equal 45 avgt 5 5.883 ? 0.173 us/op
StringEquals.equal 64 avgt 5 6.737 ? 0.035 us/op
StringEquals.equal 91 avgt 5 8.997 ? 0.215 us/op
StringEquals.equal 121 avgt 5 10.789 ? 0.386 us/op
StringEquals.equal 181 avgt 5 14.063 ? 0.253 us/op
StringEquals.equal 256 avgt 5 19.679 ? 1.419 us/op
StringEquals.equal 512 avgt 5 38.813 ? 1.378 us/op
StringEquals.equal 1024 avgt 5 77.769 ? 3.082 us/op

From the results, we can see that,

  • for small size (45~181), the performance of ldp version is not as good as neon/ ld1 version
  • for big size, ldp version is better that neon/ld1 version
  • all versions (both ldp and ld1) are better that old simple version .
  • I agree with you ldp version is better than ld1 version at last patch because I used
__ ldr(v0, __ Q, Address(__ post(a1, wordSize * 2))); 
__ ldr(v1, __ Q, Address(__ post(a2, wordSize * 2)));

at last patch. However, I use

__ ld1(v0, v1, __ T2D, Address(__ post(a1, loopThreshold)));
__ ld1(v2, v3, __ T2D, Address(__ post(a2, loopThreshold)));

in recent patch. I think this change has fixed the problem here.

@theRealAph
Copy link
Contributor

theRealAph commented Jul 2, 2021

Please have a very good look at the stubGenerator changes in

theRealAph@db6d620

@@ -93,6 +93,8 @@ define_pd_global(intx, InlineSmallCode, 1000);
"Use SIMD instructions in generated array equals code") \
product(bool, UseSimpleArrayEquals, false, \
"Use simpliest and shortest implementation for array equals") \
product(bool, UseSimpleStringEquals, true, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a user-facing toggle, especially a product one? Under what situations do we expect the user to change this? It's useful for comparison but if the new implementation if demonstrably better then we should just delete the old one.

Copy link
Contributor

@theRealAph theRealAph Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But we've a way to go; taking this out should be the last step.


// Main 32 byte comparison loop.
__ bind(LOOP);
__ ld1(v0, v1, __ T2D, Address(__ post(a1, loopThreshold)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to reserve v0-3 as temporaries in the .ad file string_equals patterns otherwise we might be overwriting live values here.

Comment on lines +4794 to +4816
ldr(rscratch1, Address(post(a1, wordSize)));
ldr(rscratch2, Address(post(a2, wordSize)));
eor(rscratch1, rscratch1, rscratch2);
cbnz(rscratch1, DONE);

bind(B24);
ldr(rscratch1, Address(post(a1, wordSize)));
ldr(rscratch2, Address(post(a2, wordSize)));
eor(rscratch2, rscratch1, rscratch2);
cbnz(rscratch2, DONE);

bind(B16);
ldr(rscratch1, Address(post(a1, wordSize)));
ldr(rscratch2, Address(post(a2, wordSize)));
eor(rscratch1, rscratch1, rscratch2);
cbnz(rscratch1, DONE);

ldr(rscratch1, Address(a1, cnt1));
ldr(rscratch2, Address(a2, cnt1));
eor(rscratch2, rscratch1, rscratch2);
cbnz(rscratch2, DONE);
b(SAME);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to balance performance against code size, given that this is expanded frequently.

@theRealAph
Copy link
Contributor

Please bear in mind that String.equals() is typically used for short strings: identifiers, names, etc. The mean string length in most cases I've tried is around 18 characters. The use of String.equals() for long strings is unusual, and we should not burden typical usages with higher overheads for the sake of rare usages.

The current String.equals() is a compromise: it performs fairly well on the String instances we expect to see, but it is not highly optimized for very long strings. Whatever you do, any replacement must not be worse for small strings. That is to say, it must not use many more registers or significantly more (expanded inline) code space.

@theRealAph
Copy link
Contributor

theRealAph commented Jul 5, 2021

There is one other thing I should mention, of which you may not be aware.
Whenever you expand a macro inline, you reduce the opportunities for methods to be inlined. That's because if a method is bigger than (default) 2500 bytes, we do not inline it into other methods. Inlining is the most powerful optimization we have, but we need to prevent code size explosion.
So not only does inlining code add pressure on the machine's icache, the HotSpot code cache, and so on, but it also prevents other optimizations. That's why we are very wary of increasing the size of String.equals to benefit unusual cases.

@theRealAph
Copy link
Contributor

Here are some Graviton 2 timings, five versions:

+UseSimpleStringEquals:
Benchmark           (size)  Mode  Cnt   Score    Error  Units
StringEquals.equal       1  avgt    5   2.813 ±  0.001  us/op
StringEquals.equal       3  avgt    5   2.821 ±  0.001  us/op
StringEquals.equal       4  avgt    5   2.812 ±  0.001  us/op
StringEquals.equal       6  avgt    5   2.821 ±  0.002  us/op
StringEquals.equal       8  avgt    5   2.420 ±  0.001  us/op
StringEquals.equal      11  avgt    5   2.420 ±  0.001  us/op
StringEquals.equal      16  avgt    5   2.421 ±  0.002  us/op
StringEquals.equal      21  avgt    5   3.291 ±  0.003  us/op
StringEquals.equal      32  avgt    5   4.412 ±  0.001  us/op
StringEquals.equal      45  avgt    5   5.623 ±  0.001  us/op
StringEquals.equal      64  avgt    5   7.225 ±  0.010  us/op
StringEquals.equal      91  avgt    5  10.426 ±  0.002  us/op
StringEquals.equal     128  avgt    5  13.628 ±  0.001  us/op
StringEquals.equal     181  avgt    5  19.231 ±  0.002  us/op
StringEquals.equal     256  avgt    5  26.436 ±  0.009  us/op

Your commit 4f02c00:

Benchmark           (size)  Mode  Cnt   Score    Error  Units
StringEquals.equal       1  avgt    5   2.812 ±  0.001  us/op
StringEquals.equal       3  avgt    5   3.212 ±  0.001  us/op
StringEquals.equal       4  avgt    5   2.812 ±  0.001  us/op
StringEquals.equal       6  avgt    5   3.212 ±  0.001  us/op
StringEquals.equal       8  avgt    5   3.612 ±  0.001  us/op
StringEquals.equal      11  avgt    5   4.413 ±  0.001  us/op
StringEquals.equal      16  avgt    5   4.813 ±  0.001  us/op
StringEquals.equal      21  avgt    5   5.613 ±  0.001  us/op
StringEquals.equal      32  avgt    5   6.418 ±  0.001  us/op
StringEquals.equal      45  avgt    5   7.614 ±  0.001  us/op
StringEquals.equal      64  avgt    5   6.929 ±  0.081  us/op
StringEquals.equal      91  avgt    5   9.617 ±  0.001  us/op
StringEquals.equal     128  avgt    5  11.880 ±  0.152  us/op
StringEquals.equal     181  avgt    5  16.576 ±  0.002  us/op
StringEquals.equal     256  avgt    5  21.869 ±  0.108  us/op

My hack using ldp:

Benchmark           (size)  Mode  Cnt   Score    Error  Units
StringEquals.equal       1  avgt    5   2.414 ±  0.001  us/op
StringEquals.equal       3  avgt    5   2.814 ±  0.001  us/op
StringEquals.equal       4  avgt    5   2.414 ±  0.001  us/op
StringEquals.equal       6  avgt    5   2.814 ±  0.001  us/op
StringEquals.equal       8  avgt    5   3.214 ±  0.001  us/op
StringEquals.equal      11  avgt    5   4.015 ±  0.001  us/op
StringEquals.equal      16  avgt    5   4.419 ±  0.001  us/op
StringEquals.equal      21  avgt    5   5.216 ±  0.001  us/op
StringEquals.equal      32  avgt    5   6.017 ±  0.001  us/op
StringEquals.equal      45  avgt    5   7.218 ±  0.001  us/op
StringEquals.equal      64  avgt    5   6.015 ±  0.001  us/op
StringEquals.equal      91  avgt    5   8.967 ±  0.015  us/op
StringEquals.equal     128  avgt    5   9.217 ±  0.001  us/op
StringEquals.equal     181  avgt    5  14.096 ±  0.011  us/op
StringEquals.equal     256  avgt    5  15.462 ±  0.259  us/op

Today's -UseSimpleStringEquals:

Benchmark           (size)  Mode  Cnt   Score    Error  Units
StringEquals.equal       1  avgt    5   2.812 ±  0.001  us/op
StringEquals.equal       3  avgt    5   3.212 ±  0.001  us/op
StringEquals.equal       4  avgt    5   2.812 ±  0.001  us/op
StringEquals.equal       6  avgt    5   3.212 ±  0.001  us/op
StringEquals.equal       8  avgt    5   2.813 ±  0.002  us/op
StringEquals.equal      11  avgt    5   2.813 ±  0.001  us/op
StringEquals.equal      16  avgt    5   2.813 ±  0.001  us/op
StringEquals.equal      21  avgt    5   3.615 ±  0.001  us/op
StringEquals.equal      32  avgt    5   4.414 ±  0.001  us/op
StringEquals.equal      45  avgt    5   7.080 ±  0.027  us/op
StringEquals.equal      64  avgt    5   7.613 ±  0.001  us/op
StringEquals.equal      91  avgt    5  10.037 ±  0.005  us/op
StringEquals.equal     128  avgt    5  10.419 ±  0.001  us/op
StringEquals.equal     181  avgt    5  14.896 ±  0.004  us/op
StringEquals.equal     256  avgt    5  16.823 ±  0.001  us/op

@theRealAph
Copy link
Contributor

I'm still seeing a slight advantage for ldp on Graviton 2:

Benchmark           (size)  Mode  Cnt   Score   Error  Units
StringEquals.equal     256  avgt    5  15.592 ± 0.080  us/op
StringEquals.equal     512  avgt    5  28.467 ± 0.245  us/op
StringEquals.equal    1024  avgt    5  53.883 ± 0.272  us/op

Versus the latest Neon version:

Benchmark           (size)  Mode  Cnt   Score   Error  Units
StringEquals.equal     256  avgt    5  16.848 ± 0.158  us/op
StringEquals.equal     512  avgt    5  29.640 ± 0.024  us/op
StringEquals.equal    1024  avgt    5  55.257 ± 0.050  us/op

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 2, 2021

@Wanghuang-Huawei This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 30, 2021

@Wanghuang-Huawei This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Aug 30, 2021
@TobiHartmann
Copy link
Member

@Wanghuang-Huawei any plans to re-open and fix this?

@theRealAph
Copy link
Contributor

@Wanghuang-Huawei any plans to re-open and fix this?

I hope not: it looks like a regression for common cases.

@RealFYang
Copy link
Member

@TobiHartmann @theRealAph :
Not sure whether the original author of this PR will notice this message since he has moved to a new company last year.
But I was told by one of the co-authors that this won't benifit common cases. So I agree that we keep this PR closed.

@TobiHartmann
Copy link
Member

Okay, thanks for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review
6 participants