New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8268229: Aarch64: Use Neon in intrinsics for String.equals #4423
Conversation
👋 Welcome back whuang! A progress list of the required criteria for merging this PR into |
@Wanghuang-Huawei The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
/contributor add Wang Huang whuang@openjdk.org |
@Wanghuang-Huawei |
@Wanghuang-Huawei |
@Wanghuang-Huawei |
Webrevs
|
cbnz(tmp1, DONE); | ||
mov(tmp2, v0, T2D, 1); | ||
cbnz(tmp2, DONE); | ||
b(SAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
mov(tmp1, v0, T2D, 0);
mov(tmp2, v0, T2D, 1);
orr(tmp1, tmp1, tmp2);
cbnz(tmp1, DONE);
... which would use up fewer branch prediction resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or maybe do the OR in the vector unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can be done with:
umaxv(v1, T4S, v0);
mov(tmp1, v1, T4S, 0);
cbnz(tmp1, DONE0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested @dgbo 's suggestion and found that the performance degradation happened by using umaxv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not surprised it's slower: even Firestorm has a 3-cycle latency for UMAX, and its output is used immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a 30% gain for bulk comparisons. It's not a complete waste of time, but we should concentrate on shortish strings because that's the common case. Me must not do anything to compromise performance in this case
The JMH test must be part of your patch. It should be in test/micro/org/openjdk/bench/java/lang.
We also need to look at performance around lengths of 32 characters, which is very common. Let's see 8,16,32,64.
Did you try comparing long strings that differ in, say the 31st character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change the size of thestring_equals
intrinsic increases by ~60% from 120 bytes to 196 bytes and this gets expanded at every String.equals
call site. It looks good on a micro-benchmark but I wonder if on a larger program this improvement is outweighed by the negative effects of methods taking up more space in the icache.
src/hotspot/cpu/aarch64/aarch64.ad
Outdated
@@ -16673,7 +16673,7 @@ instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt, | |||
|
|||
format %{ "String Equals $str1,$str2,$cnt -> $result" %} | |||
ins_encode %{ | |||
// Count is in 8-bit bytes; non-Compact chars are 16 bits. | |||
// Count is in 8-bit bytes; non-Compact chars are 8 bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a bit confusing: non-compact chars are still 16 bits, it's just at this point we know the string contains only 8-bit Latin characters. I think it's better to instead delete everything after the ";" (or leave it as it is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed this comment. Thank you for your suggestion.
That's an excellent point. There's no need at all for the Neon part to be expanded inline: it could be a subroutine. We'd have to use fixed Neon registers at the call site. |
Thinking some more,we could use this opportunity to move as much of the bulk comparison code as we can out of line, hopefully achieving a reduction in footprint as well as an improvement in performance. |
Dear @theRealAph @dgbo @nick-arm @mdinacci,
Diff postion is in the LAST 2/3 of whole string
Diff postion is in the FIRST 1/3 of whole string
|
Mailing list message from Andrew Haley on hotspot-dev: I had to make some changes to the benchmark to get accurate timing, because It should be clear from my patch what I did. The most important part is This is what I see, Apple M1, two equal strings: Old: StringEquals.equal 8 avgt 5 0.948 ? 0.001 us/op Your patch: Benchmark (size) Mode Cnt Score Error Units As you can see, we're looking at regressions all the way up to size=45, A few things: You should never be executing the TAIL unless the string is really Please don't use aliases for rscratch1 and rscratch2. Calling them tmp1 So: please make sure the smaller strings are at least as good as I don't think that Neon does any good here. This is what I get by rewriting Benchmark (size) Mode Cnt Score Error Units I use an editor with automatic indentation, as do many people, so -- |
1 similar comment
Mailing list message from Andrew Haley on hotspot-dev: I had to make some changes to the benchmark to get accurate timing, because It should be clear from my patch what I did. The most important part is This is what I see, Apple M1, two equal strings: Old: StringEquals.equal 8 avgt 5 0.948 ? 0.001 us/op Your patch: Benchmark (size) Mode Cnt Score Error Units As you can see, we're looking at regressions all the way up to size=45, A few things: You should never be executing the TAIL unless the string is really Please don't use aliases for rscratch1 and rscratch2. Calling them tmp1 So: please make sure the smaller strings are at least as good as I don't think that Neon does any good here. This is what I get by rewriting Benchmark (size) Mode Cnt Score Error Units I use an editor with automatic indentation, as do many people, so -- |
@theRealAph Thank you for your suggestion. It's my fault that the JMH I used is not accurate. I changed my codes and re-tested under your JMH: Before opt:
After opt:
|
ldr(rscratch1, Address(post(a1, wordSize))); | ||
ldr(rscratch2, Address(post(a2, wordSize))); | ||
eor(rscratch1, rscratch1, rscratch2); | ||
cbnz(rscratch1, DONE); | ||
|
||
bind(B24); | ||
ldr(rscratch1, Address(post(a1, wordSize))); | ||
ldr(rscratch2, Address(post(a2, wordSize))); | ||
eor(rscratch2, rscratch1, rscratch2); | ||
cbnz(rscratch2, DONE); | ||
|
||
bind(B16); | ||
ldr(rscratch1, Address(post(a1, wordSize))); | ||
ldr(rscratch2, Address(post(a2, wordSize))); | ||
eor(rscratch1, rscratch1, rscratch2); | ||
cbnz(rscratch1, DONE); | ||
|
||
ldr(rscratch1, Address(a1, cnt1)); | ||
ldr(rscratch2, Address(a2, cnt1)); | ||
eor(rscratch2, rscratch1, rscratch2); | ||
cbnz(rscratch2, DONE); | ||
b(SAME); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are not going to do all this unrolling at the call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why we unrolling the loop ?
- It is found that these codes degrades performance
br(LE, B16);
subs(cnt1, cnt1, wordSize);
br(LE, B24);
subs(cnt1, cnt1, wordSize);
We unrolls the loop to remove these comparsion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why we unroll the loop ?
- It is found that these codes degrades performance
br(LE, B16);
subs(cnt1, cnt1, wordSize);
br(LE, B24);
subs(cnt1, cnt1, wordSize);
We unroll the loop to remove these comparsion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why we unroll the loop ?
- It is found that these codes degrades performance
br(LE, B16);
subs(cnt1, cnt1, wordSize);
br(LE, B24);
subs(cnt1, cnt1, wordSize);
We unroll the loop to remove these comparsion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need to balance performance against code size, given that this is expanded frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nick is right. This unrolling at the call site is not going to be accepted.
__ orr(rscratch1, rscratch1, rscratch2); | ||
__ cbnz(rscratch1, NOT_EQUAL); | ||
__ br(__ GE, LOOP); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, we gain nothing by using Neon here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better:
+ __ ldp(r5, r6, Address(__ post(a1, wordSize * 2)));
+ __ ldp(rscratch1, rscratch2, Address(__ post(a2, wordSize * 2)));
+ __ cmp(r5, rscratch1);
+ __ ccmp(r6, rscratch2, 0, Assembler::EQ);
+ __ br(__ NE, NOT_EQUAL);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed ld1
into ldp
and get the result as following,
simple:
Benchmark | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
StringEquals.equal | 45 | avgt | 5 | 6.105 | ? 0.635 | us/op |
StringEquals.equal | 64 | avgt | 5 | 7.226 | ? 0.056 | us/op |
StringEquals.equal | 91 | avgt | 5 | 12.010 | ? 0.375 | us/op |
StringEquals.equal | 121 | avgt | 5 | 14.772 | ? 0.114 | us/op |
StringEquals.equal | 181 | avgt | 5 | 21.468 | ? 0.676 | us/op |
StringEquals.equal | 256 | avgt | 5 | 28.942 | ? 4.806 | us/op |
StringEquals.equal | 512 | avgt | 5 | 58.479 | ? 5.918 | us/op |
StringEquals.equal | 1024 | avgt | 5 | 119.313 | ? 16.661 | us/op |
ldp:
Benchmark | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
StringEquals.equal | 45 | avgt | 5 | 6.449 | ? 0.202 | us/op |
StringEquals.equal | 64 | avgt | 5 | 7.367 | ? 0.055 | us/op |
StringEquals.equal | 91 | avgt | 5 | 9.984 | ? 0.065 | us/op |
StringEquals.equal | 121 | avgt | 5 | 12.540 | ? 0.545 | us/op |
StringEquals.equal | 181 | avgt | 5 | 15.614 | ? 0.280 | us/op |
StringEquals.equal | 256 | avgt | 5 | 19.346 | ? 0.243 | us/op |
StringEquals.equal | 512 | avgt | 5 | 35.718 | ? 0.599 | us/op |
StringEquals.equal | 1024 | avgt | 5 | 67.846 | ? 0.439 | us/op |
neon:
Benchmark | (size) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
StringEquals.equal | 45 | avgt | 5 | 5.883 | ? 0.173 | us/op |
StringEquals.equal | 64 | avgt | 5 | 6.737 | ? 0.035 | us/op |
StringEquals.equal | 91 | avgt | 5 | 8.997 | ? 0.215 | us/op |
StringEquals.equal | 121 | avgt | 5 | 10.789 | ? 0.386 | us/op |
StringEquals.equal | 181 | avgt | 5 | 14.063 | ? 0.253 | us/op |
StringEquals.equal | 256 | avgt | 5 | 19.679 | ? 1.419 | us/op |
StringEquals.equal | 512 | avgt | 5 | 38.813 | ? 1.378 | us/op |
StringEquals.equal | 1024 | avgt | 5 | 77.769 | ? 3.082 | us/op |
From the results, we can see that,
- for small size (45~181), the performance of
ldp
version is not as good asneon/ ld1
version - for big size,
ldp
version is better thatneon/ld1
version - all versions (both
ldp
andld1
) are better that oldsimple
version . - I agree with you
ldp
version is better thanld1
version at last patch because I used
__ ldr(v0, __ Q, Address(__ post(a1, wordSize * 2)));
__ ldr(v1, __ Q, Address(__ post(a2, wordSize * 2)));
at last patch. However, I use
__ ld1(v0, v1, __ T2D, Address(__ post(a1, loopThreshold)));
__ ld1(v2, v3, __ T2D, Address(__ post(a2, loopThreshold)));
in recent patch. I think this change has fixed the problem here.
Please have a very good look at the stubGenerator changes in |
@@ -93,6 +93,8 @@ define_pd_global(intx, InlineSmallCode, 1000); | |||
"Use SIMD instructions in generated array equals code") \ | |||
product(bool, UseSimpleArrayEquals, false, \ | |||
"Use simpliest and shortest implementation for array equals") \ | |||
product(bool, UseSimpleStringEquals, true, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a user-facing toggle, especially a product one? Under what situations do we expect the user to change this? It's useful for comparison but if the new implementation if demonstrably better then we should just delete the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But we've a way to go; taking this out should be the last step.
|
||
// Main 32 byte comparison loop. | ||
__ bind(LOOP); | ||
__ ld1(v0, v1, __ T2D, Address(__ post(a1, loopThreshold))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to reserve v0-3 as temporaries in the .ad file string_equals patterns otherwise we might be overwriting live values here.
ldr(rscratch1, Address(post(a1, wordSize))); | ||
ldr(rscratch2, Address(post(a2, wordSize))); | ||
eor(rscratch1, rscratch1, rscratch2); | ||
cbnz(rscratch1, DONE); | ||
|
||
bind(B24); | ||
ldr(rscratch1, Address(post(a1, wordSize))); | ||
ldr(rscratch2, Address(post(a2, wordSize))); | ||
eor(rscratch2, rscratch1, rscratch2); | ||
cbnz(rscratch2, DONE); | ||
|
||
bind(B16); | ||
ldr(rscratch1, Address(post(a1, wordSize))); | ||
ldr(rscratch2, Address(post(a2, wordSize))); | ||
eor(rscratch1, rscratch1, rscratch2); | ||
cbnz(rscratch1, DONE); | ||
|
||
ldr(rscratch1, Address(a1, cnt1)); | ||
ldr(rscratch2, Address(a2, cnt1)); | ||
eor(rscratch2, rscratch1, rscratch2); | ||
cbnz(rscratch2, DONE); | ||
b(SAME); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need to balance performance against code size, given that this is expanded frequently.
Please bear in mind that String.equals() is typically used for short strings: identifiers, names, etc. The mean string length in most cases I've tried is around 18 characters. The use of String.equals() for long strings is unusual, and we should not burden typical usages with higher overheads for the sake of rare usages. The current String.equals() is a compromise: it performs fairly well on the String instances we expect to see, but it is not highly optimized for very long strings. Whatever you do, any replacement must not be worse for small strings. That is to say, it must not use many more registers or significantly more (expanded inline) code space. |
There is one other thing I should mention, of which you may not be aware. |
Here are some Graviton 2 timings, five versions:
Your commit 4f02c00:
My hack using ldp:
Today's -UseSimpleStringEquals:
|
I'm still seeing a slight advantage for
Versus the latest Neon version:
|
@Wanghuang-Huawei This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@Wanghuang-Huawei This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
@Wanghuang-Huawei any plans to re-open and fix this? |
I hope not: it looks like a regression for common cases. |
@TobiHartmann @theRealAph : |
Okay, thanks for clarifying! |
Dear all,
Could you give me a favor to review this patch? It improves the performance of the intrinsic of
String.equals
on Neon backend of Aarch64.We profile the performance by using this JMH case:
The result is list as following:(Linux aarch64 with 128cores)
Yours,
WANG Huang
Progress
Issue
Contributors
<whuang@openjdk.org>
<mouzhuojun@huawei.com>
<aijiaming1@huawei.com>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4423/head:pull/4423
$ git checkout pull/4423
Update a local copy of the PR:
$ git checkout pull/4423
$ git pull https://git.openjdk.java.net/jdk pull/4423/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 4423
View PR using the GUI difftool:
$ git pr show -t 4423
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4423.diff