-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8310929: Optimization for Integer.toString #14699
Conversation
👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into |
Webrevs
|
Although this method is jdk/src/hotspot/share/opto/stringopts.cpp Lines 597 to 613 in 7fffdb5
|
Can we cache results for small integers? |
caching tinyInt result string is a good idea, it can improve performance, but it is expensive, i am not sure whether it should be added. |
benchmark data is update, it's ready for review @rgiulietti |
charPos -= 2; | ||
UNSAFE.putShortUnaligned(buf, Unsafe.ARRAY_BYTE_BASE_OFFSET + charPos, PACKED_DIGITS[r], false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When switching to use Unsafe, getChars
should do the array bounds check in the loop of the store index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value range of the r variable is 0-99, and the length of PACKED_DIGITS is 100, There is no need to check the array boundary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he means to check the charPos
to ensure it is not out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, charPos needs to do bound checks, I added assert with reference to the implementation of StringUTF16#putChar, is this safe enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to look into the code around to ensure user input cannot cause charPos to go out of bounds. If charPos is not touched by user code at all, I think assert suffices (which are enabled via -esa and is thus enabled in the jtreg test suite)
UNSAFE.putShortUnaligned( | ||
buf, | ||
Unsafe.ARRAY_BYTE_BASE_OFFSET + charPos, | ||
Integer.PACKED_DIGITS[(int)((q * 100) - i)], | ||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the array bound check for the store of the characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added assert with reference to the implementation of StringUTF16#putChar, is this safe enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use VarHandle
here?
charPos -= 2; | ||
UNSAFE.putShortUnaligned( | ||
buf, | ||
Unsafe.ARRAY_BYTE_BASE_OFFSET + charPos, | ||
Integer.PACKED_DIGITS[(q2 * 100) - i2], | ||
false); | ||
i2 = q2; | ||
buf[--charPos] = Integer.DigitOnes[r]; | ||
buf[--charPos] = Integer.DigitTens[r]; | ||
} | ||
|
||
// We know there are at most two digits left at this point. | ||
buf[--charPos] = Integer.DigitOnes[-i2]; | ||
if (i2 < -9) { | ||
buf[--charPos] = Integer.DigitTens[-i2]; | ||
charPos -= 2; | ||
UNSAFE.putShortUnaligned( | ||
buf, | ||
Unsafe.ARRAY_BYTE_BASE_OFFSET + charPos, | ||
Integer.PACKED_DIGITS[-i2], | ||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the implicit array bounds check with an explicit array index check if using Unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added assert with reference to the implementation of StringUTF16#putChar, is this safe enough?
charPos -= 2; | ||
UNSAFE.putIntUnaligned( | ||
buf, | ||
Unsafe.ARRAY_BYTE_BASE_OFFSET + (charPos << 1), | ||
PACKED_DIGITS_UTF16[-i]); | ||
} else { | ||
putChar(buf, --charPos, '0' - i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto add explicit array bounds check when using Unsafe, especially since the method is used outside of the source file. Here and in the uses of Unsafe below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added assert with reference to the implementation of StringUTF16#putChar, is this safe enough?
It could be worth it to have a cache for small integers to skip the calculations altogether. |
this PR is only for calculation optimization. caching small values should be a separate PR |
@wenshao This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@AlanBateman can you help me to review this PR? |
@cl4es can you help me to review this PR? |
I'm wondering if a micro benchmark like this is very realistic. They may score positive as eventually these helper tables are in the inner most cache level, but it may be a net negative for larger functions that only do integer conversion occasionally, and almost always have a cache miss when doing a conversion. The tables may also be displacing other more useful cache lines. In my opinion, cache is a limited resource, and having a low level function use a big chunk of it may be optimal for a micro benchmark, but seems unlikely to be optimal overall. If you are only doing mass string to integer conversion, I'm sure it will do better. If you are writing out JSON that sometimes contains integers, the integer conversions may now sometimes have cache misses (in which case a non-cached based conversion would perform better), or the conversion is displacing other more useful cache lines. |
Exactly so! This is almost the canonical example of the "JMH considered harmful" talk I gave recently. |
The subject is a joke! Yes, I love JMH, but be very careful how you use it. |
@wenshao How about of approach used in James Anhalt's algorithm? It reduces number of multiplications (and store operations in case of writing by 4-8 byte words) but increases the total number of instructions for the routine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I share the wariness against lookup tables (and the microbenchmarks that lend them too much credence), I think the approach in this PR is mostly an incremental improvement on an existing design and should be evaluated as such. By packing two lookup tables into one we reduce the opportunity for cache misses. The new lookup table is no larger than the two existing ones, so we're not wasting memory or sacrificing density. Having it in a single table makes it less likely that the arrays will end up in different locations on the heap or in compiled code. I see mostly positives here.
The main drawback is the proliferation of Unsafe
usage.
Switching out the lookup table-based algorithm for something clever without a lookup table is laudable, but comes with a new set of challenges and should be done as a follow up. Since these tables can be aggressively constant-folded into the code section by our JITs it might even turn out to be a wash.
If @RogerRiggs is happy with asserts in lieu of explicit bounds checking then I move to approve this.
If the compiled code size is greater than 325 (FreqInlineSize), it will not be inlined and performance will slow down. This algorithm obviously greater than 325. different algorithms perform differently on different types of tiny/small/big values. The first commit of this PR performs better under big values. But I still use the current algorithm, with few changes, and performance can be improved in all scenarios. |
I also agree that using VarHandler is better, but using VarHandler in StringLatin1 causes exception, as follows: final class StringLatin1 {
private static final VarHandle SHORT = MethodHandles.byteArrayViewVarHandle(short[].class, ByteOrder.LITTLE_ENDIAN);
}
|
@wenshao You can use a holder class that will get initialized only when |
In the test of Integers.toStringBig, using ByteArrayLittleEndian is about 5% slower than using Unsafe.
static int getChars(int i, int index, byte[] buf) {
// Used by trusted callers. Assumes all necessary bounds checks have been done by the caller.
int q, r;
int charPos = index;
boolean negative = i < 0;
if (!negative) {
i = -i;
}
// Generate two digits per iteration
while (i <= -100) {
q = i / 100;
r = (q * 100) - i;
i = q;
charPos -= 2;
assert charPos >= 0 && charPos < buf.length : "Trusted caller missed bounds check";
ByteArrayLittleEndian.setShort(buf, charPos, PACKED_DIGITS[r]);
}
// We know there are at most two digits left at this point.
if (i < -9) {
charPos -= 2;
assert charPos >= 0 && charPos < buf.length : "Trusted caller missed bounds check";
ByteArrayLittleEndian.setShort(buf, charPos, PACKED_DIGITS[-i]);
} else {
buf[--charPos] = (byte)('0' - i);
}
if (negative) {
buf[--charPos] = (byte)'-';
}
return charPos;
}
make test TEST="micro:java.lang.Integers.toStringBig"
-Benchmark (size) Mode Cnt Score Error Units (use Unsafe)
-Integers.toStringBig 500 avgt 15 5.152 ? 0.052 us/op
+Benchmark (size) Mode Cnt Score Error Units (use ByteArrayLittleEndian)
+Integers.toStringBig 500 avgt 15 5.386 ? 0.020 us/op (slower 4.5%) If you must not use Unsafe, I will replace it with ByteArrayLittleEndian |
The bulk of the improvement comes from the algorithmic change; using ByteArrayLittleEndian is more robust. |
Assuming this is the M1 numbers we're still looking at a ~120% improvement. And if there's some systemic overhead to using |
The performance test results of the latest version (PR Update 20 c0f42a7c ) are as follows: 1. aliyun_ecs_c8i.xlarge
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-Integers.toStringBig 500 avgt 15 6.800 ? 0.022 us/op
-Integers.toStringSmall 500 avgt 15 4.792 ? 0.021 us/op
-Integers.toStringTiny 500 avgt 15 3.757 ? 0.081 us/op
+Benchmark (size) Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+Integers.toStringBig 500 avgt 15 5.894 ? 0.046 us/op (+15.37%)
+Integers.toStringSmall 500 avgt 15 4.027 ? 0.012 us/op (+18.99%)
+Integers.toStringTiny 500 avgt 15 3.491 ? 0.090 us/op (+7.61%)
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-Longs.toStringBig 500 avgt 15 9.213 ? 0.019 us/op
-Longs.toStringSmall 500 avgt 15 4.550 ? 0.016 us/op
+Benchmark (size) Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+Longs.toStringBig 500 avgt 15 7.507 ? 0.011 us/op (+22.72%)
+Longs.toStringSmall 500 avgt 15 3.967 ? 0.021 us/op (+14.69%)
-Benchmark Mode Cnt Score Error Units (baseline)
-StringBuilders.toStringCharWithInt8 avgt 15 89.187 ? 0.236 ns/op
+Benchmark Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+StringBuilders.toStringCharWithInt8 avgt 15 36.125 ? 0.309 ns/op (+146.88%) 2. aliyun_ecs_c8y.xlarge
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-Integers.toStringBig 500 avgt 15 11.649 ? 0.011 us/op
-Integers.toStringSmall 500 avgt 15 6.985 ? 0.018 us/op
-Integers.toStringTiny 500 avgt 15 5.972 ? 0.013 us/op
+Benchmark (size) Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+Integers.toStringBig 500 avgt 15 8.957 ? 0.026 us/op (+30.05%)
+Integers.toStringSmall 500 avgt 15 6.136 ? 0.018 us/op (+13.83%)
+Integers.toStringTiny 500 avgt 15 5.753 ? 0.026 us/op (+3.80%)
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-Longs.toStringBig 500 avgt 15 14.568 ? 0.021 us/op
-Longs.toStringSmall 500 avgt 15 7.250 ? 0.023 us/op
+Benchmark (size) Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+Longs.toStringBig 500 avgt 15 13.401 ? 0.012 us/op (+8.70%)
+Longs.toStringSmall 500 avgt 15 6.031 ? 0.018 us/op (+20.21%)
-Benchmark Mode Cnt Score Error Units (baseline)
-StringBuilders.toStringCharWithInt8 avgt 15 52.484 ? 0.534 ns/op
+Benchmark Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+StringBuilders.toStringCharWithInt8 avgt 15 40.410 ? 0.348 ns/op (+29.87%) 3. MacBookPro M1 Pro-Benchmark (size) Mode Cnt Score Error Units (baseline)
-Integers.toStringBig 500 avgt 15 18.483 ± 2.771 us/op
-Integers.toStringSmall 500 avgt 15 4.435 ± 0.067 us/op
-Integers.toStringTiny 500 avgt 15 2.382 ± 0.063 us/op
+Benchmark (size) Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+Integers.toStringBig 500 avgt 15 5.392 ? 0.016 us/op (+242.78%)
+Integers.toStringSmall 500 avgt 15 3.201 ? 0.024 us/op (+38.55%)
+Integers.toStringTiny 500 avgt 15 2.141 ? 0.021 us/op (+11.25%)
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-Longs.toStringBig 500 avgt 15 8.336 ± 0.025 us/op
-Longs.toStringSmall 500 avgt 15 4.389 ± 0.018 us/op
+Benchmark (size) Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+Longs.toStringBig 500 avgt 15 7.706 ? 0.015 us/op (+8.17%)
+Longs.toStringSmall 500 avgt 15 3.094 ? 0.021 us/op (+41.85%)
-Benchmark Mode Cnt Score Error Units (baseline)
-StringBuilders.toStringCharWithInt8 avgt 15 124.316 ± 61.017 ns/op
+Benchmark Mode Cnt Score Error Units (PR Update 20 c0f42a7c)
+StringBuilders.toStringCharWithInt8 avgt 15 44.497 ? 29.741 ns/op (+179.38%) |
/integrate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ByteArrayLittleEndian
cleaned up the code nicely. The overhead compared to Unsafe
doesn't seem too concerning (maybe even some wins?)
Can it be merged now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you for your patience.
For future performance PRs, please put the detailed performance data in a separate comment, not the description. The description is included in every email and the perf data gets out of date quickly.
/sponsor |
@wenshao Only Committers are allowed to sponsor changes. |
/sponsor |
Going to push as commit 4b43c25.
Your commit was automatically rebased without conflicts. |
/backport jdk21u |
@wenshao To use the |
fyi, the backport command is issued on the final commit (not the PR). |
Optimization for:
Integer.toString
Long.toString
StringBuilder#append(int)
Benchmark Result
1. aliyun_ecs_c8i.xlarge
2. aliyun_ecs_c8a.xlarge
3. aliyun_ecs_c8y.xlarge
4. MacBookPro M1 Pro
5. Orange Pi 5 Plus
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14699/head:pull/14699
$ git checkout pull/14699
Update a local copy of the PR:
$ git checkout pull/14699
$ git pull https://git.openjdk.org/jdk.git pull/14699/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14699
View PR using the GUI difftool:
$ git pr show -t 14699
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14699.diff
Webrev
Link to Webrev Comment