8316426: Optimization for HexFormat.formatHex #15768

wenshao · 2023-09-15T18:04:29Z

In the improvement of @cl4es PR #15591, the advantages of non-lookup-table were discussed.

But if the input is byte[], using lookup table can improve performance.

For HexFormat#formatHex(Appendable, byte[]) and HexFormat#formatHex(byte[]), If the length of byte[] is larger, the performance of table lookup will be improved more obviously.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8316426: Optimization for HexFormat.formatHex (Enhancement - P4)

Reviewers

Chen Liang (@liach - Author) ⚠️ Review applies to 7ffdba50
Roger Riggs (@RogerRiggs - Reviewer) ⚠️ Review applies to 8eff0ee4

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15768/head:pull/15768
$ git checkout pull/15768

Update a local copy of the PR:
$ git checkout pull/15768
$ git pull https://git.openjdk.org/jdk.git pull/15768/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15768

View PR using the GUI difftool:
$ git pr show -t 15768

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15768.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2023-09-15T18:05:43Z

👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2023-09-15T18:07:58Z

@wenshao The following labels will be automatically applied to this pull request:

core-libs
security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

cl4es · 2023-09-15T18:20:32Z

What numbers do you get with this?

In my experiments the improvement was either negligible or small enough to hardly matter. I also tried flipping bits similar to what you're doing for uppercasing but saw a net regression from that.

I opted for simplicity in #15991 - a simple cleanup that removed a few tiny lookup-tables and still led to a decent improvement. Going the other direction seems like the wrong move.

wenshao · 2023-09-15T19:08:35Z

The performance test results are as follows:

0. sciprt

bash configure
make images
sh make/devkit/createJMHBundle.sh
bash configure --with-jmh=build/jmh/jars
make test TEST="micro:java.util.HexFormatBench.*"

1. aliyun_ecs_c8i.xlarge

cpu : intel xeon sapphire rapids (x64)
os debian linux

-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baselinie)
-HexFormatBench.appenderLower           512  avgt   15  2.768 ? 0.034  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.796 ? 0.042  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.800 ? 0.032  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.781 ? 0.018  us/op
-HexFormatBench.formatLower             512  avgt   15  0.544 ? 0.002  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.548 ? 0.004  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.546 ? 0.007  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.550 ? 0.005  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  3.364 ? 0.015  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  3.770 ? 0.017  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  4.990 ? 0.018  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  3.466 ? 0.017  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.415 ? 0.005  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.422 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.413 ? 0.005  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.423 ? 0.004  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.163 ?  0.001  us/op (+1598.16)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.161 ?  0.001  us/op (+1636.65)
+HexFormatBench.appenderUpper           512  avgt   15  0.251 ?  0.023  us/op (+1015.54)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.266 ?  0.001  us/op (+945.49)
+HexFormatBench.formatLower             512  avgt   15  0.275 ?  0.001  us/op (+97.82)
+HexFormatBench.formatLowerCached       512  avgt   15  0.277 ?  0.001  us/op (+97.84)
+HexFormatBench.formatUpper             512  avgt   15  0.285 ?  0.001  us/op (+91.58)
+HexFormatBench.formatUpperCached       512  avgt   15  0.285 ?  0.001  us/op (+92.99)
+HexFormatBench.toHexDigitsByte         512  avgt   15  3.554 ?  0.028  us/op (-5.35)
+HexFormatBench.toHexDigitsInt          512  avgt   15  3.910 ?  0.015  us/op (-3.59)
+HexFormatBench.toHexDigitsLong         512  avgt   15  5.288 ?  0.018  us/op (-5.64)
+HexFormatBench.toHexDigitsShort        512  avgt   15  3.637 ?  0.012  us/op (-4.71)
+HexFormatBench.toHexLower              512  avgt   15  0.445 ?  0.001  us/op (-6.75)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.442 ?  0.001  us/op (-4.53)
+HexFormatBench.toHexUpper              512  avgt   15  0.445 ?  0.001  us/op (-7.20)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.441 ?  0.001  us/op (-4.09)

2. aliyun_ecs_c8y.xlarge

cpu : aliyun yitian 710 (aarch64)
os debian linux

-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.857 ? 0.791  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.832 ? 0.758  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.360 ? 0.010  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.361 ? 0.013  us/op
-HexFormatBench.formatLower             512  avgt   15  0.947 ? 0.406  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.616 ? 0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  1.212 ? 0.411  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.616 ? 0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  5.844 ? 0.264  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  7.392 ? 0.207  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  8.068 ? 0.303  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  6.214 ? 0.266  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.926 ? 0.003  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  1.000 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.927 ? 0.002  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.999 ? 0.020  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.356 ?  0.001  us/op (+702.53)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.356 ?  0.001  us/op (+695.51)
+HexFormatBench.appenderUpper           512  avgt   15  0.304 ?  0.001  us/op (+676.32)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.304 ?  0.001  us/op (+676.65)
+HexFormatBench.formatLower             512  avgt   15  0.461 ?  0.001  us/op (+105.43)
+HexFormatBench.formatLowerCached       512  avgt   15  0.485 ?  0.001  us/op (+27.02)
+HexFormatBench.formatUpper             512  avgt   15  0.644 ?  0.003  us/op (+88.20)
+HexFormatBench.formatUpperCached       512  avgt   15  0.595 ?  0.003  us/op (+3.53)
+HexFormatBench.toHexDigitsByte         512  avgt   15  5.804 ?  0.237  us/op (+0.69)
+HexFormatBench.toHexDigitsInt          512  avgt   15  7.209 ?  0.212  us/op (+2.54)
+HexFormatBench.toHexDigitsLong         512  avgt   15  8.301 ?  0.422  us/op (-2.81)
+HexFormatBench.toHexDigitsShort        512  avgt   15  5.908 ?  0.255  us/op (+5.18)
+HexFormatBench.toHexLower              512  avgt   15  0.494 ?  0.001  us/op (+87.45)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.494 ?  0.001  us/op (+102.43)
+HexFormatBench.toHexUpper              512  avgt   15  0.494 ?  0.001  us/op (+87.66)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.493 ?  0.001  us/op (+102.64)

3. Mac Book Pro M1 Pro

-Benchmark                           (size)  Mode  Cnt  Score    Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.867 ?  0.035  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  1.656 ?  0.875  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.813 ?  0.085  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  1.575 ?  0.901  us/op
-HexFormatBench.formatLower             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.385 ?  0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.384 ?  0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  1.688 ?  0.009  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  2.991 ?  0.015  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  3.719 ?  0.081  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  1.868 ?  0.010  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.321 ?  0.001  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.322 ?  0.001  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.324 ?  0.001  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.325 ?  0.001  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.212 ?  0.003  us/op (+1252.36)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.211 ?  0.001  us/op (+684.84)
+HexFormatBench.appenderUpper           512  avgt   15  0.199 ?  0.002  us/op (+1313.57)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.198 ?  0.001  us/op (+695.46)
+HexFormatBench.formatLower             512  avgt   15  0.221 ?  0.001  us/op (+74.21)
+HexFormatBench.formatLowerCached       512  avgt   15  0.192 ?  0.001  us/op (+100.53)
+HexFormatBench.formatUpper             512  avgt   15  0.317 ?  0.002  us/op (+21.46)
+HexFormatBench.formatUpperCached       512  avgt   15  0.348 ?  0.003  us/op (+10.35)
+HexFormatBench.toHexDigitsByte         512  avgt   15  1.715 ?  0.011  us/op (-1.58)
+HexFormatBench.toHexDigitsInt          512  avgt   15  2.261 ?  0.012  us/op (+32.29)
+HexFormatBench.toHexDigitsLong         512  avgt   15  3.776 ?  0.023  us/op (-1.51)
+HexFormatBench.toHexDigitsShort        512  avgt   15  1.862 ?  0.011  us/op (+0.33)
+HexFormatBench.toHexLower              512  avgt   15  0.289 ?  0.004  us/op (+11.08)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.294 ?  0.002  us/op (+9.53)
+HexFormatBench.toHexUpper              512  avgt   15  0.288 ?  0.001  us/op (+12.50)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.295 ?  0.001  us/op (+10.17)

wenshao · 2023-09-15T19:11:48Z

What numbers do you get with this?

In my experiments the improvement was either negligible or small enough to hardly matter. I also tried flipping bits similar to what you're doing for uppercasing but saw a net regression from that.

I opted for simplicity in #15991 - a simple cleanup that removed a few tiny lookup-tables and still led to a decent improvement. Going the other direction seems like the wrong move.

The byte[] length used by HexFormatBench is 512. In this scenario, the performance improvement of using table lookup is significant.

cl4es · 2023-09-15T19:59:16Z

The byte[] length used by HexFormatBench is 512. In this scenario, the performance improvement of using table lookup is significant.

Is this a common use-case? I could see an argument that formatting small chunks is much more common, where users will experience relatively more cache misses from the table lookup.

The gain from specializing for StringBuilder is impressive. My guess is that most of the gain is from enabling better loop optimizations by turning a complicated loop with interface calls into a simple counted loop over a byte[]. This could be split out into a separate PR, or perhaps we could consider more generalizable approaches to make append and format perform at the same level

cl4es · 2023-09-15T20:50:59Z

Results of an experiment to remove the lookup table use from StringBuilder::appendHex:

-p size=512
Benchmark                           (size)  Mode  Cnt  Score   Error  Units
HexFormatBench.appenderLowerCached     512  avgt   15  1,629 ± 0,378  us/op # baseline
HexFormatBench.appenderLowerCached     512  avgt   15  0,222 ± 0,009  us/op # 15768, 7.3x
HexFormatBench.appenderLowerCached     512  avgt   15  0,366 ± 0,002  us/op # no_lookup_tables.diff, 4.5x

-p size=4
Benchmark                           (size)  Mode  Cnt   Score   Error  Units
HexFormatBench.appenderLowerCached       4  avgt    5  26,723 ± 1,042  ns/op
HexFormatBench.appenderLowerCached       4  avgt    5   7,492 ± 0,140  ns/op # 15768, 3.6x
HexFormatBench.appenderLowerCached       4  avgt    5  10,205 ± 0,055  ns/op # no_lookup_tables.diff, 2.6x

Most of the win is from having a loop that is easier for the JIT to optimize; the lookup-table impl is faster than no lookup table in these micros, by a factor that is in the same ballpark as your format micro numbers. This on a M1. You might get different numbers.

I'm not convinced 30-100% wins on these micro is worth the increase in cache traffic, and would like to see a deeper analysis of the pros and cons of lookup tables before we go ahead with any optimizations that depend on their use.

The appendHex method in StringBuilder is interesting, but has other maintainability issues. It's a slippery slope to start adding specialized internal routines to StringBuilder, and perhaps there are better ways to model this at a higher level.

no_lookup_table.diff:

diff --git a/src/java.base/share/classes/java/lang/AbstractStringBuilder.java b/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
index 57a16bb769a..70727375c0b 100644
--- a/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
+++ b/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
@@ -1825,6 +1825,15 @@ private final void appendChars(CharSequence s, int off, int end) {
         count += end - off;
     }

+    private static char toHighHexDigit(boolean ucase, int value) {
+        value = (value >> 4) & 0xf;
+        return (char) ((value < 10 ? '0' : ucase ? 'A' : 'a') + value);
+    }
+    private static char toLowHexDigit(boolean ucase, int value) {
+        value = (value) & 0xf;
+        return (char) ((value < 10 ? '0' : ucase ? 'A' : 'a') + value);
+    }
+
     final void appendHex(boolean ucase, byte[] bytes, int fromIndex, int toIndex) {
         Preconditions.checkFromToIndex(fromIndex, toIndex, bytes.length, Preconditions.IOOBE_FORMATTER);

@@ -1832,17 +1841,14 @@ final void appendHex(boolean ucase, byte[] bytes, int fromIndex, int toIndex) {

         int charPos = count;
         for (int i = fromIndex; i < toIndex; ++i) {
-            short packed = HexDigits.digitPair(bytes[i], ucase);
+            byte b = bytes[i];
             if (isLatin1()) {
-                ByteArrayLittleEndian.setShort(
-                        value,
-                        charPos,
-                        packed);
+                StringLatin1.putChar(value, charPos++, toHighHexDigit(ucase, b));
+                StringLatin1.putChar(value, charPos++, toLowHexDigit(ucase, b));
             } else {
-                StringUTF16.putChar(value, charPos, packed & 0xff);
-                StringUTF16.putChar(value, charPos + 1, packed >> 8);
+                StringUTF16.putChar(value, charPos++, toHighHexDigit(ucase, b));
+                StringUTF16.putChar(value, charPos++, toLowHexDigit(ucase, b));
             }
-            charPos += 2;
         }

         count = charPos;

wenshao · 2023-09-15T22:19:53Z

Add internal methods to StringBuilder for performance optimization, I saw that the implementation of JEP 403 String Template does similar things.

class AbstractStringBuilder {
    long mix(long lengthCoder) { }
    long prepend(long lengthCoder, byte[] buffer) {}
    // ...
}

However, the StringBuilder.appendHex method can have more usage scenarios and can be considered as a public method. Is it necessary to submit a new PR to add these methods?

class AbstractStringBuilder {
    public void appendHex(byte[] bytes) {}
    public void appendHex(byte[] bytes, boolean ucase) {}
    public void appendHex(byte[] bytes, int fromIndex, int toIndex) {}
    public void appendHex(byte[] bytes, int fromIndex, int toIndex, boolean ucase) {}
}

Regarding the performance of using lookup table, I think it makes sense when the length of byte[] is greater than 8. I think that when the length of byte[] is actually used, there is a high probability that it will be greater than 8. Of course, I just said the number 8 casually, it could be 12, or 16.

HexDecimal#DIGITS is a table with a size of 512 bytes. I think that in such a table, when it needs to be used continuously, it is worthwhile to perform table lookup operations.

cl4es · 2023-09-15T23:48:28Z

I don't think appendHex would make the cut as a public API. I think we would need something with much more general utility for that.

A more high-level idea would be to improve said String templates JEP to be able to format more generally into Appendable/StringBuilder, so that hex, octal - any custom format - can be efficiently appended to a StringBuilder.

BTW, here's a patch on top of #15768 that simplifies formatHex(Appendable..) to simply append the result of formatHex(byte[]...):

Benchmark                                              (size)  Mode  Cnt     Score     Error   Units
HexFormatBench.appenderLowerCached                        512  avgt   15     0,202 ±   0,001   us/op
HexFormatBench.formatLowerCached                          512  avgt   15     0,189 ±   0,016   us/op

Same speed. Drawback is that the appender becomes allocating but short-lived allocations is one of my least concerns:

diff --git a/src/java.base/share/classes/java/util/HexFormat.java b/src/java.base/share/classes/java/util/HexFormat.java
index e1d88e3ceb6..68a15307fe0 100644
--- a/src/java.base/share/classes/java/util/HexFormat.java
+++ b/src/java.base/share/classes/java/util/HexFormat.java
@@ -407,38 +407,7 @@ public <A extends Appendable> A formatHex(A out, byte[] bytes, int fromIndex, in
         int length = toIndex - fromIndex;
         if (length > 0) {
             try {
-                boolean prefixEmpty = prefix.isEmpty();
-                boolean suffixEmpty = suffix.isEmpty();
-                boolean delimiterEmpty = delimiter.isEmpty();
-                if (!prefixEmpty) {
-                    out.append(prefix);
-                }
-                if (suffixEmpty && delimiterEmpty && prefixEmpty) {
-                    if (out instanceof StringBuilder sb) {
-                        jla.appendHex(sb, digitCase == Case.UPPERCASE, bytes, fromIndex, toIndex);
-                    } else {
-                        for (int i = 0; i < length; i++) {
-                            toHexDigits(out, bytes[fromIndex + i]);
-                        }
-                    }
-                } else {
-                    toHexDigits(out, bytes[fromIndex]);
-                    for (int i = 1; i < length; i++) {
-                        if (!suffixEmpty) {
-                            out.append(suffix);
-                        }
-                        if (!delimiterEmpty) {
-                            out.append(delimiter);
-                        }
-                        if (!prefixEmpty) {
-                            out.append(prefix);
-                        }
-                        toHexDigits(out, bytes[fromIndex + i]);
-                    }
-                }
-                if (!suffixEmpty) {
-                    out.append(suffix);
-                }
+                out.append(formatHex(bytes, fromIndex, toIndex));
             } catch (IOException ioe) {
                 throw new UncheckedIOException(ioe.getMessage(), ioe);
             }

wenshao · 2023-09-16T08:52:10Z

I deleted the newly added AbstractBuilder.appendHex method, Such changes are reduced and performance improvements are similar.

The new performance test results are as follows:

1. aliyun_ecs_c8i.xlarge

cpu : intel xeon sapphire rapids (x64)
os debian linux

-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baselinie)
-HexFormatBench.appenderLower           512  avgt   15  2.768 ? 0.034  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.796 ? 0.042  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.800 ? 0.032  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.781 ? 0.018  us/op
-HexFormatBench.formatLower             512  avgt   15  0.544 ? 0.002  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.548 ? 0.004  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.546 ? 0.007  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.550 ? 0.005  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  3.364 ? 0.015  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  3.770 ? 0.017  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  4.990 ? 0.018  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  3.466 ? 0.017  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.415 ? 0.005  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.422 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.413 ? 0.005  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.423 ? 0.004  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.211 ?  0.002  us/op (+1211.85)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.210 ?  0.004  us/op (+1231.43)
+HexFormatBench.appenderUpper           512  avgt   15  0.289 ?  0.002  us/op (+868.86)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.296 ?  0.019  us/op (+839.53)
+HexFormatBench.formatLower             512  avgt   15  0.265 ?  0.001  us/op (+105.29)
+HexFormatBench.formatLowerCached       512  avgt   15  0.267 ?  0.002  us/op (+105.25)
+HexFormatBench.formatUpper             512  avgt   15  0.274 ?  0.002  us/op (+99.28)
+HexFormatBench.formatUpperCached       512  avgt   15  0.286 ?  0.019  us/op (+92.31)
+HexFormatBench.toHexDigitsByte         512  avgt   15  3.351 ?  0.011  us/op (+0.39)
+HexFormatBench.toHexDigitsInt          512  avgt   15  3.708 ?  0.011  us/op (+1.68)
+HexFormatBench.toHexDigitsLong         512  avgt   15  5.051 ?  0.014  us/op (-1.21)
+HexFormatBench.toHexDigitsShort        512  avgt   15  3.456 ?  0.012  us/op (+0.29)
+HexFormatBench.toHexLower              512  avgt   15  0.445 ?  0.001  us/op (-6.75)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.441 ?  0.001  us/op (-4.31)
+HexFormatBench.toHexUpper              512  avgt   15  0.444 ?  0.001  us/op (-6.99)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.441 ?  0.001  us/op (-4.09)

2. aliyun_ecs_c8y.xlarge

cpu : aliyun yitian 710 (aarch64)
os debian linux

-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.857 ? 0.791  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.832 ? 0.758  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.360 ? 0.010  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.361 ? 0.013  us/op
-HexFormatBench.formatLower             512  avgt   15  0.947 ? 0.406  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.616 ? 0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  1.212 ? 0.411  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.616 ? 0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  5.844 ? 0.264  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  7.392 ? 0.207  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  8.068 ? 0.303  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  6.214 ? 0.266  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.926 ? 0.003  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  1.000 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.927 ? 0.002  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.999 ? 0.020  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.343 ?  0.001  us/op (+732.95)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.345 ?  0.001  us/op (+720.87)
+HexFormatBench.appenderUpper           512  avgt   15  0.352 ?  0.002  us/op (+570.46)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.349 ?  0.001  us/op (+576.51)
+HexFormatBench.formatLower             512  avgt   15  0.464 ?  0.001  us/op (+104.10)
+HexFormatBench.formatLowerCached       512  avgt   15  0.484 ?  0.002  us/op (+27.28)
+HexFormatBench.formatUpper             512  avgt   15  0.650 ?  0.001  us/op (+86.47)
+HexFormatBench.formatUpperCached       512  avgt   15  0.598 ?  0.001  us/op (+3.02)
+HexFormatBench.toHexDigitsByte         512  avgt   15  5.591 ?  0.058  us/op (+4.53)
+HexFormatBench.toHexDigitsInt          512  avgt   15  7.080 ?  0.114  us/op (+4.41)
+HexFormatBench.toHexDigitsLong         512  avgt   15  7.754 ?  0.040  us/op (+4.05)
+HexFormatBench.toHexDigitsShort        512  avgt   15  5.779 ?  0.076  us/op (+7.53)
+HexFormatBench.toHexLower              512  avgt   15  0.494 ?  0.001  us/op (+87.45)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.493 ?  0.001  us/op (+102.84)
+HexFormatBench.toHexUpper              512  avgt   15  0.494 ?  0.001  us/op (+87.66)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.493 ?  0.001  us/op (+102.64)

3. Mac Book Pro M1 Pro

-Benchmark                           (size)  Mode  Cnt  Score    Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.867 ?  0.035  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  1.656 ?  0.875  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.813 ?  0.085  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  1.575 ?  0.901  us/op
-HexFormatBench.formatLower             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.385 ?  0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.384 ?  0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  1.688 ?  0.009  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  2.991 ?  0.015  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  3.719 ?  0.081  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  1.868 ?  0.010  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.321 ?  0.001  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.322 ?  0.001  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.324 ?  0.001  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.325 ?  0.001  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.207 ?  0.001  us/op (+1285.03)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.206 ?  0.001  us/op (+703.89)
+HexFormatBench.appenderUpper           512  avgt   15  0.225 ?  0.001  us/op (+1150.23)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.225 ?  0.001  us/op (+600.00)
+HexFormatBench.formatLower             512  avgt   15  0.211 ?  0.003  us/op (+82.47)
+HexFormatBench.formatLowerCached       512  avgt   15  0.186 ?  0.001  us/op (+106.99)
+HexFormatBench.formatUpper             512  avgt   15  0.312 ?  0.001  us/op (+23.40)
+HexFormatBench.formatUpperCached       512  avgt   15  0.344 ?  0.001  us/op (+11.63)
+HexFormatBench.toHexDigitsByte         512  avgt   15  1.718 ?  0.054  us/op (-1.75)
+HexFormatBench.toHexDigitsInt          512  avgt   15  2.255 ?  0.010  us/op (+32.64)
+HexFormatBench.toHexDigitsLong         512  avgt   15  3.764 ?  0.005  us/op (-1.20)
+HexFormatBench.toHexDigitsShort        512  avgt   15  1.858 ?  0.008  us/op (+0.54)
+HexFormatBench.toHexLower              512  avgt   15  0.289 ?  0.004  us/op (+11.08)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.295 ?  0.001  us/op (+9.16)
+HexFormatBench.toHexUpper              512  avgt   15  0.288 ?  0.001  us/op (+12.50)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.297 ?  0.005  us/op (+9.43)

wenshao · 2023-09-16T17:13:34Z

/label remove security

openjdk · 2023-09-16T17:15:22Z

@wenshao
The security label was successfully removed.

wenshao · 2023-09-17T00:51:57Z

@cl4es Can you help me create an issue for this PR?

liach · 2023-09-18T13:40:48Z

src/java.base/share/classes/java/util/HexFormat.java

                    for (int i = 1; i < length; i++) {
                        out.append(suffix);
                        out.append(delimiter);
                        out.append(prefix);
                        toHexDigits(out, bytes[fromIndex + i]);
                    }
+                    out.append(suffix);


Maybe change this whole else block to

for (int i = 0; i < length; i++) { if (i > 0) out.append(delimiter); out.append(prefix); toHexDigits(out, bytes[fromIndex + i]); out.append(suffix); }

for clarity?

The original (and current) is coded to avoid a condition inside the loop.

I also think that the way of writing for_0 combined with if > 0 is easier to understand, The operation overhead of if > 0 is very small, and it will not affect performance when used in a loop.

The coding pattern to handle first and last special cases outside the loop is also common and may help the JIT with loop unrolling and other optimizations. (And yes JITs can recognize all kinds of code and optimize it anyway).

src/java.base/share/classes/java/util/HexFormat.java

liach · 2023-09-18T13:51:06Z

src/java.base/share/classes/java/util/HexFormat.java

-            return (char)('a' - 10 + value);
-        }
-        return (char)('A' - 10 + value);
+        return (char) ((value < 10 ? '0' : hexBase) + value);


These changes seem to be the reason toHexDigitsLong consistently sees a slight performance drop. Can any professional engineer explain why the old version, which would be like (char) (value < 10 ? '0' + value : hexBase + value), is slightly faster?

Since the effect of this change hexBase is not clear, I recommend dropping this for now (remove the field and changes to the 2 methods).

Yes, please drop this. Possibly re-attempting it in a future PR, though I'm a bit skeptical of the potential win here.

I did test something like hexBase + value in #15591 to a similar effect, so opted for just storing digitCase. My best guess as to why are that the JIT is simply better at (or more eager to) eliminating untaken paths than it is at constant fold the field load. Or the branch isn't eliminated and we're seeing the effect of branch prediction. I'd recommend analyzing the asm generated (via e.g. -prof perfasm) to understand this in depth

liach · 2023-09-18T14:14:49Z

I have created a JBS issue including the 2 confirmed performance improvements in this patch: https://bugs.openjdk.org/browse/JDK-8316426

mlbridge · 2023-09-18T15:44:56Z

Webrevs

RogerRiggs · 2023-09-18T17:18:29Z

HexDecimal#DIGITS is a table with a size of 512 bytes. I think that in such a table, when it needs to be used continuously, it is worthwhile to perform table lookup operations.

HexFormat was designed for presenting bytes to humans, typical line lengths for the useful output are 80-130 characters, so input sizes of 40-65 bytes, probably less when other interpretations (such as latin1 or ascii) are appended.
And the output is not typically performance sensitive. Please keep the code small and easy to maintain.

RogerRiggs · 2023-09-18T17:28:39Z

src/java.base/share/classes/jdk/internal/util/HexDigits.java

+    public static short digitPair(int i, boolean ucase) {
+        short v = DIGITS[i & 0xff];
+        return ucase
+                ? (short) (v & ~((v & 0b0100_0000_0100_0000) >> 1)) // really: to uppper


Clever but perhaps ('a' - 'A') to make it clearer where the constant came from

RogerRiggs · 2023-09-18T19:08:14Z

src/java.base/share/classes/java/util/HexFormat.java

                    for (int i = 1; i < length; i++) {
                        out.append(suffix);
                        out.append(delimiter);
                        out.append(prefix);
                        toHexDigits(out, bytes[fromIndex + i]);
                    }
+                    out.append(suffix);


The original (and current) is coded to avoid a condition inside the loop.

src/java.base/share/classes/jdk/internal/util/HexDigits.java

RogerRiggs · 2023-09-27T04:16:11Z

ok, but perhaps you can shrink it further, if it does not hurt performance, by subtracting 0x20 before indexing and cut the table to 64 bytes.

wenshao · 2023-09-27T09:04:50Z

ok, but perhaps you can shrink it further, if it does not hurt performance, by subtracting 0x20 before indexing and cut the table to 64 bytes.

If you use DIGITS of size 54, performance will be 10% slower, The code is written like this:

public final class HexFormat {
    private static final byte[] DIGITS = {
             0,  1,  2,  3,  4,  5,  6,  7,  8,  9, -1, -1, -1, -1, -1, -1,
            -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
            -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
            -1, 10, 11, 12, 13, 14, 15
           // remove 128 - 255
    };

    public static boolean isHexDigit(int ch) {
        // unsigned right shift 8 change to 7
        return (ch >= '0' && ch <= 'f' && DIGITS[ch - '0'] >= 0);
    }

    public static int fromHexDigit(int ch) {
        int value;
        // unsigned right shift 8 change to 7
        if (ch >= '0' && ch <= 'f' && (value = DIGITS[ch - '0']) >= 0) 
        ...
    }
}

wenshao · 2023-09-27T13:32:40Z

/integrate

openjdk · 2023-09-27T13:34:37Z

@wenshao
Your change (at version 2ed9ac3) is now ready to be sponsored by a Committer.

RogerRiggs · 2023-09-27T13:35:55Z

It is conventional to get re-reviews of any change and give the reviewers that taken the time to review the PR a chance to review any subsequent changes.

wenshao · 2023-09-29T19:55:39Z

/integrate

openjdk · 2023-09-29T19:56:59Z

@wenshao
Your change (at version 1463863) is now ready to be sponsored by a Committer.

wenshao · 2023-10-06T12:30:42Z

@RogerRiggs Can this PR be added to /sponsor now?

cl4es · 2023-10-06T14:20:13Z

FWIW I'll not review or sponsor any PRs that use ByteArrayLittleEndian for trivial byte[] writes until there's been a thorough analysis of why this helps and shown that JITs can't be expected to generate code that is as optimal.

wenshao · 2023-10-06T17:16:32Z

FWIW I'll not review or sponsor any PRs that use ByteArrayLittleEndian for trivial byte[] writes until there's been a thorough analysis of why this helps and shown that JITs can't be expected to generate code that is as optimal.

Thanks, I received your opinion and I have removed ByteArrayLittleEndian.

Such optimization should be done by JIT. It seems that this optimization is called SuperWord Level Parallelism (SLP). Currently, HotSpot C2 has only completed the loop version and not the linear version. Improvements require JIT experts to submit patches.

wenshao · 2023-10-09T20:13:15Z

/integrate

openjdk · 2023-10-09T20:14:10Z

@wenshao
Your change (at version f7d832f) is now ready to be sponsored by a Committer.

wenshao · 2023-10-12T06:17:35Z

@cl4es Can it be integrated?

cl4es · 2023-10-12T13:45:46Z

Yes, I think this looks fine. Thank you for your patience.

/sponsor

openjdk · 2023-10-12T13:54:21Z

Going to push as commit 9355431.
Since your change was applied there have been 360 commits pushed to the master branch:

32ccf01: 8317772: NMT: Make peak values available in release builds
4c79e7d: 8170817: G1: Returning MinTLABSize from unsafe_max_tlab_alloc causes TLAB flapping
7633a76: 8317998: Temporarily disable malformed control flow assert to reduce noise in testing
00ef9f9: 8316947: Write a test to check textArea triggers MouseEntered/MouseExited events properly
77dc891: 8317963: Serial: Remove unused GenerationIsInReservedClosure
d95b548: 8315850: Improve AbstractMap anonymous Iterator classes
424de29: 8317866: replace NET_SocketAvailable
6d6c900: 8038244: (fs) Check return value of malloc in Java_sun_nio_fs_AixNativeDispatcher_getmntctl()
eca6ea4: 8317873: Add @sealedGraph to IllegalFormatException
2edf9c3: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR
... and 350 more: https://git.openjdk.org/jdk/compare/3abe7982bfbc5787962863f8604ddecadf770b74...master

Your commit was automatically rebased without conflicts.

openjdk · 2023-10-12T13:54:33Z

@cl4es @wenshao Pushed as commit 9355431.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

optimization for HexFormat.formatHex

951cee8

openjdk bot added security security-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Sep 15, 2023

Remove redundant code in the benchmark

c9ef43f

wenshao added 2 commits September 16, 2023 16:15

remove AbstractStringBuilder internal method 'appendHex'

be96603

remove unused import

7ffdba5

openjdk bot removed the security security-dev@openjdk.org label Sep 16, 2023

liach approved these changes Sep 18, 2023

View reviewed changes

liach reviewed Sep 18, 2023

View reviewed changes

restore toLowHexDigit & toHighHexDigit

530ca01

wenshao changed the title ~~optimization for HexFormat.formatHex~~ 8316426: optimization for HexFormat.formatHex Sep 18, 2023

wenshao changed the title ~~8316426: optimization for HexFormat.formatHex~~ 8316426: Optimization for HexFormat.formatHex Sep 18, 2023

openjdk bot added the rfr Pull request is ready for review label Sep 18, 2023

RogerRiggs reviewed Sep 18, 2023

View reviewed changes

reduce the size of HexFormat##DIGITS, from 256 to 128

2ed9ac3

openjdk bot removed the sponsor Pull request is ready to be sponsored label Sep 27, 2023

openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 27, 2023

Update full name

1463863

openjdk bot removed the sponsor Pull request is ready to be sponsored label Sep 29, 2023

openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 29, 2023

Not ByteArrayLittleEndian

f7d832f

openjdk bot removed the sponsor Pull request is ready to be sponsored label Oct 6, 2023

openjdk bot added the sponsor Pull request is ready to be sponsored label Oct 9, 2023

openjdk bot added the integrated Pull request has been integrated label Oct 12, 2023

openjdk bot closed this Oct 12, 2023

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 12, 2023

8316426: Optimization for HexFormat.formatHex #15768

8316426: Optimization for HexFormat.formatHex #15768

Conversation

wenshao commented Sep 15, 2023 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

bridgekeeper bot commented Sep 15, 2023

openjdk bot commented Sep 15, 2023

cl4es commented Sep 15, 2023

wenshao commented Sep 15, 2023 • edited Loading

0. sciprt

1. aliyun_ecs_c8i.xlarge

2. aliyun_ecs_c8y.xlarge

3. Mac Book Pro M1 Pro

wenshao commented Sep 15, 2023

cl4es commented Sep 15, 2023

cl4es commented Sep 15, 2023

wenshao commented Sep 15, 2023

cl4es commented Sep 15, 2023

wenshao commented Sep 16, 2023

1. aliyun_ecs_c8i.xlarge

2. aliyun_ecs_c8y.xlarge

3. Mac Book Pro M1 Pro

wenshao commented Sep 16, 2023

openjdk bot commented Sep 16, 2023

wenshao commented Sep 17, 2023

liach Sep 18, 2023

Choose a reason for hiding this comment

RogerRiggs Sep 18, 2023

Choose a reason for hiding this comment

wenshao Sep 19, 2023

Choose a reason for hiding this comment

RogerRiggs Sep 19, 2023

Choose a reason for hiding this comment

liach Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

cl4es Sep 18, 2023

Choose a reason for hiding this comment

liach commented Sep 18, 2023

mlbridge bot commented Sep 18, 2023 • edited Loading

Webrevs

RogerRiggs commented Sep 18, 2023

RogerRiggs Sep 18, 2023

Choose a reason for hiding this comment

RogerRiggs Sep 18, 2023

Choose a reason for hiding this comment

RogerRiggs commented Sep 27, 2023

wenshao commented Sep 27, 2023

wenshao commented Sep 27, 2023

openjdk bot commented Sep 27, 2023

RogerRiggs commented Sep 27, 2023

wenshao commented Sep 29, 2023

openjdk bot commented Sep 29, 2023

wenshao commented Oct 6, 2023

cl4es commented Oct 6, 2023 • edited Loading

wenshao commented Oct 6, 2023

wenshao commented Oct 9, 2023

openjdk bot commented Oct 9, 2023

wenshao commented Oct 12, 2023

cl4es commented Oct 12, 2023

openjdk bot commented Oct 12, 2023

openjdk bot commented Oct 12, 2023

wenshao commented Sep 15, 2023 •

edited by openjdk bot

Loading

wenshao commented Sep 15, 2023 •

edited

Loading

liach Sep 18, 2023 •

edited

Loading

mlbridge bot commented Sep 18, 2023 •

edited

Loading

cl4es commented Oct 6, 2023 •

edited

Loading