Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8316426: Optimization for HexFormat.formatHex #15768

Closed
wants to merge 16 commits into from

Conversation

wenshao
Copy link
Contributor

@wenshao wenshao commented Sep 15, 2023

In the improvement of @cl4es PR #15591, the advantages of non-lookup-table were discussed.

But if the input is byte[], using lookup table can improve performance.

For HexFormat#formatHex(Appendable, byte[]) and HexFormat#formatHex(byte[]), If the length of byte[] is larger, the performance of table lookup will be improved more obviously.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8316426: Optimization for HexFormat.formatHex (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15768/head:pull/15768
$ git checkout pull/15768

Update a local copy of the PR:
$ git checkout pull/15768
$ git pull https://git.openjdk.org/jdk.git pull/15768/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15768

View PR using the GUI difftool:
$ git pr show -t 15768

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15768.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 15, 2023

👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 15, 2023

@wenshao The following labels will be automatically applied to this pull request:

  • core-libs
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added security security-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Sep 15, 2023
@cl4es
Copy link
Member

cl4es commented Sep 15, 2023

What numbers do you get with this?

In my experiments the improvement was either negligible or small enough to hardly matter. I also tried flipping bits similar to what you're doing for uppercasing but saw a net regression from that.

I opted for simplicity in #15991 - a simple cleanup that removed a few tiny lookup-tables and still led to a decent improvement. Going the other direction seems like the wrong move.

@wenshao
Copy link
Contributor Author

wenshao commented Sep 15, 2023

The performance test results are as follows:

0. sciprt

bash configure
make images
sh make/devkit/createJMHBundle.sh
bash configure --with-jmh=build/jmh/jars
make test TEST="micro:java.util.HexFormatBench.*"

1. aliyun_ecs_c8i.xlarge

  • cpu : intel xeon sapphire rapids (x64)
  • os debian linux
-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baselinie)
-HexFormatBench.appenderLower           512  avgt   15  2.768 ? 0.034  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.796 ? 0.042  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.800 ? 0.032  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.781 ? 0.018  us/op
-HexFormatBench.formatLower             512  avgt   15  0.544 ? 0.002  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.548 ? 0.004  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.546 ? 0.007  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.550 ? 0.005  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  3.364 ? 0.015  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  3.770 ? 0.017  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  4.990 ? 0.018  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  3.466 ? 0.017  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.415 ? 0.005  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.422 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.413 ? 0.005  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.423 ? 0.004  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.163 ?  0.001  us/op (+1598.16)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.161 ?  0.001  us/op (+1636.65)
+HexFormatBench.appenderUpper           512  avgt   15  0.251 ?  0.023  us/op (+1015.54)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.266 ?  0.001  us/op (+945.49)
+HexFormatBench.formatLower             512  avgt   15  0.275 ?  0.001  us/op (+97.82)
+HexFormatBench.formatLowerCached       512  avgt   15  0.277 ?  0.001  us/op (+97.84)
+HexFormatBench.formatUpper             512  avgt   15  0.285 ?  0.001  us/op (+91.58)
+HexFormatBench.formatUpperCached       512  avgt   15  0.285 ?  0.001  us/op (+92.99)
+HexFormatBench.toHexDigitsByte         512  avgt   15  3.554 ?  0.028  us/op (-5.35)
+HexFormatBench.toHexDigitsInt          512  avgt   15  3.910 ?  0.015  us/op (-3.59)
+HexFormatBench.toHexDigitsLong         512  avgt   15  5.288 ?  0.018  us/op (-5.64)
+HexFormatBench.toHexDigitsShort        512  avgt   15  3.637 ?  0.012  us/op (-4.71)
+HexFormatBench.toHexLower              512  avgt   15  0.445 ?  0.001  us/op (-6.75)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.442 ?  0.001  us/op (-4.53)
+HexFormatBench.toHexUpper              512  avgt   15  0.445 ?  0.001  us/op (-7.20)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.441 ?  0.001  us/op (-4.09)

2. aliyun_ecs_c8y.xlarge

  • cpu : aliyun yitian 710 (aarch64)
  • os debian linux
-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.857 ? 0.791  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.832 ? 0.758  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.360 ? 0.010  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.361 ? 0.013  us/op
-HexFormatBench.formatLower             512  avgt   15  0.947 ? 0.406  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.616 ? 0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  1.212 ? 0.411  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.616 ? 0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  5.844 ? 0.264  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  7.392 ? 0.207  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  8.068 ? 0.303  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  6.214 ? 0.266  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.926 ? 0.003  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  1.000 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.927 ? 0.002  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.999 ? 0.020  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.356 ?  0.001  us/op (+702.53)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.356 ?  0.001  us/op (+695.51)
+HexFormatBench.appenderUpper           512  avgt   15  0.304 ?  0.001  us/op (+676.32)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.304 ?  0.001  us/op (+676.65)
+HexFormatBench.formatLower             512  avgt   15  0.461 ?  0.001  us/op (+105.43)
+HexFormatBench.formatLowerCached       512  avgt   15  0.485 ?  0.001  us/op (+27.02)
+HexFormatBench.formatUpper             512  avgt   15  0.644 ?  0.003  us/op (+88.20)
+HexFormatBench.formatUpperCached       512  avgt   15  0.595 ?  0.003  us/op (+3.53)
+HexFormatBench.toHexDigitsByte         512  avgt   15  5.804 ?  0.237  us/op (+0.69)
+HexFormatBench.toHexDigitsInt          512  avgt   15  7.209 ?  0.212  us/op (+2.54)
+HexFormatBench.toHexDigitsLong         512  avgt   15  8.301 ?  0.422  us/op (-2.81)
+HexFormatBench.toHexDigitsShort        512  avgt   15  5.908 ?  0.255  us/op (+5.18)
+HexFormatBench.toHexLower              512  avgt   15  0.494 ?  0.001  us/op (+87.45)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.494 ?  0.001  us/op (+102.43)
+HexFormatBench.toHexUpper              512  avgt   15  0.494 ?  0.001  us/op (+87.66)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.493 ?  0.001  us/op (+102.64)

3. Mac Book Pro M1 Pro

-Benchmark                           (size)  Mode  Cnt  Score    Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.867 ?  0.035  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  1.656 ?  0.875  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.813 ?  0.085  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  1.575 ?  0.901  us/op
-HexFormatBench.formatLower             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.385 ?  0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.384 ?  0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  1.688 ?  0.009  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  2.991 ?  0.015  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  3.719 ?  0.081  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  1.868 ?  0.010  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.321 ?  0.001  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.322 ?  0.001  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.324 ?  0.001  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.325 ?  0.001  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.212 ?  0.003  us/op (+1252.36)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.211 ?  0.001  us/op (+684.84)
+HexFormatBench.appenderUpper           512  avgt   15  0.199 ?  0.002  us/op (+1313.57)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.198 ?  0.001  us/op (+695.46)
+HexFormatBench.formatLower             512  avgt   15  0.221 ?  0.001  us/op (+74.21)
+HexFormatBench.formatLowerCached       512  avgt   15  0.192 ?  0.001  us/op (+100.53)
+HexFormatBench.formatUpper             512  avgt   15  0.317 ?  0.002  us/op (+21.46)
+HexFormatBench.formatUpperCached       512  avgt   15  0.348 ?  0.003  us/op (+10.35)
+HexFormatBench.toHexDigitsByte         512  avgt   15  1.715 ?  0.011  us/op (-1.58)
+HexFormatBench.toHexDigitsInt          512  avgt   15  2.261 ?  0.012  us/op (+32.29)
+HexFormatBench.toHexDigitsLong         512  avgt   15  3.776 ?  0.023  us/op (-1.51)
+HexFormatBench.toHexDigitsShort        512  avgt   15  1.862 ?  0.011  us/op (+0.33)
+HexFormatBench.toHexLower              512  avgt   15  0.289 ?  0.004  us/op (+11.08)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.294 ?  0.002  us/op (+9.53)
+HexFormatBench.toHexUpper              512  avgt   15  0.288 ?  0.001  us/op (+12.50)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.295 ?  0.001  us/op (+10.17)

@wenshao
Copy link
Contributor Author

wenshao commented Sep 15, 2023

What numbers do you get with this?

In my experiments the improvement was either negligible or small enough to hardly matter. I also tried flipping bits similar to what you're doing for uppercasing but saw a net regression from that.

I opted for simplicity in #15991 - a simple cleanup that removed a few tiny lookup-tables and still led to a decent improvement. Going the other direction seems like the wrong move.

The byte[] length used by HexFormatBench is 512. In this scenario, the performance improvement of using table lookup is significant.

@cl4es
Copy link
Member

cl4es commented Sep 15, 2023

The byte[] length used by HexFormatBench is 512. In this scenario, the performance improvement of using table lookup is significant.

Is this a common use-case? I could see an argument that formatting small chunks is much more common, where users will experience relatively more cache misses from the table lookup.

The gain from specializing for StringBuilder is impressive. My guess is that most of the gain is from enabling better loop optimizations by turning a complicated loop with interface calls into a simple counted loop over a byte[]. This could be split out into a separate PR, or perhaps we could consider more generalizable approaches to make append and format perform at the same level

@cl4es
Copy link
Member

cl4es commented Sep 15, 2023

Results of an experiment to remove the lookup table use from StringBuilder::appendHex:

-p size=512
Benchmark                           (size)  Mode  Cnt  Score   Error  Units
HexFormatBench.appenderLowerCached     512  avgt   15  1,629 ± 0,378  us/op # baseline
HexFormatBench.appenderLowerCached     512  avgt   15  0,222 ± 0,009  us/op # 15768, 7.3x
HexFormatBench.appenderLowerCached     512  avgt   15  0,366 ± 0,002  us/op # no_lookup_tables.diff, 4.5x

-p size=4
Benchmark                           (size)  Mode  Cnt   Score   Error  Units
HexFormatBench.appenderLowerCached       4  avgt    5  26,723 ± 1,042  ns/op
HexFormatBench.appenderLowerCached       4  avgt    5   7,492 ± 0,140  ns/op # 15768, 3.6x
HexFormatBench.appenderLowerCached       4  avgt    5  10,205 ± 0,055  ns/op # no_lookup_tables.diff, 2.6x

Most of the win is from having a loop that is easier for the JIT to optimize; the lookup-table impl is faster than no lookup table in these micros, by a factor that is in the same ballpark as your format micro numbers. This on a M1. You might get different numbers.

I'm not convinced 30-100% wins on these micro is worth the increase in cache traffic, and would like to see a deeper analysis of the pros and cons of lookup tables before we go ahead with any optimizations that depend on their use.

The appendHex method in StringBuilder is interesting, but has other maintainability issues. It's a slippery slope to start adding specialized internal routines to StringBuilder, and perhaps there are better ways to model this at a higher level.

no_lookup_table.diff:

diff --git a/src/java.base/share/classes/java/lang/AbstractStringBuilder.java b/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
index 57a16bb769a..70727375c0b 100644
--- a/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
+++ b/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
@@ -1825,6 +1825,15 @@ private final void appendChars(CharSequence s, int off, int end) {
         count += end - off;
     }

+    private static char toHighHexDigit(boolean ucase, int value) {
+        value = (value >> 4) & 0xf;
+        return (char) ((value < 10 ? '0' : ucase ? 'A' : 'a') + value);
+    }
+    private static char toLowHexDigit(boolean ucase, int value) {
+        value = (value) & 0xf;
+        return (char) ((value < 10 ? '0' : ucase ? 'A' : 'a') + value);
+    }
+
     final void appendHex(boolean ucase, byte[] bytes, int fromIndex, int toIndex) {
         Preconditions.checkFromToIndex(fromIndex, toIndex, bytes.length, Preconditions.IOOBE_FORMATTER);

@@ -1832,17 +1841,14 @@ final void appendHex(boolean ucase, byte[] bytes, int fromIndex, int toIndex) {

         int charPos = count;
         for (int i = fromIndex; i < toIndex; ++i) {
-            short packed = HexDigits.digitPair(bytes[i], ucase);
+            byte b = bytes[i];
             if (isLatin1()) {
-                ByteArrayLittleEndian.setShort(
-                        value,
-                        charPos,
-                        packed);
+                StringLatin1.putChar(value, charPos++, toHighHexDigit(ucase, b));
+                StringLatin1.putChar(value, charPos++, toLowHexDigit(ucase, b));
             } else {
-                StringUTF16.putChar(value, charPos, packed & 0xff);
-                StringUTF16.putChar(value, charPos + 1, packed >> 8);
+                StringUTF16.putChar(value, charPos++, toHighHexDigit(ucase, b));
+                StringUTF16.putChar(value, charPos++, toLowHexDigit(ucase, b));
             }
-            charPos += 2;
         }

         count = charPos;

@wenshao
Copy link
Contributor Author

wenshao commented Sep 15, 2023

Add internal methods to StringBuilder for performance optimization, I saw that the implementation of JEP 403 String Template does similar things.

class AbstractStringBuilder {
    long mix(long lengthCoder) { }
    long prepend(long lengthCoder, byte[] buffer) {}
    // ...
}

However, the StringBuilder.appendHex method can have more usage scenarios and can be considered as a public method. Is it necessary to submit a new PR to add these methods?

class AbstractStringBuilder {
    public void appendHex(byte[] bytes) {}
    public void appendHex(byte[] bytes, boolean ucase) {}
    public void appendHex(byte[] bytes, int fromIndex, int toIndex) {}
    public void appendHex(byte[] bytes, int fromIndex, int toIndex, boolean ucase) {}
}

Regarding the performance of using lookup table, I think it makes sense when the length of byte[] is greater than 8. I think that when the length of byte[] is actually used, there is a high probability that it will be greater than 8. Of course, I just said the number 8 casually, it could be 12, or 16.

HexDecimal#DIGITS is a table with a size of 512 bytes. I think that in such a table, when it needs to be used continuously, it is worthwhile to perform table lookup operations.

@cl4es
Copy link
Member

cl4es commented Sep 15, 2023

I don't think appendHex would make the cut as a public API. I think we would need something with much more general utility for that.

A more high-level idea would be to improve said String templates JEP to be able to format more generally into Appendable/StringBuilder, so that hex, octal - any custom format - can be efficiently appended to a StringBuilder.

BTW, here's a patch on top of #15768 that simplifies formatHex(Appendable..) to simply append the result of formatHex(byte[]...):

Benchmark                                              (size)  Mode  Cnt     Score     Error   Units
HexFormatBench.appenderLowerCached                        512  avgt   15     0,202 ±   0,001   us/op
HexFormatBench.formatLowerCached                          512  avgt   15     0,189 ±   0,016   us/op

Same speed. Drawback is that the appender becomes allocating but short-lived allocations is one of my least concerns:

diff --git a/src/java.base/share/classes/java/util/HexFormat.java b/src/java.base/share/classes/java/util/HexFormat.java
index e1d88e3ceb6..68a15307fe0 100644
--- a/src/java.base/share/classes/java/util/HexFormat.java
+++ b/src/java.base/share/classes/java/util/HexFormat.java
@@ -407,38 +407,7 @@ public <A extends Appendable> A formatHex(A out, byte[] bytes, int fromIndex, in
         int length = toIndex - fromIndex;
         if (length > 0) {
             try {
-                boolean prefixEmpty = prefix.isEmpty();
-                boolean suffixEmpty = suffix.isEmpty();
-                boolean delimiterEmpty = delimiter.isEmpty();
-                if (!prefixEmpty) {
-                    out.append(prefix);
-                }
-                if (suffixEmpty && delimiterEmpty && prefixEmpty) {
-                    if (out instanceof StringBuilder sb) {
-                        jla.appendHex(sb, digitCase == Case.UPPERCASE, bytes, fromIndex, toIndex);
-                    } else {
-                        for (int i = 0; i < length; i++) {
-                            toHexDigits(out, bytes[fromIndex + i]);
-                        }
-                    }
-                } else {
-                    toHexDigits(out, bytes[fromIndex]);
-                    for (int i = 1; i < length; i++) {
-                        if (!suffixEmpty) {
-                            out.append(suffix);
-                        }
-                        if (!delimiterEmpty) {
-                            out.append(delimiter);
-                        }
-                        if (!prefixEmpty) {
-                            out.append(prefix);
-                        }
-                        toHexDigits(out, bytes[fromIndex + i]);
-                    }
-                }
-                if (!suffixEmpty) {
-                    out.append(suffix);
-                }
+                out.append(formatHex(bytes, fromIndex, toIndex));
             } catch (IOException ioe) {
                 throw new UncheckedIOException(ioe.getMessage(), ioe);
             }

@wenshao
Copy link
Contributor Author

wenshao commented Sep 16, 2023

I deleted the newly added AbstractBuilder.appendHex method, Such changes are reduced and performance improvements are similar.

The new performance test results are as follows:

1. aliyun_ecs_c8i.xlarge

  • cpu : intel xeon sapphire rapids (x64)
  • os debian linux
-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baselinie)
-HexFormatBench.appenderLower           512  avgt   15  2.768 ? 0.034  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.796 ? 0.042  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.800 ? 0.032  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.781 ? 0.018  us/op
-HexFormatBench.formatLower             512  avgt   15  0.544 ? 0.002  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.548 ? 0.004  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.546 ? 0.007  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.550 ? 0.005  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  3.364 ? 0.015  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  3.770 ? 0.017  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  4.990 ? 0.018  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  3.466 ? 0.017  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.415 ? 0.005  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.422 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.413 ? 0.005  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.423 ? 0.004  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.211 ?  0.002  us/op (+1211.85)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.210 ?  0.004  us/op (+1231.43)
+HexFormatBench.appenderUpper           512  avgt   15  0.289 ?  0.002  us/op (+868.86)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.296 ?  0.019  us/op (+839.53)
+HexFormatBench.formatLower             512  avgt   15  0.265 ?  0.001  us/op (+105.29)
+HexFormatBench.formatLowerCached       512  avgt   15  0.267 ?  0.002  us/op (+105.25)
+HexFormatBench.formatUpper             512  avgt   15  0.274 ?  0.002  us/op (+99.28)
+HexFormatBench.formatUpperCached       512  avgt   15  0.286 ?  0.019  us/op (+92.31)
+HexFormatBench.toHexDigitsByte         512  avgt   15  3.351 ?  0.011  us/op (+0.39)
+HexFormatBench.toHexDigitsInt          512  avgt   15  3.708 ?  0.011  us/op (+1.68)
+HexFormatBench.toHexDigitsLong         512  avgt   15  5.051 ?  0.014  us/op (-1.21)
+HexFormatBench.toHexDigitsShort        512  avgt   15  3.456 ?  0.012  us/op (+0.29)
+HexFormatBench.toHexLower              512  avgt   15  0.445 ?  0.001  us/op (-6.75)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.441 ?  0.001  us/op (-4.31)
+HexFormatBench.toHexUpper              512  avgt   15  0.444 ?  0.001  us/op (-6.99)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.441 ?  0.001  us/op (-4.09)

2. aliyun_ecs_c8y.xlarge

  • cpu : aliyun yitian 710 (aarch64)
  • os debian linux
-Benchmark                           (size)  Mode  Cnt  Score   Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.857 ? 0.791  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  2.832 ? 0.758  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.360 ? 0.010  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  2.361 ? 0.013  us/op
-HexFormatBench.formatLower             512  avgt   15  0.947 ? 0.406  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.616 ? 0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  1.212 ? 0.411  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.616 ? 0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  5.844 ? 0.264  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  7.392 ? 0.207  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  8.068 ? 0.303  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  6.214 ? 0.266  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.926 ? 0.003  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  1.000 ? 0.005  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.927 ? 0.002  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.999 ? 0.020  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.343 ?  0.001  us/op (+732.95)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.345 ?  0.001  us/op (+720.87)
+HexFormatBench.appenderUpper           512  avgt   15  0.352 ?  0.002  us/op (+570.46)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.349 ?  0.001  us/op (+576.51)
+HexFormatBench.formatLower             512  avgt   15  0.464 ?  0.001  us/op (+104.10)
+HexFormatBench.formatLowerCached       512  avgt   15  0.484 ?  0.002  us/op (+27.28)
+HexFormatBench.formatUpper             512  avgt   15  0.650 ?  0.001  us/op (+86.47)
+HexFormatBench.formatUpperCached       512  avgt   15  0.598 ?  0.001  us/op (+3.02)
+HexFormatBench.toHexDigitsByte         512  avgt   15  5.591 ?  0.058  us/op (+4.53)
+HexFormatBench.toHexDigitsInt          512  avgt   15  7.080 ?  0.114  us/op (+4.41)
+HexFormatBench.toHexDigitsLong         512  avgt   15  7.754 ?  0.040  us/op (+4.05)
+HexFormatBench.toHexDigitsShort        512  avgt   15  5.779 ?  0.076  us/op (+7.53)
+HexFormatBench.toHexLower              512  avgt   15  0.494 ?  0.001  us/op (+87.45)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.493 ?  0.001  us/op (+102.84)
+HexFormatBench.toHexUpper              512  avgt   15  0.494 ?  0.001  us/op (+87.66)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.493 ?  0.001  us/op (+102.64)

3. Mac Book Pro M1 Pro

-Benchmark                           (size)  Mode  Cnt  Score    Error  Units (baseline)
-HexFormatBench.appenderLower           512  avgt   15  2.867 ?  0.035  us/op
-HexFormatBench.appenderLowerCached     512  avgt   15  1.656 ?  0.875  us/op
-HexFormatBench.appenderUpper           512  avgt   15  2.813 ?  0.085  us/op
-HexFormatBench.appenderUpperCached     512  avgt   15  1.575 ?  0.901  us/op
-HexFormatBench.formatLower             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatLowerCached       512  avgt   15  0.385 ?  0.002  us/op
-HexFormatBench.formatUpper             512  avgt   15  0.385 ?  0.001  us/op
-HexFormatBench.formatUpperCached       512  avgt   15  0.384 ?  0.001  us/op
-HexFormatBench.toHexDigitsByte         512  avgt   15  1.688 ?  0.009  us/op
-HexFormatBench.toHexDigitsInt          512  avgt   15  2.991 ?  0.015  us/op
-HexFormatBench.toHexDigitsLong         512  avgt   15  3.719 ?  0.081  us/op
-HexFormatBench.toHexDigitsShort        512  avgt   15  1.868 ?  0.010  us/op
-HexFormatBench.toHexLower              512  avgt   15  0.321 ?  0.001  us/op
-HexFormatBench.toHexLowerCached        512  avgt   15  0.322 ?  0.001  us/op
-HexFormatBench.toHexUpper              512  avgt   15  0.324 ?  0.001  us/op
-HexFormatBench.toHexUpperCached        512  avgt   15  0.325 ?  0.001  us/op

+Benchmark                           (size)  Mode  Cnt  Score    Error  Units (optimized)
+HexFormatBench.appenderLower           512  avgt   15  0.207 ?  0.001  us/op (+1285.03)
+HexFormatBench.appenderLowerCached     512  avgt   15  0.206 ?  0.001  us/op (+703.89)
+HexFormatBench.appenderUpper           512  avgt   15  0.225 ?  0.001  us/op (+1150.23)
+HexFormatBench.appenderUpperCached     512  avgt   15  0.225 ?  0.001  us/op (+600.00)
+HexFormatBench.formatLower             512  avgt   15  0.211 ?  0.003  us/op (+82.47)
+HexFormatBench.formatLowerCached       512  avgt   15  0.186 ?  0.001  us/op (+106.99)
+HexFormatBench.formatUpper             512  avgt   15  0.312 ?  0.001  us/op (+23.40)
+HexFormatBench.formatUpperCached       512  avgt   15  0.344 ?  0.001  us/op (+11.63)
+HexFormatBench.toHexDigitsByte         512  avgt   15  1.718 ?  0.054  us/op (-1.75)
+HexFormatBench.toHexDigitsInt          512  avgt   15  2.255 ?  0.010  us/op (+32.64)
+HexFormatBench.toHexDigitsLong         512  avgt   15  3.764 ?  0.005  us/op (-1.20)
+HexFormatBench.toHexDigitsShort        512  avgt   15  1.858 ?  0.008  us/op (+0.54)
+HexFormatBench.toHexLower              512  avgt   15  0.289 ?  0.004  us/op (+11.08)
+HexFormatBench.toHexLowerCached        512  avgt   15  0.295 ?  0.001  us/op (+9.16)
+HexFormatBench.toHexUpper              512  avgt   15  0.288 ?  0.001  us/op (+12.50)
+HexFormatBench.toHexUpperCached        512  avgt   15  0.297 ?  0.005  us/op (+9.43)

@wenshao
Copy link
Contributor Author

wenshao commented Sep 16, 2023

/label remove security

@openjdk openjdk bot removed the security security-dev@openjdk.org label Sep 16, 2023
@openjdk
Copy link

openjdk bot commented Sep 16, 2023

@wenshao
The security label was successfully removed.

@wenshao
Copy link
Contributor Author

wenshao commented Sep 17, 2023

@cl4es Can you help me create an issue for this PR?

for (int i = 1; i < length; i++) {
out.append(suffix);
out.append(delimiter);
out.append(prefix);
toHexDigits(out, bytes[fromIndex + i]);
}
out.append(suffix);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this whole else block to

                    for (int i = 0; i < length; i++) {
                        if (i > 0)
                            out.append(delimiter);
                        out.append(prefix);
                        toHexDigits(out, bytes[fromIndex + i]);
                        out.append(suffix);
                    }

for clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original (and current) is coded to avoid a condition inside the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the way of writing for_0 combined with if > 0 is easier to understand, The operation overhead of if > 0 is very small, and it will not affect performance when used in a loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coding pattern to handle first and last special cases outside the loop is also common and may help the JIT with loop unrolling and other optimizations. (And yes JITs can recognize all kinds of code and optimize it anyway).

return (char)('a' - 10 + value);
}
return (char)('A' - 10 + value);
return (char) ((value < 10 ? '0' : hexBase) + value);
Copy link
Member

@liach liach Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem to be the reason toHexDigitsLong consistently sees a slight performance drop. Can any professional engineer explain why the old version, which would be like (char) (value < 10 ? '0' + value : hexBase + value), is slightly faster?

Since the effect of this change hexBase is not clear, I recommend dropping this for now (remove the field and changes to the 2 methods).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please drop this. Possibly re-attempting it in a future PR, though I'm a bit skeptical of the potential win here.

I did test something like hexBase + value in #15591 to a similar effect, so opted for just storing digitCase. My best guess as to why are that the JIT is simply better at (or more eager to) eliminating untaken paths than it is at constant fold the field load. Or the branch isn't eliminated and we're seeing the effect of branch prediction. I'd recommend analyzing the asm generated (via e.g. -prof perfasm) to understand this in depth

@liach
Copy link
Member

liach commented Sep 18, 2023

I have created a JBS issue including the 2 confirmed performance improvements in this patch: https://bugs.openjdk.org/browse/JDK-8316426

@wenshao wenshao changed the title optimization for HexFormat.formatHex 8316426: optimization for HexFormat.formatHex Sep 18, 2023
@wenshao wenshao changed the title 8316426: optimization for HexFormat.formatHex 8316426: Optimization for HexFormat.formatHex Sep 18, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 18, 2023
@mlbridge
Copy link

mlbridge bot commented Sep 18, 2023

@RogerRiggs
Copy link
Contributor

HexDecimal#DIGITS is a table with a size of 512 bytes. I think that in such a table, when it needs to be used continuously, it is worthwhile to perform table lookup operations.

HexFormat was designed for presenting bytes to humans, typical line lengths for the useful output are 80-130 characters, so input sizes of 40-65 bytes, probably less when other interpretations (such as latin1 or ascii) are appended.
And the output is not typically performance sensitive. Please keep the code small and easy to maintain.

public static short digitPair(int i, boolean ucase) {
short v = DIGITS[i & 0xff];
return ucase
? (short) (v & ~((v & 0b0100_0000_0100_0000) >> 1)) // really: to uppper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever but perhaps ('a' - 'A') to make it clearer where the constant came from

for (int i = 1; i < length; i++) {
out.append(suffix);
out.append(delimiter);
out.append(prefix);
toHexDigits(out, bytes[fromIndex + i]);
}
out.append(suffix);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original (and current) is coded to avoid a condition inside the loop.

@RogerRiggs
Copy link
Contributor

ok, but perhaps you can shrink it further, if it does not hurt performance, by subtracting 0x20 before indexing and cut the table to 64 bytes.

@openjdk openjdk bot removed the sponsor Pull request is ready to be sponsored label Sep 27, 2023
@wenshao
Copy link
Contributor Author

wenshao commented Sep 27, 2023

ok, but perhaps you can shrink it further, if it does not hurt performance, by subtracting 0x20 before indexing and cut the table to 64 bytes.

If you use DIGITS of size 54, performance will be 10% slower, The code is written like this:

public final class HexFormat {
    private static final byte[] DIGITS = {
             0,  1,  2,  3,  4,  5,  6,  7,  8,  9, -1, -1, -1, -1, -1, -1,
            -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
            -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
            -1, 10, 11, 12, 13, 14, 15
           // remove 128 - 255
    };

    public static boolean isHexDigit(int ch) {
        // unsigned right shift 8 change to 7
        return (ch >= '0' && ch <= 'f' && DIGITS[ch - '0'] >= 0);
    }

    public static int fromHexDigit(int ch) {
        int value;
        // unsigned right shift 8 change to 7
        if (ch >= '0' && ch <= 'f' && (value = DIGITS[ch - '0']) >= 0) 
        ...
    }
}

@wenshao
Copy link
Contributor Author

wenshao commented Sep 27, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 27, 2023
@openjdk
Copy link

openjdk bot commented Sep 27, 2023

@wenshao
Your change (at version 2ed9ac3) is now ready to be sponsored by a Committer.

@RogerRiggs
Copy link
Contributor

It is conventional to get re-reviews of any change and give the reviewers that taken the time to review the PR a chance to review any subsequent changes.

@openjdk openjdk bot removed the sponsor Pull request is ready to be sponsored label Sep 29, 2023
@wenshao
Copy link
Contributor Author

wenshao commented Sep 29, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 29, 2023
@openjdk
Copy link

openjdk bot commented Sep 29, 2023

@wenshao
Your change (at version 1463863) is now ready to be sponsored by a Committer.

@wenshao
Copy link
Contributor Author

wenshao commented Oct 6, 2023

@RogerRiggs Can this PR be added to /sponsor now?

@cl4es
Copy link
Member

cl4es commented Oct 6, 2023

FWIW I'll not review or sponsor any PRs that use ByteArrayLittleEndian for trivial byte[] writes until there's been a thorough analysis of why this helps and shown that JITs can't be expected to generate code that is as optimal.

@openjdk openjdk bot removed the sponsor Pull request is ready to be sponsored label Oct 6, 2023
@wenshao
Copy link
Contributor Author

wenshao commented Oct 6, 2023

FWIW I'll not review or sponsor any PRs that use ByteArrayLittleEndian for trivial byte[] writes until there's been a thorough analysis of why this helps and shown that JITs can't be expected to generate code that is as optimal.

Thanks, I received your opinion and I have removed ByteArrayLittleEndian.

Such optimization should be done by JIT. It seems that this optimization is called SuperWord Level Parallelism (SLP). Currently, HotSpot C2 has only completed the loop version and not the linear version. Improvements require JIT experts to submit patches.

@wenshao
Copy link
Contributor Author

wenshao commented Oct 9, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Oct 9, 2023
@openjdk
Copy link

openjdk bot commented Oct 9, 2023

@wenshao
Your change (at version f7d832f) is now ready to be sponsored by a Committer.

@wenshao
Copy link
Contributor Author

wenshao commented Oct 12, 2023

@cl4es Can it be integrated?

@cl4es
Copy link
Member

cl4es commented Oct 12, 2023

Yes, I think this looks fine. Thank you for your patience.

/sponsor

@openjdk
Copy link

openjdk bot commented Oct 12, 2023

Going to push as commit 9355431.
Since your change was applied there have been 360 commits pushed to the master branch:

  • 32ccf01: 8317772: NMT: Make peak values available in release builds
  • 4c79e7d: 8170817: G1: Returning MinTLABSize from unsafe_max_tlab_alloc causes TLAB flapping
  • 7633a76: 8317998: Temporarily disable malformed control flow assert to reduce noise in testing
  • 00ef9f9: 8316947: Write a test to check textArea triggers MouseEntered/MouseExited events properly
  • 77dc891: 8317963: Serial: Remove unused GenerationIsInReservedClosure
  • d95b548: 8315850: Improve AbstractMap anonymous Iterator classes
  • 424de29: 8317866: replace NET_SocketAvailable
  • 6d6c900: 8038244: (fs) Check return value of malloc in Java_sun_nio_fs_AixNativeDispatcher_getmntctl()
  • eca6ea4: 8317873: Add @sealedGraph to IllegalFormatException
  • 2edf9c3: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR
  • ... and 350 more: https://git.openjdk.org/jdk/compare/3abe7982bfbc5787962863f8604ddecadf770b74...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 12, 2023
@openjdk openjdk bot closed this Oct 12, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 12, 2023
@openjdk
Copy link

openjdk bot commented Oct 12, 2023

@cl4es @wenshao Pushed as commit 9355431.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org integrated Pull request has been integrated
4 participants