Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8265237: String.join and StringJoiner can be improved further #3501

Closed
wants to merge 5 commits into from

Conversation

@plevart
Copy link
Contributor

@plevart plevart commented Apr 14, 2021

While JDK-8148937 improved StringJoiner class by replacing internal use of getChars that copies out characters from String elements into a char[] array with StringBuilder which is somehow more optimal, the improvement was marginal in speed (0% ... 10%) and mainly for smaller strings, while GC was reduced by about 50% in average per operation.
Initial attempt to tackle that issue was more involved, but was later discarded because it was apparently using too much internal String details in code that lives outside String and outside java.lang package.
But there is another way to package such "intimate" code - we can put it into String itself and just call it from StringJoiner.
This PR is an attempt at doing just that. It introduces new package-private method in java.lang.String which is then used from both pubic static String.join methods as well as from java.util.StringJoiner (via SharedSecrets). The improvements can be seen by running the following JMH benchmark:

https://gist.github.com/plevart/86ac7fc6d4541dbc08256cde544019ce

The comparative results are here:

https://jmh.morethan.io/?gist=7eb421cf7982456a2962269137f71c15

The jmh-result.json files are here:

https://gist.github.com/plevart/7eb421cf7982456a2962269137f71c15

Improvement in speed ranges from 8% (for small strings) to 200% (for long strings), while creation of garbage has been further reduced to an almost garbage-free operation.

So WDYT?


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8265237: String.join and StringJoiner can be improved further

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3501/head:pull/3501
$ git checkout pull/3501

Update a local copy of the PR:
$ git checkout pull/3501
$ git pull https://git.openjdk.java.net/jdk pull/3501/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3501

View PR using the GUI difftool:
$ git pr show -t 3501

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3501.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 14, 2021

👋 Welcome back plevart! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Apr 14, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 14, 2021

@plevart The following labels will be automatically applied to this pull request:

  • core-libs
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Apr 14, 2021

@plevart
Copy link
Contributor Author

@plevart plevart commented Apr 14, 2021

Some background: This change was motivated by Sergey Tsypanov's attempt to replace usage of StringBuilder with string concatenation in JDK-8265075 only to find out that string concatenation in java.base module is compiled down to inline usage of StringBuilder, so no improvement was possible.
StringJoiner API and String.join static utility methods lend itself to a better implementation, but in last incarnation they are implemented with StringBuilder internally: String.join -> StringJoiner -> StringBuilder. A lot of JDK internal usages of StringBuilder were already replaced with StringJoiner (JDK-8054714) but under the hood the StringBuilder is still used. There were also lots of incremental attempts to improve StringJoiner. I think this one is the last one for some time to come... :-)

}

/**
* Designated join routine.

This comment has been minimized.

@efge

efge Apr 14, 2021

Did you mean "dedicated"?

This comment has been minimized.

@plevart

plevart Apr 14, 2021
Author Contributor

No, I meant designated. It is the routine that all other public API entry points call at the end to do the job. Would some other word more accurately describe that? I definitely didn't mean "dedicated".

This comment has been minimized.

@efge

efge Apr 15, 2021

Oh then sorry, I thought it was a typo of some sort. I'd have said something like "Centralized join logic". But whatever works for you.

@cl4es
Copy link
Member

@cl4es cl4es commented Apr 14, 2021

I think this seems reasonable. The String encoding details doesn't leak outside of java.lang, and numbers do look convincing.

There's a StringJoinerBenchmark micro added by JDK-8148937 which could perhaps be expanded with the scenarios you've experimented with here?

@plevart
Copy link
Contributor Author

@plevart plevart commented Apr 15, 2021

There's a StringJoinerBenchmark micro added by JDK-8148937 which could perhaps be expanded with the scenarios you've experimented with here?

I modified that micro benchmark and added a method to also measure String.join static method along with StringJoiner for same parameters and extended the range of parameters to cover more diversity. The results are here:

https://jmh.morethan.io/?gist=c38cc13d63774ec505cc8d394c00d502

It is apparent that there is a huge speedup when strings are bigger. But even smaller strings get a substantial speedup. There's also substantial reduction of garbage per operation. Previously the garbage amounted to the internal array of String elements and the StringBuffer with its internal byte[] array of characters. Now only the array of elements is the garbage.

@AlanBateman
Copy link
Contributor

@AlanBateman AlanBateman commented Apr 16, 2021

/label remove security

@openjdk openjdk bot removed the security label Apr 16, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 16, 2021

@AlanBateman
The security label was successfully removed.

Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Look very good.

byte[] value = StringConcatHelper.newArray(((long) icoder << 32) | llen);
int off = 0;
prefix.getBytes(value, off, coder); off += prefix.length();
for (int i = 0; i < size; i++) {

This comment has been minimized.

@RogerRiggs

RogerRiggs Apr 16, 2021
Contributor

Can you save a branch inside the loop by handling element 0 outside the loop and
then do the loop for the rest?

This comment has been minimized.

@plevart

plevart Apr 16, 2021
Author Contributor

Thanks, I'll do that and then re-test to see if there's any improvement.

return joiner.toString();

byte[] value = StringConcatHelper.newArray(((long) icoder << 32) | llen);
int off = 0;

This comment has been minimized.

@RogerRiggs

RogerRiggs Apr 16, 2021
Contributor

StringConcatHelper.newArray() can double the length (based on the coder) and it is then truncated to 32 bits when passed to UNSAFE.allocatlUnitializedArray.
The test of length above only ensures llen can be truncated to 32 bits without loss of data.

This comment has been minimized.

@plevart

plevart Apr 16, 2021
Author Contributor

I thought about that, yes. And I think we have to do the check for the doubled length before calling the newArray. I checked the StringJoinerTest and it only deals with ascii strings unfortunately. Will have to add a test for that too...

This comment has been minimized.

@plevart

plevart Apr 18, 2021
Author Contributor

I do the checks before calling StringConcatHelper.newArray() now and pass a long value to it that already holds the number of bytes needed and where the upper 32 bits (coder) is always 0.

@amaembo
Copy link
Contributor

@amaembo amaembo commented Apr 17, 2021

Great work, thanks! I tried to do something similar a couple of years ago but did not submit my patch. One thing that stopped me is that joining many one-character strings was slower with my patch, compared to the baseline. Something like this:

// setup
String[] strings = new String[100];
Arrays.fill(strings, "a");
// benchmark
return String.join(",", strings);

Could you add this case to your benchmark, for completeness?

@plevart
Copy link
Contributor Author

@plevart plevart commented Apr 18, 2021

So here's another iteration where I have taken into account all the suggestions:

  • avoid the branch in the append loop by appending the 1st element outside the loop so that delimiter can unconditionally be appended inside the loop for rest of the elements.
  • fixed overflow checking logic + added a test for it - previously it was possible to provoke wrong exception: IllegalArgumentException: size is negative. I added another test instead of extending existing one because it needs a 1GiB string and together with 2GiB string in existing test it would not fit in 4GiB heap any more when running with unpatched version. I wanted the tests to pass with unpatched version too.
  • extended the parameters of JMH benchmark to include 1-character strings + re-run the benchmarks

The results are not that different from previous version (that used a conditional in the loop) as JIT is probably smart enough to unroll the loop and divide it into at least two parts (the way I did it manually now). What brings the most improvement in this version is the annotation to force in-lining of the String.join package-private method. Without it, it would happen that for one combination of JMH parameters, the results were a little worse compared to unpatched baseline - not for much: about 7%, but with the annotation, all combinations of parameters bring consistent improvement in range from 12% to 102%:

https://jmh.morethan.io/?gist=babf00a655d21ff784ede887af53ec31

@cl4es
cl4es approved these changes Apr 19, 2021
Copy link
Member

@cl4es cl4es left a comment

Looks good!

@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
@Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Fork(value = 1, jvmArgsAppend = {"-Xms1g", "-Xmx1g"})

This comment has been minimized.

@cl4es

cl4es Apr 19, 2021
Member

In general I think it's prudent to default to at least 3 forks, to catch issues with run-to-run variance.

This comment has been minimized.

@plevart

plevart Apr 19, 2021
Author Contributor

Thanks, Claes. Will change that to 3 forks before pushing.
Any more comments? Will wait for a day, then push it.

@openjdk
Copy link

@openjdk openjdk bot commented Apr 19, 2021

@plevart This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8265237: String.join and StringJoiner can be improved further

Reviewed-by: rriggs, redestad

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 105 new commits pushed to the master branch:

  • f1d4ae6: 8263718: unused-result warning happens at os_linux.cpp
  • 787908c: 8264221: Rewrite confusing stream API chain in SnippetMaps
  • 142edd3: 8265152: jpackage cleanup fails on Windows with IOException deleting msi
  • ab22407: 8265486: ProblemList javax/sound/midi/Sequencer/Recording.java on macosx-aarch64
  • e0fd5fc: 8265028: JDWP debug agent thread lookup can be made faster
  • 713483c: 8265373: Change to GCC 10.3 for building on Linux at Oracle
  • 3990713: 8265463: ProblemList vmTestbase/vm/mlvm/mixed/stress/regression/b6969574/INDIFY_Test.java on Win-X64 -Xcomp
  • 5b43b39: 8263154: [macos] DMG builds have finder errors
  • 54cb388: 8252600: [JVMCI] remove mx configuration
  • b703e0a: 8264569: Remove obsolete error messages from CDSTestUtils.java
  • ... and 95 more: https://git.openjdk.java.net/jdk/compare/e80012ede3564a868abc1705f332bcee942baf48...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Apr 19, 2021
Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Looks good.

@plevart
Copy link
Contributor Author

@plevart plevart commented Apr 21, 2021

/integrate

@openjdk openjdk bot closed this Apr 21, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Apr 21, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 21, 2021

@plevart Since your change was applied there have been 134 commits pushed to the master branch:

  • ed477da: 8264945: Optimize the code-gen for Math.pow(x, 0.5)
  • 7146104: 8265421: java/lang/String/StringRepeat.java test is missing a memory requirement
  • b5c92ca: 8265106: IGV: Enforce en-US locale while parsing ideal graph
  • 3de0dcb: 8265483: All-caps “JAVA” in the top navigation bar
  • 739769c: 8265411: Avoid unnecessary Method::init_intrinsic_id calls
  • a22ad03: 8264983: Add gtest for JDK-8264008
  • 91b08b7: 8261779: JCK test api/javax_crypto/EncryptedPrivateKeyInfo/Ctor4.html is failing with assertion error when assertions enabled
  • 2fcd920: 8261183: Follow on to Make lists of normal filenames
  • 40ef00c: 8258457: testlibrary_tests/ctw/JarDirTest.java fails with InvalidPathException on windows
  • 3f0da35: 8261392: Exclude testlibrary_tests/ctw/JarDirTest.java
  • ... and 124 more: https://git.openjdk.java.net/jdk/compare/e80012ede3564a868abc1705f332bcee942baf48...master

Your commit was automatically rebased without conflicts.

Pushed as commit 98cb81b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants