Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8259699: Reduce char[] copying in URLEncoder.encode(String, Charset) #1598

Closed
wants to merge 5 commits into from

Conversation

@stsypanov
Copy link
Contributor

@stsypanov stsypanov commented Dec 3, 2020

Instead of allocating a copy of underlying array via CharArrayWriter.toCharArray() and passing it to constructor of String

String str = new String(charArrayWriter.toCharArray());

we could call toString() method

String str = charArrayWriter.toString();

decoding existing char[] without making a copy. This slightly speeds up the method reducing at the same time memory consumption for decoding URLs with non-latin symbols:

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class UrlEncoderBenchmark {
  private static final Charset charset = Charset.defaultCharset();
  private static final String utf8Url = "https://ru.wikipedia.org/wiki/Организация_Объединённых_Наций"; // UN

  @Benchmark
  public String encodeUtf8() {
    return URLEncoder.encode(utf8Url, charset);
  }
}

The benchmark on my maching give the following output:

before
Benchmark                                                        Mode  Cnt     Score    Error   Units
UrlEncoderBenchmark.encodeUtf8                                   avgt  100  1166.378 ±  8.411   ns/op
UrlEncoderBenchmark.encodeUtf8:·gc.alloc.rate                    avgt  100   932.944 ±  6.393  MB/sec
UrlEncoderBenchmark.encodeUtf8:·gc.alloc.rate.norm               avgt  100  1712.193 ±  0.005    B/op
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Eden_Space           avgt  100   929.221 ± 24.268  MB/sec
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Eden_Space.norm      avgt  100  1705.444 ± 43.235    B/op
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Survivor_Space       avgt  100     0.006 ±  0.001  MB/sec
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Survivor_Space.norm  avgt  100     0.011 ±  0.002    B/op
UrlEncoderBenchmark.encodeUtf8:·gc.count                         avgt  100   652.000           counts
UrlEncoderBenchmark.encodeUtf8:·gc.time                          avgt  100   334.000               ms

after
Benchmark                                                        Mode  Cnt     Score    Error   Units
UrlEncoderBenchmark.encodeUtf8                                   avgt  100  1058.851 ±  6.006   ns/op
UrlEncoderBenchmark.encodeUtf8:·gc.alloc.rate                    avgt  100   931.489 ±  5.182  MB/sec
UrlEncoderBenchmark.encodeUtf8:·gc.alloc.rate.norm               avgt  100  1552.176 ±  0.005    B/op
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Eden_Space           avgt  100   933.491 ± 24.164  MB/sec
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Eden_Space.norm      avgt  100  1555.488 ± 39.204    B/op
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Survivor_Space       avgt  100     0.006 ±  0.001  MB/sec
UrlEncoderBenchmark.encodeUtf8:·gc.churn.G1_Survivor_Space.norm  avgt  100     0.010 ±  0.002    B/op
UrlEncoderBenchmark.encodeUtf8:·gc.count                         avgt  100   655.000           counts
UrlEncoderBenchmark.encodeUtf8:·gc.time                          avgt  100   333.000               ms

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8259699: Reduce char[] copying in URLEncoder.encode(String, Charset)

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1598/head:pull/1598
$ git checkout pull/1598

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 3, 2020

👋 Welcome back stsypanov! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Dec 3, 2020

@stsypanov The following label will be automatically applied to this pull request:

  • net

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Loading

@openjdk openjdk bot added the net label Dec 3, 2020
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 31, 2020

@stsypanov This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Loading

Copy link
Contributor

@szegedi szegedi left a comment

Looks good!

Loading

@stsypanov
Copy link
Contributor Author

@stsypanov stsypanov commented Jan 11, 2021

@szegedi could you please create a ticket for this change? Otherwise I cannot merge it

Loading

@szegedi
Copy link
Contributor

@szegedi szegedi commented Jan 13, 2021

Filed a ticket, you should rename this PR to "8259699: Reduce char[] copying in URLEncoder.encode(String, Charset)"

Loading

@stsypanov stsypanov changed the title Improve URLEncoder.encode(String, Charset) 8259699: Reduce char[] copying in URLEncoder.encode(String, Charset) Jan 13, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jan 13, 2021

⚠️ @stsypanov the full name on your profile does not match the author name in this pull requests' HEAD commit. If this pull request gets integrated then the author name from this pull requests' HEAD commit will be used for the resulting commit. If you wish to push a new commit with a different author name, then please run the following commands in a local repository of your personal fork:

$ git checkout enc
$ git commit -c user.name='Preferred Full Name' --allow-empty -m 'Update full name'
$ git push

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Jan 13, 2021

@stsypanov This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8259699: Reduce char[] copying in URLEncoder.encode(String, Charset)

Reviewed-by: attila, redestad, chegar

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the master branch:

  • ff3e6e4: 8259773: Incorrect encoding of AVX-512 kmovq instruction
  • b8ef2ba: 8259563: The CPU model name is printed multiple times when using -Xlog:os+cpu

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@szegedi, @cl4es, @ChrisHegarty) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Loading

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jan 13, 2021

Webrevs

Loading

@stsypanov
Copy link
Contributor Author

@stsypanov stsypanov commented Jan 13, 2021

/integrate

Loading

@openjdk openjdk bot added the sponsor label Jan 13, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jan 13, 2021

@stsypanov
Your change (at version 2856c92) is now ready to be sponsored by a Committer.

Loading

cl4es
cl4es approved these changes Jan 13, 2021
Copy link
Member

@cl4es cl4es left a comment

Looks good.

I wonder... CharArrayWriter is an old and synchronized data structure, and since the instance used here isn't shared that synchronization seem useless. And since you're now bypassing the char[] and going straight for a String you might get better performance with a StringBuilder here? (setLength(0) instead of reset()...)

Loading

@stsypanov
Copy link
Contributor Author

@stsypanov stsypanov commented Jan 13, 2021

@cl4es hi, let me try StringBuilder

Loading

@stsypanov
Copy link
Contributor Author

@stsypanov stsypanov commented Jan 14, 2021

@cl4es SB brings pessimization both for time and memory, try org.openjdk.bench.java.net.URLEncodeDecode:

master
                                                 (count)  (maxLength)  (mySeed)  Mode  Cnt         Score       Error   Units
testEncodeUTF8                                      1024         1024         3  avgt   25         8.573 ?     0.023   ms/op
testEncodeUTF8:?gc.alloc.rate                       1024         1024         3  avgt   25      1202.896 ?     3.225  MB/sec
testEncodeUTF8:?gc.alloc.rate.norm                  1024         1024         3  avgt   25  11355727.904 ?   196.249    B/op
testEncodeUTF8:?gc.churn.G1_Eden_Space              1024         1024         3  avgt   25      1203.785 ?     6.240  MB/sec
testEncodeUTF8:?gc.churn.G1_Eden_Space.norm         1024         1024         3  avgt   25  11364143.637 ? 52830.222    B/op
testEncodeUTF8:?gc.churn.G1_Survivor_Space          1024         1024         3  avgt   25         0.008 ?     0.001  MB/sec
testEncodeUTF8:?gc.churn.G1_Survivor_Space.norm     1024         1024         3  avgt   25        77.088 ?     9.303    B/op
testEncodeUTF8:?gc.count                            1024         1024         3  avgt   25      1973.000              counts
testEncodeUTF8:?gc.time                             1024         1024         3  avgt   25       996.000                  ms

enc
                                                 (count)  (maxLength)  (mySeed)  Mode  Cnt        Score       Error   Units
testEncodeUTF8                                      1024         1024         3  avgt   25        7.931 ?     0.006   ms/op
testEncodeUTF8:?gc.alloc.rate                       1024         1024         3  avgt   25      965.347 ?     0.736  MB/sec
testEncodeUTF8:?gc.alloc.rate.norm                  1024         1024         3  avgt   25  8430590.163 ?     7.213    B/op
testEncodeUTF8:?gc.churn.G1_Eden_Space              1024         1024         3  avgt   25      966.373 ?     5.248  MB/sec
testEncodeUTF8:?gc.churn.G1_Eden_Space.norm         1024         1024         3  avgt   25  8439563.689 ? 47282.178    B/op
testEncodeUTF8:?gc.churn.G1_Survivor_Space          1024         1024         3  avgt   25        0.007 ?     0.001  MB/sec
testEncodeUTF8:?gc.churn.G1_Survivor_Space.norm     1024         1024         3  avgt   25       60.949 ?     8.405    B/op
testEncodeUTF8:?gc.count                            1024         1024         3  avgt   25     1715.000              counts
testEncodeUTF8:?gc.time                             1024         1024         3  avgt   25      888.000                  ms

stringBuilder
                                                 (count)  (maxLength)  (mySeed)  Mode  Cnt         Score       Error   Units
testEncodeUTF8                                      1024         1024         3  avgt   25         8.115 ?     0.110   ms/op
testEncodeUTF8:?gc.alloc.rate                       1024         1024         3  avgt   25      1259.267 ?    16.716  MB/sec
testEncodeUTF8:?gc.alloc.rate.norm                  1024         1024         3  avgt   25  11249391.875 ?     6.552    B/op
testEncodeUTF8:?gc.churn.G1_Eden_Space              1024         1024         3  avgt   25      1259.937 ?    17.232  MB/sec
testEncodeUTF8:?gc.churn.G1_Eden_Space.norm         1024         1024         3  avgt   25  11255413.875 ? 43636.143    B/op
testEncodeUTF8:?gc.churn.G1_Survivor_Space          1024         1024         3  avgt   25         0.007 ?     0.001  MB/sec
testEncodeUTF8:?gc.churn.G1_Survivor_Space.norm     1024         1024         3  avgt   25        59.461 ?     9.087    B/op
testEncodeUTF8:?gc.count                            1024         1024         3  avgt   25      2236.000              counts
testEncodeUTF8:?gc.time                             1024         1024         3  avgt   25      1089.000                  ms

The reason seems to be single char StringBuilder.append() that apart from range check does encoding check and stores char as two bytes in byte[] in ASB

Loading

@openjdk openjdk bot removed the sponsor label Jan 14, 2021
@cl4es
Copy link
Member

@cl4es cl4es commented Jan 14, 2021

Surprising, but thanks for checking!

Loading

@cl4es
Copy link
Member

@cl4es cl4es commented Jan 14, 2021

/sponsor

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Jan 14, 2021

@cl4es The PR has been updated since the change author (@stsypanov) issued the integrate command - the author must perform this command again.

Loading

@cl4es
Copy link
Member

@cl4es cl4es commented Jan 14, 2021

No need to merge in changes in master unless there are conflicts or you want to do more changes of your own, they will be merged automatically on integration. But it seems the bots need you to do /integrate again

Loading

@stsypanov
Copy link
Contributor Author

@stsypanov stsypanov commented Jan 14, 2021

/integrate

Loading

@openjdk openjdk bot added the sponsor label Jan 14, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jan 14, 2021

@stsypanov
Your change (at version 2183af4) is now ready to be sponsored by a Committer.

Loading

@stsypanov
Copy link
Contributor Author

@stsypanov stsypanov commented Jan 14, 2021

@cl4es done

Loading

@cl4es
Copy link
Member

@cl4es cl4es commented Jan 14, 2021

/sponsor

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Jan 14, 2021

@cl4es @stsypanov Since your change was applied there have been 2 commits pushed to the master branch:

  • ff3e6e4: 8259773: Incorrect encoding of AVX-512 kmovq instruction
  • b8ef2ba: 8259563: The CPU model name is printed multiple times when using -Xlog:os+cpu

Your commit was automatically rebased without conflicts.

Pushed as commit c822eda.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Loading

@stsypanov stsypanov deleted the enc branch Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants