Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8314774: Optimize URLEncoder #15354

Closed
wants to merge 23 commits into from
Closed

8314774: Optimize URLEncoder #15354

wants to merge 23 commits into from

Conversation

Glavo
Copy link
Contributor

@Glavo Glavo commented Aug 19, 2023

I mainly made these optimizations:

  • Avoid allocating StringBuilder when there are no characters in the URL that need to be encoded;
  • Implement a fast path for UTF-8. (Has been removed from this PR)

In addition to improving performance, these optimizations also reduce temporary objects:

  • It no longer allocates any object when there are no characters in the URL that need to be encoded;
  • The initial size of StringBuilder is larger to avoid expansion as much as possible;
  • For UTF-8, the temporary CharArrayWriter, strings and byte arrays are no longer needed. (Has been removed from this PR)

I also updated the tests to add more test cases.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15354/head:pull/15354
$ git checkout pull/15354

Update a local copy of the PR:
$ git checkout pull/15354
$ git pull https://git.openjdk.org/jdk.git pull/15354/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15354

View PR using the GUI difftool:
$ git pr show -t 15354

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15354.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 19, 2023

👋 Welcome back Glavo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 19, 2023

@Glavo The following label will be automatically applied to this pull request:

  • net

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the net net-dev@openjdk.org label Aug 19, 2023
@caizixian
Copy link
Member

Please use https://bugs.openjdk.org/browse/JDK-8314774

@Glavo Glavo changed the title Optimize URLEncoder 8314774: Optimize URLEncoder Aug 22, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 22, 2023
@mlbridge
Copy link

mlbridge bot commented Aug 22, 2023

@jaikiran
Copy link
Member

Hello Glavo, for changes like these, I think it would be more productive and useful to create a mailing list discussion first to provide some context on why this change is needed and gathering inputs from people familiar with this code on whether this change is necessary and worth it. Such discussions will then give the Reviewers some context and inputs on what needs to be considered in these changes and to what extent the changes should be done in the code.

With the proposed changes in this PR which touches the character encoding handling and such, I think this will need a very thorough review keeping aside the performance aspects. I don't have enough experience of this class to know if it's worth doing this amount of change for any kind of performance improvements which may not be visible outside of micro benchmarks.

@Glavo
Copy link
Contributor Author

Glavo commented Aug 23, 2023

Hello Glavo, for changes like these, I think it would be more productive and useful to create a mailing list discussion first to provide some context on why this change is needed and gathering inputs from people familiar with this code on whether this change is necessary and worth it. Such discussions will then give the Reviewers some context and inputs on what needs to be considered in these changes and to what extent the changes should be done in the code.

I see. Thank you for your suggestion.

I don't have enough experience of this class to know if it's worth doing this amount of change for any kind of performance improvements which may not be visible outside of micro benchmarks.

I know it's usually not a performance bottleneck, so the main goal of this PR is to reduce temporary object allocations.

I noticed that this method is called quite frequently in our code, and it is also used in popular frameworks such as spring. I want to minimize GC pressure by minimizing unnecessary temporary objects.

Since that method almost always uses UTF-8, I think it's worth providing a fast path for UTF-8. If it is too difficult to review, then I will try to optimize it in other ways.

@dfuch
Copy link
Member

dfuch commented Aug 23, 2023

The fast path that just returns the given string if ASCII-only and no encoding looks simple enough. I don't particularly like the idea of embedding the logic of encoding UTF-8 into that class though, that increases the complexity significantly, and Charset encoders are there for that. Also I don't understand the reason for changing BitSet into a boolean array - that seems gratuitous?

@Glavo
Copy link
Contributor Author

Glavo commented Aug 23, 2023

I don't particularly like the idea of embedding the logic of encoding UTF-8 into that class though, that increases the complexity significantly, and Charset encoders are there for that.

Unfortunately, the CharsetEncoder is too generic. Due to our knowledge of UTF-8, implementing it inline eliminates unnecessary temporary objects. There are already some places that do this, such as String.

I'm thinking we might be able to extract this logic into a static helper class.

public class UTF8EncodeUtils {
    public static boolean isSingleByte(char c) { return c < 0x80; }
    public static boolean isDoubleBytes(char c) { return c < 0x800; }

    public static byte[] encodeDoubleBytes(char c) {
        byte b0 = (byte) (0xc0 | (c >> 6));
        byte b1 = (byte) (0x80 | (c & 0x3f));
        return new byte[]{b0, b1};
    }

    public static byte[] encodeThreeBytes(char c) {
        byte b0 = (byte) (0xe0 | (c >> 12));
        byte b1 = (byte) (0x80 | ((c >> 6) & 0x3f));
        byte b2 = (byte) (0x80 | (c & 0x3f));
        return new byte[]{b0, b1, b2};
    }

    public static byte[] encodeCodePoint(int uc) {
        byte b0 = (byte) (0xf0 | ((uc >> 18)));
        byte b1 = (byte) (0x80 | ((uc >> 12) & 0x3f));
        byte b2 = (byte) (0x80 | ((uc >> 6) & 0x3f));
        byte b3 = (byte) (0x80 | (uc & 0x3f));
        return new byte[]{b0, b1, b2, b3};
    }
}

We can use this helper class to reimplement String and the UTF-8 CharsetEncoder (after we make sure it has no overhead), then use it to implement more UTF-8 fast paths.

I've also been doing some work on OutputStreamWriter recently. By implementing a fast path for UTF-8, there are over 20x speedups in some cases. I think maybe we can get exciting improvements in more places.

@Glavo
Copy link
Contributor Author

Glavo commented Aug 24, 2023

Also I don't understand the reason for changing BitSet into a boolean array - that seems gratuitous?

I observed a throughput improvement of 7%~10% after switching from BitSet to boolean[].

@openjdk openjdk bot removed the rfr Pull request is ready for review label Aug 24, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 24, 2023
@Glavo
Copy link
Contributor Author

Glavo commented Aug 24, 2023

I will extract the logic of encoding UTF-8 to UTF8EncodeUtils, and then rerun the benchmark:

Baseline:
Benchmark                                         (count)  (maxLength)  (mySeed)  Mode  Cnt        Score   Error   Units
URLEncodeDecode.testEncodeUTF8                        1024         1024         3  avgt   15        5.582 ± 0.009   ms/op
URLEncodeDecode.testEncodeUTF8:gc.alloc.rate          1024         1024         3  avgt   15     1439.974 ± 2.386  MB/sec
URLEncodeDecode.testEncodeUTF8:gc.alloc.rate.norm     1024         1024         3  avgt   15  8429374.434 ± 0.239    B/op
URLEncodeDecode.testEncodeUTF8:gc.count               1024         1024         3  avgt   15        6.000          counts
URLEncodeDecode.testEncodeUTF8:gc.time                1024         1024         3  avgt   15        9.000              ms

Inline UTF-8 encoding:
Benchmark                                         (count)  (maxLength)  (mySeed)  Mode  Cnt        Score       Error   Units
URLEncodeDecode.testEncodeUTF8                        1024         1024         3  avgt   15        3.681 ±     0.156   ms/op
URLEncodeDecode.testEncodeUTF8:gc.alloc.rate          1024         1024         3  avgt   15      519.050 ±    23.530  MB/sec
URLEncodeDecode.testEncodeUTF8:gc.alloc.rate.norm     1024         1024         3  avgt   15  2000689.365 ± 12769.291    B/op
URLEncodeDecode.testEncodeUTF8:gc.count               1024         1024         3  avgt   15        3.000              counts
URLEncodeDecode.testEncodeUTF8:gc.time                1024         1024         3  avgt   15        3.000                  ms

Use UTF8EncodeUtils:
Benchmark                                         (count)  (maxLength)  (mySeed)  Mode  Cnt        Score       Error   Units
URLEncodeDecode.testEncodeUTF8                        1024         1024         3  avgt   15        3.753 ±     0.169   ms/op
URLEncodeDecode.testEncodeUTF8:gc.alloc.rate          1024         1024         3  avgt   15      507.190 ±    24.402  MB/sec
URLEncodeDecode.testEncodeUTF8:gc.alloc.rate.norm     1024         1024         3  avgt   15  1992529.825 ± 12769.347    B/op
URLEncodeDecode.testEncodeUTF8:gc.count               1024         1024         3  avgt   15        3.000              counts
URLEncodeDecode.testEncodeUTF8:gc.time                1024         1024         3  avgt   15        3.000                  ms

Using UTF8EncodeUtils is approximately 2% slower, which is acceptable as it does not increase the object allocation rate.

Compared to baseline, this PR reduces memory allocation by 76%.

@dfuch
Copy link
Member

dfuch commented Aug 24, 2023

I am not sure the added complexity is worth the gain. It's fine for String to have special knowledge of UTF-8 but I don't think we want that to bleed all over the place.

@Glavo
Copy link
Contributor Author

Glavo commented Sep 18, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 18, 2023
@openjdk
Copy link

openjdk bot commented Sep 18, 2023

@Glavo
Your change (at version a2cb7b3) is now ready to be sponsored by a Committer.

@AlanBateman
Copy link
Contributor

I ran the tier1 test with no failures.

It's very important to run the tier2 tests as that is where the jdk_net test group runs.

@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

I ran the tier1 test with no failures.

It's very important to run the tier2 tests as that is where the jdk_net test group runs.

I see. I ran tier2 and the only failure seemed unrelated (runtime/Thread/ThreadCountLimit.java).

Co-authored-by: Claes Redestad <claes.redestad@oracle.com>
@openjdk openjdk bot removed the sponsor Pull request is ready to be sponsored label Sep 19, 2023
Copy link
Member

@dfuch dfuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @Glavo, @cl4es. I am happy with where this pull request eventually ended. Thanks for your patience and for taking on so many feedback!
Please make sure tier2 tests are still passing before integrating.

@cl4es
Copy link
Member

cl4es commented Sep 19, 2023

You need to issue /integrate again since there's been changes since the last time.

@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

You need to issue /integrate again since there's been changes since the last time.

I thought I had to wait for a re review after the modification to integrate.

@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

/integrate

@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

Nice work @Glavo, @cl4es. I am happy with where this pull request eventually ended. Thanks for your patience and for taking on so many feedback! Please make sure tier2 tests are still passing before integrating.

I'm re-running the tier2 tests. I'll reply here when it's done.

@cl4es
Copy link
Member

cl4es commented Sep 19, 2023

Re-approval isn't actually required but perhaps it would be good form to pick up that habit.

@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

I ran tier1~2 tests and there were no new failures.

@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Sep 19, 2023
@openjdk
Copy link

openjdk bot commented Sep 19, 2023

@Glavo
Your change (at version 9eb12c8) is now ready to be sponsored by a Committer.

@openjdk
Copy link

openjdk bot commented Sep 19, 2023

@Glavo
Your change (at version 9eb12c8) is now ready to be sponsored by a Committer.

@cl4es
Copy link
Member

cl4es commented Sep 19, 2023

/sponsor

@openjdk
Copy link

openjdk bot commented Sep 19, 2023

Going to push as commit f25c920.
Since your change was applied there have been 13 commits pushed to the master branch:

  • 7c5f2a2: 8315669: Open source several Swing PopupMenu related tests
  • cf74b8c: 8316337: (bf) Concurrency issue in DirectByteBuffer.Deallocator
  • 4461eeb: 8312498: Thread::getState and JVM TI GetThreadState should return TIMED_WAITING virtual thread is timed parked
  • 670b456: 8315038: Capstone disassembler stops when it sees a bad instruction
  • fab372d: 8316428: G1: Nmethod count statistics only count last code root set iterated
  • 283c360: 8314877: Make fields final in 'java.net' package
  • 86115c2: 8316420: Serial: Remove unused GenCollectedHeap::oop_iterate
  • d038571: 8030815: Code roots are not accounted for in region prediction
  • 138542d: 8316061: Open source several Swing RootPane and Slider related tests
  • f52e500: 8316104: Open source several Swing SplitPane and RadioButton related tests
  • ... and 3 more: https://git.openjdk.org/jdk/compare/373e37bf13df654ba40c0bd9fcf345215be4eafb...master

Your commit was automatically rebased without conflicts.

Comment on lines 250 to 251
if (c == ' ') {
c = '+';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra test (on every regular character) for space could be moved to a separate if at line 255 (and remove space from DONT_NEED_ENCODING). The performance improvement might not be noticable but it would remove an anomaly from the algorithm.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 19, 2023
@openjdk openjdk bot closed this Sep 19, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Sep 19, 2023
@openjdk
Copy link

openjdk bot commented Sep 19, 2023

@cl4es @Glavo Pushed as commit f25c920.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@Glavo Glavo deleted the url-encoder branch September 19, 2023 14:03
@Glavo
Copy link
Contributor Author

Glavo commented Sep 19, 2023

@cl4es I also have a PR (#15353) to remove DEFAULT_ENCODING_NAME from URLEncoder and URLDecoder, can you take a look at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated net net-dev@openjdk.org
8 participants