-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8314774: Optimize URLEncoder #15354
8314774: Optimize URLEncoder #15354
Conversation
👋 Welcome back Glavo! A progress list of the required criteria for merging this PR into |
Please use https://bugs.openjdk.org/browse/JDK-8314774 |
Webrevs
|
Hello Glavo, for changes like these, I think it would be more productive and useful to create a mailing list discussion first to provide some context on why this change is needed and gathering inputs from people familiar with this code on whether this change is necessary and worth it. Such discussions will then give the Reviewers some context and inputs on what needs to be considered in these changes and to what extent the changes should be done in the code. With the proposed changes in this PR which touches the character encoding handling and such, I think this will need a very thorough review keeping aside the performance aspects. I don't have enough experience of this class to know if it's worth doing this amount of change for any kind of performance improvements which may not be visible outside of micro benchmarks. |
I see. Thank you for your suggestion.
I know it's usually not a performance bottleneck, so the main goal of this PR is to reduce temporary object allocations. I noticed that this method is called quite frequently in our code, and it is also used in popular frameworks such as spring. I want to minimize GC pressure by minimizing unnecessary temporary objects. Since that method almost always uses UTF-8, I think it's worth providing a fast path for UTF-8. If it is too difficult to review, then I will try to optimize it in other ways. |
The fast path that just returns the given string if ASCII-only and no encoding looks simple enough. I don't particularly like the idea of embedding the logic of encoding UTF-8 into that class though, that increases the complexity significantly, and Charset encoders are there for that. Also I don't understand the reason for changing BitSet into a boolean array - that seems gratuitous? |
Unfortunately, the I'm thinking we might be able to extract this logic into a static helper class. public class UTF8EncodeUtils {
public static boolean isSingleByte(char c) { return c < 0x80; }
public static boolean isDoubleBytes(char c) { return c < 0x800; }
public static byte[] encodeDoubleBytes(char c) {
byte b0 = (byte) (0xc0 | (c >> 6));
byte b1 = (byte) (0x80 | (c & 0x3f));
return new byte[]{b0, b1};
}
public static byte[] encodeThreeBytes(char c) {
byte b0 = (byte) (0xe0 | (c >> 12));
byte b1 = (byte) (0x80 | ((c >> 6) & 0x3f));
byte b2 = (byte) (0x80 | (c & 0x3f));
return new byte[]{b0, b1, b2};
}
public static byte[] encodeCodePoint(int uc) {
byte b0 = (byte) (0xf0 | ((uc >> 18)));
byte b1 = (byte) (0x80 | ((uc >> 12) & 0x3f));
byte b2 = (byte) (0x80 | ((uc >> 6) & 0x3f));
byte b3 = (byte) (0x80 | (uc & 0x3f));
return new byte[]{b0, b1, b2, b3};
}
} We can use this helper class to reimplement I've also been doing some work on |
I observed a throughput improvement of 7%~10% after switching from |
I will extract the logic of encoding UTF-8 to
Using Compared to baseline, this PR reduces memory allocation by 76%. |
I am not sure the added complexity is worth the gain. It's fine for String to have special knowledge of UTF-8 but I don't think we want that to bleed all over the place. |
/integrate |
It's very important to run the tier2 tests as that is where the jdk_net test group runs. |
I see. I ran tier2 and the only failure seemed unrelated ( |
Co-authored-by: Claes Redestad <claes.redestad@oracle.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to issue /integrate again since there's been changes since the last time. |
I thought I had to wait for a re review after the modification to integrate. |
/integrate |
Re-approval isn't actually required but perhaps it would be good form to pick up that habit. |
I ran tier1~2 tests and there were no new failures. |
/integrate |
/sponsor |
Going to push as commit f25c920.
Your commit was automatically rebased without conflicts. |
if (c == ' ') { | ||
c = '+'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra test (on every regular character) for space could be moved to a separate if at line 255 (and remove space from DONT_NEED_ENCODING). The performance improvement might not be noticable but it would remove an anomaly from the algorithm.
I mainly made these optimizations:
StringBuilder
when there are no characters in the URL that need to be encoded;Implement a fast path for UTF-8.(Has been removed from this PR)In addition to improving performance, these optimizations also reduce temporary objects:
For UTF-8, the temporary(Has been removed from this PR)CharArrayWriter
, strings and byte arrays are no longer needed.I also updated the tests to add more test cases.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15354/head:pull/15354
$ git checkout pull/15354
Update a local copy of the PR:
$ git checkout pull/15354
$ git pull https://git.openjdk.org/jdk.git pull/15354/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 15354
View PR using the GUI difftool:
$ git pr show -t 15354
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15354.diff
Webrev
Link to Webrev Comment