-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8301873: Avoid string decoding in ZipFile.Source.getEntryPos #12290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Pos for the common case that both the name and the entry name are encoded in LATIN1 compatible UTF-8
👋 Welcome back eirbjo! A progress list of the required criteria for merging this PR into |
…e and API with a ~3% regression on getEntryHit
As suggested by @cl4es, I've replaced the use of ArraySupport.mismatch with Arrays.mismatch. The added range checks seems to give a regression of ~3% on the getEntryHit micro. |
if (false) { | ||
// Arrays.mismatch without the range checks (~3% faster micro getEntryHit) | ||
int aLength = encoded.length; | ||
int bLength = toIndex - fromIndex; | ||
int length = Math.min(aLength, bLength); | ||
int i = ArraysSupport.mismatch(encoded, 0, b, fromIndex, length); | ||
return (i < 0 && aLength != bLength) ? length : i; | ||
} | ||
return Arrays.mismatch(encoded, 0, encoded.length, b, fromIndex, toIndex); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the ArraySupport.mismatch code here for now if anyone wants to investigate the ~3% regression introduced by the range checks in Arrays.mismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance hit of using Arrays.mismatch instead of ArraysSupport might be more like 5% actually:
Benchmark (size) Mode Cnt Score Error Units
ZipFileGetEntry.getEntryHit 512 avgt 15 59.149 ± 0.820 ns/op
ZipFileGetEntry.getEntryHit 1024 avgt 15 73.250 ± 1.114 ns/op
ZipFileGetEntry.getEntryHitUncached 512 avgt 15 103.377 ± 1.118 ns/op
ZipFileGetEntry.getEntryHitUncached 1024 avgt 15 115.418 ± 2.767 ns/op
ZipFileGetEntry.getEntryMiss 512 avgt 15 22.200 ± 0.145 ns/op
ZipFileGetEntry.getEntryMiss 1024 avgt 15 22.528 ± 0.271 ns/op
ZipFileGetEntry.getEntryMissUncached 512 avgt 15 57.359 ± 0.428 ns/op
ZipFileGetEntry.getEntryMissUncached 1024 avgt 15 68.013 ± 2.070 ns/op
ZipFileGetEntry.getEntrySlash 512 avgt 15 72.407 ± 0.603 ns/op
ZipFileGetEntry.getEntrySlash 1024 avgt 15 82.875 ± 11.094 ns/op
… win for "name/" matches.
… checked opening time by initCEN
…pFile rejects opening a ZIP with invalid UTF-8 byte sequences in its name or comment fields
…mentation to decode to string for comparison instead of encoding to bytes, this seems safer. Revert some changes from previous commits to parameters in the hasTrailingSlash method.
I realized that encoding to bytes and then comparing to CEN bytes might not be safe for encodings were multiple representations is possible for the same code points. So I moved string/byte array comparison into ZipCoder, which can now decode from CEN and compare as in the current code. Micros indicate this has no performance impact. |
|
||
@Override | ||
public int mismatchUTF8(String str, byte[] b, int fromIndex, int toIndex) { | ||
byte[] encoded = str.isLatin1() ? str.value() : str.getBytes(UTF_8.INSTANCE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect: latin-1 characters above codepoint 127 (non-ascii) would be represented by 2 bytes in UTF-8. What you want here is probably str.isAscii() ? ...
. The ASCII check will have to look at the bytes, so will incur a minor penalty.
Good news is that you should already be able to do this with what's already exposed via JLA.getBytesNoRepl(str, StandardCharsets.UTF_8)
, so no need for more shared secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I have updated the PR such that the new shared secret is replaced with using getBytesNoRepl instead. If there is a performance difference, it seems to hide in the noise.
I had expected such a regression to be caught by existing tests, which seems not to be the case. I added TestZipFileEncodings.latin1NotAscii to adress this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getBytesNoRepl throws CharacterCodingException "for malformed input or unmappable characters".
This should never happen since initCEN should already reject it. If it should happen anyway, I return NO_MATCH which will ignore the match just like the catch in getEntryPos currently does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be fine.
byte[] encoded = str.isLatin1() ? str.value() : str.getBytes(UTF_8.INSTANCE); | ||
if (false) { | ||
// Arrays.mismatch without the range checks (~5% faster micro getEntryHit) | ||
int aLength = encoded.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the difference you're seeing is due to knowing that you'll be matching the entire length of the first array (encoded, 0, encoded.length
).
As an experiment I added Arrays.mismatch(byte[], byte[], int, int)
to mismatch the entire range of the first array argument vs the range of the second and can spot an improvement in affected micros:
Benchmark (size) Mode Cnt Score Error Units
ArraysMismatch.Char.differentSubrangeMatches 90 avgt 10 12.165 ± 0.074 ns/op # mismatch(a, aFrom, aTo, b, bFrom, bTo)
ArraysMismatch.Char.subrangeMatches 90 avgt 10 10.748 ± 0.006 ns/op # mismatch(a, b, bFrom, bTo)
This might be something we can solve in the JITs without having to add new methods to java.util.Arrays
to deal as efficiently as possible with the case when we're matching against the entirety of one of the arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Would be nice to solve this in the JIT!
This disabled code got deleted in my last commit, but it seems like you have a good analysis so we can let it go now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I might have fumbled this experiment a bit, and perhaps your setup would inline and then eliminate some of the known-in-range checks already.
Though we have intrinsified some of the Preconditions.check*
methods in the past to help improve range checks, but the checkFromToIndex
method that would be applicable here has not been intrinsified. It might be a reasonable path forward to replace Arrays.rangeCheck
with Preconditions.checkFromToIndex
and then look at intrinsifying that method to help eliminating or optimizing some of the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I had a flaw in my experiment and seems the first range check in a call like Arrays.mismatch(encoded, 0, encoded.length, b, off, off+len);
should be eliminated. So perhaps you're seeing the cost of the second range check, which might be unavoidable to be safe (zip meta data could otherwise be doctored to try and perform out of bounds reads via intrinsified code)
…ompare. Add a test case for UTF-8 encoded entry name which is latin1, but not ASCII
Filed JDK-8301873 for this, update PR title when you're ready. |
/issue 8301873 |
Yes, I'll file a PR to see if we can make |
Thanks! I pushed the change to ZipCoder.compare.
..and if String had (an optimized) mismatch method, then I bet all or most of the comparison methods (equals, compareTo, endsWith, startsWith, regionMatches) could delegate to that :-) |
A private mismatch method might make sense. A public method would require making a stronger case, I think, e.g. showing use cases a mismatcher would solve (elegantly, performantly) that can't be expressed with existing methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Eirik for your efforts here. I am running your latest changes through our internal Mach5 systems.
Additional comments below which are meant to add clarity for future maintainers
} | ||
} | ||
idx = getEntryNext(idx); | ||
} | ||
return -1; | ||
// No exact match found, will return either slashMatch or -1 | ||
return slashMatch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets a bit confusing as we return pos
when we have an exact match so it would be helpful to had more clarity via additional comments(it might not have been clear with the previous comments but I think if we are going to add slashMatch
we should take the time to beef up the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dual-modality of this method certainly allows for some head-scratch trying to find a succinct way to describe its logic. I have made an attempt to improve it, but I'm sure it could be even better.
The slashPos
name was probably ok as a local variable, but now that it is part of the contract of ZipCoder.compare, I think it helps to rename the enum value to DIRECTORY_NAME
and update slashPos
to dirPos
accordingly.
Do you have any suggestions on how to improve the comments in the last version?
SLASH_MATCH, | ||
NO_MATCH | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment indicating what the values mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments to the enum and each of the values.
* String for comparison, this can be avoided if the String coder | ||
* is known and matches the charset of this ZipCoder. | ||
*/ | ||
Comparison compare(String str, byte[] b, int off, int len, boolean addSlash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could add an @param
comments, that would be awesome 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the Javadoc of this method and added @param
comments.
@Test(expectedExceptions = ZipException.class, | ||
expectedExceptionsMessageRegExp = BAD_ENTRY_NAME_OR_COMMENT) | ||
public void shouldRejectInvalidName() throws IOException { | ||
try (ZipFile zf = new ZipFile(invalidName.toFile())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could please convert to use expectThrows
to get to validate the message name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is already validated using expectedExceptionsMessageRegExp
in the @Test
annotation.
Would you prefer if I use expectThrows instead, or perhaps inline the BAD_ENTRY_NAME_OR_COMMENT
constant as a literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this was not clear, we have gone away from using the annotation element exepectedExceptions
for new and updated tests and have tried to standardize on assertThrows
and expectThrows
instead which is the basis for my suggestion.
Thank you for your other updates. I will go through them later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, this makes sense. Updated to expectThrows with assertEquals.
@@ -115,6 +124,54 @@ public void testUnicodeManyEntries(String charsetName) throws Throwable { | |||
test(70000, 10, true, Charset.forName(charsetName)); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment introducing the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I described the rationale of adding this test in a comment.
assertNotNull(z.getEntry(entryName)); | ||
} | ||
} | ||
@Test(dataProvider = "all-charsets") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment introducing the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I described the rationale of adding this test in a comment.
@@ -247,7 +304,7 @@ private void writeEntry(ZipOutputStream zos, CRC32 crc, | |||
crc.update(data); | |||
ze.setCrc(crc.getValue()); | |||
} | |||
ze.setTime(System.currentTimeMillis()); | |||
ze.setTime(1675862371399L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment indicating what the time is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was accidentaly introduced, I have reverted it. Good catch.
…ues. Describe rationale for the tests added to TestZipFileEncodings. Revert unintended change in setTime in TestZipFileEncodings. Rename Comparison.SLASH_MATCH to DIRECTORY_MATCH.
Thanks for your thorough and helpful review, Lance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Eirik, it looks much better.
I will kick off another run tomorrow and make another pass as it has been a long day :-)
Thank you again for your work here to improve the Zip code base
// Should throw ZipException | ||
} | ||
ZipException ex = expectThrows(ZipException.class, () -> { | ||
openZipFile(invalidName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just do
ZipException ex = expectThrows(ZipException.class, () -> {
new ZipFile(invalidName.toFile())
});
versus adding openZipFile()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The habit of opening resources in a TwR is hard to break, but I guess it's ok for a test like this. I have inlined the method and removed the TwR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The habit of opening resources in a TwR is hard to break, but I guess it's ok for a test like this. I have inlined the method and removed the TwR.
Agree, thanks for addressing the suggestion as it makes the test cleaner given expects/assertThrows should react to the Exception being thrown by new ZipFile(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inlining accidentally made the shouldRejectInvalidComment test the invalid name file. Fixed with last commit.
…coded string" refers to those found in the CEN byte array and to use the term "the encoded string" instead of the more convoluted "the string encoded in the byte array".
…in the invalid comment test
As soon as my last set of Mach5 runs complete I will approve this change. The JCK tests in this area also look good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Eirik for your patience through the review and testing of your PR.
I think we are in good shape overall. Mach5 tiers1-6 look good as do the JCK tests for Zip
Best
Lance
/integrate |
/sponsor |
Going to push as commit 78f71b4.
Your commit was automatically rebased without conflicts. |
After finding a hash match, getEntryPos needs to compare the lookup name up to the encoded entry name in the CEN. This comparison is done by decoding the entry name into a String. The names can then be compared using the String API. This decoding step adds a significat cost to this method.
This PR suggest to update the string comparison such that in the common case where both the lookup name and the entry name are encoded in ASCII-compatible UTF-8, decoding can be avoided and the byte arrays can instead be compared direcly.
ZipCoder is updated with a new method to compare a string with an encoded byte array range. The default implementation decodes to string (like the current code), while the UTF-8 implementation uses JavaLangAccess.getBytesNoRepl to get the bytes. Both methods thes uses Arrays.mismatch for comparison with or without matching trailing slashes.
Additionally, this PR suggest to make the following updates to getEntryPos:
(My though is that including these additional updates in this PR might reduce reviewer overhead given that it touches the exact same code. I might be wrong on this, please advise :)
I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified to use xalan.jar):
Baseline:
PR:
To assess the impact on startup/warmup, I made a main method class which measures the total time of calling ZipFile.getEntry for all entries in the 109 JAR file dependenies of spring-petclinic. The shows a nice improvement (time in micros):
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12290/head:pull/12290
$ git checkout pull/12290
Update a local copy of the PR:
$ git checkout pull/12290
$ git pull https://git.openjdk.org/jdk pull/12290/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 12290
View PR using the GUI difftool:
$ git pr show -t 12290
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12290.diff