-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8302871: Speed up StringLatin1.regionMatchesCI #12632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rick in the book'
👋 Welcome back eirbjo! A progress list of the required criteria for merging this PR into |
Benchmark results: Baseline:
Patch:
|
src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template
Outdated
Show resolved
Hide resolved
/issue 8302871 |
@eirbjo The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated. |
Webrevs
|
…e latin1 verification EqualsIgnoreCase test
* @param b2 another byte representing a latin1 code point | ||
* @return true if the two bytes are considered equals ignoring case in latin1 | ||
*/ | ||
static boolean equalsIgnoreCase(byte b1, byte b2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not already in CharacterDataLatin1?
Here is a comparison of relying on improvements in CharacterDataLatin1.toUpperCase/toLowerCase
only vs. using CharacterDataLatin1.equalsIgnoreCase
:
Character.toUpperCase/toLowerCase only:
Benchmark (codePoints) (size) Mode Cnt Score Error Units
RegionMatchesIC.Latin1.regionMatchesIC ascii-match 1024 avgt 15 1310.582 ± 84.777 ns/op
RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch 1024 avgt 15 4.547 ± 0.545 ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match 1024 avgt 15 686.947 ± 11.850 ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-mismatch 1024 avgt 15 3.836 ± 0.634 ns/op
RegionMatchesIC.Latin1.regionMatchesIC lat1-match 1024 avgt 15 2107.219 ± 17.662 ns/op
RegionMatchesIC.Latin1.regionMatchesIC lat1-mismatch 1024 avgt 15 4.924 ± 0.829 ns/op
CharacterDataLatin1.equalsIgnoreCase:
Benchmark (codePoints) (size) Mode Cnt Score Error Units
RegionMatchesIC.Latin1.regionMatchesIC ascii-match 1024 avgt 15 742.467 ± 34.490 ns/op
RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch 1024 avgt 15 3.960 ± 0.046 ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match 1024 avgt 15 361.158 ± 37.096 ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-mismatch 1024 avgt 15 4.039 ± 0.521 ns/op
RegionMatchesIC.Latin1.regionMatchesIC lat1-match 1024 avgt 15 1158.091 ± 41.617 ns/op
RegionMatchesIC.Latin1.regionMatchesIC lat1-mismatch 1024 avgt 15 4.358 ± 0.123 ns/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I lost context and thought this was in StringLatin1
.
Thanks for running the numbers with #12623. Looks like you're getting big enough of an improvement on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems equalsIgnoreCase
carries its weight.
@eirbjo This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 18 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@cl4es, @Martin-Buchholz, @AlanBateman) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@@ -160,6 +160,26 @@ class CharacterDataLatin1 extends CharacterData { | |||
} | |||
return mapChar; | |||
} | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you should insert a blank line between the two methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed now.
return true; | ||
} | ||
// uppercase b1 using 'the oldest ASCII trick in the book' | ||
int U = b1 & 0xDF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure some people reading this comment will wonder which book :-) It might be better to drop that bit and if possible, find a better name for "U" as normally variables start with a lower case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Alan,
I thought I was clever by encoding the 'uppercaseness' in the variable name, but yeah I'll find a better name :)
There is some precedent for using the 'ASCII trick' comment in the JDK. I found it in ZipFile.isMetaName, which is also where I first learned about this interesting relationship between ASCII (and also latin1) letters.
The comment was first added by Martin Buchholz back in 2016 as part of JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating this code and Lance had some input about clarifying the comment. Martin then chimed in to defend his comment:
I still like my ancient "ASCII trick" comment.
I think this 'trick', whatever we call it, is sufficiently intricate that it deserves to be called out somehow and that we should not just casually bitmask with these magic constants without any discussion at all.
An earlier iteration of this PR included a small essay in the javadoc of this method describing the layout and relationship of letters in latin1 and how we can apply that knowledge of the layout to implement the method.
How would you feel about adding that description back to the Javadocs? This would then live close to the similarly implemented toUpperCase and toLowerCase methods currently under review in #12623.
Here's the updated discussion included in the Javadoc:
/**
* Compares two latin1 code points, ignoring case considerations.
*
* Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
* letters are found in the following code point ranges:
*
* 0x41-0x5A: Uppercase ASCII letters: A-Z
* 0x61-0x7A: Lowercase ASCII letters: a-z
* 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
* 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
* 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
* 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
*
* While both ASCII letter ranges are contiguous, the latin1 ranges are not:
*
* The 'multiplication sign' 0xD7 splits the uppercase range in two.
* The 'division sign' 0xF7 splits the lowercase range in two.
*
* Lowercase letters are found 32 positions (0x20) after their corresponding uppercase letter.
* The 'division sign' and 'multiplication sign' have the same relative distance.
*
* Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in the book' to
* lowercase any letter by setting the bit:
*
* ('C' | 0x20) == 'c'
*
* By removing the bit, we can perform the uppercase operation:
*
* ('c' & 0xDF) == 'C'
*
* Applying this knowledge of the latin1 layout, we can test for equality ignoring case by
* checking that the code points are either equal, or that one of the code points is a letter
* which uppercases is the same as the uppercase of the other code point.
*
* @param b1 byte representing a latin1 code point
* @param b2 another byte representing a latin1 code point
* @return true if the two bytes are considered equals ignoring case in latin1
*/
static boolean equalsIgnoreCase(byte b1, byte b2) {
if (b1 == b2) {
return true;
}
int upper = b1 & 0xDF;
if (upper < 'A') {
return false; // Low ASCII
}
return (upper <= 'Z' // In range A-Z
|| (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // ..or A-grave-Thorn, excl. multiplication
&& upper == (b2 & 0xDF); // b2 has same uppercase
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @Martin-Buchholz could chime in and also tell us which book he found his ASCII trick in :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"oldest trick in the book" is a phrase that does not necessarily imply existence of an actual book!
Let this evoke an image of a personal book of tricks that programmers in the 1960s might have recorded such techniques in. And the tricks were passed down across generations of programmers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More to the point, ASCII was obviously designed to allow you to uppercase a lower case letter with a single instruction, so "the book" might have been a draft standard before they scrubbed out the interesting history!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More history: IIRC I originally used 'ASCII trick' when I was truly only cared about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight misnomer to use "ASCII" here. But Latin1 followed the precedent of ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an opinion on the appropriate level of documentation / comments for this kind of 'tricky' code?
This code is not that tricky! And the proposed level of documentation is excessive! A couple of lines of explanation and perhaps a link to an external document would be good.
It often happens to me that I will write such exhaustive notes for myself when learning a new technology. A year later I pare it all back because much of it is "obvious in retrospect".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Martin, David, Alan. This was instructive (and fun!)
I suggest we condense the comment to something like this:
// Uppercase b1 by removing a single bit
int upper = b1 & 0xDF;
if (upper < 'A') {
return false; // Low ASCII
}
...
The similar methods toLowerCase
toUpperCase
just above have been updated to follow the same style. (I also updated local variable names there to align better with equalsIgnoreCase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ASCII and Latin-1 were designed to optimize case-twiddling operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This expresses the higher-level benefit succinctly, without getting into the details. I like it!
…noreCase by using 'lower' and 'upper'
…ase with "by removing (setting) a single bit"
public void checkConsistencyWithCharacterUppercaseLowerCase() { | ||
for (int ab = 0; ab < 256; ab++) { | ||
for (int bb = 0; bb < 256; bb++) { | ||
char a = (char) ab, b = (char) bb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char is an unsigned numeric type, so cleaner is
for (char a = 0; a < 256; a++)
for (char b = 0; b < 256; b++)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed. Might have been copied over from processing of code points in the higher planes. Not needed here.
Thanks for reviews Claes and Martin! I'll let this linger a bit before integrating in case Alan has comments after the latest updates. |
I found this in Appendix A of the 1973
https://ia800606.us.archive.org/17/items/enf-ascii-1972-1975/Image070917152640_text.pdf |
/integrate |
I plan to look at it, been busy with other things. |
Thanks, Alan 👍 Sponsors, hold your horses! (Not sure how to 'undo' the integrate command) |
/sponsor |
Going to push as commit 17e3769.
Your commit was automatically rebased without conflicts. |
@AlanBateman @eirbjo Pushed as commit 17e3769. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
This PR suggests we can speed up
StringLatin1.regionMatchesCI
by applying 'the oldest ASCII trick in the book'.The new static method
CharacterDataLatin1.equalsIgnoreCase
compares two latin1 bytes for equality ignoring case.StringLatin1.regionMatchesCI
is updated to useequalsIgnoreCase
To verify the correctness of
equalsIgnoreCase
, a new test is added toEqualsIgnoreCase
with an exhaustive verification that all 256x256 latin1 code point pairs have anequalsIgnoreCase
consistent with Character.toUpperCase, Character.toLowerCase.Performance is tested for matching and mismatching cases of code point pairs picked from the ASCII letter, ASCII number and latin1 letter ranges. Results in the first comment below.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632
$ git checkout pull/12632
Update a local copy of the PR:
$ git checkout pull/12632
$ git pull https://git.openjdk.org/jdk pull/12632/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 12632
View PR using the GUI difftool:
$ git pr show -t 12632
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12632.diff