-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8302877: Speed up latin1 case conversions #12623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…the 'oldest ASCII trick in the book'
Benchmark results: Baseline:
PR:
|
👋 Welcome back eirbjo! A progress list of the required criteria for merging this PR into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'd rather not use "case folding", as to me it implies "normalizing" but this is simply lowercasing/uppercasing.
|
||
/** | ||
* @test | ||
* @summary Provides exchaustive verification of Character.toUpperCase and Character.toLowerCase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "exhaustive"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an 'exchaustive' search for 'exchaustive' across the code base and found two comments in LocaleData
and LocaleData.cldr
in jdk/test/jdk/sun/text/resources
.
Would you like me to update these as well while we're here, or should we avoid getting out scope for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate it. I don't mind fixing it with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Naoto, I have fixed the spelling in these two unrelated files.
…ro Sign' and 'y with Diaeresis'
… as 'Micro Sign' ("Mu" is the Unicode code point it uppercases to)
I guess I was looking for a generic term for uppercase/lowercase. I picked "case conversion" instead. |
/issue 8302877 |
@eirbjo The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated. |
Webrevs
|
A site note: Early and crude experiements using the Vector API indicate that the 'oldest ASCII trick in the book' vectorizes pretty well. Here's a benchmark comparing the strings "helloworld" and "HelloWorld" repeated 1024 times, followed by either 'A' or 'B' (to force a an expensive mismatch):
I have the feeling that most case-insensitive comparisons are pretty short, so not sure how useful this is IRL. |
There seems to be a win from strings of size 32 bytes upwards. (That's probably longer than most keys in TreeMaps using String.CASE_INSENSITIVE_ORDER, such as j.n.h.HttpHeaders)
This is outside scope for this PR, I just wanted to leave a trace of this observation here for future record. For reference, I suggested this benchmark for inclusion in the Vector API benchmark: https://mail.openjdk.org/pipermail/panama-dev/2023-February/018709.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some nits inline
} | ||
return mapChar; | ||
int l = ch | 0x20; // Lowercase using 'oldest ASCII trick in the book' | ||
if ( l <= 'z' // In range a-z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( l <= 'z' // In range a-z | |
if (l <= 'z' // In range a-z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! (My IDE does not highlight this code, making it a bit harder to spot mistakes like this)
@Warmup(iterations = 5, time = 1) | ||
@Measurement(iterations = 5, time = 1) | ||
@Fork(3) | ||
public static class Latin1CaseConversions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if qualifying this as "Latin1" is necessary, even though that's what you've focused on for this PR. We could easily add some codePoints outside of the latin1 range (now or later) without changing the test.
While having a switch with some readable names is a nice touch I think we should additionally allow integer codePoint as-is to keep it in line with the outer class, e.g. default -> Integer.parseInt(codePoint);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right that Latin1 is a bit narrow here, removing the prefix.
I added Integer.parseInt as the default, good idea!
@eirbjo This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 216 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@naotoj, @cl4es) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Thanks for your review and JBS juggling, Claes! I'll wait for a final word from @naotoj before integrating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the fix!
/integrate |
/sponsor |
Going to push as commit ef1f7bd.
Your commit was automatically rebased without conflicts. |
The testcase change is failing to compile:
|
Darn, this was rather embarrassing! Since this was just integrated, I guess I'll need to open a new PR for the fix? |
Yes please. |
I created #12701 for this. Could someone please file a JBS? |
Thanks, #12701 is ready for review. |
This PR suggests we speed up Character.toUpperCase and Character.toLowerCase for latin1 code points by applying the 'oldest ASCII trick in the book'.
This takes advantage of the fact that latin1 uppercase code points are always 0x20 lower than their lowercase (with the exception of two code points which uppercase out of latin1).
To verify the correctness of the new implementation, the test
Latin1CaseConversion
is added with an exhaustive verification of toUpperCase/toLowerCase for all latin1 code points.The implementation needs to balance the performance of the various ranges in latin1. An effort has been made to favour operations on ASCII code points, without causing excessive regression for higher code points.
Performance is benchmarked for 7 chosen sample code points, each representing a range or a special-case. Results in the first comment.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12623/head:pull/12623
$ git checkout pull/12623
Update a local copy of the PR:
$ git checkout pull/12623
$ git pull https://git.openjdk.org/jdk pull/12623/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 12623
View PR using the GUI difftool:
$ git pr show -t 12623
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12623.diff