Skip to content

8302877: Speed up latin1 case conversions #12623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

eirbjo
Copy link
Contributor

@eirbjo eirbjo commented Feb 17, 2023

This PR suggests we speed up Character.toUpperCase and Character.toLowerCase for latin1 code points by applying the 'oldest ASCII trick in the book'.

This takes advantage of the fact that latin1 uppercase code points are always 0x20 lower than their lowercase (with the exception of two code points which uppercase out of latin1).

To verify the correctness of the new implementation, the test Latin1CaseConversion is added with an exhaustive verification of toUpperCase/toLowerCase for all latin1 code points.

The implementation needs to balance the performance of the various ranges in latin1. An effort has been made to favour operations on ASCII code points, without causing excessive regression for higher code points.

Performance is benchmarked for 7 chosen sample code points, each representing a range or a special-case. Results in the first comment.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12623/head:pull/12623
$ git checkout pull/12623

Update a local copy of the PR:
$ git checkout pull/12623
$ git pull https://git.openjdk.org/jdk pull/12623/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12623

View PR using the GUI difftool:
$ git pr show -t 12623

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12623.diff

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 17, 2023

Benchmark results:

Baseline:

Benchmark                                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.Latin1CaseConversion.toLowerCase          low  avgt   15  1.267 ± 0.013  ns/op
Characters.Latin1CaseConversion.toLowerCase            A  avgt   15  1.657 ± 0.011  ns/op
Characters.Latin1CaseConversion.toLowerCase            a  avgt   15  1.258 ± 0.005  ns/op
Characters.Latin1CaseConversion.toLowerCase      A-grave  avgt   15  1.656 ± 0.011  ns/op
Characters.Latin1CaseConversion.toLowerCase      a-grave  avgt   15  1.270 ± 0.023  ns/op
Characters.Latin1CaseConversion.toLowerCase           mu  avgt   15  1.261 ± 0.006  ns/op
Characters.Latin1CaseConversion.toLowerCase           yD  avgt   15  1.260 ± 0.005  ns/op
Characters.Latin1CaseConversion.toUpperCase          low  avgt   15  1.284 ± 0.043  ns/op
Characters.Latin1CaseConversion.toUpperCase            A  avgt   15  1.264 ± 0.008  ns/op
Characters.Latin1CaseConversion.toUpperCase            a  avgt   15  1.818 ± 0.016  ns/op
Characters.Latin1CaseConversion.toUpperCase      A-grave  avgt   15  1.261 ± 0.015  ns/op
Characters.Latin1CaseConversion.toUpperCase      a-grave  avgt   15  1.822 ± 0.013  ns/op
Characters.Latin1CaseConversion.toUpperCase           mu  avgt   15  1.823 ± 0.006  ns/op
Characters.Latin1CaseConversion.toUpperCase           yD  avgt   15  1.822 ± 0.008  ns/op

PR:

Benchmark                                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.Latin1CaseConversion.toLowerCase          low  avgt   15  0.878 ± 0.005  ns/op
Characters.Latin1CaseConversion.toLowerCase            A  avgt   15  1.038 ± 0.009  ns/op
Characters.Latin1CaseConversion.toLowerCase            a  avgt   15  1.036 ± 0.007  ns/op
Characters.Latin1CaseConversion.toLowerCase      A-grave  avgt   15  1.357 ± 0.015  ns/op
Characters.Latin1CaseConversion.toLowerCase      a-grave  avgt   15  1.352 ± 0.003  ns/op
Characters.Latin1CaseConversion.toLowerCase           mu  avgt   15  1.273 ± 0.002  ns/op
Characters.Latin1CaseConversion.toLowerCase           yD  avgt   15  1.352 ± 0.004  ns/op
Characters.Latin1CaseConversion.toUpperCase          low  avgt   15  0.880 ± 0.013  ns/op
Characters.Latin1CaseConversion.toUpperCase            A  avgt   15  0.920 ± 0.071  ns/op
Characters.Latin1CaseConversion.toUpperCase            a  avgt   15  1.055 ± 0.013  ns/op
Characters.Latin1CaseConversion.toUpperCase      A-grave  avgt   15  1.394 ± 0.010  ns/op
Characters.Latin1CaseConversion.toUpperCase      a-grave  avgt   15  1.391 ± 0.009  ns/op
Characters.Latin1CaseConversion.toUpperCase           mu  avgt   15  1.597 ± 0.021  ns/op
Characters.Latin1CaseConversion.toUpperCase           yD  avgt   15  1.354 ± 0.003  ns/op

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 17, 2023

👋 Welcome back eirbjo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 17, 2023

@eirbjo The following labels will be automatically applied to this pull request:

  • core-libs
  • i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added core-libs core-libs-dev@openjdk.org i18n i18n-dev@openjdk.org labels Feb 17, 2023
@eirbjo eirbjo marked this pull request as ready for review February 17, 2023 17:36
Copy link
Member

@naotoj naotoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'd rather not use "case folding", as to me it implies "normalizing" but this is simply lowercasing/uppercasing.


/**
* @test
* @summary Provides exchaustive verification of Character.toUpperCase and Character.toLowerCase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "exhaustive"?

Copy link
Contributor Author

@eirbjo eirbjo Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an 'exchaustive' search for 'exchaustive' across the code base and found two comments in LocaleData and LocaleData.cldr in jdk/test/jdk/sun/text/resources.

Would you like me to update these as well while we're here, or should we avoid getting out scope for this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate it. I don't mind fixing it with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Naoto, I have fixed the spelling in these two unrelated files.

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 18, 2023

I'd rather not use "case folding", as to me it implies "normalizing" but this is simply lowercasing/uppercasing.

I guess I was looking for a generic term for uppercase/lowercase. I picked "case conversion" instead.

@eirbjo eirbjo changed the title Speed up latin1 case folding Speed up latin1 case transformation Feb 18, 2023
@eirbjo eirbjo changed the title Speed up latin1 case transformation Speed up latin1 case conversions Feb 18, 2023
@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 20, 2023

/issue 8302877

@openjdk openjdk bot changed the title Speed up latin1 case conversions 8302877: Speed up latin1 case conversions Feb 20, 2023
@openjdk
Copy link

openjdk bot commented Feb 20, 2023

@eirbjo The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated.

@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 20, 2023
@mlbridge
Copy link

mlbridge bot commented Feb 20, 2023

Webrevs

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

A site note: Early and crude experiements using the Vector API indicate that the 'oldest ASCII trick in the book' vectorizes pretty well.

Here's a benchmark comparing the strings "helloworld" and "HelloWorld" repeated 1024 times, followed by either 'A' or 'B' (to force a an expensive mismatch):

Benchmark                    (size)  Mode  Cnt     Score    Error  Units
EqualsIgnoreCase.scalar        1024  avgt   15  6225.624 ± 89.182  ns/op
EqualsIgnoreCase.vectorized    1024  avgt   15  1246.110 ± 14.767  ns/op

I have the feeling that most case-insensitive comparisons are pretty short, so not sure how useful this is IRL.

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

I have the feeling that most case-insensitive comparisons are pretty short, so not sure how useful this is IRL.

There seems to be a win from strings of size 32 bytes upwards. (That's probably longer than most keys in TreeMaps using String.CASE_INSENSITIVE_ORDER, such as j.n.h.HttpHeaders)

Benchmark                    (size)  Mode  Cnt   Score   Error  Units
EqualsIgnoreCase.scalar          16  avgt    2  20.608          ns/op
EqualsIgnoreCase.scalar          32  avgt    2  36.510          ns/op
EqualsIgnoreCase.vectorized      16  avgt    2  18.601          ns/op
EqualsIgnoreCase.vectorized      32  avgt    2  12.795          ns/op

This is outside scope for this PR, I just wanted to leave a trace of this observation here for future record.

For reference, I suggested this benchmark for inclusion in the Vector API benchmark:

https://mail.openjdk.org/pipermail/panama-dev/2023-February/018709.html

Copy link
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some nits inline

}
return mapChar;
int l = ch | 0x20; // Lowercase using 'oldest ASCII trick in the book'
if ( l <= 'z' // In range a-z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( l <= 'z' // In range a-z
if (l <= 'z' // In range a-z

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! (My IDE does not highlight this code, making it a bit harder to spot mistakes like this)

@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(3)
public static class Latin1CaseConversions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if qualifying this as "Latin1" is necessary, even though that's what you've focused on for this PR. We could easily add some codePoints outside of the latin1 range (now or later) without changing the test.

While having a switch with some readable names is a nice touch I think we should additionally allow integer codePoint as-is to keep it in line with the outer class, e.g. default -> Integer.parseInt(codePoint);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right that Latin1 is a bit narrow here, removing the prefix.

I added Integer.parseInt as the default, good idea!

@openjdk
Copy link

openjdk bot commented Feb 21, 2023

@eirbjo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8302877: Speed up latin1 case conversions

Reviewed-by: naoto, redestad

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 216 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@naotoj, @cl4es) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 21, 2023
@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

Thanks for your review and JBS juggling, Claes!

I'll wait for a final word from @naotoj before integrating.

Copy link
Member

@naotoj naotoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Feb 21, 2023
@openjdk
Copy link

openjdk bot commented Feb 21, 2023

@eirbjo
Your change (at version bff999c) is now ready to be sponsored by a Committer.

@naotoj
Copy link
Member

naotoj commented Feb 21, 2023

/sponsor

@openjdk
Copy link

openjdk bot commented Feb 21, 2023

Going to push as commit ef1f7bd.
Since your change was applied there have been 216 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 21, 2023
@openjdk openjdk bot closed this Feb 21, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Feb 21, 2023
@openjdk
Copy link

openjdk bot commented Feb 21, 2023

@naotoj @eirbjo Pushed as commit ef1f7bd.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dholmes-ora
Copy link
Member

The testcase change is failing to compile:

open/test/micro/org/openjdk/bench/java/lang/Characters.java:115: error: case, default, or '}' expected

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

The testcase change is failing to compile:

open/test/micro/org/openjdk/bench/java/lang/Characters.java:115: error: case, default, or '}' expected

Darn, this was rather embarrassing!

Since this was just integrated, I guess I'll need to open a new PR for the fix?

@naotoj
Copy link
Member

naotoj commented Feb 21, 2023

Yes please.

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

I created #12701 for this.

Could someone please file a JBS?

@naotoj
Copy link
Member

naotoj commented Feb 21, 2023

@eirbjo
Copy link
Contributor Author

eirbjo commented Feb 21, 2023

Thanks, #12701 is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org i18n i18n-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants