-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8365675: Add String Unicode Case-Folding Support #26892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back sherman! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@xuemingshen-oracle The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
fe40e23 to
a46d3bf
Compare
a46d3bf to
9c8f02c
Compare
9c8f02c to
76b8920
Compare
4c47ef2 to
2e698dc
Compare
|
@xuemingshen-oracle |
|
/label remove client |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs
|
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
|
@RogerRiggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to need to find a more compact format for the data, individual 1 or 2 entry char arrays have a large overhead. Plus the map entries take a lot of space for the data and indexing.
Summary
Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
String.equalsIgnoreCase(String)
Character.toLowerCase(int) / Character.toUpperCase(int)
String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)
1:M mapping example, U+00DF (ß)
Motivation & Direction
Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.
This PR proposes to introduce the following comparison methods in
StringclassThese methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.
*Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.
The New API
Usage Examples
Sharp s (U+00DF) case-folds to "ss"
Performance
The JMH microbenchmark StringCompareToIgnoreCase has been updated to compare performance of compareToFoldCase with the existing compareToIgnoreCase().
Refs
Unicode Standard 5.18.4 Caseless Matching
Unicode® Standard Annex #44: 5.6 Case and Case Mapping
Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
Unicode SpecialCasing.txt
Unicode CaseFolding.txt
Other Languages
Python string.casefold()
The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.
Perl’s fc()
Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].
ICU4J UCharacter.foldCase (Java)
Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.
u_strFoldCase (C/C++)
A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.
Progress
Issues
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26892/head:pull/26892$ git checkout pull/26892Update a local copy of the PR:
$ git checkout pull/26892$ git pull https://git.openjdk.org/jdk.git pull/26892/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26892View PR using the GUI difftool:
$ git pr show -t 26892Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26892.diff
Using Webrev
Link to Webrev Comment