Skip to content

Conversation

@xuemingshen-oracle
Copy link

@xuemingshen-oracle xuemingshen-oracle commented Aug 22, 2025

Summary

Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.

Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:

String.equalsIgnoreCase(String)

  • Unicode-aware, locale-independent.
  • Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
  • Limited: does not support 1:M mapping defined in Unicode case folding.

Character.toLowerCase(int) / Character.toUpperCase(int)

  • Locale-independent, single code point only.
  • No support for 1:M mappings.

String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)

  • Based on Unicode SpecialCasing.txt, supports 1:M mappings.
  • Intended primarily for presentation/display, not structural case-insensitive matching.
  • Requires full string conversion before comparison, which is less efficient and not intended for structural matching.

1:M mapping example, U+00DF (ß)

  • String.toUpperCase(Locale.ROOT, "ß") → "SS"
  • Case folding produces "ss", matching Unicode caseless comparison rules.
jshell> "\u00df".equalsIgnoreCase("ss")
$22 ==> false

jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
$24 ==> true

Motivation & Direction

Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.

  • Unicode-compliant full case folding.
  • Simpler, stable and more efficient case-less matching without workarounds.
  • Brings Java's string comparison handling in line with other programming languages/libraries.

This PR proposes to introduce the following comparison methods in String class

  • boolean equalsFoldCase(String anotherString)
  • int compareToFoldCase(String anotherString)
  • Comparator UNICODE_CASEFOLD_ORDER

These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.

*Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.

The New API

   /**
     * Compares this {@code String} to another {@code String} for equality,
     * using <em>Unicode case folding</em>. Two strings are considered equal
     * by this method if their case-folded forms are identical.
     * <p>
     * Case folding is defined by the Unicode Standard in
     * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
     * including 1:M mappings. For example, {@code "Maße".equalsFoldCase("MASSE")}
     * returns {@code true}, since the character {@code U+00DF} (sharp s) folds
     * to {@code "ss"}.
     * <p>
     * Case folding is locale-independent and language-neutral, unlike
     * locale-sensitive transformations such as {@link #toLowerCase()} or
     * {@link #toUpperCase()}. It is intended for caseless matching,
     * searching, and indexing.
     *
     * @apiNote
     * This method is the Unicode-compliant alternative to
     * {@link #equalsIgnoreCase(String)}. It implements full case folding as
     * defined by the Unicode Standard, which may differ from the simpler
     * per-character mapping performed by {@code equalsIgnoreCase}.
     * For example:
     * <pre>{@snippet lang=java :
     * String a = "Maße";
     * String b = "MASSE";
     * boolean equalsFoldCase = a.equalsFoldCase(b);       // returns true
     * boolean equalsIgnoreCase = a.equalsIgnoreCase(b);   // returns false
     * }</pre>
     *
     * @param  anotherString
     *         The {@code String} to compare this {@code String} against
     *
     * @return  {@code true} if the given object is not {@code null} and represents
     *          the same sequence of characters as this string under Unicode case
     *          folding; {@code false} otherwise.
     *
     * @see     #compareToFoldCase(String)
     * @see     #equalsIgnoreCase(String)
     * @since   26
     */
    public boolean equalsFoldCase(String anotherString)

    /**
     * Compares two strings lexicographically using <em>Unicode case folding</em>.
     * This method returns an integer whose sign is that of calling {@code compareTo}
     * on the Unicode case folded version of the strings. Unicode Case folding
     * eliminates differences in case according to the Unicode Standard, using the
     * mappings defined in
     * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
     * including 1:M mappings, such as {@code"ß"} → {@code }"ss"}.
     * <p>
     * Case folding is a locale-independent, language-neutral form of case mapping,
     * primarily intended for caseless matching. Unlike {@link #compareToIgnoreCase(String)},
     * which applies a simpler locale-insensitive uppercase mapping. This method
     * follows the Unicode <em>full</em> case folding, providing stable and
     * consistent results across all environments.
     * <p>
     * Note that this method does <em>not</em> take locale into account, and may
     * produce results that differ from locale-sensitive ordering. Use
     * {@link java.text.Collator} for locale-sensitive comparison.
     *
     * @apiNote
     * This method is the Unicode-compliant alternative to
     * {@link #compareToIgnoreCase(String)}. It implements the <em>full</em> case folding
     * as defined by the Unicode Standard, which may differ from the simpler
     * per-character mapping performed by {@code compareToIgnoreCase}.
     * For example:
     * <pre>{@snippet lang=java :
     * String a = "Maße";
     * String b = "MASSE";
     * int cmpFoldCase = a.compareToFoldCase(b);     // returns 0
     * int cmpIgnoreCase = a.compareToIgnoreCase(b); // returns > 0
     * }</pre>
     *
     * @param   str   the {@code String} to be compared.
     * @return  a negative integer, zero, or a positive integer as the specified
     *          String is greater than, equal to, or less than this String,
     *          ignoring case considerations by case folding.
     * @see     #equalsFoldCase(String)
     * @see     #compareToIgnoreCase(String)
     * @see     java.text.Collator
     * @since   26
     */
    public int compareToFoldCase(String str) 

    /**
     * A Comparator that orders {@code String} objects as by
     * {@link #compareToFoldCase(String) compareToFoldCase()}.
     *
     * @see     #compareToFoldCase(String)
     * @since   26
     */
    public static final Comparator<String> UNICODE_CASEFOLD_ORDER;

Usage Examples

Sharp s (U+00DF) case-folds to "ss"

    "straße".equalsIgnoreCase("strasse");             // false
    "straße".compareToIgnoreCase("strasse");          // != 0
    "straße".equalsFoldCase("strasse");               // true

Performance

The JMH microbenchmark StringCompareToIgnoreCase has been updated to compare performance of compareToFoldCase with the existing compareToIgnoreCase().

Benchmark                                         Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.asciiGreekLower         avgt   15  20.195 ± 0.300  ns/op
StringCompareToIgnoreCase.asciiGreekLowerCF       avgt   15  11.051 ± 0.254  ns/op
StringCompareToIgnoreCase.asciiGreekUpperLower    avgt   15   6.035 ± 0.047  ns/op
StringCompareToIgnoreCase.asciiGreekUpperLowerCF  avgt   15  14.786 ± 0.382  ns/op
StringCompareToIgnoreCase.asciiLower              avgt   15  17.688 ± 1.396  ns/op
StringCompareToIgnoreCase.asciiLowerCF            avgt   15  44.552 ± 0.155  ns/op
StringCompareToIgnoreCase.asciiUpperLower         avgt   15  13.069 ± 0.487  ns/op
StringCompareToIgnoreCase.asciiUpperLowerCF       avgt   15  58.684 ± 0.274  ns/op
StringCompareToIgnoreCase.greekLower              avgt   15  20.642 ± 0.082  ns/op
StringCompareToIgnoreCase.greekLowerCF            avgt   15   7.255 ± 0.271  ns/op
StringCompareToIgnoreCase.greekUpperLower         avgt   15   5.737 ± 0.013  ns/op
StringCompareToIgnoreCase.greekUpperLowerCF       avgt   15  11.100 ± 1.147  ns/op
StringCompareToIgnoreCase.lower                   avgt   15  20.192 ± 0.044  ns/op
StringCompareToIgnoreCase.lowerrCF                avgt   15  11.257 ± 0.259  ns/op
StringCompareToIgnoreCase.supLower                avgt   15  54.801 ± 0.415  ns/op
StringCompareToIgnoreCase.supLowerCF              avgt   15  15.207 ± 0.418  ns/op
StringCompareToIgnoreCase.supUpperLower           avgt   15  14.431 ± 0.188  ns/op
StringCompareToIgnoreCase.supUpperLowerCF         avgt   15  19.149 ± 0.985  ns/op
StringCompareToIgnoreCase.upperLower              avgt   15   5.650 ± 0.051  ns/op
StringCompareToIgnoreCase.upperLowerCF            avgt   15  14.338 ± 0.352  ns/op
StringCompareToIgnoreCase.utf16SubLower           avgt   15  14.774 ± 0.200  ns/op
StringCompareToIgnoreCase.utf16SubLowerCF         avgt   15   2.669 ± 0.041  ns/op
StringCompareToIgnoreCase.utf16SupUpperLower      avgt   15  16.250 ± 0.099  ns/op
StringCompareToIgnoreCase.utf16SupUpperLowerCF    avgt   15  11.524 ± 0.327  ns/op

Refs

Unicode Standard 5.18.4 Caseless Matching
Unicode® Standard Annex #44: 5.6 Case and Case Mapping
Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
Unicode SpecialCasing.txt
Unicode CaseFolding.txt

Other Languages

Python string.casefold()

The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.

Perl’s fc()

Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].

ICU4J UCharacter.foldCase (Java)

Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.

u_strFoldCase (C/C++)

A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change requires CSR request JDK-8369017 to be approved

Issues

  • JDK-8365675: Add String Unicode Case-Folding Support (Enhancement - P3)
  • JDK-8369017: Add String Unicode Case-Folding Support (CSR)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26892/head:pull/26892
$ git checkout pull/26892

Update a local copy of the PR:
$ git checkout pull/26892
$ git pull https://git.openjdk.org/jdk.git pull/26892/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26892

View PR using the GUI difftool:
$ git pr show -t 26892

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26892.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 22, 2025

👋 Welcome back sherman! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 22, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot changed the title 8365675: Add String.toCaseFold() to support Unicode case-folding 8365675: Add String.toCaseFold() to support Unicode Case-Folding Aug 22, 2025
@openjdk
Copy link

openjdk bot commented Aug 22, 2025

@xuemingshen-oracle The following labels will be automatically applied to this pull request:

  • build
  • core-libs
  • i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added build build-dev@openjdk.org core-libs core-libs-dev@openjdk.org i18n i18n-dev@openjdk.org labels Aug 22, 2025
@xuemingshen-oracle xuemingshen-oracle force-pushed the JDK-8365675 branch 2 times, most recently from fe40e23 to a46d3bf Compare August 25, 2025 07:30
@xuemingshen-oracle xuemingshen-oracle changed the title 8365675: Add String.toCaseFold() to support Unicode Case-Folding 8365675: Add String Unicode Case-Folding Support Sep 10, 2025
@openjdk openjdk bot changed the title 8365675: Add String Unicode Case-Folding Support 8365675: Add String Unicode Case-Folding Support #26892 Sep 10, 2025
@xuemingshen-oracle xuemingshen-oracle force-pushed the JDK-8365675 branch 2 times, most recently from 4c47ef2 to 2e698dc Compare September 25, 2025 07:10
@openjdk openjdk bot added graal graal-dev@openjdk.org serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org ide-support ide-support-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org javadoc javadoc-dev@openjdk.org security security-dev@openjdk.org jmx jmx-dev@openjdk.org nio nio-dev@openjdk.org client client-libs-dev@openjdk.org net net-dev@openjdk.org compiler compiler-dev@openjdk.org labels Sep 25, 2025
@openjdk
Copy link

openjdk bot commented Sep 25, 2025

@xuemingshen-oracle client, compiler, graal, hotspot, ide-support, javadoc, jmx, net, nio, security, serviceability, shenandoah have been added to this pull request based on files touched in new commit(s).

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Oct 1, 2025
@xuemingshen-oracle xuemingshen-oracle changed the title 8365675: Add String Unicode Case-Folding Support #26892 8365675: Add String Unicode Case-Folding Support Oct 1, 2025
@RogerRiggs
Copy link
Contributor

/label remove client
/label remove graal
/label remove net
/label remove io
/label remove compiler
/label remove hotspot
/label remove ide-support
/label remove javadoc
/label remove jmx
/label remove nio
/label remove security
/label remove serviceability
/label remove shenandoah

@openjdk openjdk bot removed the client client-libs-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The client label was successfully removed.

@openjdk openjdk bot removed the graal graal-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The graal label was successfully removed.

@openjdk openjdk bot removed the net net-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The net label was successfully removed.

@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The label io is not a valid label.
These labels are valid:

  • graal
  • serviceability
  • hotspot
  • hotspot-compiler
  • ide-support
  • i18n
  • shenandoah
  • jdk
  • javadoc
  • security
  • hotspot-runtime
  • jmx
  • build
  • nio
  • client
  • core-libs
  • compiler
  • net
  • hotspot-gc
  • hotspot-jfr

@openjdk openjdk bot removed the compiler compiler-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The compiler label was successfully removed.

@openjdk openjdk bot removed the hotspot hotspot-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The hotspot label was successfully removed.

@openjdk openjdk bot removed the ide-support ide-support-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The ide-support label was successfully removed.

@openjdk openjdk bot removed the javadoc javadoc-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The javadoc label was successfully removed.

@openjdk openjdk bot removed the jmx jmx-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The jmx label was successfully removed.

@openjdk openjdk bot removed the nio nio-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The nio label was successfully removed.

@openjdk openjdk bot removed merge-conflict Pull request has merge conflict with target branch security security-dev@openjdk.org labels Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The security label was successfully removed.

@openjdk openjdk bot removed the serviceability serviceability-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The serviceability label was successfully removed.

@openjdk openjdk bot removed the shenandoah shenandoah-dev@openjdk.org label Oct 3, 2025
@openjdk
Copy link

openjdk bot commented Oct 3, 2025

@RogerRiggs
The shenandoah label was successfully removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need to find a more compact format for the data, individual 1 or 2 entry char arrays have a large overhead. Plus the map entries take a lot of space for the data and indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build build-dev@openjdk.org core-libs core-libs-dev@openjdk.org csr Pull request needs approved CSR before integration i18n i18n-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

2 participants