Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix fc (full case-folding) #332

Closed
rurban opened this issue Sep 16, 2017 · 1 comment
Assignees
Labels

Comments

@rurban
Copy link
Member

@rurban rurban commented Sep 16, 2017

perl talks about doing full case-folding with fc, but misses the normalization step.
Not even Unicode::UCD which implements all the non-default foldcase options (full, simple, mapping, status, turkic) offers full case-folding.

Casefolding is the process of mapping strings to a form where case differences are erased;
comparing two strings in their casefolded form is effectively a way of asking if two strings
are equal, regardless of case.

See http://perl11.org/blog/foldcase.html
Comparing composed to decomposed strings does not help in comparisons.

Add it. First in the fast NFD form. (We call it FCD, fast fold-cased decomposed). We'll probably stay there to safe cpu and memory.

Also use the new optimized -ind Normalize canon table, implemented for safeclib wcsfc_s(), which does full case-folding on wchar_t, not utf-8. This is needed for better memory usage with unicode identifiers also. It needs 3x less memory than the current unfcan.h, 174772 => 55520. See #334 for this.

Note that most fc results are already NFD, just the greek PROSGEGRAMMENI + YPOGEGRAMMENI
U+1f80 .. U+1ff4 need to be decomposed also.

TODO: on sizeof(wchar_t)==2 systems (windows, cygwin, and aix/solaris 32bit) create the normalization tables with surrogate pairs already. fc tables are always below 0xffff.

i.e.

  • fc "\x{1f0a}\x{345}" => "\x{1F02}\x{3B9}" but should be "\x{1f82}" if normalized to NFC.
  • or if denormalized to NFD: fc "\x{1f8A}" => U+1f82 => U+1F02 + U+0345, but perl does => \x{1F02}\x{3B9}
@rurban rurban self-assigned this Sep 16, 2017
@rurban rurban added the ready label Oct 23, 2017
rurban added a commit that referenced this issue Nov 1, 2017
perl does NFKC composition for case-folding, not just NFD/FCD decomposition.
See the FC NFKC property.
See [cperl #332]
rurban added a commit that referenced this issue Nov 2, 2017
perl does NFKC composition for case-folding, not just NFD/FCD decomposition.
See the FC NFKC property.
But not consistently.
E.g. fc "\x{1F0A}\x{345}" => "\x{1F02}\x{3B9}", but should be "\x{1F82}"
if normalized to NFC.  But it denormalizes to NFD (aka FCD):
fc "\x{1F8A}" => U+1F82 => U+1F02 + U+03B9

See [cperl #332]
@rurban

This comment has been minimized.

Copy link
Member Author

@rurban rurban commented Dec 5, 2017

Released with 5.27.2

@rurban rurban closed this Dec 5, 2017
@ghost ghost removed the ready label Dec 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.