toLocaleUpperCase seems broken for greek on ICU 58 #9445

srl295 · 2016-11-03T16:02:46Z

ICU 58 (HEAD as of this writing)

> 'Πατάτα'.toLocaleUpperCase('el')
'ΠΑΤΆΤΑ'

should be ΠΑΤΑΤΑ (no accent)

investigate…

may be related to hard coded lists mentioned in http://bugs.icu-project.org/trac/ticket/12647

// cc @nodejs/intl @jshin

The text was updated successfully, but these errors were encountered:

srl295 · 2016-11-03T17:10:41Z

I would also expect 'I'.toLocaleLowerCase('tr') === 'ı'.toLocaleLowerCase('tr') but it is false.

srl295 · 2016-11-03T17:45:35Z

{ http_parser: '2.7.0', node: '8.0.0-pre', v8: '5.4.500.36', uv: '1.10.0', zlib: '1.2.8', ares: '1.10.1-DEV', modules: '51', openssl: '1.0.2j', icu: '58.1', unicode: '9.0', cldr: '30.0.2', tz: '2016g' }

I just noticed that https://codereview.chromium.org/1812673005 mentions a icu_case_mapping flag. So perhaps this needs to be turned on for node.

jungshik · 2016-11-03T18:32:24Z

Yes, that flag needs to be turned on. In addition, with ICU 58, I can just use ICU's case conversion API for Greek uppercasing instead of relying on ICU transliteration API.

srl295 · 2016-11-03T20:34:32Z

$ ./node --icu_case_mapping -p "'Πατάτα'.toLocaleUpperCase('el')"
ΠΑΤΑΤΑ
$ ./node --icu_case_mapping -p "'I'.toLocaleLowerCase('tr') === 'ı'.toLocaleLowerCase('tr')"
true

* toLocaleUpperCase() and toLocaleLowerCase() do not function properly without this flag. * basic test case. The test case would fail if `--no_icu_case_mapping` was set. Fixes: nodejs#9445 PR-URL: nodejs#9454 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

jungshik · 2016-11-05T17:51:21Z

For the record, it is behind the flag because there's a performance issue for characters below u+0080 or u+0100.

* toLocaleUpperCase() and toLocaleLowerCase() do not function properly without this flag. * basic test case. The test case would fail if `--no_icu_case_mapping` was set. Fixes: #9445 PR-URL: #9454 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

jungshik · 2016-12-21T18:02:37Z

FYI, a CL was landed last night in v8's ToT to enable icu_case_mapping by default. The performance for [U+0100, U+FFFF] is 60~80% of non-ICU approach, but the correctness trumped the performance.
We'll look into this performance issue on both the ICU-side and the v8-side.

HansBrende · 2022-01-23T05:15:04Z

@jungshik @srl295 I'm a little confused here... the Turkish I <-> ı mapping is defined as "Conditional", "Language-Sensitive" mapping by http://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

However... all of the greek special case mappings appear to be defined as "Language-Insensitive" or "Unconditional" mappings!

So why then is toLocaleUpperCase('el') required to achieve correct results and not simply toUpperCase()?

Am I missing something? 🧐

srl295 · 2022-01-24T18:46:48Z

@markusicu can you comment on this last? ICU4C tests seems to also use el for these mappings. Should the el parameter be required for these mappings?

// ICU4C test case
#include <unicode/ustring.h>
#include <unicode/ustdio.h>
#include <unicode/errorcode.h>

int main(int argc, const char **argv) {
  const char16_t in[] = u"Πατάτα";
  const char *loc = "el"; // or ""
  char16_t str[256];
  icu::ErrorCode status;
  u_strToUpper(str, 256, in, -1, loc, status);
  u_printf_u(u"Status: %s, loc %s\n", status.errorName(), loc);
  if (status.isFailure()) {
    return 1;
  }
  u_printf_u(u"=> %S\n", str);
  return 0;
}

loc=""
Status: U_ZERO_ERROR, loc (root locale)
=> ΠΑΤΆΤΑ

loc="el"
Status: U_ZERO_ERROR, loc el
=> ΠΑΤΑΤΑ

markusicu · 2022-01-24T19:52:56Z

The UTC has given up defining language-specific case mappings in the Unicode Character Database and in the Unicode core spec. It defers to CLDR+ICU for new work in this area.

In particular, Greek uppercasing is quite complicated, and cannot be expressed with the limited machinery in UCD SpecialCasing.txt. It looks like Steven created this issue here because in ICU 58 I implemented proper Greek uppercasing. Look for "Greek" in https://icu.unicode.org/download/58

HansBrende · 2022-01-24T20:02:52Z

@markusicu gotcha, makes sense. I guess that still begs the question, though: why would locale "en" (for example) not follow the same rules for greek uppercasing as locale "el"? Shouldn't the greek language have the final say on what proper greek uppercasing is? It's not like the tr/az vs. en I situation where the two languages are using the same alphabet and actually disagree on what the rules should be... or am I wrong?

markusicu · 2022-01-24T20:37:41Z

Unicode is not going to change its default case mapping behavior; default case mapping should continue to follow SpecialCasing.txt.
The special Greek uppercasing is complex, and likely slower than the default.
I believe that the fancy behavior is modern Greek usage, and does not necessarily apply to older Greek text.
The fancy behavior is quite lossy and really changes the text.

HansBrende · 2022-01-24T20:40:29Z

@markusicu cool, all makes sense. Thanks for the explanations; this has been enlightening!

srl295 · 2022-01-24T20:50:01Z

@markusicu yes, thanks for replying here. Will leave this issue closed.

mscdex added the i18n-api Issues and PRs related to the i18n implementation. label Nov 3, 2016

srl295 self-assigned this Nov 3, 2016

srl295 added the semver-minor PRs that contain new features and should be released in the next minor version. label Nov 3, 2016

srl295 mentioned this issue Nov 3, 2016

src: default --icu_case_mapping on as a v8 option #9454

Closed

3 tasks

srl295 closed this as completed in 1a55e9a Nov 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toLocaleUpperCase seems broken for greek on ICU 58 #9445

toLocaleUpperCase seems broken for greek on ICU 58 #9445

srl295 commented Nov 3, 2016

srl295 commented Nov 3, 2016

srl295 commented Nov 3, 2016

jungshik commented Nov 3, 2016

srl295 commented Nov 3, 2016 •

edited

Loading

jungshik commented Nov 5, 2016

jungshik commented Dec 21, 2016

HansBrende commented Jan 23, 2022

srl295 commented Jan 24, 2022

markusicu commented Jan 24, 2022

HansBrende commented Jan 24, 2022 •

edited

Loading

markusicu commented Jan 24, 2022

HansBrende commented Jan 24, 2022

srl295 commented Jan 24, 2022

toLocaleUpperCase seems broken for greek on ICU 58 #9445

toLocaleUpperCase seems broken for greek on ICU 58 #9445

Comments

srl295 commented Nov 3, 2016

srl295 commented Nov 3, 2016

srl295 commented Nov 3, 2016

jungshik commented Nov 3, 2016

srl295 commented Nov 3, 2016 • edited Loading

jungshik commented Nov 5, 2016

jungshik commented Dec 21, 2016

HansBrende commented Jan 23, 2022

srl295 commented Jan 24, 2022

markusicu commented Jan 24, 2022

HansBrende commented Jan 24, 2022 • edited Loading

markusicu commented Jan 24, 2022

HansBrende commented Jan 24, 2022

srl295 commented Jan 24, 2022

srl295 commented Nov 3, 2016 •

edited

Loading

HansBrende commented Jan 24, 2022 •

edited

Loading