Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toLocaleUpperCase seems broken for greek on ICU 58 #9445

Closed
srl295 opened this issue Nov 3, 2016 · 13 comments
Closed

toLocaleUpperCase seems broken for greek on ICU 58 #9445

srl295 opened this issue Nov 3, 2016 · 13 comments
Assignees
Labels
i18n-api Issues and PRs related to the i18n implementation. semver-minor PRs that contain new features and should be released in the next minor version.

Comments

@srl295
Copy link
Member

srl295 commented Nov 3, 2016

ICU 58 (HEAD as of this writing)

> 'Πατάτα'.toLocaleUpperCase('el')
'ΠΑΤΆΤΑ'

should be ΠΑΤΑΤΑ (no accent)

investigate…

may be related to hard coded lists mentioned in http://bugs.icu-project.org/trac/ticket/12647

// cc @nodejs/intl @jshin

@srl295
Copy link
Member Author

srl295 commented Nov 3, 2016

I would also expect 'I'.toLocaleLowerCase('tr') === 'ı'.toLocaleLowerCase('tr') but it is false.

@srl295
Copy link
Member Author

srl295 commented Nov 3, 2016

{ http_parser: '2.7.0', node: '8.0.0-pre', v8: '5.4.500.36', uv: '1.10.0', zlib: '1.2.8', ares: '1.10.1-DEV', modules: '51', openssl: '1.0.2j', icu: '58.1', unicode: '9.0', cldr: '30.0.2', tz: '2016g' }

I just noticed that https://codereview.chromium.org/1812673005 mentions a icu_case_mapping flag. So perhaps this needs to be turned on for node.

@jungshik
Copy link

jungshik commented Nov 3, 2016

Yes, that flag needs to be turned on. In addition, with ICU 58, I can just use ICU's case conversion API for Greek uppercasing instead of relying on ICU transliteration API.

@mscdex mscdex added the i18n-api Issues and PRs related to the i18n implementation. label Nov 3, 2016
@srl295
Copy link
Member Author

srl295 commented Nov 3, 2016

$ ./node --icu_case_mapping -p "'Πατάτα'.toLocaleUpperCase('el')"
ΠΑΤΑΤΑ
$ ./node --icu_case_mapping -p "'I'.toLocaleLowerCase('tr') === 'ı'.toLocaleLowerCase('tr')"
true

@srl295 srl295 self-assigned this Nov 3, 2016
@srl295 srl295 added the semver-minor PRs that contain new features and should be released in the next minor version. label Nov 3, 2016
srl295 added a commit to srl295/node that referenced this issue Nov 4, 2016
* toLocaleUpperCase() and toLocaleLowerCase() do not function properly
without this flag.
* basic test case. The test case would fail if `--no_icu_case_mapping`
was set.

Fixes: nodejs#9445
PR-URL: nodejs#9454
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@srl295 srl295 closed this as completed in 1a55e9a Nov 4, 2016
@jungshik
Copy link

jungshik commented Nov 5, 2016

For the record, it is behind the flag because there's a performance issue for characters below u+0080 or u+0100.

evanlucas pushed a commit that referenced this issue Nov 7, 2016
* toLocaleUpperCase() and toLocaleLowerCase() do not function properly
without this flag.
* basic test case. The test case would fail if `--no_icu_case_mapping`
was set.

Fixes: #9445
PR-URL: #9454
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@jungshik
Copy link

FYI, a CL was landed last night in v8's ToT to enable icu_case_mapping by default. The performance for [U+0100, U+FFFF] is 60~80% of non-ICU approach, but the correctness trumped the performance.
We'll look into this performance issue on both the ICU-side and the v8-side.

@HansBrende
Copy link

@jungshik @srl295 I'm a little confused here... the Turkish I <-> ı mapping is defined as "Conditional", "Language-Sensitive" mapping by http://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

However... all of the greek special case mappings appear to be defined as "Language-Insensitive" or "Unconditional" mappings!

So why then is toLocaleUpperCase('el') required to achieve correct results and not simply toUpperCase()?

Am I missing something? 🧐

@srl295
Copy link
Member Author

srl295 commented Jan 24, 2022

@markusicu can you comment on this last? ICU4C tests seems to also use el for these mappings. Should the el parameter be required for these mappings?

// ICU4C test case
#include <unicode/ustring.h>
#include <unicode/ustdio.h>
#include <unicode/errorcode.h>

int main(int argc, const char **argv) {
  const char16_t in[] = u"Πατάτα";
  const char *loc = "el"; // or ""
  char16_t str[256];
  icu::ErrorCode status;
  u_strToUpper(str, 256, in, -1, loc, status);
  u_printf_u(u"Status: %s, loc %s\n", status.errorName(), loc);
  if (status.isFailure()) {
    return 1;
  }
  u_printf_u(u"=> %S\n", str);
  return 0;
}
loc=""
Status: U_ZERO_ERROR, loc (root locale)
=> ΠΑΤΆΤΑ
loc="el"
Status: U_ZERO_ERROR, loc el
=> ΠΑΤΑΤΑ

@markusicu
Copy link

The UTC has given up defining language-specific case mappings in the Unicode Character Database and in the Unicode core spec. It defers to CLDR+ICU for new work in this area.

In particular, Greek uppercasing is quite complicated, and cannot be expressed with the limited machinery in UCD SpecialCasing.txt. It looks like Steven created this issue here because in ICU 58 I implemented proper Greek uppercasing. Look for "Greek" in https://icu.unicode.org/download/58

@HansBrende
Copy link

HansBrende commented Jan 24, 2022

@markusicu gotcha, makes sense. I guess that still begs the question, though: why would locale "en" (for example) not follow the same rules for greek uppercasing as locale "el"? Shouldn't the greek language have the final say on what proper greek uppercasing is? It's not like the tr/az vs. en I situation where the two languages are using the same alphabet and actually disagree on what the rules should be... or am I wrong?

@markusicu
Copy link

  • Unicode is not going to change its default case mapping behavior; default case mapping should continue to follow SpecialCasing.txt.
  • The special Greek uppercasing is complex, and likely slower than the default.
  • I believe that the fancy behavior is modern Greek usage, and does not necessarily apply to older Greek text.
  • The fancy behavior is quite lossy and really changes the text.

@HansBrende
Copy link

@markusicu cool, all makes sense. Thanks for the explanations; this has been enlightening!

@srl295
Copy link
Member Author

srl295 commented Jan 24, 2022

@markusicu yes, thanks for replying here. Will leave this issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-api Issues and PRs related to the i18n implementation. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

No branches or pull requests

5 participants