-
-
Notifications
You must be signed in to change notification settings - Fork 17
Unicode normalization of identifiers/names #228
Description
Preferably as in perl6 (NFG - https://design.perl6.org/S15.html their invention with dynamic codepoints for unknown combinations, which is really just needed for strings not identifiers) or just NFC (the standard - http://www.unicode.org/reports/tr15/).
Not as on the MacOS filesystem decomposing to NFD. (i.e. é 0xe9 => e◌́ 0x65 0x301).
(e.g. https://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames).
in cperl it's vice versa: e◌́ 0x65 0x301 => é 0xe9)
NCD is faster, but needs more space and is ugly.
Which means, like in Python 3 and unlike as in perl6, unicode identifiers (names for gv and packages: variables, subs, packages, symbols) should be parsed and stored normalized, not asis.
In perl5 and perl6 there's no difference between strings and identifiers, in cperl there should.
In rakudo only the PERLIFY-STR method does this.
Measured slowdown: <1% as of 55831a8a986f304f53bd0bc0530085a9a4c72e92, on darwin even 5.5% faster on usage with non-utf8 identifiers. utf8 identifiers are mostly slower due to the previous check_script call to the slow Unicode::UCD. Normalization checks and calls are fast.
Related work:
- Python 3: normalizes to NFKC (2007)
https://docs.python.org/3/reference/lexical_analysis.html#identifiers - R: locale dependent letters. with utf8 even detects same-script spoofs. uses similar to perl6 dynamic normalization in the stringi package.
- Java: similar to perl5, not normalized.
- JavaScript: similar to perl5, but names stored as UCS-2, which disallows supplementary Unicode characters and allows escapes. not normalized. https://mathiasbynens.be/notes/javascript-identifiers
- C#: not normalized. source in NFC required, but diagnostics implementation dependent.
https://msdn.microsoft.com/en-us/library/aa664670(v=vs.71).aspx - Go: similar to perl5. anything goes. not normalized. https://golang.org/ref/spec#Identifiers
- Julia: similar to perl5. http://docs.julialang.org/en/latest/manual/variables/. Normalization discussed 2014 and rejected canonicalize unicode identifiers JuliaLang/julia#5434
A bigger overview is https://rosettacode.org/wiki/Unicode_variable_names