Unicode normalization of identifiers/names

Preferably as in perl6 (NFG - https://design.perl6.org/S15.html their invention with dynamic codepoints for unknown combinations, which is really just needed for strings not identifiers) or just NFC (the standard - http://www.unicode.org/reports/tr15/).
Not as on the MacOS filesystem decomposing to NFD. (i.e. é 0xe9 => e◌́ 0x65 0x301).
(e.g. https://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames).
in cperl it's vice versa: e◌́ 0x65 0x301 => é 0xe9)

NCD is faster, but needs more space and is ugly.

Which means, like in Python 3 and unlike as in perl6, unicode identifiers (names for gv and packages: variables, subs, packages, symbols) should be parsed and stored normalized, not asis.
In perl5 and perl6 there's no difference between strings and identifiers, in cperl there should.

In rakudo only the PERLIFY-STR method does this.

Measured slowdown: <1% as of 55831a8a986f304f53bd0bc0530085a9a4c72e92, on darwin even 5.5% faster on usage with non-utf8 identifiers. utf8 identifiers are mostly slower due to the previous check_script call to the slow Unicode::UCD. Normalization checks and calls are fast.

Related work:
* Python 3: normalizes to NFKC (2007)
https://docs.python.org/3/reference/lexical_analysis.html#identifiers
* R: locale dependent letters. with utf8 even detects same-script spoofs. uses similar to perl6 dynamic normalization in the stringi package.
* Java: similar to perl5, not normalized.
* JavaScript: similar to perl5, but names stored as UCS-2, which disallows supplementary Unicode characters and allows escapes. not normalized. https://mathiasbynens.be/notes/javascript-identifiers
* C#: not normalized. source in NFC required, but diagnostics implementation dependent.
https://msdn.microsoft.com/en-us/library/aa664670(v=vs.71).aspx
* Go: similar to perl5. anything goes. not normalized. https://golang.org/ref/spec#Identifiers
* Julia: similar to perl5. http://docs.julialang.org/en/latest/manual/variables/. Normalization discussed 2014 and rejected https://github.com/JuliaLang/julia/issues/5434 

A bigger overview is https://rosettacode.org/wiki/Unicode_variable_names

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unicode normalization of identifiers/names #228

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unicode normalization of identifiers/names #228

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions