Skip to content
This repository was archived by the owner on Jun 1, 2023. It is now read-only.
This repository was archived by the owner on Jun 1, 2023. It is now read-only.

Unicode normalization of identifiers/names #228

@rurban

Description

@rurban

Preferably as in perl6 (NFG - https://design.perl6.org/S15.html their invention with dynamic codepoints for unknown combinations, which is really just needed for strings not identifiers) or just NFC (the standard - http://www.unicode.org/reports/tr15/).
Not as on the MacOS filesystem decomposing to NFD. (i.e. é 0xe9 => e◌́ 0x65 0x301).
(e.g. https://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames).
in cperl it's vice versa: e◌́ 0x65 0x301 => é 0xe9)

NCD is faster, but needs more space and is ugly.

Which means, like in Python 3 and unlike as in perl6, unicode identifiers (names for gv and packages: variables, subs, packages, symbols) should be parsed and stored normalized, not asis.
In perl5 and perl6 there's no difference between strings and identifiers, in cperl there should.

In rakudo only the PERLIFY-STR method does this.

Measured slowdown: <1% as of 55831a8a986f304f53bd0bc0530085a9a4c72e92, on darwin even 5.5% faster on usage with non-utf8 identifiers. utf8 identifiers are mostly slower due to the previous check_script call to the slow Unicode::UCD. Normalization checks and calls are fast.

Related work:

A bigger overview is https://rosettacode.org/wiki/Unicode_variable_names

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions