Skip to content
This repository was archived by the owner on Jun 1, 2023. It is now read-only.
This repository was archived by the owner on Jun 1, 2023. It is now read-only.

unicode tr39 parser bugs: (HALFWIDTH ) HANGUL FILLER #166

@rurban

Description

@rurban

In current Unicode there are 2 cases for whitespace chars, which are not whitespace but valid ID_Start and ID_Continue chars: U+3164 HANGUL FILLER and U+ffa0 HALFWIDTH HANGUL FILLER.

Fixup them at least in cperl as invalid IDs, but keep the space and whitespace categories.
Typical whitespace confusables, wrongly assigned as ID_Start and ID_Continue.
The default PropList property is Other_Default_Ignorable_Code_Point

See https://github.com/jagracey/Awesome-Unicode#user-content-variable-identifiers-can-effectively-include-whitespace.

In a more Korean friendly environment, we could check for a ID_Start Hangul filler if the next character is a valid Hangul ID_Continue character, and allow it then. Ditto for a ID_Continue Hangul filler if the
previous and next character is a valid Hangul ID_Start or ID_Continue character, and allow it then.
But those fillers should be treated as whitespace, and should be ignored.
And all valid word checks need to be changed then and are much slower, as we only consider single chars as valid ID_Start or ID_Continue.

http://www.unicode.org/L2/L2006/06310-hangul-decompose9.pdf explains:

The two other hangul fillers HANGUL CHOSEONG FILLER (Lf), i.e. lead filler, and HANGUL JUNGSEONG FILLER (Vf) are used as placeholders for missing letters, where there should be at least one letter.

... that leaves the (HALFWIDTH) HANGUL FILLERs useless. Indeed, they should not be rendered at all, despite that they have been given the property Lo. Note that these FILLERs are also given the property of Default_Ignorable_Codepoint.

Note that the standard normal forms NFKD and NFKC ... return (in all views) incorrect results for strings containing these characters.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions