Add parser to generate names. by bramstein · Pull Request #41 · node-unicode/node-unicode-data

bramstein · 2016-07-01T07:41:31Z

I have a couple use-cases where I need to look up the official name for code points. Rather than writing yet another Unicode parser, I thought you might be interested in adding this.

Right now, it maps everything in the PUA to "Private Use Area XX", but perhaps we should drop that (it isn't really a name). Let me know and I will change that.

mathiasbynens · 2016-07-01T09:54:31Z

Let’s use Name instead of Names, as that seems to be the canonical property name per http://unicode.org/reports/tr18/#RL2.5.

These names are not correct:

    [0, '<control>'],
    [1, '<control>'],
    [2, '<control>'],
    [3, '<control>'],
    [4, '<control>'],
    [5, '<control>'],
    [6, '<control>'],
    [7, '<control>'],
    [8, '<control>'],
    [9, '<control>'],
    [10, '<control>'],
    [11, '<control>'],
    [12, '<control>'],
    [13, '<control>'],
    [14, '<control>'],
    [15, '<control>'],
    [16, '<control>'],
    [17, '<control>'],
    [18, '<control>'],
    [19, '<control>'],
    [20, '<control>'],
    [21, '<control>'],
    [22, '<control>'],
    [23, '<control>'],
    [24, '<control>'],
    [25, '<control>'],
    [26, '<control>'],
    [27, '<control>'],
    [28, '<control>'],
    [29, '<control>'],
    [30, '<control>'],
    [31, '<control>'],

http://unicode.org/Public/UNIDATA/NamesList.txt (info) might be a better source, although it’s not supposed to be used for automated parsing. Let’s find out where it gets those names and use that source directly.

…an old name is available.)

bramstein · 2016-07-01T14:12:44Z

Good point. I renamed it to "Name".

It looks like those names for control characters were taken from the old Unicode 1.0 names. I've added a special case for when a code point's name is <control> and it has an old Unicode 1 name.

How do you feel about the private use area names?

mathiasbynens · 2016-07-01T14:36:20Z

It’s not just control characters and private use area symbols but also CJK ideograph extensions, e.g.

    [13312, 'CJK Ideograph Extension A'],
    [13313, 'CJK Ideograph Extension A'],
    [13314, 'CJK Ideograph Extension A'],
    [13315, 'CJK Ideograph Extension A'],
    [13316, 'CJK Ideograph Extension A'],
    [13317, 'CJK Ideograph Extension A'],
    [13318, 'CJK Ideograph Extension A'],
    [13319, 'CJK Ideograph Extension A'],
    [13320, 'CJK Ideograph Extension A'],
    [13321, 'CJK Ideograph Extension A'],
    [13322, 'CJK Ideograph Extension A'],

The result should be a set that maps code point to a unique canonical symbol name, i.e. no name should appear twice.

bramstein · 2016-07-01T16:00:14Z

I ran some tests and found one case where the old Unicode 1.0 name conflicts with a new name: 0x0007 (<control>, BELL) shares the same name as 0x1F514 (BELL).

mathiasbynens · 2016-07-02T01:22:19Z

http://unicode.org/Public/UNIDATA/NameAliases.txt contains:

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control

http://unicode.org/reports/tr18/#RL2.5 lists these two examples:

\p{name=BEL} = U+0007, the control character
\p{name=BELL} = U+1F514, the graphic symbol 🔔

bramstein · 2016-07-02T07:08:50Z

Interesting. The NameAliases.txt file doesn't appear to be used for generating the NamesList.txt file (i.e. it lists U+0007 as BELL).

I find this quote from TR18 particularly interesting:

Certain code points are not assigned names or name aliases in the standard. With the exception of "reserved", these should be given names based on Code Point Label Tags table in [UAX44]

Then the example lists \p{name=control-0007} [\u{7}]. The table you quoted appears to be for looking up a code point by name (whereas here we go from code point to name). If I'm reading this correctly, all control characters should then be named control-XXXX (and likewise, private-use-XXX, etc).

Do you agree, or am I misreading it?

mathiasbynens · 2016-07-05T09:00:40Z

The table you quoted appears to be for looking up a code point by name (whereas here we go from code point to name).

Ideally we’d be able to do both. And ideally the mappings would be the same (except for aliases), just the other way around.

Do you agree, or am I misreading it?

It’s not 100% clear to me at this time, to be honest.

Let’s go with Control XXXX and Private Use XXXX for now.

bramstein · 2016-07-14T12:20:10Z

Sorry for the delay. I've generated names for control characters and ranges in the form of <name> <hex>. Let me know what you think.

bramstein · 2016-07-29T11:39:40Z

Did you have a chance to look at this? Any changes I can make?

mathiasbynens · 2016-09-27T12:53:55Z

Sorry for the delay in getting back to you.

I finally had time to look into this further and found this (via http://unicode.org/charts/About.html), regarding the CJK compatibility ideographs:

Character names are not provided for any CJK Compatibility Ideograph blocks because the
name of a compatibility ideograph simply consists of its Unicode code point preceded by
CJK COMPATIBILITY IDEOGRAPH-.

So, could you update the patch to use a hyphen instead of a space for such cases?

It would be good to get a proper reference for the other special cases we’re unsure about, i.e.

control characters
private use characters
(any others I’m missing?)

According to CodePoints.net, private use code points don’t have a Name, so maybe they should be excluded from the output? cc @Boldewyn

Boldewyn · 2016-09-27T13:49:51Z

I'd suggest to exclude PU characters, yes. If you take a look at, e.g., http://www.unicode.org/Public/9.0.0/ucd/NamesList.txt (large text file!) there is no entry for them at all, only a comment at the appropriate place. The Unicode semantics of them is, at the end of the day, that the consortium guarantees, that it will never put anything there.

You need to decide, though, if it might be useful for authors using the lib, if they can determine by name, that something is a PU character. Possible use case off the top of my head: PU character used in an icon font, dev wants to know, if its name starts with Private Use... Might be unnecessary, because the range of PU characters is already fixed.

mathiasbynens · 2016-09-27T14:10:23Z

Possible use case off the top of my head: PU character used in an icon font, dev wants to know, if its name starts with Private Use... Might be unnecessary, because the range of PU characters is already fixed.

Yeah, they could and should use one of the regular expressions for that.

@bramstein If you agree with the above, feel free to remove PUA entries from the generated output.

Add parser to generate names.

ccf6284

bramstein added 2 commits July 1, 2016 16:09

Rename to 'Name'.

5d4026a

Use old unicode 1 name if the code point is a control character (and …

66b8f9c

…an old name is available.)

Exclude ranges from the names.

7f51c2c

Use <name> <hex> for ranges and control characters.

00aecdd

mathiasbynens closed this in 830b2fb Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parser to generate names.#41

Add parser to generate names.#41
bramstein wants to merge 5 commits intonode-unicode:masterfrom
bramstein:bs-add-names

bramstein commented Jul 1, 2016

Uh oh!

mathiasbynens commented Jul 1, 2016

Uh oh!

bramstein commented Jul 1, 2016

Uh oh!

mathiasbynens commented Jul 1, 2016

Uh oh!

bramstein commented Jul 1, 2016 •

edited

Loading

Uh oh!

mathiasbynens commented Jul 2, 2016 •

edited

Loading

Uh oh!

bramstein commented Jul 2, 2016

Uh oh!

mathiasbynens commented Jul 5, 2016

Uh oh!

bramstein commented Jul 14, 2016

Uh oh!

bramstein commented Jul 29, 2016

Uh oh!

mathiasbynens commented Sep 27, 2016

Uh oh!

Boldewyn commented Sep 27, 2016

Uh oh!

mathiasbynens commented Sep 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bramstein commented Jul 1, 2016

Uh oh!

mathiasbynens commented Jul 1, 2016

Uh oh!

bramstein commented Jul 1, 2016

Uh oh!

mathiasbynens commented Jul 1, 2016

Uh oh!

bramstein commented Jul 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathiasbynens commented Jul 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bramstein commented Jul 2, 2016

Uh oh!

mathiasbynens commented Jul 5, 2016

Uh oh!

bramstein commented Jul 14, 2016

Uh oh!

bramstein commented Jul 29, 2016

Uh oh!

mathiasbynens commented Sep 27, 2016

Uh oh!

Boldewyn commented Sep 27, 2016

Uh oh!

mathiasbynens commented Sep 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bramstein commented Jul 1, 2016 •

edited

Loading

mathiasbynens commented Jul 2, 2016 •

edited

Loading