Skip to content

Add parser to generate names.#41

Closed
bramstein wants to merge 5 commits intonode-unicode:masterfrom
bramstein:bs-add-names
Closed

Add parser to generate names.#41
bramstein wants to merge 5 commits intonode-unicode:masterfrom
bramstein:bs-add-names

Conversation

@bramstein
Copy link
Contributor

I have a couple use-cases where I need to look up the official name for code points. Rather than writing yet another Unicode parser, I thought you might be interested in adding this.

Right now, it maps everything in the PUA to "Private Use Area XX", but perhaps we should drop that (it isn't really a name). Let me know and I will change that.

@mathiasbynens
Copy link
Collaborator

Let’s use Name instead of Names, as that seems to be the canonical property name per http://unicode.org/reports/tr18/#RL2.5.

These names are not correct:

    [0, '<control>'],
    [1, '<control>'],
    [2, '<control>'],
    [3, '<control>'],
    [4, '<control>'],
    [5, '<control>'],
    [6, '<control>'],
    [7, '<control>'],
    [8, '<control>'],
    [9, '<control>'],
    [10, '<control>'],
    [11, '<control>'],
    [12, '<control>'],
    [13, '<control>'],
    [14, '<control>'],
    [15, '<control>'],
    [16, '<control>'],
    [17, '<control>'],
    [18, '<control>'],
    [19, '<control>'],
    [20, '<control>'],
    [21, '<control>'],
    [22, '<control>'],
    [23, '<control>'],
    [24, '<control>'],
    [25, '<control>'],
    [26, '<control>'],
    [27, '<control>'],
    [28, '<control>'],
    [29, '<control>'],
    [30, '<control>'],
    [31, '<control>'],

http://unicode.org/Public/UNIDATA/NamesList.txt (info) might be a better source, although it’s not supposed to be used for automated parsing. Let’s find out where it gets those names and use that source directly.

@bramstein
Copy link
Contributor Author

Good point. I renamed it to "Name".

It looks like those names for control characters were taken from the old Unicode 1.0 names. I've added a special case for when a code point's name is <control> and it has an old Unicode 1 name.

How do you feel about the private use area names?

@mathiasbynens
Copy link
Collaborator

It’s not just control characters and private use area symbols but also CJK ideograph extensions, e.g.

    [13312, 'CJK Ideograph Extension A'],
    [13313, 'CJK Ideograph Extension A'],
    [13314, 'CJK Ideograph Extension A'],
    [13315, 'CJK Ideograph Extension A'],
    [13316, 'CJK Ideograph Extension A'],
    [13317, 'CJK Ideograph Extension A'],
    [13318, 'CJK Ideograph Extension A'],
    [13319, 'CJK Ideograph Extension A'],
    [13320, 'CJK Ideograph Extension A'],
    [13321, 'CJK Ideograph Extension A'],
    [13322, 'CJK Ideograph Extension A'],

The result should be a set that maps code point to a unique canonical symbol name, i.e. no name should appear twice.

@bramstein
Copy link
Contributor Author

bramstein commented Jul 1, 2016

I ran some tests and found one case where the old Unicode 1.0 name conflicts with a new name: 0x0007 (<control>, BELL) shares the same name as 0x1F514 (BELL).

@mathiasbynens
Copy link
Collaborator

mathiasbynens commented Jul 2, 2016

http://unicode.org/Public/UNIDATA/NameAliases.txt contains:

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control

http://unicode.org/reports/tr18/#RL2.5 lists these two examples:

\p{name=BEL} = U+0007, the control character
\p{name=BELL} = U+1F514, the graphic symbol 🔔

@bramstein
Copy link
Contributor Author

Interesting. The NameAliases.txt file doesn't appear to be used for generating the NamesList.txt file (i.e. it lists U+0007 as BELL).

I find this quote from TR18 particularly interesting:

Certain code points are not assigned names or name aliases in the standard. With the exception of "reserved", these should be given names based on Code Point Label Tags table in [UAX44]

Then the example lists \p{name=control-0007} [\u{7}]. The table you quoted appears to be for looking up a code point by name (whereas here we go from code point to name). If I'm reading this correctly, all control characters should then be named control-XXXX (and likewise, private-use-XXX, etc).

Do you agree, or am I misreading it?

@mathiasbynens
Copy link
Collaborator

The table you quoted appears to be for looking up a code point by name (whereas here we go from code point to name).

Ideally we’d be able to do both. And ideally the mappings would be the same (except for aliases), just the other way around.

Do you agree, or am I misreading it?

It’s not 100% clear to me at this time, to be honest.

Let’s go with Control XXXX and Private Use XXXX for now.

@bramstein
Copy link
Contributor Author

Sorry for the delay. I've generated names for control characters and ranges in the form of <name> <hex>. Let me know what you think.

@bramstein
Copy link
Contributor Author

Did you have a chance to look at this? Any changes I can make?

@mathiasbynens
Copy link
Collaborator

Sorry for the delay in getting back to you.

I finally had time to look into this further and found this (via http://unicode.org/charts/About.html), regarding the CJK compatibility ideographs:

Character names are not provided for any CJK Compatibility Ideograph blocks because the
name of a compatibility ideograph simply consists of its Unicode code point preceded by
CJK COMPATIBILITY IDEOGRAPH-.

So, could you update the patch to use a hyphen instead of a space for such cases?

It would be good to get a proper reference for the other special cases we’re unsure about, i.e.

  1. control characters
  2. private use characters
  3. (any others I’m missing?)

According to CodePoints.net, private use code points don’t have a Name, so maybe they should be excluded from the output? cc @Boldewyn

@Boldewyn
Copy link

I'd suggest to exclude PU characters, yes. If you take a look at, e.g., http://www.unicode.org/Public/9.0.0/ucd/NamesList.txt (large text file!) there is no entry for them at all, only a comment at the appropriate place. The Unicode semantics of them is, at the end of the day, that the consortium guarantees, that it will never put anything there.

You need to decide, though, if it might be useful for authors using the lib, if they can determine by name, that something is a PU character. Possible use case off the top of my head: PU character used in an icon font, dev wants to know, if its name starts with Private Use... Might be unnecessary, because the range of PU characters is already fixed.

@mathiasbynens
Copy link
Collaborator

Possible use case off the top of my head: PU character used in an icon font, dev wants to know, if its name starts with Private Use... Might be unnecessary, because the range of PU characters is already fixed.

Yeah, they could and should use one of the regular expressions for that.

@bramstein If you agree with the above, feel free to remove PUA entries from the generated output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants