Add parser to generate names.#41
Conversation
|
Let’s use These names are not correct: http://unicode.org/Public/UNIDATA/NamesList.txt (info) might be a better source, although it’s not supposed to be used for automated parsing. Let’s find out where it gets those names and use that source directly. |
…an old name is available.)
|
Good point. I renamed it to "Name". It looks like those names for control characters were taken from the old Unicode 1.0 names. I've added a special case for when a code point's name is How do you feel about the private use area names? |
|
It’s not just control characters and private use area symbols but also CJK ideograph extensions, e.g. The result should be a set that maps code point to a unique canonical symbol name, i.e. no name should appear twice. |
|
I ran some tests and found one case where the old Unicode 1.0 name conflicts with a new name: |
|
http://unicode.org/Public/UNIDATA/NameAliases.txt contains: http://unicode.org/reports/tr18/#RL2.5 lists these two examples:
|
|
Interesting. The NameAliases.txt file doesn't appear to be used for generating the NamesList.txt file (i.e. it lists U+0007 as BELL). I find this quote from TR18 particularly interesting:
Then the example lists Do you agree, or am I misreading it? |
Ideally we’d be able to do both. And ideally the mappings would be the same (except for aliases), just the other way around.
It’s not 100% clear to me at this time, to be honest. Let’s go with |
|
Sorry for the delay. I've generated names for control characters and ranges in the form of |
|
Did you have a chance to look at this? Any changes I can make? |
|
Sorry for the delay in getting back to you. I finally had time to look into this further and found this (via http://unicode.org/charts/About.html), regarding the CJK compatibility ideographs:
So, could you update the patch to use a hyphen instead of a space for such cases? It would be good to get a proper reference for the other special cases we’re unsure about, i.e.
According to CodePoints.net, private use code points don’t have a |
|
I'd suggest to exclude PU characters, yes. If you take a look at, e.g., http://www.unicode.org/Public/9.0.0/ucd/NamesList.txt (large text file!) there is no entry for them at all, only a comment at the appropriate place. The Unicode semantics of them is, at the end of the day, that the consortium guarantees, that it will never put anything there. You need to decide, though, if it might be useful for authors using the lib, if they can determine by name, that something is a PU character. Possible use case off the top of my head: PU character used in an icon font, dev wants to know, if its name starts with |
Yeah, they could and should use one of the regular expressions for that. @bramstein If you agree with the above, feel free to remove PUA entries from the generated output. |
I have a couple use-cases where I need to look up the official name for code points. Rather than writing yet another Unicode parser, I thought you might be interested in adding this.
Right now, it maps everything in the PUA to "Private Use Area XX", but perhaps we should drop that (it isn't really a name). Let me know and I will change that.