Skip to content

Teach data generation to parse and publish NameAliases.txt#68

Merged
mathiasbynens merged 4 commits intonode-unicode:mainfrom
jpassaro:add-name-aliases
Jul 29, 2022
Merged

Teach data generation to parse and publish NameAliases.txt#68
mathiasbynens merged 4 commits intonode-unicode:mainfrom
jpassaro:add-name-aliases

Conversation

@jpassaro
Copy link
Contributor

@jpassaro jpassaro commented Mar 4, 2022

Current modules allow users to query the official published Unicode "Name" for each code point. In some cases, more detailed or corrected information can be found using the NameAliases.txt file.

Implement a parser that reads the formal Unicode aliases and makes them part of the published modules, for unicode versions in which aliases are available.

Copy link
Collaborator

@mathiasbynens mathiasbynens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I left a few comments — PTAL.

'script-extensions': 'https://unicode.org/Public/7.0.0/ucd/ScriptExtensions.txt',
'blocks': 'https://unicode.org/Public/7.0.0/ucd/Blocks.txt',
'properties': 'https://unicode.org/Public/7.0.0/ucd/PropList.txt',
'aliases': 'https://unicode.org/Public/7.0.0/ucd/NameAliases.txt',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s go with name-aliases instead of the less precise aliases

index.js Outdated
parsers.parseEmoji = require('./scripts/parse-emoji.js');
parsers.parseEmojiSequences = require('./scripts/parse-emoji-sequences.js');
parsers.parseNames = require('./scripts/parse-names.js');
parsers.parseAliases = require('./scripts/parse-aliases.js');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, parseNameAliases

index.js Outdated
extend(dirMap, utils.writeFiles({
'version': version,
'map': parsers.parseAliases(version),
'type': 'Aliases',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go under the Names directory? Feels like it belongs there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an update. Here is the directory structure as reflected in the changed README:

diff --git a/README.md b/README.md
index 373e3dbadd85..c657677a7efe 100644
--- a/README.md
+++ b/README.md
@@ -57,6 +57,20 @@ const openingBrackets = require('@unicode/unicode-14.0.0/Bidi_Paired_Bracket_Typ
 Other than categories, data on Unicode properties, blocks, scripts, and script extensions is available too (for recent versions of the Unicode standard). Here’s the full list of the available data for v14.0.0:
 
 ```js
+// `Names`:
+
+require('@unicode/unicode-14.0.0/Names/index.js'); // array of canonical names
+
+require('@unicode/unicode-14.0.0/Names/Abbreviation/index.js'); // lookup map from codepoint to aliases
+
+require('@unicode/unicode-14.0.0/Names/Alternate/index.js'); // lookup map from codepoint to aliases
+
+require('@unicode/unicode-14.0.0/Names/Control/index.js'); // lookup map from codepoint to aliases
+
+require('@unicode/unicode-14.0.0/Names/Correction/index.js'); // lookup map from codepoint to aliases
+
+require('@unicode/unicode-14.0.0/Names/Figment/index.js'); // lookup map from codepoint to aliases
+
 // `Binary_Property`:
 
 require('@unicode/unicode-14.0.0/Binary_Property/ASCII/code-points.js');

Copy link
Contributor Author

@jpassaro jpassaro Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want something different let me know. I'm working with what's simple and hopefully least invasive. It's a bit hard to know exactly how this is impacting the "output" library because it outputs compressed data so it's effectively diffing a huge base64 string that changes profoundly with each npm test. that said, no files appear to be lost and the new files are all as indicated in the above snippet, so i think it's probably okay. if there's a better way to validate the change please let me know.

thank you for the feedback and for this excellent library!

@jpassaro
Copy link
Contributor Author

hello @mathiasbynens , is there anything i can do to make this change more acceptable for merging?

@mathiasbynens
Copy link
Collaborator

There’s still an unresolved comment here: #68 (comment)

@jpassaro
Copy link
Contributor Author

@mathiasbynens thanks for the reminder. I think it's addressed. Please note #70 should be merged first, without it the build is broken.

Let me know if this directory structure is okay, or if you prefer an intermediate Aliases or NameAliases directory

diff --git a/README.md b/README.md
index 0efc52f2343b..480a7449a401 100644
--- a/README.md
+++ b/README.md
@@ -57,6 +57,16 @@ const openingBrackets = require('@unicode/unicode-15.0.0/Bidi_Paired_Bracket_Typ
 Other than categories, data on Unicode properties, blocks, scripts, and script extensions is available too (for recent versions of the Unicode standard). Here’s the full list of the available data for v15.0.0:
 
 ```js
+// `Names`:
+
+require('@unicode/unicode-15.0.0/Names/index.js'); // array of canonical names
+require('@unicode/unicode-15.0.0/Names/Abbreviation/index.js'); // lookup map from codepoint to aliases
+require('@unicode/unicode-15.0.0/Names/Alternate/index.js'); // lookup map from codepoint to aliases
+require('@unicode/unicode-15.0.0/Names/Control/index.js'); // lookup map from codepoint to aliases
+require('@unicode/unicode-15.0.0/Names/Correction/index.js'); // lookup map from codepoint to aliases
+require('@unicode/unicode-15.0.0/Names/Figment/index.js'); // lookup map from codepoint to aliases
+
+
 // `Binary_Property`:
 
 require('@unicode/unicode-15.0.0/Binary_Property/ASCII/code-points.js');

@mathiasbynens mathiasbynens merged commit 13ebc14 into node-unicode:main Jul 29, 2022
mathiasbynens pushed a commit that referenced this pull request Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants