-
Notifications
You must be signed in to change notification settings - Fork 9.1k
tools: add a powershell script to generate CPWD from the UCD #5946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit introduces Generate-CodepointWidthsFromUCD, a powershell (7+) script that will parse a UCD XML database in the UAX 42 format from https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate CodepointWidthDetector's giant width array. By default, it will emit one UnicodeRange for every range of non-narrow glyphs with a different Width + Emoji + Emoji Presentation class; however, it can be run in "packing" and "full" mode. * Packing mode: ignore the width/emoji/pres class and combine adjacent runs that CPWD will treat the same. * This is for optimizing the number of individual ranges emitted into code. * Full mode: include narrow codepoints (helpful for visualization) The output (when packing) looks like this: ```c++ // Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False // on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0. // 66182 (0x10286) codepoints covered. static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{ UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous }, . . . UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide }, UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide }, UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous }, }; ```
This comment has been minimized.
This comment has been minimized.
|
This will let us move our overrides into a documented, maintainable format. They will be part of CodepointWidthDetector's table, and we can use the same tooling to generate it and the same code to read it at runtime. I will profile the speed difference between "getQuick" (51 comparisons) and "CPWD::GetWidth" ( |
|
I was exaggerating, but GetQuickCharWidth actually does do 28*2 comparisons and CodepointWidthDetector is definitely log2(n) |
|
This is ready for review. I've added support for an overrides file and moved a bunch of the logic into classes. The overrides file is of the same format as the UCD itself. This is the example I was building with: <?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
<repertoire>
<override first-cp="2500" last-cp="259F" ea="H" comment="box-drawing and block elements require cell alignment" />
<override first-cp="4DC0" last-cp="4DFF" ea="H" comment="hexagrams are historically narrow" />
<override first-cp="FE20" last-cp="FE2F" ea="H" comment="narrow combining ligatures (split into left/right halves, which take 2 columns together)" />
</repertoire>
</ucd> |
zadjii-msft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Definitely add an example of how to use it - where to get the input files, how to invoke it, and where to copy the output too
- Should we include the input files that are relevant to the current content of the CPWD in the
docs/referencedirectory, with timestamps?
|
I’d rather not include the input file; it’s 52MB. Fortunately, Unicode doesn’t go back and edit existing datasets, and the tool emits a version reference when it runs. The overrides file is going to be checked in in a subsequent commit, titled “regenerate all the Unicode tables again”. This is just the script that does it :) |
holy what Okay yea we don't need that.
Ah yea, okay I guess I missed that. Considering the above, |
|
The conflicts shouldn't be hard to resolve |
|
Thanks for the offer, @jsoref. Got this one on lock. 😄 |
|
Hello @DHowett! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
This commit introduces Generate-CodepointWidthsFromUCD, a powershell
(7+) script that will parse a UCD XML database in the UAX 42 format from
https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate
CodepointWidthDetector's giant width array.
By default, it will emit one UnicodeRange for every range of non-narrow
glyphs with a different Width + Emoji + Emoji Presentation class;
however, it can be run in "packing" and "full" mode.
runs that CPWD will treat the same.
into code.
It also supports overrides, provided in an XML document of the same format
as the UCD itself. Entries in the overrides files are applied after the
entire UCD is read and will replace any impacted ranges.
The output (when packing) looks like this:
The output (when overriding) looks like this: