tools: add a powershell script to generate CPWD from the UCD #5946

DHowett · 2020-05-17T02:50:02Z

This commit introduces Generate-CodepointWidthsFromUCD, a powershell
(7+) script that will parse a UCD XML database in the UAX 42 format from
https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate
CodepointWidthDetector's giant width array.

By default, it will emit one UnicodeRange for every range of non-narrow
glyphs with a different Width + Emoji + Emoji Presentation class;
however, it can be run in "packing" and "full" mode.

Packing mode: ignore the width/emoji/pres class and combine adjacent
runs that CPWD will treat the same.
- This is for optimizing the number of individual ranges emitted
  into code.
Full mode: include narrow codepoints (helpful for visualization)

It also supports overrides, provided in an XML document of the same format
as the UCD itself. Entries in the overrides files are applied after the
entire UCD is read and will replace any impacted ranges.

The output (when packing) looks like this:

// Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False
// on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0.
// 66182 (0x10286) codepoints covered.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous },
    .
    .
    .
    UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide },
    UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide },
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};

The output (when overriding) looks like this:

// Generated by Generate-CodepointWidthsFromUCD.ps1 -Pack:True -Full:False -NoOverrides:False
// on 5/22/2020 11:17:39 PM (UTC) from Unicode 13.0.0.
// 321205 (0x4E6B5) codepoints covered.
// 240 (0xF0) codepoints overridden.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    ...
    UnicodeRange{ 0xfe20, 0xfe2f, CodepointWidth::Narrow }, // narrow combining ligatures (split into left/right halves, which take 2 columns together)
    ...
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};

This commit introduces Generate-CodepointWidthsFromUCD, a powershell (7+) script that will parse a UCD XML database in the UAX 42 format from https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate CodepointWidthDetector's giant width array. By default, it will emit one UnicodeRange for every range of non-narrow glyphs with a different Width + Emoji + Emoji Presentation class; however, it can be run in "packing" and "full" mode. * Packing mode: ignore the width/emoji/pres class and combine adjacent runs that CPWD will treat the same. * This is for optimizing the number of individual ranges emitted into code. * Full mode: include narrow codepoints (helpful for visualization) The output (when packing) looks like this: ```c++ // Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False // on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0. // 66182 (0x10286) codepoints covered. static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{ UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous }, . . . UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide }, UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide }, UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous }, }; ```

DHowett · 2020-05-18T01:55:40Z

This will let us move our overrides into a documented, maintainable format. They will be part of CodepointWidthDetector's table, and we can use the same tooling to generate it and the same code to read it at runtime.

I will profile the speed difference between "getQuick" (51 comparisons) and "CPWD::GetWidth" (log(245) comparisons (~8)) (!!!!!!) and maybe when we document it in an xml file we can remove GetQuick completely.

DHowett · 2020-05-18T17:22:19Z

I was exaggerating, but GetQuickCharWidth actually does do 28*2 comparisons and CodepointWidthDetector is definitely log2(n)

DHowett · 2020-05-26T03:15:43Z

This is ready for review. I've added support for an overrides file and moved a bunch of the logic into classes.

The overrides file is of the same format as the UCD itself. This is the example I was building with:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
   <repertoire>
      <override first-cp="2500" last-cp="259F" ea="H" comment="box-drawing and block elements require cell alignment" />
      <override first-cp="4DC0" last-cp="4DFF" ea="H" comment="hexagrams are historically narrow" />
      <override first-cp="FE20" last-cp="FE2F" ea="H" comment="narrow combining ligatures (split into left/right halves, which take 2 columns together)" />
   </repertoire>
</ucd>

tools/Generate-CodepointWidthsFromUCD.ps1

zadjii-msft

Definitely add an example of how to use it - where to get the input files, how to invoke it, and where to copy the output too
Should we include the input files that are relevant to the current content of the CPWD in the docs/reference directory, with timestamps?

DHowett · 2020-05-28T16:45:45Z

I’d rather not include the input file; it’s 52MB. Fortunately, Unicode doesn’t go back and edit existing datasets, and the tool emits a version reference when it runs.

The overrides file is going to be checked in in a subsequent commit, titled “regenerate all the Unicode tables again”. This is just the script that does it :)

zadjii-msft · 2020-05-28T16:49:34Z

it’s 52MB

holy what

Okay yea we don't need that.

The overrides file is going to be checked in in a subsequent commit, titled “regenerate all the Unicode tables again”. This is just the script that does it :)

Ah yea, okay I guess I missed that. Considering the above,

jsoref · 2020-05-28T22:26:55Z

The conflicts shouldn't be hard to resolve whitelist was renamed to expect -- If you need help, I can help on Sunday.

DHowett · 2020-05-29T06:49:37Z

Thanks for the offer, @jsoref. Got this one on lock. 😄

ghost · 2020-06-03T07:16:04Z

Hello @DHowett!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

DHowett added 4 commits May 16, 2020 19:49

fix tabs

448d698

lint the script

3e5bf5f

spell, fix

e97fc6a

This comment has been minimized.

Sign in to view

DHowett added 5 commits May 22, 2020 15:11

Add overrides support

c958298

rework to use UnicodeRange class

593439f

cleanup: rangelist is a class, remove $last

7a482b6

Require PS 7

d8470b3

slightly clarify

a481fd3

DHowett added 3 commits May 25, 2020 20:20

Merge remote-tracking branch 'origin/master' into ttt

8a804b0

IComparable to allowlist

2381587

also bnot isn't a word

5820518

DHowett requested a review from miniksa May 26, 2020 03:22

DHowett added 2 commits May 25, 2020 20:29

Align operators with PS documentation

05c71a7

allowlist more powershell operators (:P)

c4d1dd2

miniksa reviewed May 27, 2020

View reviewed changes

tools/Generate-CodepointWidthsFromUCD.ps1 Show resolved Hide resolved

zadjii-msft requested changes May 28, 2020

View reviewed changes

ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label May 28, 2020

ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label May 28, 2020

zadjii-msft approved these changes May 28, 2020

View reviewed changes

DHowett added 3 commits May 28, 2020 23:43

Add a comment at the top

8302aef

more comment

cd0dc2e

Merge remote-tracking branch 'origin/master' into ttt

da3e8ca

DHowett added 2 commits May 28, 2020 23:50

actually prefer https

a7192d9

and pass more spelling checks

57c867c

miniksa approved these changes Jun 2, 2020

View reviewed changes

DHowett added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jun 3, 2020

ghost merged commit eccfb53 into master Jun 3, 2020

ghost deleted the dev/duhowett/cpwd_from_powershell branch June 3, 2020 07:16

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools: add a powershell script to generate CPWD from the UCD #5946

tools: add a powershell script to generate CPWD from the UCD #5946

Uh oh!

DHowett commented May 17, 2020 •

edited

Loading

Uh oh!

This comment has been minimized.

DHowett commented May 18, 2020

Uh oh!

DHowett commented May 18, 2020 •

edited

Loading

Uh oh!

DHowett commented May 26, 2020

Uh oh!

Uh oh!

zadjii-msft left a comment

Uh oh!

DHowett commented May 28, 2020

Uh oh!

zadjii-msft commented May 28, 2020

Uh oh!

jsoref commented May 28, 2020

Uh oh!

DHowett commented May 29, 2020

Uh oh!

ghost commented Jun 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tools: add a powershell script to generate CPWD from the UCD #5946

tools: add a powershell script to generate CPWD from the UCD #5946

Uh oh!

Conversation

DHowett commented May 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

DHowett commented May 18, 2020

Uh oh!

DHowett commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DHowett commented May 26, 2020

Uh oh!

Uh oh!

zadjii-msft left a comment

Choose a reason for hiding this comment

Uh oh!

DHowett commented May 28, 2020

Uh oh!

zadjii-msft commented May 28, 2020

Uh oh!

jsoref commented May 28, 2020

Uh oh!

DHowett commented May 29, 2020

Uh oh!

ghost commented Jun 3, 2020

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DHowett commented May 17, 2020 •

edited

Loading

DHowett commented May 18, 2020 •

edited

Loading

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.