Skip to content

Conversation

@DHowett
Copy link
Member

@DHowett DHowett commented May 17, 2020

This commit introduces Generate-CodepointWidthsFromUCD, a powershell
(7+) script that will parse a UCD XML database in the UAX 42 format from
https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate
CodepointWidthDetector's giant width array.

By default, it will emit one UnicodeRange for every range of non-narrow
glyphs with a different Width + Emoji + Emoji Presentation class;
however, it can be run in "packing" and "full" mode.

  • Packing mode: ignore the width/emoji/pres class and combine adjacent
    runs that CPWD will treat the same.
    • This is for optimizing the number of individual ranges emitted
      into code.
  • Full mode: include narrow codepoints (helpful for visualization)

It also supports overrides, provided in an XML document of the same format
as the UCD itself. Entries in the overrides files are applied after the
entire UCD is read and will replace any impacted ranges.

The output (when packing) looks like this:

// Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False
// on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0.
// 66182 (0x10286) codepoints covered.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous },
    .
    .
    .
    UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide },
    UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide },
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};

The output (when overriding) looks like this:

// Generated by Generate-CodepointWidthsFromUCD.ps1 -Pack:True -Full:False -NoOverrides:False
// on 5/22/2020 11:17:39 PM (UTC) from Unicode 13.0.0.
// 321205 (0x4E6B5) codepoints covered.
// 240 (0xF0) codepoints overridden.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    ...
    UnicodeRange{ 0xfe20, 0xfe2f, CodepointWidth::Narrow }, // narrow combining ligatures (split into left/right halves, which take 2 columns together)
    ...
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};

DHowett added 4 commits May 16, 2020 19:49
This commit introduces Generate-CodepointWidthsFromUCD, a powershell
(7+) script that will parse a UCD XML database in the UAX 42 format from
https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate
CodepointWidthDetector's giant width array.

By default, it will emit one UnicodeRange for every range of non-narrow
glyphs with a different Width + Emoji + Emoji Presentation class;
however, it can be run in "packing" and "full" mode.

* Packing mode: ignore the width/emoji/pres class and combine adjacent
  runs that CPWD will treat the same.
     * This is for optimizing the number of individual ranges emitted
       into code.
* Full mode: include narrow codepoints (helpful for visualization)

The output (when packing) looks like this:

```c++
// Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False
// on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0.
// 66182 (0x10286) codepoints covered.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous },
    .
    .
    .
    UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide },
    UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide },
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};
```
@DHowett

This comment has been minimized.

@DHowett
Copy link
Member Author

DHowett commented May 18, 2020

This will let us move our overrides into a documented, maintainable format. They will be part of CodepointWidthDetector's table, and we can use the same tooling to generate it and the same code to read it at runtime.

I will profile the speed difference between "getQuick" (51 comparisons) and "CPWD::GetWidth" (log(245) comparisons (~8)) (!!!!!!) and maybe when we document it in an xml file we can remove GetQuick completely.

@DHowett
Copy link
Member Author

DHowett commented May 18, 2020

I was exaggerating, but GetQuickCharWidth actually does do 28*2 comparisons and CodepointWidthDetector is definitely log2(n)

@DHowett
Copy link
Member Author

DHowett commented May 26, 2020

This is ready for review. I've added support for an overrides file and moved a bunch of the logic into classes.

The overrides file is of the same format as the UCD itself. This is the example I was building with:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
   <repertoire>
      <override first-cp="2500" last-cp="259F" ea="H" comment="box-drawing and block elements require cell alignment" />
      <override first-cp="4DC0" last-cp="4DFF" ea="H" comment="hexagrams are historically narrow" />
      <override first-cp="FE20" last-cp="FE2F" ea="H" comment="narrow combining ligatures (split into left/right halves, which take 2 columns together)" />
   </repertoire>
</ucd>

@DHowett DHowett requested a review from miniksa May 26, 2020 03:22
Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Definitely add an example of how to use it - where to get the input files, how to invoke it, and where to copy the output too
  2. Should we include the input files that are relevant to the current content of the CPWD in the docs/reference directory, with timestamps?

@ghost ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label May 28, 2020
@DHowett
Copy link
Member Author

DHowett commented May 28, 2020

I’d rather not include the input file; it’s 52MB. Fortunately, Unicode doesn’t go back and edit existing datasets, and the tool emits a version reference when it runs.

The overrides file is going to be checked in in a subsequent commit, titled “regenerate all the Unicode tables again”. This is just the script that does it :)

@ghost ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label May 28, 2020
@zadjii-msft
Copy link
Member

it’s 52MB

holy what

Okay yea we don't need that.

The overrides file is going to be checked in in a subsequent commit, titled “regenerate all the Unicode tables again”. This is just the script that does it :)

Ah yea, okay I guess I missed that. Considering the above,

@jsoref
Copy link
Contributor

jsoref commented May 28, 2020

The conflicts shouldn't be hard to resolve whitelist was renamed to expect -- If you need help, I can help on Sunday.

@DHowett
Copy link
Member Author

DHowett commented May 29, 2020

Thanks for the offer, @jsoref. Got this one on lock. 😄

@DHowett DHowett added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jun 3, 2020
@ghost
Copy link

ghost commented Jun 3, 2020

Hello @DHowett!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit eccfb53 into master Jun 3, 2020
@ghost ghost deleted the dev/duhowett/cpwd_from_powershell branch June 3, 2020 07:16
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AutoMerge Marked for automatic merge by the bot when requirements are met

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants