Viewing USE character subclasses #3758

Richard57 · 2022-07-27T07:32:30Z

If one could determined the release of HarfBuzz being used, one used to be able to easily lookup the HarfBuzz Universal Shaping Engine subclasses of characters in hb-ot-shaper-use-table.hh. However, the useful comments in the definition of hb_use_u8[] have disappeared, and to find the place one has to recognise a sequence of values. What now is the recommended way for a human to quickly look up the subclassifications?

I refer to subclasses because some of the classes are split up in a manner that is largely related to their positioning.

behdad · 2022-07-27T20:01:54Z

Right. I just added a main() function to hb-ot-shaper-use-table.hh such that if you compile it like this:

$ g++ -x c++  -DHB_USE_TABLE_MAIN src/hb-ot-shaper-use-table.hh -o use-table

then you can do for example:

$ ./use-table 10A00
1

That means the category for U+10A00 is 1. Looking number 1 in hb-ot-shaper-use-machine.rl shows:

export B↦       = 1; # BASE

Hope this helps at least.

Richard57 · 2022-07-28T08:00:30Z

I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory? I hope the format of the Ragel code stays stable enough for simple-minded parsing (no worse than awk) to complete the job of converting number to text. Outputting the whole line wouldn't be too bad, because some of the comments are quite useful.

behdad · 2022-07-28T15:35:54Z

I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory?

Not in the util directory. But we can maybe build it in the src/ directory. @khaledhosny can you help with that?

I hope the format of the Ragel code stays stable enough for simple-minded parsing (no worse than awk) to complete the job of converting number to text. Outputting the whole line wouldn't be too bad, because some of the comments are quite useful.

It should stay stable, yes.

I also changed it such that if you don't pass any argument it prints the values for all of Unicode, in this format:

...
U+10A00 1
U+10A01 34
U+10A02 34
U+10A03 34
U+10A04 0
U+10A05 33
U+10A06 34
U+10A07 0
U+10A08 0
U+10A09 0
U+10A0A 0
U+10A0B 0
...

khaledhosny · 2022-07-31T07:32:39Z

I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory?

Not in the util directory. But we can maybe build it in the src/ directory. @khaledhosny can you help with that?

To be honest, I have no idea what is being discussed here.

Richard57 · 2022-07-31T09:59:07Z

@khaledhosny, the Universal Shaping Engine (USE) groups text in a supported script into short runs called 'clusters'. These are the domain of the GSUB commands first executed and of rearrangement. Any text that does not group into a cluster has a dotted circle inserted. The definitions of clusters are defined in terms of a partition of codepoints into USE categories, along with a further refinement by the usual position, yielding subclasses. The division into categories is mostly defined by the formally arbitrary (no published definition or guide) Unicode property IndicSyllabicCategory. To make it work, there is a list of overrides applied by Windows to make it work, and the more egregious errors are again fixed by HarfBuzz, e.g. to handle visible viramas being used as components of vowels. (One problem is that the categorisation is a partition, whereas a mark can function in several categories.) Looking up the subclass so as to understand a dotted circle is not easy and is error-prone.

HarfBuzz, in autogenerated code, records the subclasses in an array in hb-ot-shaper-use-table.hh. (The data has migrated around several similarly named files. I can't work out how to navigate to it from github.com) The comments used therein to label each value by codepoint, so this enabled a user to see what class but at some point after the release of Unicode 14.0, those comments ceased to be generated. I therefore asked if there were a recommended alternative method to find out what subclasses HarfBuzz was using for the USE.

Behdad kindly added a programmatic interface to extract these values from codepoint. With the recomended manual compilation, (not part of the standard build process) this results in an executable image use-table in the src directory. Behadad has asked if you could tidy this up. That question or request is not completely clear to me; is that where your incomprehension lies. I presume the ideal outcome is that when HarfBuzz is built, the executable will occur in the src directory in the build directory structure.

For example, wrapping a simple script around the program, I can display a debatable set of subclass allocations by my command USE-table 1A79 1A7c and get

1A79 => VMAbv # VOWEL_MOD_ABOVE
1A7A => VAbv # VOWEL_ABOVE / VOWEL_ABOVE_BELOW / VOWEL_ABOVE_BELOW_POST / VOWEL_ABOVE_POST
1A7B => VMAbv # VOWEL_MOD_ABOVE
1A7C => VMAbv # VOWEL_MOD_ABOVE

This is all extracted from the HarfBuzz code (use-table and hb-ot-shaper-use-machine.rl) in one fashion or another, using nothing more sophisticated than grep, awk and sed.

khaledhosny · 2022-07-31T20:54:18Z

Thanks, this all makes sense now. I think building a binary in src should be doable, there is a bunch of them there already.

behdad · 2022-08-02T18:19:15Z

Now building src/test-use-table. That's the closest we can get I'm afraid. The grep, awk, sed has to remain in your scripts.

behdad closed this as completed in 826639f Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Viewing USE character subclasses #3758

Viewing USE character subclasses #3758

Richard57 commented Jul 27, 2022

behdad commented Jul 27, 2022

Richard57 commented Jul 28, 2022

behdad commented Jul 28, 2022

khaledhosny commented Jul 31, 2022

Richard57 commented Jul 31, 2022

khaledhosny commented Jul 31, 2022

behdad commented Aug 2, 2022

Viewing USE character subclasses #3758

Viewing USE character subclasses #3758

Comments

Richard57 commented Jul 27, 2022

behdad commented Jul 27, 2022

Richard57 commented Jul 28, 2022

behdad commented Jul 28, 2022

khaledhosny commented Jul 31, 2022

Richard57 commented Jul 31, 2022

khaledhosny commented Jul 31, 2022

behdad commented Aug 2, 2022