Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viewing USE character subclasses #3758

Closed
Richard57 opened this issue Jul 27, 2022 · 7 comments
Closed

Viewing USE character subclasses #3758

Richard57 opened this issue Jul 27, 2022 · 7 comments

Comments

@Richard57
Copy link

If one could determined the release of HarfBuzz being used, one used to be able to easily lookup the HarfBuzz Universal Shaping Engine subclasses of characters in hb-ot-shaper-use-table.hh. However, the useful comments in the definition of hb_use_u8[] have disappeared, and to find the place one has to recognise a sequence of values. What now is the recommended way for a human to quickly look up the subclassifications?

I refer to subclasses because some of the classes are split up in a manner that is largely related to their positioning.

@behdad
Copy link
Member

behdad commented Jul 27, 2022

Right. I just added a main() function to hb-ot-shaper-use-table.hh such that if you compile it like this:

$ g++ -x c++  -DHB_USE_TABLE_MAIN src/hb-ot-shaper-use-table.hh -o use-table

then you can do for example:

$ ./use-table 10A00
1

That means the category for U+10A00 is 1. Looking number 1 in hb-ot-shaper-use-machine.rl shows:

export B↦       = 1; # BASE 

Hope this helps at least.

@Richard57
Copy link
Author

I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory? I hope the format of the Ragel code stays stable enough for simple-minded parsing (no worse than awk) to complete the job of converting number to text. Outputting the whole line wouldn't be too bad, because some of the comments are quite useful.

@behdad
Copy link
Member

behdad commented Jul 28, 2022

I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory?

Not in the util directory. But we can maybe build it in the src/ directory. @khaledhosny can you help with that?

I hope the format of the Ragel code stays stable enough for simple-minded parsing (no worse than awk) to complete the job of converting number to text. Outputting the whole line wouldn't be too bad, because some of the comments are quite useful.

It should stay stable, yes.

I also changed it such that if you don't pass any argument it prints the values for all of Unicode, in this format:

...
U+10A00 1
U+10A01 34
U+10A02 34
U+10A03 34
U+10A04 0
U+10A05 33
U+10A06 34
U+10A07 0
U+10A08 0
U+10A09 0
U+10A0A 0
U+10A0B 0
...

@khaledhosny
Copy link
Collaborator

I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory?

Not in the util directory. But we can maybe build it in the src/ directory. @khaledhosny can you help with that?

To be honest, I have no idea what is being discussed here.

@Richard57
Copy link
Author

@khaledhosny, the Universal Shaping Engine (USE) groups text in a supported script into short runs called 'clusters'. These are the domain of the GSUB commands first executed and of rearrangement. Any text that does not group into a cluster has a dotted circle inserted. The definitions of clusters are defined in terms of a partition of codepoints into USE categories, along with a further refinement by the usual position, yielding subclasses. The division into categories is mostly defined by the formally arbitrary (no published definition or guide) Unicode property IndicSyllabicCategory. To make it work, there is a list of overrides applied by Windows to make it work, and the more egregious errors are again fixed by HarfBuzz, e.g. to handle visible viramas being used as components of vowels. (One problem is that the categorisation is a partition, whereas a mark can function in several categories.) Looking up the subclass so as to understand a dotted circle is not easy and is error-prone.

HarfBuzz, in autogenerated code, records the subclasses in an array in hb-ot-shaper-use-table.hh. (The data has migrated around several similarly named files. I can't work out how to navigate to it from github.com) The comments used therein to label each value by codepoint, so this enabled a user to see what class but at some point after the release of Unicode 14.0, those comments ceased to be generated. I therefore asked if there were a recommended alternative method to find out what subclasses HarfBuzz was using for the USE.

Behdad kindly added a programmatic interface to extract these values from codepoint. With the recomended manual compilation, (not part of the standard build process) this results in an executable image use-table in the src directory. Behadad has asked if you could tidy this up. That question or request is not completely clear to me; is that where your incomprehension lies. I presume the ideal outcome is that when HarfBuzz is built, the executable will occur in the src directory in the build directory structure.

For example, wrapping a simple script around the program, I can display a debatable set of subclass allocations by my command USE-table 1A79 1A7c and get

1A79 => VMAbv # VOWEL_MOD_ABOVE
1A7A => VAbv # VOWEL_ABOVE / VOWEL_ABOVE_BELOW / VOWEL_ABOVE_BELOW_POST / VOWEL_ABOVE_POST
1A7B => VMAbv # VOWEL_MOD_ABOVE
1A7C => VMAbv # VOWEL_MOD_ABOVE

This is all extracted from the HarfBuzz code (use-table and hb-ot-shaper-use-machine.rl) in one fashion or another, using nothing more sophisticated than grep, awk and sed.

@khaledhosny
Copy link
Collaborator

Thanks, this all makes sense now. I think building a binary in src should be doable, there is a bunch of them there already.

@behdad behdad closed this as completed in 826639f Aug 2, 2022
@behdad
Copy link
Member

behdad commented Aug 2, 2022

Now building src/test-use-table. That's the closest we can get I'm afraid. The grep, awk, sed has to remain in your scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants