-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Viewing USE character subclasses #3758
Comments
Right. I just added a main() function to
then you can do for example:
That means the category for U+10A00 is 1. Looking number 1 in
Hope this helps at least. |
I feared it might be as bad as this. Is use-table going to be tidied up to appear in the utilities directory? I hope the format of the Ragel code stays stable enough for simple-minded parsing (no worse than awk) to complete the job of converting number to text. Outputting the whole line wouldn't be too bad, because some of the comments are quite useful. |
Not in the util directory. But we can maybe build it in the src/ directory. @khaledhosny can you help with that?
It should stay stable, yes. I also changed it such that if you don't pass any argument it prints the values for all of Unicode, in this format:
|
To be honest, I have no idea what is being discussed here. |
@khaledhosny, the Universal Shaping Engine (USE) groups text in a supported script into short runs called 'clusters'. These are the domain of the GSUB commands first executed and of rearrangement. Any text that does not group into a cluster has a dotted circle inserted. The definitions of clusters are defined in terms of a partition of codepoints into USE categories, along with a further refinement by the usual position, yielding subclasses. The division into categories is mostly defined by the formally arbitrary (no published definition or guide) Unicode property IndicSyllabicCategory. To make it work, there is a list of overrides applied by Windows to make it work, and the more egregious errors are again fixed by HarfBuzz, e.g. to handle visible viramas being used as components of vowels. (One problem is that the categorisation is a partition, whereas a mark can function in several categories.) Looking up the subclass so as to understand a dotted circle is not easy and is error-prone. HarfBuzz, in autogenerated code, records the subclasses in an array in Behdad kindly added a programmatic interface to extract these values from codepoint. With the recomended manual compilation, (not part of the standard build process) this results in an executable image For example, wrapping a simple script around the program, I can display a debatable set of subclass allocations by my command
This is all extracted from the HarfBuzz code ( |
Thanks, this all makes sense now. I think building a binary in src should be doable, there is a bunch of them there already. |
Now building |
If one could determined the release of HarfBuzz being used, one used to be able to easily lookup the HarfBuzz Universal Shaping Engine subclasses of characters in hb-ot-shaper-use-table.hh. However, the useful comments in the definition of hb_use_u8[] have disappeared, and to find the place one has to recognise a sequence of values. What now is the recommended way for a human to quickly look up the subclassifications?
I refer to subclasses because some of the classes are split up in a manner that is largely related to their positioning.
The text was updated successfully, but these errors were encountered: