[USE] Merge the categories S and O #3249
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR conflates the USE categories SYM and OTHER. The only effect is that SYM_MOD characters are now allowed after OTHER, not just SYM. This is the first part of a bigger set of changes I want to make, so I’ll explain here the circumstances under which I think HarfBuzz should and should not violate the USE spec.
USE clusters determine where dotted circles are inserted. They should only be inserted when the text is misencoded, not merely unexpected or wrong. If there are multiple ways to represent some text in an encoding, and they aren’t canonically equivalent, and they don’t involve default ignorable code points, then one of them is chosen (more or less arbitrarily) to be right, and the rest are considered misencoded.
One example is the order of ccc=0 vowel signs. By convention, above-base marks precede below-base marks. This is the opposite of ccc≠0 vowel signs. Neither is wrong: it’s completely arbitrary. Therefore, we pick one convention and insert dotted circles in the other.
Another example is #1273. The Khudawadi independent ai is represented as <U+112B7 KHUDAWADI LETTER AI>, but it could just as well have been <U+112B0 KHUDAWADI LETTER A, U+112E6 KHUDAWADI VOWEL SIGN AI>. There is no real difference between them: it is an artifact of the encoding. Therefore, we insert a dotted circle in the latter. (This isn’t related to USE clusters; I mention it because it follows the same principle.)
Another example is #627. Batak closed syllables are encoded in pronunciation order, not in visual order: <consonant, vowel, consonant, killer> instead of <consonant, consonant, vowel, killer>. The visual order is not a priori wrong, but it’s not the Unicode way. Therefore, text in visual order is misencoded.
Another example is hieroglyph clusters, which Andrew Glass is still working on. Hieroglyphic format controls are an abstraction defined by Unicode, not really part of the Egyptian hieroglyphic script, so Unicode is the sole determiner of how to use these code points correctly. Using them incorrectly warrants a dotted circle.
On the other hand, a string like <U+0024 DOLLAR SIGN, U+1B6B BALINESE MUSICAL SYMBOL COMBINING TEGEH> is not misencoded. It is completely unambiguous what it represents and how to render it, and there is no other way to encode it. It is implausible, of course, but that is none of the shaper’s business. See #525, #609, #1035, #1399, #1631, and #1685 for cases where supposedly implausible strings turned out to be attested.
Clusters are helpful for Indic reordering and dotted circles, but otherwise, their boundaries are arbitrary and don’t really matter. For compatibility we might as well follow the spec as closely as possible. We should only extend it to influence dotted circle insertion. Even then, we should not extend it more than necessary, even when that requires more complex code, as in #1720.