[USE] Merge the categories S and O #3249

dscorbett · 2021-10-05T21:01:25Z

This PR conflates the USE categories SYM and OTHER. The only effect is that SYM_MOD characters are now allowed after OTHER, not just SYM. This is the first part of a bigger set of changes I want to make, so I’ll explain here the circumstances under which I think HarfBuzz should and should not violate the USE spec.

USE clusters determine where dotted circles are inserted. They should only be inserted when the text is misencoded, not merely unexpected or wrong. If there are multiple ways to represent some text in an encoding, and they aren’t canonically equivalent, and they don’t involve default ignorable code points, then one of them is chosen (more or less arbitrarily) to be right, and the rest are considered misencoded.

One example is the order of ccc=0 vowel signs. By convention, above-base marks precede below-base marks. This is the opposite of ccc≠0 vowel signs. Neither is wrong: it’s completely arbitrary. Therefore, we pick one convention and insert dotted circles in the other.

Another example is #1273. The Khudawadi independent ai is represented as <U+112B7 KHUDAWADI LETTER AI>, but it could just as well have been <U+112B0 KHUDAWADI LETTER A, U+112E6 KHUDAWADI VOWEL SIGN AI>. There is no real difference between them: it is an artifact of the encoding. Therefore, we insert a dotted circle in the latter. (This isn’t related to USE clusters; I mention it because it follows the same principle.)

Another example is #627. Batak closed syllables are encoded in pronunciation order, not in visual order: <consonant, vowel, consonant, killer> instead of <consonant, consonant, vowel, killer>. The visual order is not a priori wrong, but it’s not the Unicode way. Therefore, text in visual order is misencoded.

Another example is hieroglyph clusters, which Andrew Glass is still working on. Hieroglyphic format controls are an abstraction defined by Unicode, not really part of the Egyptian hieroglyphic script, so Unicode is the sole determiner of how to use these code points correctly. Using them incorrectly warrants a dotted circle.

On the other hand, a string like <U+0024 DOLLAR SIGN, U+1B6B BALINESE MUSICAL SYMBOL COMBINING TEGEH> is not misencoded. It is completely unambiguous what it represents and how to render it, and there is no other way to encode it. It is implausible, of course, but that is none of the shaper’s business. See #525, #609, #1035, #1399, #1631, and #1685 for cases where supposedly implausible strings turned out to be attested.

Clusters are helpful for Indic reordering and dotted circles, but otherwise, their boundaries are arbitrary and don’t really matter. For compatibility we might as well follow the spec as closely as possible. We should only extend it to influence dotted circle insertion. Even then, we should not extend it more than necessary, even when that requires more complex code, as in #1720.

behdad · 2021-10-07T23:45:13Z

How to make the diversion more clear in the code?

dscorbett · 2021-10-08T14:39:02Z

I could add comments to is_OTHER etc. saying which USE categories HarfBuzz merges together. Would that help?

behdad · 2021-10-08T15:57:58Z

I could add comments to is_OTHER etc. saying which USE categories HarfBuzz merges together. Would that help?

Yes please.

Richard57 · 2022-07-17T12:57:16Z

One example is the order of ccc=0 vowel signs. By convention, above-base marks precede below-base marks. This is the opposite of ccc≠0 vowel signs. Neither is wrong: it’s completely arbitrary. Therefore, we pick one convention and insert dotted circles in the other.

It hits problems when the classification of non-base characters is not obvious. Some marks occur in more than one functional category, especially in Tai Tham. The process of manually converting image to characters for a word one can read is a lot smoother when one uses the text (as opposed to character) encoding apparently accepted by the UTC. For a vertical stack, it then doesn't matter whether MAI KANG is a vowel or a vowel modifier like other aunusvaras/niggahitas or whether SIGN OA BELOW is a vowel mark or, as many seem to feel, a subscript consonant, often misencoded <SAKOT, LETTER A>. One might add that many don't see a difference between the vowel mark MAI SAT, which can also function as a final consonant, when it is known as 'mai kak', and as a vowel modifier (it shortens the vowel, which can change the phonemic tone) and the tone mark TONE-2, which is widely called 'mai sat'!

You allow both ways round for visually and semantically identical but canonically inequivalent Thai sequences.

Incidentally, the 'convention' is a USE diktat.

Clusters are helpful for Indic reordering and dotted circles, but otherwise, their boundaries are arbitrary and don’t really matter.

Again, Tai Tham provides an exception in the form of U+1A58 MAI KANG LAI. In modern Tai Khuen, it's a final consonant and can have an intra-word line break after it. In Laos and NE Thailand, it behaves like rephas and Burmese kinzi. If the boundary came before it, fonts supporting the latter styles could happily use the rphf feature, instead of having to do manual reordering. (Both styles occur in Northern Thailand.)

Another exception is U+1A63 TAI THAM SIGN AA. A lot of grief would have been avoided if it had been treated as a consonant and a cluster boundary inserted before it. U+1A61 and U+1A64 have similar behaviours, though words like ᨠᩡ᩠ᩃᩣ are very rare - I've only seen it in Tai Lue. The downside is that some of what could be done in features pstf and abvf would then have to be done in features psts and abvs, and would have been impossible in the Indic shaper.

dscorbett added 2 commits October 8, 2021 13:14

[USE] Merge the categories S and O

7287125

[USE] Document customizations of USE categories

bb50aae

dscorbett force-pushed the use-merge-s-o branch from 686f3fe to bb50aae Compare October 8, 2021 18:05

behdad merged commit cca42cd into main Oct 8, 2021

behdad deleted the use-merge-s-o branch October 8, 2021 19:10

khaledhosny added the USE Universal Shaping Engine label Oct 8, 2021

dscorbett mentioned this pull request Mar 5, 2022

[USE] Allow any non-numeric tail in a symbol cluster #3473

Merged

dscorbett mentioned this pull request Jul 17, 2022

Whitespace and USE dotted circles #3718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[USE] Merge the categories S and O #3249

[USE] Merge the categories S and O #3249

dscorbett commented Oct 5, 2021 •

edited

behdad commented Oct 7, 2021

dscorbett commented Oct 8, 2021

behdad commented Oct 8, 2021

Richard57 commented Jul 17, 2022

[USE] Merge the categories S and O #3249

[USE] Merge the categories S and O #3249

Conversation

dscorbett commented Oct 5, 2021 • edited

behdad commented Oct 7, 2021

dscorbett commented Oct 8, 2021

behdad commented Oct 8, 2021

Richard57 commented Jul 17, 2022

dscorbett commented Oct 5, 2021 •

edited