Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[USE] Merge the categories S and O #3249

Merged
merged 2 commits into from Oct 8, 2021
Merged

[USE] Merge the categories S and O #3249

merged 2 commits into from Oct 8, 2021

Conversation

dscorbett
Copy link
Collaborator

@dscorbett dscorbett commented Oct 5, 2021

This PR conflates the USE categories SYM and OTHER. The only effect is that SYM_MOD characters are now allowed after OTHER, not just SYM. This is the first part of a bigger set of changes I want to make, so I’ll explain here the circumstances under which I think HarfBuzz should and should not violate the USE spec.

USE clusters determine where dotted circles are inserted. They should only be inserted when the text is misencoded, not merely unexpected or wrong. If there are multiple ways to represent some text in an encoding, and they aren’t canonically equivalent, and they don’t involve default ignorable code points, then one of them is chosen (more or less arbitrarily) to be right, and the rest are considered misencoded.

One example is the order of ccc=0 vowel signs. By convention, above-base marks precede below-base marks. This is the opposite of ccc≠0 vowel signs. Neither is wrong: it’s completely arbitrary. Therefore, we pick one convention and insert dotted circles in the other.

Another example is #1273. The Khudawadi independent ai is represented as <U+112B7 KHUDAWADI LETTER AI>, but it could just as well have been <U+112B0 KHUDAWADI LETTER A, U+112E6 KHUDAWADI VOWEL SIGN AI>. There is no real difference between them: it is an artifact of the encoding. Therefore, we insert a dotted circle in the latter. (This isn’t related to USE clusters; I mention it because it follows the same principle.)

Another example is #627. Batak closed syllables are encoded in pronunciation order, not in visual order: <consonant, vowel, consonant, killer> instead of <consonant, consonant, vowel, killer>. The visual order is not a priori wrong, but it’s not the Unicode way. Therefore, text in visual order is misencoded.

Another example is hieroglyph clusters, which Andrew Glass is still working on. Hieroglyphic format controls are an abstraction defined by Unicode, not really part of the Egyptian hieroglyphic script, so Unicode is the sole determiner of how to use these code points correctly. Using them incorrectly warrants a dotted circle.

On the other hand, a string like <U+0024 DOLLAR SIGN, U+1B6B BALINESE MUSICAL SYMBOL COMBINING TEGEH> is not misencoded. It is completely unambiguous what it represents and how to render it, and there is no other way to encode it. It is implausible, of course, but that is none of the shaper’s business. See #525, #609, #1035, #1399, #1631, and #1685 for cases where supposedly implausible strings turned out to be attested.

Clusters are helpful for Indic reordering and dotted circles, but otherwise, their boundaries are arbitrary and don’t really matter. For compatibility we might as well follow the spec as closely as possible. We should only extend it to influence dotted circle insertion. Even then, we should not extend it more than necessary, even when that requires more complex code, as in #1720.

@behdad
Copy link
Member

behdad commented Oct 7, 2021

How to make the diversion more clear in the code?

@dscorbett
Copy link
Collaborator Author

I could add comments to is_OTHER etc. saying which USE categories HarfBuzz merges together. Would that help?

@behdad
Copy link
Member

behdad commented Oct 8, 2021

I could add comments to is_OTHER etc. saying which USE categories HarfBuzz merges together. Would that help?

Yes please.

@behdad behdad merged commit cca42cd into main Oct 8, 2021
@behdad behdad deleted the use-merge-s-o branch October 8, 2021 19:10
@khaledhosny khaledhosny added the USE Universal Shaping Engine label Oct 8, 2021
@Richard57
Copy link

One example is the order of ccc=0 vowel signs. By convention, above-base marks precede below-base marks. This is the opposite of ccc≠0 vowel signs. Neither is wrong: it’s completely arbitrary. Therefore, we pick one convention and insert dotted circles in the other.

It hits problems when the classification of non-base characters is not obvious. Some marks occur in more than one functional category, especially in Tai Tham. The process of manually converting image to characters for a word one can read is a lot smoother when one uses the text (as opposed to character) encoding apparently accepted by the UTC. For a vertical stack, it then doesn't matter whether MAI KANG is a vowel or a vowel modifier like other aunusvaras/niggahitas or whether SIGN OA BELOW is a vowel mark or, as many seem to feel, a subscript consonant, often misencoded <SAKOT, LETTER A>. One might add that many don't see a difference between the vowel mark MAI SAT, which can also function as a final consonant, when it is known as 'mai kak', and as a vowel modifier (it shortens the vowel, which can change the phonemic tone) and the tone mark TONE-2, which is widely called 'mai sat'!

You allow both ways round for visually and semantically identical but canonically inequivalent Thai sequences.

Incidentally, the 'convention' is a USE diktat.

Clusters are helpful for Indic reordering and dotted circles, but otherwise, their boundaries are arbitrary and don’t really matter.

Again, Tai Tham provides an exception in the form of U+1A58 MAI KANG LAI. In modern Tai Khuen, it's a final consonant and can have an intra-word line break after it. In Laos and NE Thailand, it behaves like rephas and Burmese kinzi. If the boundary came before it, fonts supporting the latter styles could happily use the rphf feature, instead of having to do manual reordering. (Both styles occur in Northern Thailand.)

Another exception is U+1A63 TAI THAM SIGN AA. A lot of grief would have been avoided if it had been treated as a consonant and a cluster boundary inserted before it. U+1A61 and U+1A64 have similar behaviours, though words like ᨠᩡ᩠ᩃᩣ are very rare - I've only seen it in Tai Lue. The downside is that some of what could be done in features pstf and abvf would then have to be done in features psts and abvs, and would have been impossible in the Indic shaper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
USE Universal Shaping Engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants