Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base Consonant Position #32

Closed
mikeday opened this issue Jul 21, 2018 · 10 comments
Closed

Base Consonant Position #32

mikeday opened this issue Jul 21, 2018 · 10 comments
Assignees

Comments

@mikeday
Copy link

mikeday commented Jul 21, 2018

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

@mikeday
Copy link
Author

mikeday commented Jul 25, 2018

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

@n8willis
Copy link
Owner

n8willis commented Aug 8, 2018

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

It's the same for all (in HarfBuzz). I left it out of the 'general" document because there were (initially) other base-positioning rules and it seemed wrong to describe one algorithm but not the others. Subsequently, HarfBuzz extracted Khmer into a separate shaper and only BASE_POS_LAST and BASE_POS_LAST_SINHALA remain.

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

So, based on my analysis of the Ragel machines, it is possible for a CONSONANT_DEAD to be identified as the base consonant. Whether or not real words do this is, naturally, a different matter. I think they don't.

But because CONSONANT_DEADs can occur in pre-base position, the classes are merged for the syllable-identification algorithm. So the shaper using the algorithm perhaps might identify a dead-consonant codepoint as base in a nonsense syllable -- but then again, it's "buyer beware" on nonsense syllables already, I would think.

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

My read is 'no'. The only possible beginnings for a valid consonant-based syllable are

  • repha
  • Consonant
  • consonantwithstacker

(And, for Sinhala, repha and consonantwithstacker don't exist in the script). The "broken syllable" expression can match potential-syllable-sequences starting with a ZWJ, but the shaper offers no guarantee of how they'll turn out.

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Right. Malayalan has a pre-base-reordering Ra, the others don't. Kannada & Telugu both allow any consonant to be a post-base form. As to whether the shaper needs to check the font, I guess that's a "level of trust" issue. If the font doesn't provide any post-base-form glyph variants through its GSUB, the user is probably not going to be able to read the resulting text since the shaping will look terrible. So an expensive check might not be worth it.

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

Correct; because it has its own base-positioning rule, BASE_POS_LAST_SINHALA.

@n8willis
Copy link
Owner

n8willis commented Aug 8, 2018

Whoops; forgot to add: the differences in Bengali and Oriya are not intentional; will update.

@mikeday
Copy link
Author

mikeday commented Aug 15, 2018

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Another possibility would be to state explicitly which scripts have the same algorithm, so that the reader doesn't need to carefully check them all and compare?

@adrianwong
Copy link
Contributor

If we could have a table that summarises these similarities/differences, that would be great!

@n8willis
Copy link
Owner

@adrianwong Do you mean the differences between the scripts, or the differences between the base-consonant algorithms?

@adrianwong
Copy link
Contributor

@n8willis Making one for the base-consonant algorithms would be a nice start, but ultimately having a summary of differences between all six stages of processing Indic2 texts in a table would be really useful.

This is an alternate approach to @mikeday's, but the motivations are the same - it'll make it easier for the reader to compare.

@n8willis
Copy link
Owner

I'm not sure that the full shaping processes or even the base-consonant-locating algorithms would fit into a table format. They're algorithms (stating the obvious); the steps are of different lengths & complexities ... not to mention the real-world problem that GitHub-rendered Markdown pages are of a fixed width. We already have problems with the latter issue in several of the character tables.

Putting in a more explicit listing like in @mikeday's comment seems more feasible.

@adrianwong
Copy link
Contributor

Valid points! Thanks for giving it some thought.

@n8willis n8willis self-assigned this Jan 3, 2019
@n8willis
Copy link
Owner

Merged tables for all script-shaping characteristics in 711b4a7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants