Base Consonant Position #32

mikeday · 2018-07-21T02:08:19Z

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

mikeday · 2018-07-25T05:50:40Z

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

n8willis · 2018-08-08T21:19:02Z

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

It's the same for all (in HarfBuzz). I left it out of the 'general" document because there were (initially) other base-positioning rules and it seemed wrong to describe one algorithm but not the others. Subsequently, HarfBuzz extracted Khmer into a separate shaper and only BASE_POS_LAST and BASE_POS_LAST_SINHALA remain.

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

So, based on my analysis of the Ragel machines, it is possible for a CONSONANT_DEAD to be identified as the base consonant. Whether or not real words do this is, naturally, a different matter. I think they don't.

But because CONSONANT_DEADs can occur in pre-base position, the classes are merged for the syllable-identification algorithm. So the shaper using the algorithm perhaps might identify a dead-consonant codepoint as base in a nonsense syllable -- but then again, it's "buyer beware" on nonsense syllables already, I would think.

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

My read is 'no'. The only possible beginnings for a valid consonant-based syllable are

repha
Consonant
consonantwithstacker

(And, for Sinhala, repha and consonantwithstacker don't exist in the script). The "broken syllable" expression can match potential-syllable-sequences starting with a ZWJ, but the shaper offers no guarantee of how they'll turn out.

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Right. Malayalan has a pre-base-reordering Ra, the others don't. Kannada & Telugu both allow any consonant to be a post-base form. As to whether the shaper needs to check the font, I guess that's a "level of trust" issue. If the font doesn't provide any post-base-form glyph variants through its GSUB, the user is probably not going to be able to read the resulting text since the shaping will look terrible. So an expensive check might not be worth it.

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

Correct; because it has its own base-positioning rule, BASE_POS_LAST_SINHALA.

n8willis · 2018-08-08T21:19:53Z

Whoops; forgot to add: the differences in Bengali and Oriya are not intentional; will update.

mikeday · 2018-08-15T05:28:33Z

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Another possibility would be to state explicitly which scripts have the same algorithm, so that the reader doesn't need to carefully check them all and compare?

adrianwong · 2018-10-15T06:59:28Z

If we could have a table that summarises these similarities/differences, that would be great!

n8willis · 2018-10-16T15:54:02Z

@adrianwong Do you mean the differences between the scripts, or the differences between the base-consonant algorithms?

adrianwong · 2018-10-16T21:27:20Z

@n8willis Making one for the base-consonant algorithms would be a nice start, but ultimately having a summary of differences between all six stages of processing Indic2 texts in a table would be really useful.

This is an alternate approach to @mikeday's, but the motivations are the same - it'll make it easier for the reader to compare.

n8willis · 2018-11-12T15:59:54Z

I'm not sure that the full shaping processes or even the base-consonant-locating algorithms would fit into a table format. They're algorithms (stating the obvious); the steps are of different lengths & complexities ... not to mention the real-world problem that GitHub-rendered Markdown pages are of a fixed width. We already have problems with the latter issue in several of the character tables.

Putting in a more explicit listing like in @mikeday's comment seems more feasible.

adrianwong · 2018-11-12T21:21:43Z

Valid points! Thanks for giving it some thought.

n8willis · 2019-02-10T14:35:42Z

Merged tables for all script-shaping characteristics in 711b4a7.

n8willis self-assigned this Jan 3, 2019

n8willis closed this as completed Feb 10, 2019

adrianwong mentioned this issue Mar 18, 2019

Pre-base-reordering "Ra" in Telugu #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base Consonant Position #32

Base Consonant Position #32

mikeday commented Jul 21, 2018 •

edited

mikeday commented Jul 25, 2018

n8willis commented Aug 8, 2018

n8willis commented Aug 8, 2018

mikeday commented Aug 15, 2018

adrianwong commented Oct 15, 2018

n8willis commented Oct 16, 2018

adrianwong commented Oct 16, 2018

n8willis commented Nov 12, 2018

adrianwong commented Nov 12, 2018

n8willis commented Feb 10, 2019

Base Consonant Position #32

Base Consonant Position #32

Comments

mikeday commented Jul 21, 2018 • edited

mikeday commented Jul 25, 2018

n8willis commented Aug 8, 2018

n8willis commented Aug 8, 2018

mikeday commented Aug 15, 2018

adrianwong commented Oct 15, 2018

n8willis commented Oct 16, 2018

adrianwong commented Oct 16, 2018

n8willis commented Nov 12, 2018

adrianwong commented Nov 12, 2018

n8willis commented Feb 10, 2019

mikeday commented Jul 21, 2018 •

edited