Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Khmer shaping does not match Uniscribe or CoreText #667

punchcutter opened this issue Dec 29, 2017 · 8 comments

Khmer shaping does not match Uniscribe or CoreText #667

punchcutter opened this issue Dec 29, 2017 · 8 comments


Copy link

@punchcutter punchcutter commented Dec 29, 2017

Brought up by @mcdurdin at
Text is

  1. ស៉ើុប
  2. ស៉េីុប
  3. ស៉ីេុប
  4. ស៉ើុុប
  5. ស៉េីុុប
  6. ស៉ីេុុប
  7. ស៊ើុប
  8. ស៊េីុប
  9. ស៊ីេុប
  10. ស៊ើុុប
  11. ស៊េីុុប
  12. ស៊ីេុុប
  13. ស៉ើីប
  14. ស៉ីើប
  15. ស៉ើីុប
  16. ស៉ីើុប
  17. ស៊ើីប
  18. ស៊ីើប
  19. ស៊ើីុប
  20. ស៊ីើុប
  21. សើីុប
  22. សើុីប
  23. សីើុប
  24. សីុើប

Windows 10 Edge and OS X 10.12.6 TextEdit agree that only one combination can be rendered correctly:
khmer_seep_textedit_osx_10 12 6

harfbuzz (here Firefox Nightly, but also tested with latest hb-view) does not agree and allows all of these combinations to render.

Font handling of consonant shifters can be different depending on the font which is why the Android test with an older version of Noto Khmer showed everything rendering exactly the same. The new version of Noto Khmer doesn't allow that to happen (only 7-12 above look correct even though they aren't), but either way the shaper shouldn't allow every one of these to shape the same.

Copy link

@mcdurdin mcdurdin commented Dec 29, 2017

@MakaraSok is currently working on providing additional data for this issue. Due to some ambiguity in the Unicode and OpenType specifications for Khmer, it is not immediately clear which combinations should be permitted and which should be blocked, especially in the boundary between syllables. Also important to consider how minority languages use the Khmer script to avoid blocking them in any fixes.

There are also 12 more basic examples for this syllable (screenshot from presentation):


Copying @jahorton, @mhosken

Copy link

@behdad behdad commented Jan 2, 2018

The same happens to all Indic scripts as well I believe. We don't enforce the one-matra per-position rule, neither do we enforce a specific matra order. cc @jfkthame

Copy link

@jfkthame jfkthame commented Jan 5, 2018

A strict one-matra-per-position rule would be problematic, because of things like हााााााा (yes, real-world examples of that kind of thing do happen, e.g. in comic-book text).

behdad added a commit that referenced this issue Jan 5, 2018
Towards fixing #667
The Khmer spec is different enough from other Indic ones to require
its own grammar.

No change in functionality.  Test numbers are:

BENGALI: 353725 out of 354188 tests passed. 463 failed (0.130722%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366355 out of 366457 tests passed. 102 failed (0.0278341%)
GURMUKHI: 60729 out of 60747 tests passed. 18 failed (0.0296311%)
KANNADA: 951300 out of 951913 tests passed. 613 failed (0.0643966%)
KHMER: 299071 out of 299124 tests passed. 53 failed (0.0177184%)
MALAYALAM: 1048136 out of 1048334 tests passed. 198 failed (0.0188871%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271662 out of 271847 tests passed. 185 failed (0.068053%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)
Copy link

@mcdurdin mcdurdin commented Jan 6, 2018

@jfkthame Yes, I don't see that as an issue given that is visually different. The issue described here relates only to multiple visually-indistinguishable sequences.

Copy link

@behdad behdad commented Jan 7, 2018

Jonathan and I separated the Khmer shaper from the Indic one. Next step is to remove more code from the Khmer shaper, then code in the matra ordering into the grammar.

Copy link

@Richard57 Richard57 commented Jan 27, 2018

The Khmer encoding was designed to have only one matra per syllable, despite have commonality with Thai writing habits in that respect. There should be a looming issue with the Khom script (i.e. the variety of the Khmer script used in Thailand for Pali and some writing in the Tai vernaculars), as all three of the apparent signs combinations <E, [I, II, Y]> have been in use at some time since the middle of the 19th century, and it is likely that they have been used for a 2-way length contrast, as in the Lao Lao script writing system. There has also been variation in Khmer, though the temptation is to dismiss that as glyph variation.

Copy link

@behdad behdad commented Sep 10, 2018

Can someone test this again? Our Khmer shaper is much closer to spec / uniscribe now.

Copy link

@MakaraSok MakaraSok commented Sep 18, 2018

@behdad Thank you for your work on this. I'm a Khmer native speaker and I'm willing to help testing Khmer script, please give me some guidance on how to test it.

behdad added a commit that referenced this issue Oct 1, 2018
Based on experimenting with Uniscribe to extract grammar and categories.

Failures down from 44 to 35:

KHMER: 299089 out of 299124 tests passed. 35 failed (0.0117008%)

We still don't enforce the one-matra rule pre-decomposition, but enforce
an order and one-matra-per-position post-decomposition.

@behdad behdad closed this in ab4c37f Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants