Kannada Ra-Virama-ZWJ gives different results from Windows 10 #435
The character sequence Ra-Virama-ZWJ in Kannada script gives a different rendering from Windows 10. Specifically, with HarfBuzz the virama is visibly rendered, just like if ZWJ was replaced by ZWNJ. In Windows 10, the virama causes a sub form to appear.
The file harfbuzz.png was generated from hb-view on Ubuntu Xenial, using the latest git sources. The file windows10.png was generated on Windows 10 using Notepad. The font in both cases was Noto Sans Kannada (Regular), version 1.04. The source text is the file renderdiff.txt. The word in the second line of the example comes from the Kannada Wikipedia wordlist that was used to test HarfBuzz.
I also tested with the fonts
with various combinations of Notepad, LibreOffice 5.1, LibreOffice 5.2, Windows 10, Windows 7, Ubuntu Xenial with HarfBuzz as packaged by Ubuntu and also compiled from source. The difference seemed to be was HarfBuzz doing the OpenType rendering, or was Microsoft DirectWrite (I don't think Uniscribe or USE would have been involved) doing the rendering.
Here's what Peter Constable wrote to me:
For Kannada character sequences < RA, VIRAMA, consonant >, there’s potential ambiguity as to whether the display should be [ gRA, gConsonant.subjoined ] or [ gConsonant, gReph ]. On pages 499-500 of Unicode 10.0 (section 12.8 — http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf), it specifies that a sequence of < RA, ZWJ, VIRAMA, consonant > be used to represent text that needs to be rendered [gRA, gConsonant.subjoined ].
However, things were not always specified that way. If you look back in Unicode 4.0, section 9.8 (http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf), it was actually specified the other way around: the sequence < RA, VIRAMA, ZWJ, Consonant > was specified to represent [ gRA, gConsonant.subjoined ]. This was changed in Unicode 5 after it came to light that there were inconsistent specifications for different Indic scripts.
This all arose as a result of various things involving Indic scripts all happening in 2004, one of which was publication of a draft for a Sri Lanka standard (http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/04-131). This came up for discussion at the UTC meeting in June 2004 (see http://www.unicode.org/cgi-bin/GetL2Ref.pl?99-C37). At the time, I was working on updates to the Indic shaping engine in Uniscribe, and I was aware of some of the inconsistencies (e.g., the draft for the Sri Lanka standard specifying the opposite of what Unicode had specified for Kannada RA and for Bangla ya-phalaa). So, I was given a UTC action item to prepare a doc regarding the general issue for Indic scripts. That eventually led to issuing Public Review Issue #37 (http://www.unicode.org/review/pr-37.pdf), which proposed having a consistent specification of ZWJ sequences across Indic scripts. (See the last page for specific changes for Kannada and Bangla.) That proposal was adopted at the UTC meeting in August 2004 (http://www.unicode.org/cgi-bin/GetL2Ref.pl?100-C22).
So, Kannada sequences < RA, VIRAMA, ZWJ, Consonant > aren’t currently recommended, but earlier on they were.
To accommodate previously-existing docs, Uniscribe does have a special-case behaviour: