Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verse numbers of "KFGQPC Uthmanic Script HAFS" are not shaped right #501

Closed
ebraminio opened this issue Jun 23, 2017 · 22 comments · Fixed by #3035
Closed

Verse numbers of "KFGQPC Uthmanic Script HAFS" are not shaped right #501

ebraminio opened this issue Jun 23, 2017 · 22 comments · Fixed by #3035

Comments

@ebraminio
Copy link
Collaborator

ebraminio commented Jun 23, 2017

Filed from this mailing list topic as a possible bug, steps to repro:

  1. Download the font from http://fonts.qurancomplex.gov.sa/?page_id=42
  2. ./hb-view UthmanicHafs1\ Ver09.otf --direction=ltr ١٢

Actual:
21

Expected:
12

(this works if we issue ./hb-view UthmanicHafs1\ Ver09.otf --shapers=coretext --direction=ltr ١٢ on macOS and perhaps uniscribe, and with other fonts)

@khaledhosny
Copy link
Collaborator

I think this is because HarfBuzz processes the characters in the “native” direction of the script, so Arabic-Indic digits are processed from right-to-left so gives the wrong ligature in the font. I think that is one of the major remaining differences between HarfBuzz and Uniscribe. I stumbled upon this issue years ago and after Behdad’s explanation I managed to come up with a workaround in my font, but I wasn’t using ligatures so it was doable unlike here.

@ebraminio
Copy link
Collaborator Author

Thanks for you insightful comment @khaledhosny,

@behdad, I remember you wanted to put your comment here also, perhaps it would be useful for the future developments of the project and other collaborators.

@behdad
Copy link
Member

behdad commented Oct 4, 2017

Here's full details of this issue:

In Unicode, one can force-change direction of text using RLO and LRO characters. HarfBuzz wants to support this correctly. Ie. if you have <RLO,e,f,f,i> you should get "i,f,f,e" rendered. If you try with Uniscribe, you get "ffi,e", ie. it wrongly forms the 'ffi' ligature. Ie. it processes lookups in whatever input text order was there. At least it used to work this way two years ago.

Now, Uniscribe does handle this correctly in their Arabic and I suppose Hebrew shapers. In HarfBuzz I want to support it correctly all the time. The way I do this is to have a per-script "native" horizontal direction, which reflects the natural order of the script. When hb_shape() and requested shaping direction is reverse of the natural horizontal direction of the script, we flip the buffer first such that text becomes in natural order before we apply the lookups, because the font designer designed lookups for the natural order.

Now, when I do the above, I do it just based on requested direction and script. Which means, I cannot distinguish between <LRO,Lam,Alef> ie. a forced non-natural direction, and <660,661> ie. natural, left-to-right, numbers in Arabic. As such, there's no way for HarfBuzz to apply lookups in the natural direction in both of these cases, short of looking at the input buffer text and if it starts with digits then assume a LTR natural direction...

While I would hate such a solution, I'm open to implementing it. We should go back to Unicode bidi spec and see what sequences of characters can naturally become LTR in otherwise RTL scripts. Note that even if we do this, kerning / ligature of digits and weak punctuation still will get confused. Eg, if you happen to have kerning between left-parenthesis and digit zero, it's not clear which direction the lookup should be applied...

Now, for years I was puzzled as to how Uniscribe gets this right. After some very heated exchanges with PeterCon, Sergey Malkin, and Dwayne Robinson, I finally found out. Sergey gave me the lead: they use a different script code for Arabic digits, such that they can differentiate later, and choose LTR natural order for Arabic digits and RTL natural order for rest of Arabic. This, obviously is out of scope for HarfBuzz as we don't do the itemization and I don't want to copy the different-script hack either.

My plan to fix this properly is to encoded intended-direction of the lookup in LookupFlags. I've talked about it for years. Will try to flesh it out and write the proposal before end of year.

@behdad
Copy link
Member

behdad commented Jan 5, 2018

CC @anthrotype

@khaledhosny
Copy link
Collaborator

I’m just wondering, can’t we solve this somehow by special casing numbers (based on the general category of the characters I suppose) and give them a LTR native direction always? Is the native direction a property of the whole buffer?

@behdad
Copy link
Member

behdad commented Jan 19, 2018

Is the native direction a property of the whole buffer?

Yes, and that's the problem. The best we can do though is indeed detect digits and use LTR native. That will screw up any punctuation stuff in the buffer, but that's a smaller issue.

What would be a nice exercise is to see what neutral characters can end up in a LTR run adjacent to other neutral characters... But then again, doesn't really matter.

@khaledhosny
Copy link
Collaborator

Firefox (cc @jfkthame) seems to be trying to workaround this in https://bugzilla.mozilla.org/show_bug.cgi?id=1716029, but I’m really concerned about it as it will mean HarfBuzz clients will be behaving differently.

Right now I’m working around this issue by using the private BUZZ feature to map the digits to an alternate version and handle them differently so that HarfBuzz and non-HarfBuzz engines work the same, but with the workaround in Firefox it will give the wrong output in Firefox without a way to handle it in the font.

@jfkthame
Copy link
Collaborator

Firefox (cc @jfkthame) seems to be trying to workaround this in https://bugzilla.mozilla.org/show_bug.cgi?id=1716029, but I’m really concerned about it as it will mean HarfBuzz clients will be behaving differently.

Yes, I'm experimenting with a possible approach to try and handle this; if it looks workable I was intending to bring it back here to discuss if we could incorporate it into HB more directly.

Right now I’m working around this issue by using the private BUZZ feature to map the digits to an alternate version and handle them differently so that HarfBuzz and non-HarfBuzz engines work the same, but with the workaround in Firefox it will give the wrong output in Firefox without a way to handle it in the font.

That's interesting, but doesn't seem like it helps with things like the existing KFGQPC font that wants to ligate digit sequences? Given that font works "as intended" in Safari, I was hoping we could achieve the same result in the HB-based engines.

@khaledhosny
Copy link
Collaborator

khaledhosny commented Jun 16, 2021

That's interesting, but doesn't seem like it helps with things like the existing KFGQPC font that wants to ligate digit sequences? Given that font works "as intended" in Safari, I was hoping we could achieve the same result in the HB-based engines.

Ideally it would be fixed in HarfBuzz for good, but if not I’d rather have HarfBuzz clients work in a consistent (thus predictable) way.

Now I think the solution Behdad suggested #501 (comment) might be that bad after all :

short of looking at the input buffer text and if it starts with digits then assume a LTR natural direction...

(though checking for the first digit only is not enough, as it will miss the case in the Firefox bug report). I’d say any buffer with LTR and RTL script and at least one BiDi AN or EN character should be treated as having natural LTR direction.

May be the long term fix is to implement script segmentation in HarfBuzz and make all clients use it (I was checking DirectWrite the other day and it appears that there is no way to manually set the script based on its code, one have to use its script analyzer to get opaque script number to pass to the shaper).

@jfkthame
Copy link
Collaborator

In https://bugzilla.mozilla.org/show_bug.cgi?id=1716029 we're working around this via a hack that basically checks for numeric runs in Arabic and shapes them with LTR directionality, but if we could move this (or something comparable) into harfbuzz itself for consistency, that would obviously be better.

As a strawman suggestion, here's a harfbuzz patch that would result in similar behavior to what we're trying in Gecko:

diff --git a/src/hb-ot-shape.cc b/src/hb-ot-shape.cc
--- a/src/hb-ot-shape.cc
+++ b/src/hb-ot-shape.cc
@@ -564,6 +565,42 @@ hb_ensure_native_direction (hb_buffer_t 
   hb_direction_t direction = buffer->props.direction;
   hb_direction_t horiz_dir = hb_script_get_horizontal_direction (buffer->props.script);
 
+  /* Numeric runs in natively-RTL scripts are actually native-LTR, so we reset the
+   * horiz_dir if the run contains at least one decimal-number char, and no chars with
+   * strong directionality or format-control chars (like the Arabic number/footnote/
+   * year signs) that may be trying to interact with digits.
+   *
+   * This allows digit sequences in Arabic etc to be shaped in "native" direction,
+   * so that features like ligatures will work as intended.
+   */
+  if (unlikely (horiz_dir == HB_DIRECTION_RTL && direction == HB_DIRECTION_LTR)) {
+    const hb_glyph_info_t* info = buffer->info;
+    bool found_number = false,
+         found_strong = false;
+    foreach_grapheme (buffer, start, end)
+    {
+      auto gc = _hb_glyph_info_get_general_category (&info[start]);
+      if (gc == HB_UNICODE_GENERAL_CATEGORY_DECIMAL_NUMBER)
+        found_number = true;
+      else if (unlikely (gc == HB_UNICODE_GENERAL_CATEGORY_FORMAT))
+      {
+        found_strong = true;
+        break;
+      }
+      else
+      {
+        auto bidi = mozilla::unicode::GetBidiCat(info[start].codepoint);
+        if (bidi == eCharType_RightToLeft || bidi == eCharType_RightToLeftArabic)
+        {
+          found_strong = true;
+          break;
+        }
+      }
+    }
+    if (found_number && !found_strong)
+      horiz_dir = HB_DIRECTION_LTR;
+  }
+
   /* TODO vertical:
    * The only BTT vertical script is Ogham, but it's not clear to me whether OpenType
    * Ogham fonts are supposed to be implemented BTT or not.  Need to research that

This works within Firefox, but depends on calling a mozilla API to look up bidi categories; I don't think we currently have that in harfbuzz, do we? So if we wanted to do something like that, we'd need to add bidi-category lookup to the unicode functions.

@behdad
Copy link
Member

behdad commented Jun 21, 2021

We don't have bidi categories currently. I'm open to adding those if we ever implement the bidi algo. In the mean time:

I think checking for GC=Letter* types is enough in this case. If not, can be combined with horiz-dir=RTL. But is that even necessary? I think checking for GC=Letter* should be enough.

My concern is with GC=Format. Doesn't that negate the point? Aren't the year-sign etc GC=Cf? Please adivse.

I'm excited about getting this in and closing the gap with Uniscribe.

@jfkthame
Copy link
Collaborator

My concern is with GC=Format. Doesn't that negate the point? Aren't the year-sign etc GC=Cf? Please adivse.

Yes, those signs are Cf.

Currently to get the desired rendering with e.g. Noto Nastaliq Urdu, the client needs to explicitly override directionality of the run. See for example https://searchfox.org/mozilla-central/source/layout/reftests/bugs/1599841-1.html, where unicode-bidi: bidi-override is used to force RTL directionality onto the list item numbers.

The tweak here breaks that rendering if we don't exclude cases where GC=Cf is present.

But there may be better ways to approach this... we should try to think it through more carefully.

@khaledhosny
Copy link
Collaborator

But if you are forcing RTL direction, the direction == HB_DIRECTION_LTR) will be false and this new code will not be applied?

@khaledhosny
Copy link
Collaborator

With the modified patch below, setting direction to LTR and RTL seems to produce visually identical output using Noto Nastaliq Urdu (for single digits of course, forcing RTL does not make sense for more than one digit), and it seems to match both Uniscribe and CoreText:

diff --git a/src/hb-ot-shape.cc b/src/hb-ot-shape.cc
index 4fe148d67..1b72527f6 100644
--- a/src/hb-ot-shape.cc
+++ b/src/hb-ot-shape.cc
@@ -574,6 +574,33 @@ hb_ensure_native_direction (hb_buffer_t *buffer)
   hb_direction_t direction = buffer->props.direction;
   hb_direction_t horiz_dir = hb_script_get_horizontal_direction (buffer->props.script);

+  /* Numeric runs in natively-RTL scripts are actually native-LTR, so we reset the
+   * horiz_dir if the run contains at least one decimal-number char, and no chars with
+   * strong directionality or format-control chars (like the Arabic number/footnote/
+   * year signs) that may be trying to interact with digits.
+   *
+   * This allows digit sequences in Arabic etc to be shaped in "native" direction,
+   * so that features like ligatures will work as intended.
+   */
+  if (unlikely (horiz_dir == HB_DIRECTION_RTL && direction == HB_DIRECTION_LTR)) {
+    const hb_glyph_info_t* info = buffer->info;
+    bool found_number = false,
+         found_strong = false;
+    foreach_grapheme (buffer, start, end)
+    {
+      auto gc = _hb_glyph_info_get_general_category (&info[start]);
+      if (gc == HB_UNICODE_GENERAL_CATEGORY_DECIMAL_NUMBER)
+        found_number = true;
+      else if (HB_UNICODE_GENERAL_CATEGORY_IS_LETTER (gc))
+      {
+        found_strong = true;
+        break;
+      }
+    }
+    if (found_number && !found_strong)
+      horiz_dir = HB_DIRECTION_LTR;
+  }
+
   /* TODO vertical:
    * The only BTT vertical script is Ogham, but it's not clear to me whether OpenType
    * Ogham fonts are supposed to be implemented BTT or not.  Need to research that
diff --git a/src/hb-unicode.hh b/src/hb-unicode.hh
index 5c0cb859d..0a7994410 100644
--- a/src/hb-unicode.hh
+++ b/src/hb-unicode.hh
@@ -359,6 +359,13 @@ DECLARE_NULL_INSTANCE (hb_unicode_funcs_t);
          FLAG (HB_UNICODE_GENERAL_CATEGORY_ENCLOSING_MARK) | \
          FLAG (HB_UNICODE_GENERAL_CATEGORY_NON_SPACING_MARK)))

+#define HB_UNICODE_GENERAL_CATEGORY_IS_LETTER(gen_cat) \
+       (FLAG_UNSAFE (gen_cat) & \
+        (FLAG (HB_UNICODE_GENERAL_CATEGORY_LOWERCASE_LETTER) | \
+         FLAG (HB_UNICODE_GENERAL_CATEGORY_MODIFIER_LETTER) | \
+         FLAG (HB_UNICODE_GENERAL_CATEGORY_OTHER_LETTER) | \
+         FLAG (HB_UNICODE_GENERAL_CATEGORY_TITLECASE_LETTER) | \
+         FLAG (HB_UNICODE_GENERAL_CATEGORY_UPPERCASE_LETTER)))

 /*
  * Ranges, used for bsearch tables.

@behdad
Copy link
Member

behdad commented Jun 23, 2021

@khaledhosny didn't you have to do special hacks in Amiri for this as well? Are they unnecessary now? I'm more interested in number-sign, ayah-sign etc working, than the kerning applied being of a particular direction.

@khaledhosny
Copy link
Collaborator

@khaledhosny didn't you have to do special hacks in Amiri for this as well? Are they unnecessary now? I'm more interested in number-sign, ayah-sign etc working, than the kerning applied being of a particular direction.

They are not needed anymore, I already built a test version of Amiri without the hacks and it works equally well with the three engines. I probably can simplify things further and use anchor attachment instead of the manually calculated positioning I currently use.

@behdad
Copy link
Member

behdad commented Jun 23, 2021

They are not needed anymore, I already built a test version of Amiri without the hacks and it works equally well with the three engines. I probably can simplify things further and use anchor attachment instead of the manually calculated positioning I currently use.

Great. Thanks. Send a PR :).

@behdad
Copy link
Member

behdad commented Jun 23, 2021

+  if (unlikely (horiz_dir == HB_DIRECTION_RTL && direction == HB_DIRECTION_LTR)) {

Bracket on new line please.

+    const hb_glyph_info_t* info = buffer->info;

We typically try to keep this closest to the for loop, for better auditing.

khaledhosny added a commit that referenced this issue Jun 23, 2021
See inline comments. Slightly modified version of the code from Jonathan
Kew on the linked issue.

Fixes #501
@khaledhosny
Copy link
Collaborator

Great. Thanks. Send a PR :).

#3035

khaledhosny added a commit that referenced this issue Jun 23, 2021
See inline comments. Slightly modified version of the code from Jonathan
Kew on the linked issue.

Fixes #501
khaledhosny added a commit that referenced this issue Jun 23, 2021
See inline comments. Slightly modified version of the code from Jonathan
Kew on the linked issue.

Fixes #501
behdad pushed a commit that referenced this issue Jun 23, 2021
See inline comments. Slightly modified version of the code from Jonathan
Kew on the linked issue.

Fixes #501
@anthrotype
Copy link
Member

nice to see this old issue finally fixed! thank you :)

@muhannad102
Copy link

I am still having the same issue in Chrome and Opera, but not with Microsoft Edge

@alerque
Copy link
Member

alerque commented Sep 27, 2021

Which version of Harfbuzz is baked into the version of Chrome & Opera you tested with? I assume this fix just hasn't landed in the branches of those projects you are using yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants