-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verse numbers of "KFGQPC Uthmanic Script HAFS" are not shaped right #501
Comments
I think this is because HarfBuzz processes the characters in the “native” direction of the script, so Arabic-Indic digits are processed from right-to-left so gives the wrong ligature in the font. I think that is one of the major remaining differences between HarfBuzz and Uniscribe. I stumbled upon this issue years ago and after Behdad’s explanation I managed to come up with a workaround in my font, but I wasn’t using ligatures so it was doable unlike here. |
Thanks for you insightful comment @khaledhosny, @behdad, I remember you wanted to put your comment here also, perhaps it would be useful for the future developments of the project and other collaborators. |
Here's full details of this issue: In Unicode, one can force-change direction of text using RLO and LRO characters. HarfBuzz wants to support this correctly. Ie. if you have <RLO,e,f,f,i> you should get "i,f,f,e" rendered. If you try with Uniscribe, you get "ffi,e", ie. it wrongly forms the 'ffi' ligature. Ie. it processes lookups in whatever input text order was there. At least it used to work this way two years ago. Now, Uniscribe does handle this correctly in their Arabic and I suppose Hebrew shapers. In HarfBuzz I want to support it correctly all the time. The way I do this is to have a per-script "native" horizontal direction, which reflects the natural order of the script. When Now, when I do the above, I do it just based on requested direction and script. Which means, I cannot distinguish between <LRO,Lam,Alef> ie. a forced non-natural direction, and <660,661> ie. natural, left-to-right, numbers in Arabic. As such, there's no way for HarfBuzz to apply lookups in the natural direction in both of these cases, short of looking at the input buffer text and if it starts with digits then assume a LTR natural direction... While I would hate such a solution, I'm open to implementing it. We should go back to Unicode bidi spec and see what sequences of characters can naturally become LTR in otherwise RTL scripts. Note that even if we do this, kerning / ligature of digits and weak punctuation still will get confused. Eg, if you happen to have kerning between left-parenthesis and digit zero, it's not clear which direction the lookup should be applied... Now, for years I was puzzled as to how Uniscribe gets this right. After some very heated exchanges with PeterCon, Sergey Malkin, and Dwayne Robinson, I finally found out. Sergey gave me the lead: they use a different script code for Arabic digits, such that they can differentiate later, and choose LTR natural order for Arabic digits and RTL natural order for rest of Arabic. This, obviously is out of scope for HarfBuzz as we don't do the itemization and I don't want to copy the different-script hack either. My plan to fix this properly is to encoded intended-direction of the lookup in LookupFlags. I've talked about it for years. Will try to flesh it out and write the proposal before end of year. |
CC @anthrotype |
I’m just wondering, can’t we solve this somehow by special casing numbers (based on the general category of the characters I suppose) and give them a LTR native direction always? Is the native direction a property of the whole buffer? |
Yes, and that's the problem. The best we can do though is indeed detect digits and use LTR native. That will screw up any punctuation stuff in the buffer, but that's a smaller issue. What would be a nice exercise is to see what neutral characters can end up in a LTR run adjacent to other neutral characters... But then again, doesn't really matter. |
Firefox (cc @jfkthame) seems to be trying to workaround this in https://bugzilla.mozilla.org/show_bug.cgi?id=1716029, but I’m really concerned about it as it will mean HarfBuzz clients will be behaving differently. Right now I’m working around this issue by using the private |
Yes, I'm experimenting with a possible approach to try and handle this; if it looks workable I was intending to bring it back here to discuss if we could incorporate it into HB more directly.
That's interesting, but doesn't seem like it helps with things like the existing KFGQPC font that wants to ligate digit sequences? Given that font works "as intended" in Safari, I was hoping we could achieve the same result in the HB-based engines. |
Ideally it would be fixed in HarfBuzz for good, but if not I’d rather have HarfBuzz clients work in a consistent (thus predictable) way. Now I think the solution Behdad suggested #501 (comment) might be that bad after all :
(though checking for the first digit only is not enough, as it will miss the case in the Firefox bug report). I’d say any buffer with LTR and RTL script and at least one BiDi AN or EN character should be treated as having natural LTR direction. May be the long term fix is to implement script segmentation in HarfBuzz and make all clients use it (I was checking DirectWrite the other day and it appears that there is no way to manually set the script based on its code, one have to use its script analyzer to get opaque script number to pass to the shaper). |
In https://bugzilla.mozilla.org/show_bug.cgi?id=1716029 we're working around this via a hack that basically checks for numeric runs in Arabic and shapes them with LTR directionality, but if we could move this (or something comparable) into harfbuzz itself for consistency, that would obviously be better. As a strawman suggestion, here's a harfbuzz patch that would result in similar behavior to what we're trying in Gecko:
This works within Firefox, but depends on calling a mozilla API to look up bidi categories; I don't think we currently have that in harfbuzz, do we? So if we wanted to do something like that, we'd need to add bidi-category lookup to the unicode functions. |
We don't have bidi categories currently. I'm open to adding those if we ever implement the bidi algo. In the mean time: I think checking for GC=Letter* types is enough in this case. If not, can be combined with horiz-dir=RTL. But is that even necessary? I think checking for GC=Letter* should be enough. My concern is with GC=Format. Doesn't that negate the point? Aren't the year-sign etc GC=Cf? Please adivse. I'm excited about getting this in and closing the gap with Uniscribe. |
Yes, those signs are Cf. Currently to get the desired rendering with e.g. Noto Nastaliq Urdu, the client needs to explicitly override directionality of the run. See for example https://searchfox.org/mozilla-central/source/layout/reftests/bugs/1599841-1.html, where The tweak here breaks that rendering if we don't exclude cases where GC=Cf is present. But there may be better ways to approach this... we should try to think it through more carefully. |
But if you are forcing RTL direction, the |
With the modified patch below, setting direction to LTR and RTL seems to produce visually identical output using Noto Nastaliq Urdu (for single digits of course, forcing RTL does not make sense for more than one digit), and it seems to match both Uniscribe and CoreText: diff --git a/src/hb-ot-shape.cc b/src/hb-ot-shape.cc
index 4fe148d67..1b72527f6 100644
--- a/src/hb-ot-shape.cc
+++ b/src/hb-ot-shape.cc
@@ -574,6 +574,33 @@ hb_ensure_native_direction (hb_buffer_t *buffer)
hb_direction_t direction = buffer->props.direction;
hb_direction_t horiz_dir = hb_script_get_horizontal_direction (buffer->props.script);
+ /* Numeric runs in natively-RTL scripts are actually native-LTR, so we reset the
+ * horiz_dir if the run contains at least one decimal-number char, and no chars with
+ * strong directionality or format-control chars (like the Arabic number/footnote/
+ * year signs) that may be trying to interact with digits.
+ *
+ * This allows digit sequences in Arabic etc to be shaped in "native" direction,
+ * so that features like ligatures will work as intended.
+ */
+ if (unlikely (horiz_dir == HB_DIRECTION_RTL && direction == HB_DIRECTION_LTR)) {
+ const hb_glyph_info_t* info = buffer->info;
+ bool found_number = false,
+ found_strong = false;
+ foreach_grapheme (buffer, start, end)
+ {
+ auto gc = _hb_glyph_info_get_general_category (&info[start]);
+ if (gc == HB_UNICODE_GENERAL_CATEGORY_DECIMAL_NUMBER)
+ found_number = true;
+ else if (HB_UNICODE_GENERAL_CATEGORY_IS_LETTER (gc))
+ {
+ found_strong = true;
+ break;
+ }
+ }
+ if (found_number && !found_strong)
+ horiz_dir = HB_DIRECTION_LTR;
+ }
+
/* TODO vertical:
* The only BTT vertical script is Ogham, but it's not clear to me whether OpenType
* Ogham fonts are supposed to be implemented BTT or not. Need to research that
diff --git a/src/hb-unicode.hh b/src/hb-unicode.hh
index 5c0cb859d..0a7994410 100644
--- a/src/hb-unicode.hh
+++ b/src/hb-unicode.hh
@@ -359,6 +359,13 @@ DECLARE_NULL_INSTANCE (hb_unicode_funcs_t);
FLAG (HB_UNICODE_GENERAL_CATEGORY_ENCLOSING_MARK) | \
FLAG (HB_UNICODE_GENERAL_CATEGORY_NON_SPACING_MARK)))
+#define HB_UNICODE_GENERAL_CATEGORY_IS_LETTER(gen_cat) \
+ (FLAG_UNSAFE (gen_cat) & \
+ (FLAG (HB_UNICODE_GENERAL_CATEGORY_LOWERCASE_LETTER) | \
+ FLAG (HB_UNICODE_GENERAL_CATEGORY_MODIFIER_LETTER) | \
+ FLAG (HB_UNICODE_GENERAL_CATEGORY_OTHER_LETTER) | \
+ FLAG (HB_UNICODE_GENERAL_CATEGORY_TITLECASE_LETTER) | \
+ FLAG (HB_UNICODE_GENERAL_CATEGORY_UPPERCASE_LETTER)))
/*
* Ranges, used for bsearch tables. |
@khaledhosny didn't you have to do special hacks in Amiri for this as well? Are they unnecessary now? I'm more interested in number-sign, ayah-sign etc working, than the kerning applied being of a particular direction. |
They are not needed anymore, I already built a test version of Amiri without the hacks and it works equally well with the three engines. I probably can simplify things further and use anchor attachment instead of the manually calculated positioning I currently use. |
Great. Thanks. Send a PR :). |
Bracket on new line please.
We typically try to keep this closest to the for loop, for better auditing. |
See inline comments. Slightly modified version of the code from Jonathan Kew on the linked issue. Fixes #501
|
See inline comments. Slightly modified version of the code from Jonathan Kew on the linked issue. Fixes #501
See inline comments. Slightly modified version of the code from Jonathan Kew on the linked issue. Fixes #501
See inline comments. Slightly modified version of the code from Jonathan Kew on the linked issue. Fixes #501
nice to see this old issue finally fixed! thank you :) |
I am still having the same issue in Chrome and Opera, but not with Microsoft Edge |
Which version of Harfbuzz is baked into the version of Chrome & Opera you tested with? I assume this fix just hasn't landed in the branches of those projects you are using yet. |
Filed from this mailing list topic as a possible bug, steps to repro:
Actual:
21
Expected:
12
(this works if we issue
./hb-view UthmanicHafs1\ Ver09.otf --shapers=coretext --direction=ltr ١٢
on macOS and perhaps uniscribe, and with other fonts)The text was updated successfully, but these errors were encountered: