Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Right to left support (Arabic, Hebrew, ...) #5359

Closed
hgrain86 opened this issue Sep 11, 2019 · 232 comments · Fixed by #5667
Closed

Right to left support (Arabic, Hebrew, ...) #5359

hgrain86 opened this issue Sep 11, 2019 · 232 comments · Fixed by #5667

Comments

@hgrain86
Copy link

  • Device: kobo aura one

Issue

Duplicate topic but I wrote for the importance
Will we see support for Arabic soon?
In the keyboard
And in writing the names of files and folders
Finally in the menus and settings
Thank you very much for this wonderful support and this extraordinary effort

@pazos
Copy link
Member

pazos commented Sep 11, 2019

From the README:

KOReader is developed and supported by volunteers [...] you can create a bounty for the specific bug or feature request you want and motivate others to do the work.

Duplicate of #5048, which was a duplicate of #1426.
Also related to #4709, #2944, #1767

@Frenzie
Copy link
Member

Frenzie commented Sep 11, 2019

In the keyboard

A basic keyboard is actually a great place for just about anyone to contribute. You can copy the English layout (or one closer to your goals if applicable) and start changing characters.

https://github.com/koreader/koreader/tree/4da512ce4e61df40a2c07f79d23ed21c9748f68f/frontend/ui/data/keyboardlayouts

And in writing the names of files and folders

Unless someone's "secretly" working on it, all we've got right now are a few promising libraries (FriBiDi, raqm).

@poire-z
Copy link
Contributor

poire-z commented Sep 11, 2019

Well, I'm currently playing with fribidi/harfbuzz on the crengine side (EPUB rendering). Technical notes in koreader/crengine#307.

I can't promise anything on the UI/filename side, but I'll probably have a look at it.
On the surface, it feels it might not be super complicated by just using the libraries to reorder the chars and correctly shape arabic, so displaying arabic/bi-directionnal filenames should be achievable.

But deeper, there are many more tedious things. A keyboard might be easy (but to be done by some arabic users), but the textbox editing/cursor/insertion/wrapping feels like a nightmare, and it would fragilize the already complicated code we have for left-to-right...
So, we'll see.

Anyway, @hgrain86 (and others reading this, that read arabic or hebrew), just a few questions so I get to understand the importance of each step - I don't read at all arabic, I'm just playing with libraries that are supposed to do things right - and visually comparing with how Firefox displays the same Wikipedia page.

First, just to have an early confirmation, are these screenshots readable/correct/perfect (with various western fonts I have, might not have the best arabic glyphs - and discarding the fact that some titles and list bullets should be right aligned, that will be fixed later):

image
image
image
image
(source)
(edit: replaced that 4th screenshot, it didn't have harfbuzz enabled :) moved the original one below).

A) are these readable, and how would you rate the quality of that on a scale of 0 to 10 :)
B) is the hyphenation of english words (in red) in the middle of a line OK/expected? If not, what's the alternative? No hyphenation at all?
C) what are these little www ww in blue, that I see only on this word and not on any others? (They are on the Wikipedia page too, so probably not a bug.)
D) On the last screenshot (a Quran.epub I found), there are these little dots in green above all words, that I don't see in any Wikipedia page. Are these expected? I see that we would need some huge interline space (more than the 130% max we have already) to get them displayed and not overriden by the next line. Would you need more interline space with arabic?

That nice arabic (I hope :) is achieved thanks to Harfbuzz, that shapes the individual characters into correct cursive display.
When we are not using it, but still reordering the chars right-to-left, we would get that:
image
image
image

E) Even if it's probably not as nice as the first screenshots, are these still readable?

This is where it gets complicated: this is the current UI code, that doesn't re-order chars, so this is probably reversed and unreadable:
image

F) Is that totally unreadable? Or is your brain able to re-order it and decypher it ? :)
G) Would the reordering without shaping (I asked about in E) be enough for these kind of text editing?

And about the UI (menu, filenames):
H) would having them displayed correctly ordered, but still left aligned, in our menu, file lists be ok? Would cursive/shaping be needed?
I) how important is it that menu items/filenames be right aligned, as you would expect - because that feels like another nightmare as each of our widgets would need to have some different options/code to align the subwidgets differently... so, alternatives to test each time we make some widget change... which would be painful for many of us.

J) would you be willing to handle the strings (english to arabic) translation work ? :)

In case some hebrew reader passes by, here's some rendering of a hebrew wikipedia epub page (source). same questions as above :) is it readable/perfect?
image
image
image

@hgrain86
Copy link
Author

I will try to read your reply leisurely and answer the questions that I can answer because I am not a developer or experience in programming at all, but I am a regular user of this special program Thank you very much for all this interest, even if I can support you financially for your gift all I have

@poire-z
Copy link
Contributor

poire-z commented Sep 14, 2019

Another question (still just thinking how much complexity this would add):

K) How should the following UI sentence be rendered when translated to arabic:

do you want to delete book title (2017).epub ?

In the following, assume the uppercase are the arabic letters making the translated words from their lowercase counterpart in english:

a) ? BOOK TITLE (2017).epub ETELED OT TNAW UOY OD
b) ? ELTIT KOOB (2017).epub ETELED OT TNAW UOY OD
c) ? (2017) ELTIT KOOB.epub ETELED OT TNAW UOY OD
d) ? epub.(2017) ELTIT KOOB ETELED OT TNAW UOY OD
e) ? bupe.(2017) ELTIT KOOB ETELED OT TNAW UOY OD

K1) when book title is english
K2) when book title is arabic

@Frenzie : related technical question: in Transiflex, or other translation tools, are there tools to isolate/override the directionality, or would the translators have to insert the unicode chars mentionned at http://unicode.org/reports/tr9/#Directional_Formatting_Characters - or would us developpers need to take care of that in our translatable strings (?!).
If you see where I'm going with the sample sentence above (a filename direction might be different from the surrounding text (I guess Windows goes with b)

@Frenzie
Copy link
Member

Frenzie commented Sep 14, 2019

Wouldn't that depend on the filename? It doesn't really strike me as something you can translate in advance.

@poire-z
Copy link
Contributor

poire-z commented Sep 14, 2019

I think all the strings/translation/templace substitutions would be done with the logical order of all strings.
So the arabic translated string would (in logical order) be DO YOU WANT TO DELETE %1 ?
T(_()) applied, we would get DO YOU WANT TO DELETE book title (2017).epub ? with some english book title and DO YOU WANT TO DELETE BOOK TITLE (2017).epub ? with some arabic book title - all that in logical order.

Give these to fribidi, and we would get the visual order:
with english book title: ? book title (2017).epub ETELED OT TNAW UOY OD
with arabic book title: ? epub.(2017) ELTIT KOOB ETELED OT TNAW UOY OD

I guess the .epub suffix would still need to be on the right of the filename.
So, either we would add to our code (as we know it's a filename):
T(_("do you want to delete <LEFT‑TO‑RIGHT ISOLATE>%1<POP DIRECTIONAL ISOLATE>?"), filename)
or:
T(_("do you want to delete %1?"), ltr_isolate(filename))

Or the translators would need to add them:
DO YOU WANT TO DELETE <LEFT‑TO‑RIGHT ISOLATE>%1<POP DIRECTIONAL ISOLATE> ?

(or LEFT-TO-RIGHT OVERRIDE, the distinction between isolate and override was clear to me for about a few minutes last week, but I've already forgotten... :)

Really nothing specific/additional for arabic in translation tools?

@Frenzie
Copy link
Member

Frenzie commented Sep 14, 2019

There is bidirectional isolation in https://projectfluent.org/ They list it as an advantage over gettext.

In our case I would think T(_("do you want to delete %1?"), ltr_isolate(filename)) looks like it makes the most sense, but maybe that should just be part of the template function?

Edit: interestingly, just the other day someone posted about a Lua implementation https://discourse.mozilla.org/t/work-started-on-a-lua-implementation/44963

Edit 2: whoa, Pontoon works quite well.

@yparitcher
Copy link
Member

yparitcher commented Sep 18, 2019

@poire-z
i am a little late to the party, but would like to help
speaking for hebrew RTL here:

And about the UI (menu, filenames):
H) would having them displayed correctly ordered, but still left aligned, in our menu, file lists be ok? Would cursive/shaping be needed?

yes, right aligned is better but left aligned is ok.
hebrew has (optional) diacritics so if they are present in the text and not aligned/positioned to the right letter would look a little funny.(this happens in crengine when not using best (harfbuzz) kerning) most filenames probably wont have the diacritics. see pictures below.

I) how important is it that menu items/filenames be right aligned, as you would expect - because that feels like another nightmare as each of our widgets would need to have some different options/code to align the subwidgets differently... so, alternatives to test each time we make some widget change... which would be painful for many of us.

depends, for example TOC would be more expected to be right aligned in a totally hebrew book than the file browser with hebrew and english entries

J) would you be willing to handle the strings (english to arabic) translation work ? :)

i don't mind helping out with integrating hebrew support in strings/code, I just need someone to point me in the right direction as i am not familiar with the codebase/lua

In case some hebrew reader passes by, here's some rendering of a hebrew wikipedia epub page same questions as above :) is it readable/perfect?

the render is readable

for comparison i will show good screenshots courtesy of RTL in crengine
Reader_2019-Sep-18_174104
this one has no diacritics.
Reader_2019-Sep-18_180916
this one with harfbuzz/best has the diacritics( dots) in the right places.
Reader_2019-Sep-18_180930
this one with freetype/good has them going into the next letter/ missing.

do you need me to test any specific features, it is quite easy for me to generate a hebrew epub with specific features, lists bullets footnotes etc.

Another question (still just thinking how much complexity this would add):

K) How should the following UI sentence be rendered when translated to arabic:

do you want to delete book title (2017).epub ?

In the following, assume the uppercase are the arabic letters making the translated words from their lowercase counterpart in english:

a) ? BOOK TITLE (2017).epub ETELED OT TNAW UOY OD
b) ? ELTIT KOOB (2017).epub ETELED OT TNAW UOY OD
c) ? (2017) ELTIT KOOB.epub ETELED OT TNAW UOY OD
d) ? epub.(2017) ELTIT KOOB ETELED OT TNAW UOY OD
e) ? bupe.(2017) ELTIT KOOB ETELED OT TNAW UOY OD

K1) when book title is english
K2) when book title is arabic

K1) a
K2) c or d, regular BIDI (d), unless you make an exception that the file suffix is always on the right (c)

i use an english locale (english speaker), but read hebrew books also
so for me.
K1) do you want to delete BOOK TITLE (2017).epub ie. regular LTR
K2) do you want to delete ELTIT KOOB (2017).epub would be proper BIDI, the hebrew part gets reversed
exactly how you said in #5359 (comment)

please implement this, (you will make a great program much better)
if you have any more questions or need someone to validate screenshots please ping me.
also are there any specific things i can add in the code to help implement this. (where are the UI text layout functions located)

@poire-z
Copy link
Contributor

poire-z commented Sep 19, 2019

Thanks for the nice feedback!

is it readable/perfect?

the render is readable

So, not perfect ? :) anything missing to make it perfect?

when not using best (harfbuzz) kerning

Well, you probably have to use "best" for anything a bit complex to show correctly. "good" is really just a hack that may work on some simple western text. "fast" is pure freetype, and similar to how our UI renders text (but without the RTL support).
Even for simple latin ê, Freetype is fine with the single unicode codepoint for ê, but when using e+^, the ^ is a bit miscentered and too near the top of the e. Only best (harfbuzz) is able to decide that this is bad and correct the offsets, or switch to use the glyph for ê if it exists in the font.
(Or may be I messed up with fast and good - but it looks it wasn't better in previous crengine versions).

That's why we'd need to use Harfbuzz for the UI too... (Hebrew without diacritics might be fine with Freetype, but we might as well want cursive like arabic and scripts with heavy glyph substitution like indic to work).

do you need me to test any specific features, it is quite easy for me to generate a hebrew epub with specific features, lists bullets footnotes etc.

Thanks, but for now, I'm ok with EPUBs made out of HE or AR wikipedia articles.
List item bullets and block stuff may have to wait a bit. For now, you might want to use a specific style tweak to workaround that, like:
body, p, li, h1, h2, h3, h4, h5, h6 { text-align: right !important; }
(Or without p if you prefer them justified:)
body, li, h1, h2, h3, h4, h5, h6 { text-align: right !important; }

i don't mind helping out with integrating hebrew support in strings/code, I just need someone to point me in the right direction as i am not familiar with the codebase/lua

Well, that's on Transiflex, and @Frenzie knows more about that than I do. But best to wait before starting translations until we have at least an idea on how to do it and have started the work :)
Because for now, it looks quite tedious (see #3904 (comment)).

where are the UI text layout functions located

It's quite all contained in a few modules:

font.lua font list and simple selection, and wrapper to Freetype
freetype.lua interface with FreeType, simple wrappers to Freetype object
rendertext.lua the real API to draw/size/truncate text, that we should keep to avoid changing all the other widgets - and glyph cache
textwidget.lua Single line text widget (used for menu items and single line file browser filenames)
textboxwidget.lua Multi lines text widget (used when filenames are put on 2 lines, and by mostly all our widgets, like displaying an InfoMessage).

textboxwidget.lua would be the more complex to adapt to using something more complex than simple Freetype (that maps 1 char => 1 glyph and just stack them on the line) - and it currently needs a lot of memory with long text (like long dict entries or Wikipedia articles shown from the UI).
So, still thinking :)

@Frenzie
Copy link
Member

Frenzie commented Sep 19, 2019

Transifex is just a workflow aid; at the core we simply work with GetText PO/POT files. So you can either use the Transifex online interface or download the relevant translation file locally, work on it in your favorite editor, and upload it. You can do that manually through the web interface or assisted by the tx command line tool.

Nice ways to edit PO files:

  • Poedit
  • Virtaal
  • Lokalize
  • Qt Linguist
  • Gtranslator
  • Text editor (Geany, Kate, etc.)
  • OmegaT

Transifex automatically switches to an RTL interface when appropriate (which can be disabled); I don't know the details with regard to the programs I mentioned.

From here:

Does Transifex support Right-To-Left (RTL) languages?

Yes! When translating to a RTL language such as Arabic or Hebrew, the Editor will automatically switch the input box to RTL for you. You can override this and use LTR by clicking RTL in the translation box.

The primary advantage of working on Transifex is when you've got a lot of strings to localize and you're working on it with multiple people simultaneously. In a regular Git scenario, it'd be easy to accidentally duplicate efforts, leading to doubly wasted time because then there would also be merge conflicts to resolve.

@poire-z For implementation considerations, this post about wxWidgets might be of interest.

Wordpress has an is_rtl() function.

@NiLuJe
Copy link
Member

NiLuJe commented Sep 19, 2019

Would something like raqm help for the UI side of things?

@poire-z
Copy link
Contributor

poire-z commented Sep 19, 2019

I had a look at raqm again, and even if I like its simplicity, I don't think we'd be able to use it as is. Had thoughts (because of its simplicity) about how much we would need to patch/extend it (and get involved in upstream development) - but there are quite a few things that we would need to tie to how our frontend does things, that relate to:

  1. line breaking: libraqm does all its stuff on a single line, and have bidi runs all along - while we would need the 2 steps that crengine does: measuring in logical order, then line breaking, then bidi visual ordering and reshape of each line
  2. font fallback: libraqm allows providing on each input char the font to use for that input char - while we would prefer auto reshaping of notdef segments with a provided set of fallback fonts (like I added to crengine - which has only 2, but nothing prevent from having a chain of fallback fonts, like our frontend has).

(One thing I noticed in libraqm and that I missed/forgot in the crengine bidi stuff is that we should split measureText() segment on "unicode script" change, and not only on direction change (so mixed hebrew and arabic in a single text node are split into different segments - while they are currently only one and harfbuzz uses the script of the first kind of text it meets).

So, although there are wishes for these features on libraqm's github, and many closed and not merged PRs that tried at them (and none that went using libunibreak) , it might be easier and quicker to just go at doing something a bit similar to how it is/what it does, but really just targeted at what we need.
And what we need is all in textboxwidget :| which is huge, but contemplating it, it's quite interesting to find a way to have most of that delegated to a C library.

Mainly, the C code would be fed the Lua text string, and we would get back a ShapedText object, with enough methods to get the cumulative width, send current line wished width, get the slice that fit in, set slices as lines, shape a line, get the glyphs (face, glyph index and positions) for that line back, so our Lua side can cache and blit the glyphs from that.
All the datastructures would be malloc'ed from C and kept as long as the ShaptedText object is not gc'ed, so I expect we would gain a lot on the memory usage (and Lua gc) issues we have.

Anyway, could be fun, many of the tough harfbuzz (clusters/glyph/chars walking) and fribidi stuff is already done and could be copied&pasted from crengine - but I would miss the help all the lvArray/lvString/lv*... helpers that crengine provides (*).

Not sure yet how to start on that... Lua library (like cre.cpp), or some plain C lib wrapped by ffi?
I'd like to avoid having to do too much (so no Freetype drawing, no Glyph caching, no blitting, no font management), mainly because I'd feel naked out of crengine LVworld :) All that would still be done nearly as it already is by our frontend code. So, we'd just have a text shaper/layout C lib (just like libraqm is...) helper that would depend on harfbuzz/fribidi/libunibreak.

(*) just wondering, if I would need some array/hash/collection facilities, what's the most easily available? Depending on GLib would suck I think - and I guess all the std:: stuff requires libstdc++.so, that it seems none of our thirdparty is using (!). So, what else?

@NiLuJe
Copy link
Member

NiLuJe commented Sep 19, 2019

We do pull glib for sdcv already, so you could theoretically use it.

(I don't recall, we might currently be building it static, but that can easily be corrected).

The std is indeed C++, not C ;).

@Frenzie
Copy link
Member

Frenzie commented Sep 19, 2019

@poire-z

libstdc++.so, that it seems none of our thirdparty is using

djvulibre and k2pdfopt are the most important users; there's some other stuff that uses it too.

https://github.com/koreader/koreader-base/blob/b9d95c73718fde01f02b4b9200fc36ea8ddc8e9c/Makefile.defs#L262-L272

koreader/koreader-base@dd4d31c

@yparitcher
Copy link
Member

is it readable/perfect?

the render is readable

So, not perfect ? :) anything missing to make it perfect?

i just don't like that font, nothing on your end.

however after using hebrew for a few days i noticed harfbuzz can be aggressive on the kerning, bring letters too close together, they usually don't touch but are too close for comfortable reading

Is there any way i can tweak how aggressively harfbuzz does kerning, in the setting or code and i can test what is the best value for hebrew kerning?

a screenshot with tight kerning highlighted. (they are the same, one with highlights)

Reader_2019-Sep-22_162128
Reader_2019-Sep-22_162128

@poire-z
Copy link
Contributor

poire-z commented Sep 22, 2019

Is there any way i can tweak how aggressively harfbuzz does kerning, in the setting or code and i can test what is the best value for hebrew kerning?

I don't think there are any tweaks about that, except a on/off toggle by enabling kerning or not. You would need to recompile crengine by setting -kern here:
https://github.com/koreader/crengine/blob/fe6efab20a759df26e2823d55be4b56ec3ad879a/crengine/src/lvfntman.cpp#L1038-L1041

Because I think Harfbuzz makes no decision: it just follows the instructions the font creator has put in his font. We can just tell Harfbuzz to follow or not these instructions.

You could try to hack your prefered font with FontForge and tweak/kill the kerning table (I know really nothing about how that work...) Or try another font, see if that aggresive happens with it too.
(Might be some bug on our side thus, dunno).

@yparitcher
Copy link
Member

Is there any way i can tweak how aggressively harfbuzz does kerning, in the setting or code and i can test what is the best value for hebrew kerning?

I don't think there are any tweaks about that, except a on/off toggle by enabling kerning or not. You would need to recompile crengine by setting -kern here:
https://github.com/koreader/crengine/blob/fe6efab20a759df26e2823d55be4b56ec3ad879a/crengine/src/lvfntman.cpp#L1038-L1041

did not help.

Because I think Harfbuzz makes no decision: it just follows the instructions the font creator has put in his font. We can just tell Harfbuzz to follow or not these instructions.

You could try to hack your prefered font with FontForge and tweak/kill the kerning table (I know really nothing about how that work...) Or try another font, see if that aggresive happens with it too.
(Might be some bug on our side thus, dunno).

i will try another font later, but i don't think it is only the font.

when using the stock kindle reader with the font hack (for details what it is doing ask @NiLuJe ) i get very nice hebrew:
screenshot_2019_09_22T17_08_28-0401

compared to koreader with auto hinting, i get the cramped letters:
Reader_2019-Sep-22_173130

and koreader with native hinting:
Reader_2019-Sep-22_173145

however the stock kindle reader without the font hack totally fails with the diacritics (i don't have a screenshot)

@NiLuJe
Copy link
Member

NiLuJe commented Sep 23, 2019

I assume that's on a KF8, not a KFX, right?

Checking with the exact same font would be a helpful comparison, because, yeah, IIRC, the KF8 renderer uses pango/cairo, but I don't completely recall how well it actually honors kerning/ligatures.

(i.e., I don't read the language, but it almost looks unkerned...).

EDIT: Oh. Also, try with no hinting, instead of auto/native ;).

@yparitcher
Copy link
Member

I assume that's on a KF8, not a KFX, right?

yes

Checking with the exact same font would be a helpful comparison, because, yeah, IIRC, the KF8 renderer uses pango/cairo, but I don't completely recall how well it actually honors kerning/ligatures.

(i.e., I don't read the language, but it almost looks unkerned...).

i am using the same font for both (SBL_hebrew), i agree the kindle one looks unkerned, however turning off kerning with harfbuzz does not solve the problem in koreader

EDIT: Oh. Also, try with no hinting, instead of auto/native ;).

no hinting does the same for the spacing

harfbuzz has this issue (both regular and light), freetype and no kerning have good spacing but the diacritics are off

@yparitcher
Copy link
Member

@poire-z is there a way to tweek letter_spacing, where is it controlled?

@poire-z
Copy link
Contributor

poire-z commented Sep 23, 2019

i will try another font later, but i don't think it is only the font.

So, what's the result with other fonts?

is there a way to tweek letter_spacing, where is it controlled?

Via CSS. Just create a style tweak containing body { letter-spacing: 1px; }

You would need to recompile crengine by setting -kern here

You can try to disable other opentype features, see some examples in the next section used for good/harfbuzz light at https://github.com/koreader/crengine/blob/fe6efab20a759df26e2823d55be4b56ec3ad879a/crengine/src/lvfntman.cpp#L1043-L1074
and if it changes something, try to pointpoint the exact feature that need to be disabled.

Also, does it happen with the same letters when there are no diacritics around?
Can you share a HTML snippet with just a few words where this happen?
(With such a snippet, you can enable HB debugging here: https://github.com/koreader/crengine/blob/fe6efab20a759df26e2823d55be4b56ec3ad879a/crengine/src/lvfntman.cpp#L27-L29 and see some x/y that may show something, or not...)

Also, can you check how that snippet would do with other renderers that use harfbuzz? I think Chrome/Chromium use harfbuzz.

@poire-z
Copy link
Contributor

poire-z commented Sep 23, 2019

@hgrain86 (and others reading this, that read arabic): could you comment about the arabic screenshots above? Are they ok, or do you notice some letter spacing issues too?

@yparitcher
Copy link
Member

i will try another font later, but i don't think it is only the font.

So, what's the result with other fonts?

similar, some fonts more than others, some fonts naturally have more spacing so it is less of an issue.

is there a way to tweek letter_spacing, where is it controlled?

Via CSS. Just create a style tweak containing body { letter-spacing: 1px; }

thanks, it helps alleviate the symptoms but on the other hand some letters get too spaced out, so not a real solution

You would need to recompile crengine by setting -kern here

You can try to disable other opentype features, see some examples in the next section used for good/harfbuzz light at https://github.com/koreader/crengine/blob/fe6efab20a759df26e2823d55be4b56ec3ad879a/crengine/src/lvfntman.cpp#L1043-L1074
and if it changes something, try to pointpoint the exact feature that need to be disabled.

i tried that, but it happens with harfbuzz light also.
it is not an issue with the features rather with

harfbuzz in general.

Also, does it happen with the same letters when there are no diacritics around?

yes

Can you share a HTML snippet with just a few words where this happen?

attached

<html>
<body>
<p>ני לששל יש אמ וז</p>
</body>
</html>

(With such a snippet, you can enable HB debugging here: https://github.com/koreader/crengine/blob/fe6efab20a759df26e2823d55be4b56ec3ad879a/crengine/src/lvfntman.cpp#L27-L29 and see some x/y that may show something, or not...)

i did that and did not see anything helpful, also that only works with harfbuzz so does not help me for comparing to freetype.

Also, can you check how that snippet would do with other renderers that use harfbuzz? I think Chrome/Chromium use harfbuzz.

firefox:
1

libreofffice:
Untitled 1

the block sytle fonts (noto) dont kern as much but are not very nice to read, they do not capture the full letter shapes.

the end result so far: harfbuzz has this issue (both regular and light), freetype and no kerning have better spacing but the diacritics are off

@Frenzie
Copy link
Member

Frenzie commented Sep 24, 2019

thanks, it helps alleviate the symptoms but on the other hand some letters get too spaced out, so not a real solution

Not sure if crengine supports it properly, but on higher DPI displays you'll need a value like .5px or less for one physical pixel.

@poire-z
Copy link
Contributor

poire-z commented Jan 1, 2022

Some issue noticed with Arabic (which should probably happen also with other complex scripts like Indic ones) when our text widgets wants to truncate some string so it fits in a specific fixed width:

The way we do it is: if the text does not fit in width W, get the substring that fits in W - ellipsis_width, and append that ellipsis character to that substring, and reshape the substring. This seems like an easy way to handle the situation, and works well with latin text and hebrew text:
image

But with Arabic, where letters have different shapes whether they are at start, at end, in the middle of a word, or standalone, if we truncate a word, the letter that used to be "medial" (with usually a small shape) is now "final" (with usually a longer shape) - or a "initial" letter could become "isolated" - and we may overflow the width:

image

Here are some of the above truncated words, untruncated:
image

I have no obvious idea how to solve this issue. (Looping with one char less until it fits would not solve the medial/final incorrect form issue.)

But I have some questions for Arabic/Persian readers, pinging @WaseemAlkurdi @Zeyadas @Monirzadeh (I would also like some answers to these questions from Indic scripts readers):

A) is it ok to truncate an arabic word, like we can do in English? or would you prefer to not see any part of the word at all (so, possibly not seeing anything but "...") ?

B) when truncating such a word, do you use the generic ellipsis, or are there some other specific indicator to mark that there is a truncation?

C) if truncating (with or without an ellipsis), do you expect to see the last character (originally in the middle of the word) in its "final form", or in its original "medial form" (as a more obvious way to see it's not right, and so it is probably truncated). And if seeing it in its medial form, is that enough to indicate it is truncated, or is an ellipsis after it needed/better?

D) any other idea on how this kind of problem is generally solved with Arabic? Any special Unicode character we could insert before the ellipsis that would magically make it work ? :)
There is https://en.wikipedia.org/wiki/Zero-width_joiner, but I'm not sure how it could help solving this issue... Replacing the ellipsis with this ZWJ would give this (with some other strange miscentering issue):
image

@dov
Copy link

dov commented Jan 1, 2022

I'm not at all familiar with Arabic, but it seems to me that the following logics should work:

  1. If a whole word is truncated, insert ellipses without ZWJ
  2. If part of a word is truncated. Insert ZWJ+ellipses

The functioning of the ZWJ is to be a place holder to force the preceding character to be in medial form, which is exactly what you need. If truncating a word is ok for latin scripts, I see no reason why it wouldn't be reasonable for a script using arabic glyphs.

Doing this strategy would also have the advantage that you will not get any overflow because of change of character.

@poire-z
Copy link
Contributor

poire-z commented Jan 1, 2022

Hello! :)

If a whole word is truncated / If part of a word is truncated

This would mean having to do something I didn't need until now: detecting words/punctuations, categorizing chars (which feels quite painful and full of special cases :)
Currently, I just need to use FriBidi and Harfbuzz hb_unicode_script() to cut segments, that I throw at harfbuzz, and then work on HB clusters as they come - and so truncate at clusters boundaries.
But may be HB's "unsafe to break" flags could just be enough ? I guess that if it is set, it means the glyph after is probably a letter that composes with the previous letter, which might be enough to tell me we are "inside a word" ?

HarfBuzz also flags glyphs as UNSAFE_TO_BREAK if breaking the string at that glyph (e.g., in a line-breaking or hyphenation process) would require re-shaping the text.

Also, (I may be wrong), it feels that even medial form glyphs can be different depending on its neighbours and the font Opentype rules, it feels some glyphs can be merged, and if I truncate into that, I won't get the same glyphs, so different widths, and risks of overflow. Unless such morphed glyphs all end up into a single cluster, and so, by working with clusters, I'm just fine hacking around the compound glyphs.

I would also need to care for substituting 2 chars and not overflow my buffers if I'm just truncating out the last char :/
And think about the bidi levels hacks I do at https://github.com/koreader/koreader-base/blob/master/xtext.cpp#L1197-L1234 to that ellipsis, that I would need to also do (or not) on the ZWJ (I have forgotten everything i learned about fribidi 2 years ago :/)
No issue you can foresee if I add a ZWJ + ellipsis after a cluster "unsafe to break", and I give them the same BiDi level as the cluster we keep, as far as BiDi re-ordering is concerned?

I'll give all this a try :)

@poire-z
Copy link
Contributor

poire-z commented Jan 1, 2022

Ok, a quick try with the unsafe_to_break flag seems to work ok !
For the sake of testing and showing differences, in the screenshots below, I use the regular ellipsis with 3 dots when I break on a unsafe_to_break flag (so, expecting it marks that the chars we truncate in between are in a same word).
And I use an ellipsis with 2 dots only when I don't meet this flag.

Various truncations (random, by having various widths for the chapters rectangles):

In Arabic, mixed of 2-dots and 3-dots ellipsis, but mostly 3-dots (still a feeling of mis-centering with the 3 dots which should be centered correctly...):

image

image

image

Do these look ok/better than the screenshots in previous post ? Is this the right thing to do?

In Hebrew, only 2 dots ellipsis:
image

In English, mostly 2 dots, but we can get 3 dots ones (!):
image

The trucated words in green are "Foreword" and "Turkish", so may be Harfbuzz marked these chars as "unsafe to break" because some kerning is involved. If that's what is really happening, we could may be still overflow, as I guess "letter1 + kerning-negative-advance | + ellipsis" may have fit, but now "letter1 (+ 0 kerning I guess with ZWJ) + ZWJ + ellipsis" might be larger. But I guess this is a really minor overflow that we could ignore :)

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2022

Although it feels this ZWJ is just a hack that would have some benefit with Arabic only - and could possibly be worse with other scripts.
An alternative that feels safer would be, if the truncation place is unsafe to break, to rewind until we find a place that is safe to break: it would make smaller text (and may be a lot smaller with Arabic), but might be saner.
We could get (trying to get the same setup as the screenshots above, marking the differences in yellow:

image
(^ The 2.2. chapter has lost quite a few glyphs!).

image

image

image
(The "Cha..." is for "Character Entities", which appeared as "Char..." previously.)

I'll wait for some Arabic reader to say which set of screenshots looks better/saner:

@Zeyadas
Copy link

Zeyadas commented Jan 2, 2022

A) is it ok to truncate an arabic word, like we can do in English? or would you prefer to not see any part of the word at all (so, possibly not seeing anything but "...") ?

I think that's OK to truncate Arabic words, we do that sometimes.

B) when truncating such a word, do you use the generic ellipsis, or are there some other specific indicator to mark that there is a truncation?

We use dots for that to indicate that the word is missing some letters.

C) if truncating (with or without an ellipsis), do you expect to see the last character (originally in the middle of the word) in its "final form", or in its original "medial form" (as a more obvious way to see it's not right, and so it is probably truncated). And if seeing it in its medial form, is that enough to indicate it is truncated, or is an ellipsis after it needed/better?

When truncating I would like to see the medial form but if you can add this (ـ) after the letters.
For example: I'll use physics in Arabic
The word is فيزياء
If you want to truncate it, ...فيزي
Or when using the (ـ) it's فيزيـ
And I think I answered the D) question here.

D) any other idea on how this kind of problem is generally solved with Arabic? Any special Unicode character we could insert before the ellipsis that would magically make it work ? :) There is https://en.wikipedia.org/wiki/Zero-width_joiner, but I'm not sure how it could help solving this issue... Replacing the ellipsis with this ZWJ would give this (with some other strange miscentering issue): image

@Zeyadas
Copy link

Zeyadas commented Jan 2, 2022

And by the way th (D) option I mean the ZWJ seems fine to me.

@poire-z
Copy link
Contributor

poire-z commented Jan 2, 2022

Thanks for the feedback.

When truncating I would like to see the medial form but if you can add this (ـ) after the letters.

This char is:
https://www.compart.com/en/unicode/U+0640
https://en.wikipedia.org/wiki/Kashida or tatweel or taṭwīl.
Wikipedia mentions it's used for elongating and help with justification, nothing about marking truncation.
Anyway, I'm not sure I want to go that deep in typography, I would need in some generic low level code to detect the Arabic script and use it instead of the ellipsis - and the widths computations (done in parts in some higher level code) could vary a lot depending on how the font shapes it with the characters before :/ So, it might look nice, but it's a lot less straightforward. So...:

E) are you/Arabic readers still ok seeing a western ellipsis after an arabic word to indicate it is truncated ? :)

Anyway, I had a try by just hardcoding using this U+0640 instead of our ellipsis, and this would give:

image

image

image

I can even less appreciate how this looks :) Does adding this U+0640 somehow makes sure the previous letter keeps its medial form?
Things get strange and shuffled when its added after a number, or after a space following a number :/

Btw, playing with your word at removing chars at end:

The word is فيزياء (original)
The word is فيزيا (no change in shape)
The word is فيزي (bad, change, last char in its final form I guess)
The word is فيز (no change in shape)
The word is في (bad, change, last char in its final form I guess)

This word looks (to me :) like it's 3 words, 3 cursives unlinked parts - what I think makes 3 "clusters" for text shaping (or at least, 3 parts with "unsafe-to-break" inside each, but not between them).
From the experiment or removing chars at end, it looks likes the ending form of a cluster does not change when I shorten and remove the cluster after. Does this mean that initial/medial/final/isolated forms apply to a cursive-linked-subpart and not really to a word as a whole itself?

My experiment at #5359 (comment) was to truncate only at cursive-linked-subpart boundaries, and so it would simplify things in that the subparts form are assured to not change at all.
Another test, with this ^ and using U+0640 instead of the ellipsis only when we shortened it to a subpart boundary, so I guess the U+0640 is standalone with a fixed width and does not composed with the previous char? This would feel a bit safer as far as widths are concerned, but it would still need to detect we are in some Arabic script parts - and I'm not sure it's any better :)
image

Btw, still curious about Arabic :) here is your word, and then with added spaces between the subparts:
فيزياء
فيز يا ء
Do you intuitively read the 2nd one as the same single word ? Or does the spacing makes you read it as 3 words, and so possibly meaning nothing ? :)
Or does your mind works just like ours when reading:
misdirection
mis direct ion

@Zeyadas
Copy link

Zeyadas commented Jan 2, 2022

This char is:
https://www.compart.com/en/unicode/U+0640
https://en.wikipedia.org/wiki/Kashida or tatweel or taṭwīl.

That's exactly what it's for, it's for making the words a little (longer) for text justification.

nothing about marking truncation.

Right, but if I/we (I mean Arabic reader) see it in an incomplete word we know that it's truncated.

E) are you/Arabic readers still ok seeing a western ellipsis after an arabic word to indicate it is truncated ? :)

Yes, it's totally understandable.

unsafe-to-break

Yes, Arabic word can't be broken, I don't know how to describe it but it's written (connected), the letters (no all but most of it) can't be written separately. You can't add spaces between letters.

Btw, still curious about Arabic :) here is your word, and then with added spaces between the subparts:
فيزياء
فيز يا ء
Do you intuitively read the 2nd one as the same single word ? Or does the spacing makes you read it as 3 words, and so possibly meaning nothing ? :)
Or does your mind works just like ours when reading:
misdirection
mis direct ion

No, now it's one word, it's three words and (some of it) could be read as something else, but surprisingly we can read it. But it's not comfortable.

@poire-z
Copy link
Contributor

poire-z commented Jan 13, 2022

Another thing I just noticed... In RTL text, italic (at least, fake italic because I don't have a NotoSomethingArabic-Italic, dunno if there exists any) is rendered oblique in the LTR direction...
Here, just having all links made italic:

image

With Hebrew:

image

A) is italic/oblique used with Arabic and Hebrew ? Or never ?
B) how does it feel reading it as in the screenshots above ?
C) should it be obliqued in the other direction ? Or is this just fine (!?)

Looks like Firefox doesn't do it differently:
image

image

@Frenzie
Copy link
Member

Frenzie commented Jan 14, 2022

Firefox uses more or less the same combination of libraries I think (or in any case FriBiDi and HarfBuzz). I guess it can't be too wrong, because it'd be hard to miss.

@Zeyadas
Copy link

Zeyadas commented Jan 15, 2022

OK. I'll try my best here so..

A) is italic/oblique used with Arabic and Hebrew ? Or never ?

Yes for Arabic, italic fonts is used in Arabic texts but sometimes it's not (clear) depending on the font type.
And I don't know about Hebrew language.

B) how does it feel reading it as in the screenshots above ?

In your screenshot above, the (underlined) words are in italic, and personally I hate the font but I think it's a standerd font for writing in Arabic websites and nobody cares about changing it.

C) should it be obliqued in the other direction ? Or is this just fine (!?)

No, it's just fine.

@WaseemAlkurdi
Copy link
Contributor

@poire-z Hello, just noticed the activity on this issue!
I think @Zeyadas did a great job here, and I'll supplement this a bit:
A:
Italics is used, but only rarely - it's nowhere as common as in Latin-script languages. A lot of people find oblique in Arabic hard to read. That being said though, there could be someone out there who has documents that use them, and so, obliques would be helpful.

@poire-z
Copy link
Contributor

poire-z commented Jan 15, 2022

@WaseemAlkurdi : thanks for popping in.
Do you also agree that the italic/oblique direction goes in the same direction as for LTR ? For us, the top of the italic letters run forward. For you RTL, is using the same italic slant as us (as in the screenshots above), the top of the italic letters would run backward ! I'm a bit surprised, just want a 2nd confirmation that this is not an issue :)

I also would like a 2nd feedback on the things above (starting at #5359 (comment)) about truncation, ellipsis, and using https://en.wikipedia.org/wiki/Kashida (tatweel / taṭwīl) instead of ellipsis with Arabic (and if you know, is it valid for Persan/Urdu too ?)

@dov
Copy link

dov commented Jan 15, 2022 via email

@poire-z
Copy link
Contributor

poire-z commented Jan 15, 2022

@dov: that would indeed be left to the font (didn't even think about that :), if we were shipping Arabic and Hebrew font with other styles that regular (Hebrew uses our FreeSerif that ships only in regular). So, when no italic font, we do fake italic by applying some transformation with FreeType.
So, in our case, if we don't fake-italicize in the right direction, it's our rendering engine issue :)

I assume Firefox left it to the font (dunno which it picks), but just asking: what's the most commonly expected direction (or, how do real Hebrew fonts do it) for content wrapped in a html <i> or <em>?

@dov
Copy link

dov commented Jan 15, 2022 via email

@poire-z
Copy link
Contributor

poire-z commented Apr 21, 2022

@uroybd : can you read through the last posts above, starting at #5359 (comment)

I still have this on my todo/tothinkabout list, the issue of truncating arabic cursive words - and I wonder how it is with Bengali (and any other Indic script you may know about) which is cursive too, and may have different forms/widths depending on how we truncate/stop drawing, and if an ellipsis is OK or if you have other ways to hint that text is truncated. Or if what we currently have is all just fine with Bengali :)
So, if you can answer my A B C D E questions (and may be the ones about italic, although you go LTR as English, so the slant shouldn't be bothering to you), that would be great.

@uroybd
Copy link
Contributor

uroybd commented Apr 21, 2022

@poire-z

Bengali is a relatively easy-going language.

A. It is okay.
B. Yes, generic ellipsis.
C. Just truncating with ellipsis is fine.
D. No magic is needed.
E. Yes, we are okay. No post-colonial grudge regarding this. ;)

While normal truncating is just fine for Bengali we prefer some visual grouping. For example, if you truncate শুভ্র we will prefer শুভ্র even by letter count it should be শুভ. They take the same amount of space.

Question is, how you should define the boundary? Here's a regex to do just that in JS which you can port if you wish to:

var bengaliRegex = /(র্){0,1}([অ-হড়-য়](?:্[অ-মশ-হড়-য়])*)((‍){0,1}(্[য-ল])){0,1}([া-ৌ]){0,1}|[্ঁঃংৎ০-৯]/g;

The full match of this regex will be one such unit.

@poire-z
Copy link
Contributor

poire-z commented Apr 21, 2022

Thanks. So, what we have is rather good.

We don't really do any kind of script specific check/processing ourselves. We trust harfbuzz for the brain put into it about scripts and doing the right things.
https://github.com/harfbuzz/harfbuzz/issues?q=bengali

So, we give it unicode chars, it gives us widths of clusters/grapheme/glyphs, putting combined chars into a single cluster. So, currently, we'll keep a whole cluster (with as many followup combining chars it has) if it fits, or drop it all and display only the previous cluster.
I guess in your sample শুভ্র , this looks like 2 clusters to me :) so either both complete, or only the first one complete.
Harfbuzz has another notion of "unsafe to break", if for example, the look/width of the first cluster could be different if the 2nd cluster was different or absent - that we could use instead to drop both clusters (or re-measure with the first only) if it would make things better/saner (but shorter).
(Or we could loop the measuring with each time a single unicode char removed, and see what Harbuzz manages to do differently with truncated logical input.)

@khaledhosny
Copy link

khaledhosny commented May 8, 2022

But I have some questions for Arabic/Persian readers, pinging @WaseemAlkurdi @Zeyadas @Monirzadeh (I would also like some answers to these questions from Indic scripts readers):

A) is it ok to truncate an arabic word, like we can do in English? or would you prefer to not see any part of the word at all (so, possibly not seeing anything but "...") ?

It is equally bad in Arabic as it is in English (i.e. it is a last resort measure).

B) when truncating such a word, do you use the generic ellipsis, or are there some other specific indicator to mark that there is a truncation?

Generic ellipsis.

C) if truncating (with or without an ellipsis), do you expect to see the last character (originally in the middle of the word) in its "final form", or in its original "medial form" (as a more obvious way to see it's not right, and so it is probably truncated). And if seeing it in its medial form, is that enough to indicate it is truncated, or is an ellipsis after it needed/better?

For Arabic the last letter should use the positional form it had before truncation (i.e. if it had initial or medial form it should keep, not become final or isolated form as this will give the misleading impression the word is complete and not truncated), it also make sure truncated string is actually shorter, since changing letter form can make the truncated string wider.

The simplest way to achieve this if you are going to reshape the string, is to add ZWJ before the ellipsis unless the string is truncated before a dual or right joining character (Unicode Character Database is to be consulted for this).

However, I think web browsers don’t reshape and simply truncate the glyph output and append ellipsis to it. This should be faster and also makes sure the glyph width remains the same, reshaping even when using ZWJ can give different width depending on the font (e.g. if it were using ligatures or different contextual alternates at the position of the truncation).

@khaledhosny
Copy link

For you RTL, is using the same italic slant as us (as in the screenshots above), the top of the italic letters would run backward ! I'm a bit surprised, just want a 2nd confirmation that this is not an issue :)

See godotengine/godot#59029 (comment)

@poire-z
Copy link
Contributor

poire-z commented Oct 22, 2022

Getting back to this issue of Arabic forms and truncation with ellipsis.

However, I think web browsers don’t reshape and simply truncate the glyph output and append ellipsis to it. This should be faster and also makes sure the glyph width remains the same

Thought about this, but it might be a tad more complicated when bidi is involved, and the truncation/ellipsis at end in logical order ends up being in the middle of the line (in visual order) after BiDi has reordered everything... It would not be as simple as "append", I would have to shove the right side to "insert/prepend" the ellipsis, so some more tricky dedicated code...

The simplest way to achieve this if you are going to reshape the string, is to add ZWJ before the ellipsis unless the string is truncated before a dual or right joining character (Unicode Character Database is to be consulted for this).

So, went with this.
Rather than having to do specific lookup of Arabic chars (or from other scripts, that I also know nothing about), I decided to "trust" Harfbuzz's "unsafe to break (before)" flag, assuming that if I see it set on the char I'm replacing, it's that the previous char (that I keep) was initial or medial form (and that if the flag is not set, it must be final or isolated).
I have no idea how this shorcut I'm taking can work or cause issues with other scripts :/

Showing Before | After of my unique test case with various truncation points:

image

image

image

Do the right parts feel less wrong than the left parts ?

reshaping even when using ZWJ can give different width depending on the font (e.g. if it were using ligatures or different contextual alternates at the position of the truncation).

I guess there's nothing complex in my sample text bits, but at least, on the screenshots above, there is no longer any overflow, so that's rather an improvement, if nothing else :)

Still feels a bit strange to get displayed ... 1.3 correctly, but 1... when 1.2 فيزياء gets truncated - I would expect the ellipsis also on the left (...1), but it's indeed the LTR 1.2 we are truncating.
How does that work for you Arabic readers ? (Just asking, I won't probably do nothing about it :/)

@Zeyadas
Copy link

Zeyadas commented Oct 23, 2022

The right preview is better than the left one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

13 participants