Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bidi isolates are reversed #837

Closed
mrphlip opened this issue Dec 19, 2018 · 29 comments
Closed

Bidi isolates are reversed #837

mrphlip opened this issue Dec 19, 2018 · 29 comments

Comments

@mrphlip
Copy link

mrphlip commented Dec 19, 2018

The Unicode BiDi isolate characters appear to be reversed in behaviour:
image
U+2066 LEFT-TO-RIGHT ISOLATE is making the text right-to-left
U+2067 RIGHT-TO-LEFT ISOLATE is making the text left-to-right

This is running on mintty 2.9.5 (x86_64-pc-cygwin)

@mrphlip
Copy link
Author

mrphlip commented Dec 19, 2018

On further investigation, behaviour seems weird across the board...
image

My (admittedly limited) understanding is that since "DEF" is Latin characters it should be showing as left-to-right regardless in some of these configurations... I believe the right-to-left override is the only one that should actually make it display as right-to-left. But that's not particularly close to what's happening...

@BrianInglis
Copy link

Assuming the charset is UTF-8 and the font(set) supports the bidi rendering, where does the behaviour differ from that expected per UAX#9 Unicode Standard Annex #9 Unicode Bidirectional Algorithm
http://unicode.org/reports/tr9/ with behaviour summarized in
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics

@mrphlip
Copy link
Author

mrphlip commented Dec 19, 2018

Like I said, I'm not an expert. But I'd think the ones where it's using letters from a left-to-right alphabet, inside a left-to-right embedding, and it's displaying right-to-left are probably different to how the algorithm is supposed to work?

If I paste the output into, say, a web browser, then only the "right-to-left override" one actually displays right-to-left, which is what I'd expect.

>>> print("ABC \u202A DEF \u202C GHI")  # left-to-right embedding
ABC ‪ DEF ‬ GHI
>>> print("ABC \u202B DEF \u202C GHI")  # right-to-left embedding
ABC ‫ DEF ‬ GHI
>>> print("ABC \u202D DEF \u202C GHI")  # left-to-right override
ABC ‭ DEF ‬ GHI
>>> print("ABC \u202E DEF \u202C GHI")  # right-to-left override
ABC ‮ DEF ‬ GHI
>>> print("ABC \u2066 DEF \u2069 GHI")  # left-to-right isolate
ABC ⁦ DEF ⁩ GHI
>>> print("ABC \u2067 DEF \u2069 GHI")  # right-to-left isolate
ABC ⁧ DEF ⁩ GHI

But no, I can't quote chapter and verse from the standard as to what exactly this is breaking.

@mintty
Copy link
Owner

mintty commented Dec 19, 2018

I agree that indeed the effect of L vs R seems to be reversed; surprising that nobody else noticed since 2.7.6.
Note:

  • Latin characters should of course also be reverted (in the proper order, though) as "Explicit Directional Embedding" is the purpose of this formatting.
  • Some of the cases change behaviour when adding space around or within the embedding markers.
  • I'm attaching a test case list. I'd appreciate if the expected appearance of all of them could be clarified with authoritative reason before I touch this entangled part of mintty again to handle the issue.
    xbidi-dirform.txt

@XVilka
Copy link

XVilka commented Dec 19, 2018

  1. @mrphlip can you please help me with collecting information about BiDi support across terminal emulators and programs (and helping testing the actual support to point the mistakes and misunderstandings)?

  2. @mintty for your question, @egmontkob might help you to answer.

@mintty
Copy link
Owner

mintty commented Dec 19, 2018

@XVilka, you list konsole and qterminal on your page but neither of them supports the Explicit Directional Embedding markers.

@mintty
Copy link
Owner

mintty commented Dec 19, 2018

... and not even mlterm does; so maybe I should just drop this stuff.

@egmontkob
Copy link

In the last couple of months I've been working on a design doc for BiDi in terminal emulators, evaluating existing documentation and implementations I came across. It's almost ready to be published, I'll publish the first version early next year. It'll have its own discussion forum.

It's unclear how BiDi control characters should be handled. I outline one possible approach, but it's still subject to discussions, case studies, and input from BiDi experts. Mlterm does remember these to such extent, it's unclear to me how – on a side note, Mlterm's BiDi is by far the buggiest one I came across, and for this reason, I haven't studied how it remembers BiDi control characters.

Unfortunately, there are way more substantial problems with BiDi in any implementation I came across. These are such core problems that the issue with BiDi controls is secondary to them, and these core issues are the ones my work primarly focuses on. How to handle control characters could be added to a subsequent version of this document in the future.

I'll notify you whenever the first version is published.

Note: Mintty was out of my radar. I'm happy to look at it and include it in my evaluation if you point me to step-by-step docs how to get it work on Ubuntu 18.10, without having to have any commercial (e.g. Windows) license, and without assuming any knowledge about Windows, Cygwin and such.

believe the right-to-left override is the only one that should actually make it display as right-to-left.

That's correct. Out of the 6 examples with "DEF", only the RLO one should show up as "FED". (I meant this generally, not within the context of terminal emulators.)

@mintty
Copy link
Owner

mintty commented Dec 19, 2018

Hi @egmontkob, nice to meet again after years (Thomas Wolff on the linux-utf8 mailing list).
So maybe I've misinterpreted what "embedded" means then, applying RLE to produce FED, where is this actually specified? I tried to follow the Unicode bidi algorithm in the implementation...
Mintty, unfortunately, is a Windows-only program, using both Posix and Windows APIs. It should run under the wine emulator. Installing and running Cygwin does not require specific knowledge about it. When it runs, it's just a Posix-like environment within Windows.

@BrianInglis
Copy link

The bidi algorithm focus is on ensuring that alpha-numeric ltr and rtl character sequences are rendered appropriately by default, with explicit direction brackets tweaking rendering in special cases, mainly to handle non-alpha-numeric characters with weak or neutral directionality. Latin characters have strong ltr direction, so the issue is under what circumstances specified in the bidi algorithm should explicit direction brackets be able to override strong directions, to achieve their desired effect?

@egmontkob
Copy link

egmontkob commented Dec 19, 2018

@mintty linux-utf8... that must have been 12+ years ago, right? Your name rings a bell, but I can't recall what we were discussing there, sorry :)

Unicode TR9 (also known as UAX9) is the BiDi algorithm, if you implement it step by step then only RLO should end up forcing reverse order inside. RLE (embedding: the old and somewhat messed up) and RLI (isolate: the new, fixed, since Unicode 6.3) just set the default direction inside, in which case strong directionality characters are still laid out according to their direction. There are a gazillion of great easy-to-read BiDi introductions on the web that demo the most typical use cases of the control characters.

It's unclear to me what RLO's real purpose is, I don't think I've ever seen it in practice. "DEF" vs. "FED" are nice to talk about in examples, but I doubt the goal is to be able to visually reverse. I believe if one wishes to see a literal "FED", they should emit the string "FED". I guess BiDi overrides are primarily meant to enforce the otherwise desired direction that makes sense anyway, rather than opposite thereof.

Even embeddings and isolates are much more rare than I had expected. I checked a few wiki pages in these languages, plus the one-line translated description of all the software installed on my computer, and haven't come across any. I think their main use case is in template strings where placeholders get substituted by values at runtime.

Did you implement the BiDi algo from scratch? I'm really not familiar with Cygwin, but wouldn't it be easier to let's say compile (or port, if necessary) an existing implementation, like FriBidi or ICU?

Unicode.org provides an online interface to its C (post-6.3) and Java (pre-6.3) implementations. Control characters aren't visible in the input field, so probably it's best to copy-paste the entire text from some editor. It's a great way to see how a given string is supposed to show up, and it's probably the authoritative answer when in doubt. I vaguely recall seeing somewhere mentioned that there are "official" unittests too for BiDi on Unicode.org, but I haven't found (haven't looked for) them.

What I typically do, though, is that I view the string in various apps including Firefox, Chromium, LibreOffice Writer, Gedit, pango-view. On Ubuntu 18.10, Chromium has some noticeable problems, the rest are doing fine. I'm not sure which ones use their own implementation (esp. the browsers and LibreOffice – Chromium probably uses its own implementation since it has bugs that aren't present in FriBidi, but I've really no clue about Firefox and LibreOffice) and which ones rely on FriBidi (probably Gedit and pango-view). Sure it's all different on Windows, and due to browsers presumably using different BiDi libraries, which one is correct and which one isn't might differ across OSes.

There's a basic problem though when it comes to terminal emulation. The string is split into characters and these are stored in the terminal's cells. How do you handle invisible, zero-width control characters? Where do you store these BiDi controls? How are they preserved, updated, wiped out on further cursor positioning, subsequent partial overwriting of the contents?

It should run under the wine emulator.

Maybe on a rainy weekend...

@mrphlip
Copy link
Author

mrphlip commented Dec 20, 2018

surprising that nobody else noticed since 2.7.6.

tbf I noticed a while ago, just up until recently I was mistaken as to what the actual problem was.

For context... there appears to be some Twitter front-end of some sort, that puts LRI around Twitter handles in posts. I don't follow anyone that uses it, but I see posts with it RT'd into my feed occasionally. cf this tweet, there's LRI's around both @-handles in there.

So when these tweets show up in my feed (via oysttyer) they'd show up with the twitter handles reversed due to this. And, up until somewhat recently, the LRI/PDI control characters were visible, too.

So I'd always assumed (without checking too hard) that it was Twitter trying to be clever, and it was outputting the twitter handles backwards with RLO, so they'd look normal, but confuse bot scrapers, or some such nonsense. I wouldn't put it past Twitter to do something dumb like that. I just assumed that mintty was ignoring the bidi controls.

It was only recently that I realised that wasn't what was happening... that Twitter was sending the handles in the correct order, and they were left-to-right control characters anyway... it was just that mintty was displaying it backwards.

It's unclear to me what RLO's real purpose is, I don't think I've ever seen it in practice.

The only use-cases I've seen are either posting troll garbage online, or this one time I wanted to write "ℵ₀" in LTR text and didn't yet know that U+2135 exists for exactly this purpose, so I used an LRO.

@mintty
Copy link
Owner

mintty commented Dec 20, 2018

@egmontkob, thanks for the elaborate background information and the link.

Did you implement the BiDi algo from scratch? I'm really not familiar with Cygwin, but wouldn't it be easier to let's say compile (or port, if necessary) an existing implementation, like FriBidi or ICU?

No, it's minibidi.c, in mintty still from putty times (and now still in putty): Author: Ahmad Khalifa
(www.arabeyes.org - under MIT license).
I added handling of the new ISOLATE markers in mintty 2.7.6 last year and thought to have fixed the other directional markers too on that occasion, but apparently I got it all wrong...

I vaguely recall seeing somewhere mentioned that there are "official" unittests too for BiDi on Unicode.org

Likely BidiCharacterTest.txt from the UCD database, not showing the resulting visual order directly however.

How do you handle invisible, zero-width control characters? Where do you store these BiDi controls?

Technically like combining characters. Which means they are stored attached to their preceding character, which may raise problems on updating...

How are they preserved, updated, wiped out on further cursor positioning, subsequent partial overwriting of the contents?

From the above, you can imagine this may not work well in all situations; if you overwrite the preceding character (now technically the base character), the marker will be lost. However, if an application really uses directional markers and then positioning around them, it's asking for trouble...

@egmontkob
Copy link

No, it's minibidi.c, in mintty still from putty times (and now still in putty)

I guess it's the same used by pterm (putty's standalone terminal emulator without ssh) on Linux. There the command echo 12345 HEBREW (well, "HEBREW" being an actual hebrew word) produces 54321 at the left margin and WERBEH (the properly RTL-ified Hebrew word) at the right, leaving an empty cell at the very right. Does the Windows PuTTY do the same?

Mirroring the number is incorrect, aligning at both edges simultaneously is something the BiDi algorithm doesn't do, and the extra space at the right is not justified either.

So, unfortunately, this one is a pretty poor BiDi implementation.

Technically like combining characters.

Fair enough, probably the most trivial approach. What happens on updating is not that big of a concern to me, I mean, something has to happen, and I can't see what would make an approach slightly better or worse than another one.

In my opinion, the bigger problem with this approach is that it cannot handle strings beginning with BiDi controls.

However, if an application really uses directional markers and then positioning around them, it's asking for trouble...

Indeed :)

@mintty
Copy link
Owner

mintty commented Dec 20, 2018

echo 12345 HEBREW

I confirm your suspicion, which convinces me that I should find a replacement for this module. Maybe the original Unicode.org demo UBA (if I find the link again, I saw it earlier in the depths of their web dungeon...).
And I'll reconsider the storage too.

@egmontkob
Copy link

You should take a look at whether you can get FriBidi work in Cygwin (and whether it's licence-compatible, of course). It's written in plain C, some of the files being auto-generated at compile time using I-don't-know-what, and it has no other dependency than libc. Version 1.0 supports isolates.

@mintty
Copy link
Owner

mintty commented Dec 20, 2018

The fribidi library is already available as a cygwin package; however, I'd appreciate something like a drop-in replacement with a simple API like:

int do_bidi(bidi_char * line, int count);
int do_shape(bidi_char * line, bidi_char * to, int count);

and fribidi looks already significantly more complex than that; also I didn't see any straight-forward howto for its usage.

@egmontkob
Copy link

I can help you with FriBidi, it's not much more complicated than what you're looking for. Definitely simpler than implementing BiDi yourself or patching a broken and out-of-date implementation :D

@khaledhosny
Copy link

It's unclear to me what RLO's real purpose is, I don't think I've ever seen it in practice.

For (mostly historical) scripts that are written in both directions; Unicode will chose one direction as default for the script and direction overrides can be used to force the other direction as needed.

@khaledhosny
Copy link

I'm not sure which ones use their own implementation (esp. the browsers and LibreOffice – Chromium probably uses its own implementation since it has bugs that aren't present in FriBidi, but I've really no clue about Firefox and LibreOffice) and which ones rely on FriBidi (probably Gedit and pango-view)

Firefox used to use it is own bidi code (forked from an old version of ICU), now it uses ICU API. LibreOffice uses ICU as well. GTK uses Pango and it used to have its own implementation as well (forked from an old version of FriBiDi), now it uses FriBiDi API.

@mintty
Copy link
Owner

mintty commented Dec 31, 2018

Since mintty runs under Windows anyway, how about using the Windows Uniscribe API for bidi transformation?

@mintty
Copy link
Owner

mintty commented Jan 17, 2019

Found the culprit of both embedding handling and the "1234 RTL" issue - what a gross mistake!

After going through all the bidi algorithm and trying to match it to minibidi.c, I'll upload another sanitizing patch. However, isolate markers will remain only partially implemented.

@mintty
Copy link
Owner

mintty commented Jan 19, 2019

@egmontkob,

In my opinion, the bigger problem with this approach is that it cannot handle strings beginning with BiDi controls.

These are now handled:)
I'll appreciate any testing. Notable implementation limits are now some details in handling "isolate runs", particulary rules X10 and N0. But I don't actually see their impact and have no test cases for them.

@egmontkob
Copy link

egmontkob commented Jan 19, 2019

These are now handled:)

Could you please describe the exact behavior (specification) how they are handled? Is the first cell (of a row or a paragraph?) handled specially? Or is there some per-row or per-paragraph storage? Or are all opening BiDi controls stored belonging to the next cell? How do subsequent overrides of certain cells wipe out, update, or preserve these BiDi marks?

However, if an application really uses directional markers and then positioning around them, it's asking for trouble...

Getting back to this point... The exact details probably don't matter for utilities that just spit out their output, nor for fullscreen apps. But there's someting in between: line editing tools, typically shell command lines and alike. Here it would be important to define a consistent behavior across apps, so that e.g. we can count on readline (or at least a future, BiDi-aware version thereof) not to break if the entire prompt is within a BiDi block, or so.

@mintty
Copy link
Owner

mintty commented Jan 19, 2019

About initial bidi markers, I considered different implementation options. The easiest one was out to prepend a dummy character with index -1, even if that's a bit subtle to handle in C. But it was largely straight-forward except for handling the scrollback buffer.

A bidi marker, as being handled like a combining character, will get cleared if its preceding character is overwritten. Initial markers should be cleared when clearing the line, which does not work yet...

@mintty
Copy link
Owner

mintty commented Jan 20, 2019

Note: It seems that some details of the implementation only work in the 32-bit version, which is a bit weird.

mintty added a commit that referenced this issue Jan 20, 2019
@mintty
Copy link
Owner

mintty commented Jan 20, 2019

Released 2.9.6.

@mintty mintty closed this as completed Jan 20, 2019
@egmontkob
Copy link

I've published my draft BiDi specification at https://terminal-wg.pages.freedesktop.org/bidi/ . Comments about this spec is welcome over there.

@mintty
Copy link
Owner

mintty commented Jan 30, 2019

Thanks, @egmontkob. I've submitted three issues to your bidi terminal spec, attempting to clarify what's useful for a terminal and how to possibly involve ECMA-48 controls to achieve that. Comments to https://gitlab.freedesktop.org/terminal-wg/bidi/issues by other stakeholders welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants