New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bidi isolates are reversed #837
Comments
Assuming the charset is UTF-8 and the font(set) supports the bidi rendering, where does the behaviour differ from that expected per UAX#9 Unicode Standard Annex #9 Unicode Bidirectional Algorithm |
Like I said, I'm not an expert. But I'd think the ones where it's using letters from a left-to-right alphabet, inside a left-to-right embedding, and it's displaying right-to-left are probably different to how the algorithm is supposed to work? If I paste the output into, say, a web browser, then only the "right-to-left override" one actually displays right-to-left, which is what I'd expect.
But no, I can't quote chapter and verse from the standard as to what exactly this is breaking. |
I agree that indeed the effect of L vs R seems to be reversed; surprising that nobody else noticed since 2.7.6.
|
|
@XVilka, you list konsole and qterminal on your page but neither of them supports the Explicit Directional Embedding markers. |
... and not even mlterm does; so maybe I should just drop this stuff. |
In the last couple of months I've been working on a design doc for BiDi in terminal emulators, evaluating existing documentation and implementations I came across. It's almost ready to be published, I'll publish the first version early next year. It'll have its own discussion forum. It's unclear how BiDi control characters should be handled. I outline one possible approach, but it's still subject to discussions, case studies, and input from BiDi experts. Mlterm does remember these to such extent, it's unclear to me how – on a side note, Mlterm's BiDi is by far the buggiest one I came across, and for this reason, I haven't studied how it remembers BiDi control characters. Unfortunately, there are way more substantial problems with BiDi in any implementation I came across. These are such core problems that the issue with BiDi controls is secondary to them, and these core issues are the ones my work primarly focuses on. How to handle control characters could be added to a subsequent version of this document in the future. I'll notify you whenever the first version is published. Note: Mintty was out of my radar. I'm happy to look at it and include it in my evaluation if you point me to step-by-step docs how to get it work on Ubuntu 18.10, without having to have any commercial (e.g. Windows) license, and without assuming any knowledge about Windows, Cygwin and such.
That's correct. Out of the 6 examples with "DEF", only the RLO one should show up as "FED". (I meant this generally, not within the context of terminal emulators.) |
Hi @egmontkob, nice to meet again after years (Thomas Wolff on the linux-utf8 mailing list). |
The bidi algorithm focus is on ensuring that alpha-numeric ltr and rtl character sequences are rendered appropriately by default, with explicit direction brackets tweaking rendering in special cases, mainly to handle non-alpha-numeric characters with weak or neutral directionality. Latin characters have strong ltr direction, so the issue is under what circumstances specified in the bidi algorithm should explicit direction brackets be able to override strong directions, to achieve their desired effect? |
@mintty linux-utf8... that must have been 12+ years ago, right? Your name rings a bell, but I can't recall what we were discussing there, sorry :) Unicode TR9 (also known as UAX9) is the BiDi algorithm, if you implement it step by step then only RLO should end up forcing reverse order inside. RLE (embedding: the old and somewhat messed up) and RLI (isolate: the new, fixed, since Unicode 6.3) just set the default direction inside, in which case strong directionality characters are still laid out according to their direction. There are a gazillion of great easy-to-read BiDi introductions on the web that demo the most typical use cases of the control characters. It's unclear to me what RLO's real purpose is, I don't think I've ever seen it in practice. "DEF" vs. "FED" are nice to talk about in examples, but I doubt the goal is to be able to visually reverse. I believe if one wishes to see a literal "FED", they should emit the string "FED". I guess BiDi overrides are primarily meant to enforce the otherwise desired direction that makes sense anyway, rather than opposite thereof. Even embeddings and isolates are much more rare than I had expected. I checked a few wiki pages in these languages, plus the one-line translated description of all the software installed on my computer, and haven't come across any. I think their main use case is in template strings where placeholders get substituted by values at runtime. Did you implement the BiDi algo from scratch? I'm really not familiar with Cygwin, but wouldn't it be easier to let's say compile (or port, if necessary) an existing implementation, like FriBidi or ICU? Unicode.org provides an online interface to its C (post-6.3) and Java (pre-6.3) implementations. Control characters aren't visible in the input field, so probably it's best to copy-paste the entire text from some editor. It's a great way to see how a given string is supposed to show up, and it's probably the authoritative answer when in doubt. I vaguely recall seeing somewhere mentioned that there are "official" unittests too for BiDi on Unicode.org, but I haven't found (haven't looked for) them. What I typically do, though, is that I view the string in various apps including Firefox, Chromium, LibreOffice Writer, Gedit, pango-view. On Ubuntu 18.10, Chromium has some noticeable problems, the rest are doing fine. I'm not sure which ones use their own implementation (esp. the browsers and LibreOffice – Chromium probably uses its own implementation since it has bugs that aren't present in FriBidi, but I've really no clue about Firefox and LibreOffice) and which ones rely on FriBidi (probably Gedit and pango-view). Sure it's all different on Windows, and due to browsers presumably using different BiDi libraries, which one is correct and which one isn't might differ across OSes. There's a basic problem though when it comes to terminal emulation. The string is split into characters and these are stored in the terminal's cells. How do you handle invisible, zero-width control characters? Where do you store these BiDi controls? How are they preserved, updated, wiped out on further cursor positioning, subsequent partial overwriting of the contents?
Maybe on a rainy weekend... |
tbf I noticed a while ago, just up until recently I was mistaken as to what the actual problem was. For context... there appears to be some Twitter front-end of some sort, that puts LRI around Twitter handles in posts. I don't follow anyone that uses it, but I see posts with it RT'd into my feed occasionally. cf this tweet, there's LRI's around both @-handles in there. So when these tweets show up in my feed (via oysttyer) they'd show up with the twitter handles reversed due to this. And, up until somewhat recently, the LRI/PDI control characters were visible, too. So I'd always assumed (without checking too hard) that it was Twitter trying to be clever, and it was outputting the twitter handles backwards with RLO, so they'd look normal, but confuse bot scrapers, or some such nonsense. I wouldn't put it past Twitter to do something dumb like that. I just assumed that mintty was ignoring the bidi controls. It was only recently that I realised that wasn't what was happening... that Twitter was sending the handles in the correct order, and they were left-to-right control characters anyway... it was just that mintty was displaying it backwards.
The only use-cases I've seen are either posting troll garbage online, or this one time I wanted to write "ℵ₀" in LTR text and didn't yet know that U+2135 exists for exactly this purpose, so I used an LRO. |
@egmontkob, thanks for the elaborate background information and the link.
No, it's minibidi.c, in mintty still from putty times (and now still in putty): Author: Ahmad Khalifa
Likely
Technically like combining characters. Which means they are stored attached to their preceding character, which may raise problems on updating...
From the above, you can imagine this may not work well in all situations; if you overwrite the preceding character (now technically the base character), the marker will be lost. However, if an application really uses directional markers and then positioning around them, it's asking for trouble... |
I guess it's the same used by pterm (putty's standalone terminal emulator without ssh) on Linux. There the command Mirroring the number is incorrect, aligning at both edges simultaneously is something the BiDi algorithm doesn't do, and the extra space at the right is not justified either. So, unfortunately, this one is a pretty poor BiDi implementation.
Fair enough, probably the most trivial approach. What happens on updating is not that big of a concern to me, I mean, something has to happen, and I can't see what would make an approach slightly better or worse than another one. In my opinion, the bigger problem with this approach is that it cannot handle strings beginning with BiDi controls.
Indeed :) |
I confirm your suspicion, which convinces me that I should find a replacement for this module. Maybe the original Unicode.org demo UBA (if I find the link again, I saw it earlier in the depths of their web dungeon...). |
You should take a look at whether you can get FriBidi work in Cygwin (and whether it's licence-compatible, of course). It's written in plain C, some of the files being auto-generated at compile time using I-don't-know-what, and it has no other dependency than libc. Version 1.0 supports isolates. |
The fribidi library is already available as a cygwin package; however, I'd appreciate something like a drop-in replacement with a simple API like:
and fribidi looks already significantly more complex than that; also I didn't see any straight-forward howto for its usage. |
I can help you with FriBidi, it's not much more complicated than what you're looking for. Definitely simpler than implementing BiDi yourself or patching a broken and out-of-date implementation :D |
For (mostly historical) scripts that are written in both directions; Unicode will chose one direction as default for the script and direction overrides can be used to force the other direction as needed. |
Firefox used to use it is own bidi code (forked from an old version of ICU), now it uses ICU API. LibreOffice uses ICU as well. GTK uses Pango and it used to have its own implementation as well (forked from an old version of FriBiDi), now it uses FriBiDi API. |
Since mintty runs under Windows anyway, how about using the Windows Uniscribe API for bidi transformation? |
Found the culprit of both embedding handling and the "1234 RTL" issue - what a gross mistake! After going through all the bidi algorithm and trying to match it to minibidi.c, I'll upload another sanitizing patch. However, isolate markers will remain only partially implemented. |
These are now handled:) |
Could you please describe the exact behavior (specification) how they are handled? Is the first cell (of a row or a paragraph?) handled specially? Or is there some per-row or per-paragraph storage? Or are all opening BiDi controls stored belonging to the next cell? How do subsequent overrides of certain cells wipe out, update, or preserve these BiDi marks?
Getting back to this point... The exact details probably don't matter for utilities that just spit out their output, nor for fullscreen apps. But there's someting in between: line editing tools, typically shell command lines and alike. Here it would be important to define a consistent behavior across apps, so that e.g. we can count on readline (or at least a future, BiDi-aware version thereof) not to break if the entire prompt is within a BiDi block, or so. |
About initial bidi markers, I considered different implementation options. The easiest one was out to prepend a dummy character with index -1, even if that's a bit subtle to handle in C. But it was largely straight-forward except for handling the scrollback buffer. A bidi marker, as being handled like a combining character, will get cleared if its preceding character is overwritten. Initial markers should be cleared when clearing the line, which does not work yet... |
Note: It seems that some details of the implementation only work in the 32-bit version, which is a bit weird. |
Released 2.9.6. |
I've published my draft BiDi specification at https://terminal-wg.pages.freedesktop.org/bidi/ . Comments about this spec is welcome over there. |
Thanks, @egmontkob. I've submitted three issues to your bidi terminal spec, attempting to clarify what's useful for a terminal and how to possibly involve ECMA-48 controls to achieve that. Comments to https://gitlab.freedesktop.org/terminal-wg/bidi/issues by other stakeholders welcome. |
The Unicode BiDi isolate characters appear to be reversed in behaviour:
U+2066 LEFT-TO-RIGHT ISOLATE is making the text right-to-left
U+2067 RIGHT-TO-LEFT ISOLATE is making the text left-to-right
This is running on mintty 2.9.5 (x86_64-pc-cygwin)
The text was updated successfully, but these errors were encountered: