Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[unic-bidi] API concerns #273

Open
raphlinus opened this issue Sep 3, 2020 · 6 comments
Open

[unic-bidi] API concerns #273

raphlinus opened this issue Sep 3, 2020 · 6 comments

Comments

@raphlinus
Copy link

raphlinus commented Sep 3, 2020

I'm looking at using unic-bidi for a contemplated text layout project. For this discussion, consider the primary goal to be adding BiDi support to a (hypothetical) in-Rust implementation of text layout conforming to Piet's TextLayout trait. I see a number of concerns.

As a higher level process issue, one way to address those concerns is to fork unic-bidi. But I see value in trying to maintain a single source of truth for BiDi in the Rust ecosystem, so it feels like working with upstream is a good way to do that. That said, #272 is pretty good evidence to me, in addition to reverse dependencies (note that the RustPython reverse dep has actually recently been removed) that nobody is actually using unic-bidi, especially within the context of paragraph level text formatting.

Now for my concerns.

A big one is the &'text lifetime requirement on BidiInfo. That precludes retaining BidiInfo as part of our text layout object. One pattern for fixing this is to change the type parameter from <'text> to AsRef<str>, which would let the text be borrowed or owned, at the caller's request. Another way to fix it is to simply clone the string (it's not as if the implementation is particularly clone-averse). But I'm not convinced we actually need to retain the string. If you look at the methods on BidiInfo, I'm not sure reorder_line() is actually still valid in a modern environment; if my understanding is correct, HarfBuzz always wants its input in logical order. The string is also used in reorder_levels_per_char() to find the codepoint boundaries (I'm not sure this function is useful), and in visual_runs(), again mostly to find codepoint boundaries, but I think the "reset" logic can be run entirely from original_classes without making reference to the text.

Other concerns center around performance. I'm not advocating doing a lot of performance work now, but I also believe that the current API locks in certain decisions that make such performance work difficult in the future. That's mostly around the per-byte (technically, per UTF-8 code unit) arrays to represent things like levels. I think these don't represent the way people typically work with text in Rust, and in any case there are about twice as many UTF-8 code units as UTF-16 for most BiDi work.

So what I'd like better is for visual_runs to return an array of runs, where each run is a range in the text and a level (the latter would be after the reset logic is applied). I think the various "reorder*" functions should go, as I don't see them as being particularly useful, and if they are needed, I think their function can be written quite reasonably in terms of this proposed visual_runs.

Lastly, the stuff around ParagraphInfo seems crufty to me. Why is this passed in? It's possible to infer from the text position. Taking a step back, it's not clear to me that segmenting into paragraphs is a useful function for unic-bidi in the first place. It seems extremely likely that its client will have already done segmentation into paragraphs by the time it gets to BiDi analysis. If the goal of the crate is to cover all of UAX#9, then perhaps paragraph segmentation can be provided through a separate API. Then the scope of BidiInfo could be narrowed to a single paragraph, which I think would be a desirable simplification.

I post this in part to provoke discussion from potential other clients of the crate, and also to test whether unic is a good container for the work.

@sffc

This comment has been minimized.

@dhardy
Copy link

dhardy commented Sep 5, 2020

Agreed, this would be worth discussing. See also: kas-gui/kas-text#20

Two tools are needed here in my opinion:

  1. A tool to resolve the embedding levels according to TR9; this is provided by unciode-bidi and unic-bidi but the implementation is imperfect (e.g. Rules W1–W6 incorrectly applied in a single pass servo/unicode-bidi#8)
  2. https://crates.io/crates/unicode-bidi-mirroring (which works perfectly according to my limited testing)

@raphlinus
Copy link
Author

I'm curious why you need the mirroring. Shouldn't HarfBuzz take care of that for you as part of its RTL shaping? It's entirely possible there's something I'm missing though.

@dhardy
Copy link

dhardy commented Sep 5, 2020

HarfBuzz does take care of it, but I also maintain a simple (Kerning only) alternative, and that uses the mirroring.

@behnam
Copy link
Member

behnam commented Oct 21, 2020

Thanks for the detailed report, @raphlinus; and others for valuable comments.

The state of Bidi Algorithm impl in unic-bidi (and unicode-bidi package, similarly) is indeed far from ideal. This is definitely not because of lack of interest, but because in my exploration on the solution space, I ended up noticing similar challenges around the current API, as explained in this issue.

Sharing some context: I put together the BidiInfo structure together without much API design deliberation, only to solve the specific problem I was dealing with in Servo. (And the main reason there's both unic-bidi and unicode-bidi packages is the fact that I want to change the API of unic-bidi drastically, which Servo may not need for the time being.)

To give you a super-quick update on the my current focus on UNIC (at least for the next few weeks) is to get it back on track for forward-development work, with the goal to get a proper Text API in place, supporting segmentation, bidi and joining control. (From there, would be possible to get to line-break, as well. I haven't plans for myself for that, yet.) One of the changes needed on this road for me, is to split the "explicit bidi run resolution" part of the Bidi Algorithm working under no-std.

And, of course, there's a lot more work to be done for Bidi impl here, in terms of conformance, and performance (specifically, getting to linear or near-linear impl, possibly under no-std conditions).

Now, whether this project is a good place for such work? Well, I have been open about my personal agenda (which is focus on Multilingual Text API), and others have been contributing per their own goals and agenda, sharing what they're comfortable with. I understand well that there's been some confusions because of my lack of resources to get to work areas that others are interested in, and there's not really much I can do about that. My attempt is to push for an initiative where everyone can participate, and every language gets a fair share of support across all the layers of the technology built. I would be more than happy to hear your thoughts on this. :)

@dhardy
Copy link

dhardy commented Oct 21, 2020

From my point of view, unic-bidi should determine embedding levels, and the rest is irrelevant. Why? We have other tools (like xi-unicode for line-breaking and unicode-bidi-mirroring) which handle most of the other jobs which can be done separately already. Re-ordering can only be done after determining line-wrap points which can only be determined after shaping text (my waffle on this subject); it's also not very difficult once the other jobs have been done.

wez added a commit to wez/wezterm that referenced this issue Jan 25, 2022
In order to support RTL/BIDI, wezterm needs a bidi implementation.  I
don't think a well-conforming rust implementation exists today; what I
found were implementations that didn't pass 100% of the conformance
tests.

So I decided to port "bidiref", the reference implementation of the UBA
described in http://unicode.org/reports/tr9/ to Rust.

This implementation focuses on conformance: no special measures have
been taken to optimize it so far, with my focus having been to ensure
that all of the approx 780,000 test cases in the unicode data for
unicode 14 pass.  Having the tests passing 100% allows for making
performance improvements with confidence in the future.

The API isn't completely designed/fully baked.  Until I get to hooking
it up to wezterm's shaper, I'm not 100% sure exactly what I'll need.
There's a good discussion on API in
open-i18n/rust-unic#273 that suggests omitting
"legacy" operations such as reordering. I suspect that wezterm may
actually need that function to support monospace text layout in some
terminal scenarios, but regardless: reordering is part of the
conformance test suite so it remains a part of the API.

That said: the API does model the major operations as separate
phases, so you should be able to pay for just what you use:

* Resolving the embedding levels from a paragraph
* Returning paragraph runs of those levels (and their directions)
* Returning the whitespace-level-reset runs for a line-slice within the
  paragraph
* Returning the reordered indices + levels for a line-slice within the
  paragraph.

refs: #784
refs: kas-gui/kas-text#20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants