Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Grapheme Width #6012

Open
pascalkuthe opened this issue Feb 16, 2023 · 4 comments
Open

Correct Grapheme Width #6012

pascalkuthe opened this issue Feb 16, 2023 · 4 comments
Assignees
Labels
A-core Area: Helix core improvements A-helix-term Area: Helix term improvements C-bug Category: This is a bug

Comments

@pascalkuthe
Copy link
Member

pascalkuthe commented Feb 16, 2023

There have been multiple issues piling up on the issue tracker where helix behaves weirdly in the presence of certain unicode graphemes (usually emoji like characters):

Furthermore some emojis are rendered too wide with additional black space. Open the following line in helix for example:

// example of where unicode width is currently wrong: 🤦🏼‍♂️ (taken from https://hsivonen.fi/string-length/)

I have been looking into these issues and the underlying cause/fix is the same but the solution is not trivial. To avoid collecting a bunch of unrelated issues I decided to create an umbrella issue here and record the results of my research.

Since the problems manifest differently (or not at all) across various editors I assumed that this was simply a case of there being no common standard (I was not 100% wrong see below) and that we could not do much about this. However looking into this further it seems that other tui applications (like nvim) do handle these characters correctly. While some characters may overlap in some editors that resize characters (kitty) there are no weird visual glitches like with helix.

It seems that terminal emulators for comparability reasons all mostly agree on how many terminal column a grapheme should take (even if kitty renders some larger that doesn't affect the actual grid layout).

The problem is that the width supplied by unicode_width does not align with this character grid. Adding one or two small edgecases like suggested in #4932 (comment) doesn't work because there are a actually a LOT of edgecases (that all behave differently). The comment by wez linked there is quite old.

Nowadys wezterm uses termwiz instead which uses https://github.com/ridiculousfish/widecharwidth/ to generate a much more accurate column width function (and then performs some special casing and emoji detection on top of that).

Even then depending on which version of Unicode is targeted the correct output may be different, see https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html.

There are a couple ways forward:

  • We should replace unicode_width with something more accurate based on https://github.com/ridiculousfish/widecharwidth/ similar to what termwiz does in helix-core
  • The also need to be done in helix-tui
    • open question: Does termwiz do any further magic here (I don't think so) or is just using the correct width enough
  • We should allow configuring the unicode version like wezterm does. Ideally we could even try to support these osc escape sequences to set the correct unicode version
  • We might just switch to termwiz and get all of this for free. However termwiz is quite heavy (large codebas, depends on multiple hashing algorithms, the pest parser generator and 3 different unicode segementation crates). Do we want to do that?
@pascalkuthe pascalkuthe added C-bug Category: This is a bug C-discussion Category: Discussion or questions that doesn't represent real issues A-helix-term Area: Helix term improvements labels Feb 16, 2023
@pascalkuthe pascalkuthe self-assigned this Feb 16, 2023
@pascalkuthe pascalkuthe added A-core Area: Helix core improvements and removed C-discussion Category: Discussion or questions that doesn't represent real issues labels Feb 16, 2023
@kchibisov
Copy link

The problem is that the width supplied by unicode_width does not align with this character grid. Adding one or two small edgecases like suggested in #4932 (comment) doesn't work because there are a actually a LOT of edgecases (that all behave differently). The comment by wez linked there is quite old.

The width of the characters is usually defined by the unicode standard, so the comment wrt emojis is not really good(if you follow the link chain). If you every tried using a terminal which does ZWJ combinations(kitty) and put them in e.g. bash it'll simply blow up.

Changing the width function will simply shift the issue, you'll probably make things look the same in wezterm, but break 3 other terminals using conservative width functions, like wcwidth from glibc or unicode-width crate.

I think the only real way to solve anything here is to use OSC sequences which helps define width for edgecases, like ZWJ. And at the very least do a research wrt who supports what. But I think I only heard about it, and never seen, probably contour author
told me about it at some point.

The good idea would be to check what contour, kitty, and wezterm does wrt handling of conservative applications, like bash. If they unconditionally alter the width (I think at least some of them is altering the width at runtime).

To sum up, changing the width function will simply move the issue to some other terminals from the ones you see in the reports.

Also, you linked the issues from windows and kitty, while kitty is know to be "advanced in that area"(it does emoji combining breaking the total width), I should warn any non familiar with windows reader wrt state of things on this platform.

When it comes to windows, you have a shim (ConPty) between(helix) you and the terminal. This shim maintains its own grid, does reflow on it(at least it was doing so in the past), and wasn't even passing through CJK in some old revisions in a way it should, the cursor movements are also weird(I think I have a report from a windows user on a monthly basis that they can't move one char up in plain fully ascii environment and how updating windows version solves the issue).

So unless microsoft will do a passthourgh mode in their shim and provide it for any other terminal on windows I'd take every issue from windows platform with a grain of salt. You can't really solve them and you simply wait for microsoft to fix their software.

Also, be aware that ConPty is also being bundled by some terminals, because microsoft don't really care about updating their system ConPty version, so can't be sure what is even used in such issues.

@EpocSquadron
Copy link
Contributor

EpocSquadron commented Oct 3, 2023

The author of the still-in-private-beta ghostty terminal wrote about this fairly recently. An emerging standard (mode 2027) from contour author allows us to query for support for proper grapheme width calculation and fall back to wcwidth if not, which should achieve much better results.

@rockorager
Copy link

rockorager commented Oct 28, 2023

The author of the still-in-private-beta ghostty terminal wrote about this fairly recently. An emerging standard (mode 2027) from contour author allows us to query for support for proper grapheme width calculation and fall back to wcwidth if not, which should achieve much better results.

Note that foot also supports this (PR).

Foot, contour, ghostty, and wezterm are the only four terminals which employ grapheme clustering in this way (at least that I have run across), I think you can be fairly confident that if you get a response that 2027 is set / set-able then you can use correct Unicode width calculations

@mitchellh
Copy link

allows us to query for support for proper grapheme width calculation and fall back to wcwidth if not

Thanks for linking my blog post 😄 Happy to answer any questions about this if you have them. This quoted conclusion you came to is one of my general recommendations for terminal applications: assume libc wcwidth unless the terminal responds to mode 2027 and then use Unicode standard character width.

Note (as I say in the blog post) this is still not a safe assumption. If mode 2027 is not present, terminals do ALL sorts of stuff. The only safe way to do anything without mode 2027 is to query the cursor position after any character but that's pretty terrible.

So the only reason I recommend assuming libc wcwidth is because it gives you a sound explanation of why your program behaves the way it does in the face of people reporting issues. And because in most terminals wcwidth is also how they work. But you can't bet on it.

Also note that you have to handle VS15/VS16. I'm not familiar with the Rust ecosystem, but looking at the fish library you linked it does not seem to handle VS15/16 for you (that's not abnormal). In this case, you need to modify any character width to 1 for VS15 and 2 for VS16. To be totally correct, you should only do this is VS15/16 is valid for the grapheme, which can be checked in the UCD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-core Area: Helix core improvements A-helix-term Area: Helix term improvements C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests

5 participants