support emojis with ZWJ and variant selectors #30014

bfredl · 2024-08-08T17:35:17Z

The implementation of grapheme clusters was upgraded to closely follow extended grapheme clusters as defined by UAX#29 in the unicode standard. Noteworthily, this enables proper display of many more emoji characters than before, including those encoded with multiple
emoji codepoints combined with ZWJ (zero width joiner) codepoints and variant selectors.

Fix #7151
Fix #22014

test/old/testdir/test_normal.vim

runtime/doc/mbyte.txt

runtime/doc/options.txt

src/nvim/mbyte.c

test/functional/ui/multibyte_spec.lua

runtime/doc/mbyte.txt

src/nvim/mbyte.c

zeertzjq · 2024-08-27T13:33:31Z

src/nvim/plines.c

@@ -146,7 +146,7 @@ CharSize charsize_regular(CharsizeArg *csarg, char *const cur, colnr_T const vco
  } else if (cur_char < 0) {
    size = kInvalidByteCells;
  } else {
-    size = char2cells(cur_char);
+    size = ptr2cells(cur);


This kind of makes it pointless to pass in cur_char, as utf_ptr2cells() already handles illegal bytes, and the cur_char >= 0x80 check can be replaced with MB_BYTE2LEN(*cur) > 1.

Or the logic in utf_ptr2cells() can be replicated here without the first utf_ptr2char() to avoid decoding first char twice, but then that will also require passing in ci.chr.len, so not sure if that's worth it.

ye, multiple decoding also happens in other places like the main win_line() loop. we probably want a specialized version of CharInfo which also includes the ptr2cells() width calculated at the same time as the byte length. Although I am thinking of that as a follow-up perf PR while only focusing on correctness (and no larger regressions) in this PR.

zeertzjq · 2024-08-27T13:33:51Z

src/nvim/plines.c

@@ -352,7 +352,7 @@ static inline CharSize charsize_fast_impl(win_T *const wp, bool use_tabstop, col
    if (cur_char < 0) {
      width = kInvalidByteCells;
    } else {
-      width = char2cells(cur_char);
+      width = ptr2cells(cur);


zeertzjq · 2024-08-27T13:33:59Z

src/nvim/plines.c

 {
  if (cur_char == TAB && use_tabstop) {
    return tabstop_padding(vcol, buf->b_p_ts, buf->b_p_vts_array);
  } else if (cur_char < 0) {
    return kInvalidByteCells;
  } else {
-    return char2cells(cur_char);
+    return ptr2cells(cur);


src/nvim/mbyte.c

Use the grapheme break algorithm from utf8proc to support grapheme clusters from recent unicode versions. Handle variant selector VS16 turning some codepoints into double-width emoji. This means we need to use ptr2cells rather than char2cells when possible.

GitMurf · 2024-09-02T05:49:35Z

@bfredl sorry to ping you but I have been searching everywhere to try and figure out a solution to my problem and I believe it is similar to the problem(s) you were aiming to fix with this PR. I posted the repro and details in a post in the Neovim reddit here: https://www.reddit.com/r/neovim/comments/1f6z9da/help_with_1_keycap_digit_1_emoji_sequence_with/

I am happy to give you more details or move this conversation somewhere else if you prefer, but the TLDR is that the emoji 1️⃣ (and the other similar numbers 2-9) are having problems and I believe it is due to the multiple code points. It is comprised of U+31 + U+FE0F + U+20E3 and doing the str2list returns: { 49, 65039, 8419 }. Thanks in advance!

zeertzjq · 2024-09-02T06:12:33Z

That probably can't be fixed due to performance reasons.

GitMurf · 2024-09-02T06:52:38Z

That probably can't be fixed due to performance reasons.

@zeertzjq thanks for the quick reply! Is there some sort of "fallback" workaround that I could implement in my config? Like an autocmd that would render emojis like these to a broken icon (or an alternative) emoji (something I would choose)? My team uses these number emojis a lot in comments so it is not an option for me to just remove / replace. But I am totally fine if I just render emojis like these as compatible ones ("replace" how it renders on client side but not alter the actual emoji as I don't want to create / commit a change. I would just create a mapping list and add to it anytime these pop up. I'm just not sure if there is a reasonable way to do this (presumably an autocmd)? Thanks!!

bfredl · 2024-09-02T09:16:51Z

we could at least mark anything + 0xFE0F as having ambiguous terminal width ( utf_ambiguous_width) . That would fix rendering of the rest of the line getting out of sync, even if it is not having the correct width (which is hard to encode as setcellwidths() does not handle clusters by design).

GitMurf · 2024-09-02T12:56:09Z

we could at least mark anything + 0xFE0F as having ambiguous terminal width ( utf_ambiguous_width) . That would fix rendering of the rest of the line getting out of sync

Thanks @bfredl ! That sounds great to me! My issue is not the display of the emoji itself but the fact it throws the rest of the line (and often surrounding lines) off. A couple questions:

are you able to confirm that my scenario is actually a problem (as opposed to just being an issue on my side)? Like does it make sense that I am having a problem with this 1️⃣ emoji?
can you think of any temporary workaround for the time being that I can apply on my end? I assume the utf_ambiguous_width is something you have to do in neovim core and not something I can do on my end?

Thanks so much for the quick response!

clason · 2024-09-02T12:58:17Z

PSA: Please don't leave tangential comments on (especially) on a (merged) PR! If you have a problem, open an issue (yes, that means filling out the template. it's annoying but there for a reason!).

That would also allow a PR to be linked to it, so you wouldn't have missed #30232.

This was referenced Aug 8, 2024

recognize 0xFE0F selector changing BMP chars into full-width emoji:s #10026

Closed

Emoji modifiers are broken #7151

Closed

bfredl force-pushed the neoemoji branch 3 times, most recently from 99ca4a4 to 5dd17d9 Compare August 13, 2024 12:03

bfredl force-pushed the neoemoji branch 2 times, most recently from 15a15a7 to 41bc866 Compare August 22, 2024 08:51

zeertzjq reviewed Aug 22, 2024

View reviewed changes

test/old/testdir/test_normal.vim Outdated Show resolved Hide resolved

bfredl force-pushed the neoemoji branch from 41bc866 to c98b3e8 Compare August 22, 2024 12:38

zeertzjq reviewed Aug 22, 2024

View reviewed changes

test/old/testdir/test_normal.vim Outdated Show resolved Hide resolved

bfredl force-pushed the neoemoji branch from c98b3e8 to 863cb30 Compare August 22, 2024 15:12

zeertzjq added unicode 💩 (multibyte) unicode characters ci:s390x Enable CI for s390x and removed ci:s390x Enable CI for s390x labels Aug 22, 2024

bfredl force-pushed the neoemoji branch 2 times, most recently from fc6e6a5 to be34d8f Compare August 23, 2024 11:11

clason reviewed Aug 23, 2024

View reviewed changes

runtime/doc/mbyte.txt Outdated Show resolved Hide resolved

runtime/doc/mbyte.txt Outdated Show resolved Hide resolved

runtime/doc/options.txt Outdated Show resolved Hide resolved

src/nvim/mbyte.c Outdated Show resolved Hide resolved

src/nvim/mbyte.c Outdated Show resolved Hide resolved

bfredl force-pushed the neoemoji branch 2 times, most recently from 2829eb5 to 249a7ea Compare August 27, 2024 09:29

bfredl marked this pull request as ready for review August 27, 2024 09:51