Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support emojis with ZWJ and variant selectors #30014

Merged
merged 1 commit into from
Aug 30, 2024
Merged

Conversation

bfredl
Copy link
Member

@bfredl bfredl commented Aug 8, 2024

The implementation of grapheme clusters was upgraded to closely follow extended grapheme clusters as defined by UAX#29 in the unicode standard. Noteworthily, this enables proper display of many more emoji characters than before, including those encoded with multiple
emoji codepoints combined with ZWJ (zero width joiner) codepoints and variant selectors.

Fix #7151
Fix #22014

@zeertzjq zeertzjq added unicode 💩 (multibyte) unicode characters ci:s390x Enable CI for s390x and removed ci:s390x Enable CI for s390x labels Aug 22, 2024
@bfredl bfredl force-pushed the neoemoji branch 2 times, most recently from fc6e6a5 to be34d8f Compare August 23, 2024 11:11
runtime/doc/mbyte.txt Outdated Show resolved Hide resolved
runtime/doc/mbyte.txt Outdated Show resolved Hide resolved
runtime/doc/options.txt Outdated Show resolved Hide resolved
src/nvim/mbyte.c Outdated Show resolved Hide resolved
src/nvim/mbyte.c Outdated Show resolved Hide resolved
@bfredl bfredl force-pushed the neoemoji branch 2 times, most recently from 2829eb5 to 249a7ea Compare August 27, 2024 09:29
@bfredl bfredl marked this pull request as ready for review August 27, 2024 09:51
@bfredl bfredl force-pushed the neoemoji branch 3 times, most recently from 1396f0c to 7bdfbe1 Compare August 27, 2024 10:54
runtime/doc/mbyte.txt Outdated Show resolved Hide resolved
src/nvim/mbyte.c Outdated Show resolved Hide resolved
src/nvim/mbyte.c Outdated Show resolved Hide resolved
@@ -146,7 +146,7 @@ CharSize charsize_regular(CharsizeArg *csarg, char *const cur, colnr_T const vco
} else if (cur_char < 0) {
size = kInvalidByteCells;
} else {
size = char2cells(cur_char);
size = ptr2cells(cur);
Copy link
Member

@zeertzjq zeertzjq Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of makes it pointless to pass in cur_char, as utf_ptr2cells() already handles illegal bytes, and the cur_char >= 0x80 check can be replaced with MB_BYTE2LEN(*cur) > 1.

Or the logic in utf_ptr2cells() can be replicated here without the first utf_ptr2char() to avoid decoding first char twice, but then that will also require passing in ci.chr.len, so not sure if that's worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ye, multiple decoding also happens in other places like the main win_line() loop. we probably want a specialized version of CharInfo which also includes the ptr2cells() width calculated at the same time as the byte length. Although I am thinking of that as a follow-up perf PR while only focusing on correctness (and no larger regressions) in this PR.

@@ -352,7 +352,7 @@ static inline CharSize charsize_fast_impl(win_T *const wp, bool use_tabstop, col
if (cur_char < 0) {
width = kInvalidByteCells;
} else {
width = char2cells(cur_char);
width = ptr2cells(cur);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

{
if (cur_char == TAB && use_tabstop) {
return tabstop_padding(vcol, buf->b_p_ts, buf->b_p_vts_array);
} else if (cur_char < 0) {
return kInvalidByteCells;
} else {
return char2cells(cur_char);
return ptr2cells(cur);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

src/nvim/mbyte.c Outdated Show resolved Hide resolved
Use the grapheme break algorithm from utf8proc to support grapheme
clusters from recent unicode versions.

Handle variant selector VS16 turning some codepoints into double-width
emoji. This means we need to use ptr2cells rather than char2cells when
possible.
@bfredl bfredl merged commit 5f95f12 into neovim:master Aug 30, 2024
31 of 32 checks passed
@GitMurf
Copy link

GitMurf commented Sep 2, 2024

@bfredl sorry to ping you but I have been searching everywhere to try and figure out a solution to my problem and I believe it is similar to the problem(s) you were aiming to fix with this PR. I posted the repro and details in a post in the Neovim reddit here: https://www.reddit.com/r/neovim/comments/1f6z9da/help_with_1_keycap_digit_1_emoji_sequence_with/

I am happy to give you more details or move this conversation somewhere else if you prefer, but the TLDR is that the emoji 1️⃣ (and the other similar numbers 2-9) are having problems and I believe it is due to the multiple code points. It is comprised of U+31 + U+FE0F + U+20E3 and doing the str2list returns: { 49, 65039, 8419 }. Thanks in advance!

@zeertzjq
Copy link
Member

zeertzjq commented Sep 2, 2024

That probably can't be fixed due to performance reasons.

@GitMurf
Copy link

GitMurf commented Sep 2, 2024

That probably can't be fixed due to performance reasons.

@zeertzjq thanks for the quick reply! Is there some sort of "fallback" workaround that I could implement in my config? Like an autocmd that would render emojis like these to a broken icon (or an alternative) emoji (something I would choose)? My team uses these number emojis a lot in comments so it is not an option for me to just remove / replace. But I am totally fine if I just render emojis like these as compatible ones ("replace" how it renders on client side but not alter the actual emoji as I don't want to create / commit a change. I would just create a mapping list and add to it anytime these pop up. I'm just not sure if there is a reasonable way to do this (presumably an autocmd)? Thanks!!

@bfredl
Copy link
Member Author

bfredl commented Sep 2, 2024

we could at least mark anything + 0xFE0F as having ambiguous terminal width ( utf_ambiguous_width) . That would fix rendering of the rest of the line getting out of sync, even if it is not having the correct width (which is hard to encode as setcellwidths() does not handle clusters by design).

@GitMurf
Copy link

GitMurf commented Sep 2, 2024

we could at least mark anything + 0xFE0F as having ambiguous terminal width ( utf_ambiguous_width) . That would fix rendering of the rest of the line getting out of sync

Thanks @bfredl ! That sounds great to me! My issue is not the display of the emoji itself but the fact it throws the rest of the line (and often surrounding lines) off. A couple questions:

  1. are you able to confirm that my scenario is actually a problem (as opposed to just being an issue on my side)? Like does it make sense that I am having a problem with this 1️⃣ emoji?
  2. can you think of any temporary workaround for the time being that I can apply on my end? I assume the utf_ambiguous_width is something you have to do in neovim core and not something I can do on my end?

Thanks so much for the quick response!

@clason
Copy link
Member

clason commented Sep 2, 2024

PSA: Please don't leave tangential comments on (especially) on a (merged) PR! If you have a problem, open an issue (yes, that means filling out the template. it's annoying but there for a reason!).

That would also allow a PR to be linked to it, so you wouldn't have missed #30232.

@neovim neovim locked as resolved and limited conversation to collaborators Sep 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ci:s390x Enable CI for s390x unicode 💩 (multibyte) unicode characters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flag emojis display as squares Emoji modifiers are broken
4 participants