Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all emoji sequences recommended for general interchange (RGI) cluster #3017

Closed
rsheeter opened this issue Jun 9, 2021 · 13 comments · Fixed by #3087
Closed

Not all emoji sequences recommended for general interchange (RGI) cluster #3017

rsheeter opened this issue Jun 9, 2021 · 13 comments · Fixed by #3087
Labels

Comments

@rsheeter
Copy link
Collaborator

rsheeter commented Jun 9, 2021

In testing emoji rgi clustering it seems polar bears and regional flags don't cluster. https://github.com/rsheeter/hb-emoji-clusters/blob/main/try_shape-stdout.txt has a list. For context, I tested all the sequences in https://unicode.org/Public/emoji/14.0/emoji-test.txt, which enumerates emoji sequences.

IIUC per Mark Davis all emoji should be grapheme clusters. I thought that would mean HB would cluster them but seemingly not. I see discussion on #2265 where a fix to make it so was discussed.

In the event emoji were to span multiple files it would help Chrome itemization if emoji rgi consistently formed clusters. If it's desirable that they don't I'd appreciate if someone could ELI5.

@behdad
Copy link
Member

behdad commented Jun 9, 2021

Regional-indicators are Complicated(TM) as seen in #2265.

Polar bear is the weirdest thing I've seen in Unicode: 🐻 U+1F43B, ZWJ U+200D,❄ U+2744, FE0F

This is our code implementing the grapheme logic:

harfbuzz/src/hb-ot-shape.cc

Lines 466 to 517 in bd5502f

hb_set_unicode_props (hb_buffer_t *buffer)
{
/* Implement enough of Unicode Graphemes here that shaping
* in reverse-direction wouldn't break graphemes. Namely,
* we mark all marks and ZWJ and ZWJ,Extended_Pictographic
* sequences as continuations. The foreach_grapheme()
* macro uses this bit.
*
* https://www.unicode.org/reports/tr29/#Regex_Definitions
*/
unsigned int count = buffer->len;
hb_glyph_info_t *info = buffer->info;
for (unsigned int i = 0; i < count; i++)
{
_hb_glyph_info_set_unicode_props (&info[i], buffer);
/* Marks are already set as continuation by the above line.
* Handle Emoji_Modifier and ZWJ-continuation. */
if (unlikely (_hb_glyph_info_get_general_category (&info[i]) == HB_UNICODE_GENERAL_CATEGORY_MODIFIER_SYMBOL &&
hb_in_range<hb_codepoint_t> (info[i].codepoint, 0x1F3FBu, 0x1F3FFu)))
{
_hb_glyph_info_set_continuation (&info[i]);
}
#ifndef HB_NO_EMOJI_SEQUENCES
else if (unlikely (_hb_glyph_info_is_zwj (&info[i])))
{
_hb_glyph_info_set_continuation (&info[i]);
if (i + 1 < count &&
_hb_unicode_is_emoji_Extended_Pictographic (info[i + 1].codepoint))
{
i++;
_hb_glyph_info_set_unicode_props (&info[i], buffer);
_hb_glyph_info_set_continuation (&info[i]);
}
}
#endif
/* Or part of the Other_Grapheme_Extend that is not marks.
* As of Unicode 11 that is just:
*
* 200C ; Other_Grapheme_Extend # Cf ZERO WIDTH NON-JOINER
* FF9E..FF9F ; Other_Grapheme_Extend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
* E0020..E007F ; Other_Grapheme_Extend # Cf [96] TAG SPACE..CANCEL TAG
*
* ZWNJ is special, we don't want to merge it as there's no need, and keeping
* it separate results in more granular clusters. Ignore Katakana for now.
* Tags are used for Emoji sub-region flag sequences:
* https://github.com/harfbuzz/harfbuzz/issues/1556
*/
else if (unlikely (hb_in_range<hb_codepoint_t> (info[i].codepoint, 0xE0020u, 0xE007Fu)))
_hb_glyph_info_set_continuation (&info[i]);
}
}

For emoji, we append any ZWJ,Extended_Pictograph sequence to the previous cluster. U+2744 SNOWFLAKE is in that list. So I expect that we handle this sequence correctly. Let me check.

@behdad
Copy link
Member

behdad commented Jun 9, 2021

Maybe we should add an equivalent of your code to the test suite.

@behdad
Copy link
Member

behdad commented Jun 9, 2021

Oops... Bad bug in emoji table generator...

behdad added a commit that referenced this issue Jun 9, 2021
Previously, the last of each range having Extended_Pictograph property
was not processed as so. Ouch!

Test:

$ echo x > null; hb-shape null -u U+1f43b,U+200d,U+2744,U+fe0f

Before:
[gid0=0+1000|gid0=2+1000]

After:
[gid0=0+1000|gid0=0+1000]

Caught by #3017
@rsheeter rsheeter changed the title Not all emoji rgi cluster Not all emoji sequences recommended for general interchange (RGI) cluster Jun 9, 2021
@behdad
Copy link
Member

behdad commented Jun 9, 2021

@rsheeter Can you please rerun your script against master and attach conclusion?

@rsheeter
Copy link
Collaborator Author

It would appear you have rescued the polar bear! rsheeter/hb-emoji-clusters@85bfd1f

@behdad
Copy link
Member

behdad commented Jun 10, 2021

Re the regional-indicator pairs: #2265 (comment)

@rsheeter
Copy link
Collaborator Author

#3018 probably fixes this;.

@behdad
Copy link
Member

behdad commented Jun 10, 2021

#3018 probably fixes this;.

Interesting to see if it actually does.

@rsheeter
Copy link
Collaborator Author

0 / 4702 failures, looks like it does :)

@drott
Copy link
Collaborator

drott commented Jun 10, 2021

Big thanks for doing this list, @rsheeter! And glad we found an actual issue here.

@khaledhosny
Copy link
Collaborator

This should be closed now, right?

@behdad
Copy link
Member

behdad commented Jun 14, 2021

Is fixed. Importing as a test would be nice.

@behdad
Copy link
Member

behdad commented Jun 14, 2021

Here's a slow way to get started. It's slow because we don't have --unicode-files...

$ echo x > null; curl https://www.unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt | cut -d';' -f1 | cut -d'#' -f1 | grep . | while read line; do ./hb-shape --font-file=null --unicodes="$line" --no-positions | grep -q '=[^0]' && ./hb-shape --font-file=null --unicodes="$line" --no-positions --show-text --show-unicode; done

Maybe just make gen-emoji-table.py emit a test file somewhere in test/shaping/data/in-house/tests. The need for non-empty null font file is also because of #2567

khaledhosny added a commit that referenced this issue Jul 28, 2021
Fixes #3017

Uses AdobeBlank2.ttf from:

  https://github.com/adobe-fonts/adobe-blank-2

instead of a dummy empty font so that everything maps to GID 1 and
control code points are kept instead of being dropped because there is
not space glyph (otherwise we’d need to identify control code points
somehow when generating the expectations).
behdad pushed a commit that referenced this issue Jul 29, 2021
Fixes #3017

Uses AdobeBlank2.ttf from:

  https://github.com/adobe-fonts/adobe-blank-2

instead of a dummy empty font so that everything maps to GID 1 and
control code points are kept instead of being dropped because there is
not space glyph (otherwise we’d need to identify control code points
somehow when generating the expectations).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants