Not all emoji sequences recommended for general interchange (RGI) cluster #3017

rsheeter · 2021-06-09T17:58:12Z

In testing emoji rgi clustering it seems polar bears and regional flags don't cluster. https://github.com/rsheeter/hb-emoji-clusters/blob/main/try_shape-stdout.txt has a list. For context, I tested all the sequences in https://unicode.org/Public/emoji/14.0/emoji-test.txt, which enumerates emoji sequences.

IIUC per Mark Davis all emoji should be grapheme clusters. I thought that would mean HB would cluster them but seemingly not. I see discussion on #2265 where a fix to make it so was discussed.

In the event emoji were to span multiple files it would help Chrome itemization if emoji rgi consistently formed clusters. If it's desirable that they don't I'd appreciate if someone could ELI5.

behdad · 2021-06-09T20:59:28Z

Regional-indicators are Complicated(TM) as seen in #2265.

Polar bear is the weirdest thing I've seen in Unicode: 🐻 U+1F43B, ZWJ U+200D,❄ U+2744, FE0F
️
This is our code implementing the grapheme logic:

harfbuzz/src/hb-ot-shape.cc

Lines 466 to 517 in bd5502f

    
           hb_set_unicode_props (hb_buffer_t *buffer) 
        
           { 
        
             /* Implement enough of Unicode Graphemes here that shaping 
        
              * in reverse-direction wouldn't break graphemes.  Namely, 
        
              * we mark all marks and ZWJ and ZWJ,Extended_Pictographic 
        
              * sequences as continuations.  The foreach_grapheme() 
        
              * macro uses this bit. 
        
              * 
        
              * https://www.unicode.org/reports/tr29/#Regex_Definitions 
        
              */ 
        
             unsigned int count = buffer->len; 
        
             hb_glyph_info_t *info = buffer->info; 
        
             for (unsigned int i = 0; i < count; i++) 
        
             { 
        
               _hb_glyph_info_set_unicode_props (&info[i], buffer); 
        
               /* Marks are already set as continuation by the above line. 
        
                * Handle Emoji_Modifier and ZWJ-continuation. */ 
        
               if (unlikely (_hb_glyph_info_get_general_category (&info[i]) == HB_UNICODE_GENERAL_CATEGORY_MODIFIER_SYMBOL && 
        
           		  hb_in_range<hb_codepoint_t> (info[i].codepoint, 0x1F3FBu, 0x1F3FFu))) 
        
               { 
        
           	_hb_glyph_info_set_continuation (&info[i]); 
        
               } 
        
           #ifndef HB_NO_EMOJI_SEQUENCES 
        
               else if (unlikely (_hb_glyph_info_is_zwj (&info[i]))) 
        
               { 
        
                 _hb_glyph_info_set_continuation (&info[i]); 
        
                 if (i + 1 < count && 
        
           	  _hb_unicode_is_emoji_Extended_Pictographic (info[i + 1].codepoint)) 
        
                 { 
        
           	i++; 
        
           	_hb_glyph_info_set_unicode_props (&info[i], buffer); 
        
           	_hb_glyph_info_set_continuation (&info[i]); 
        
                 } 
        
               } 
        
           #endif 
        
               /* Or part of the Other_Grapheme_Extend that is not marks. 
        
                * As of Unicode 11 that is just: 
        
                * 
        
                * 200C          ; Other_Grapheme_Extend # Cf       ZERO WIDTH NON-JOINER 
        
                * FF9E..FF9F    ; Other_Grapheme_Extend # Lm   [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK 
        
                * E0020..E007F  ; Other_Grapheme_Extend # Cf  [96] TAG SPACE..CANCEL TAG 
        
                * 
        
                * ZWNJ is special, we don't want to merge it as there's no need, and keeping 
        
                * it separate results in more granular clusters.  Ignore Katakana for now. 
        
                * Tags are used for Emoji sub-region flag sequences: 
        
                * https://github.com/harfbuzz/harfbuzz/issues/1556 
        
                */ 
        
               else if (unlikely (hb_in_range<hb_codepoint_t> (info[i].codepoint, 0xE0020u, 0xE007Fu))) 
        
                 _hb_glyph_info_set_continuation (&info[i]); 
        
             } 
        
           }

For emoji, we append any ZWJ,Extended_Pictograph sequence to the previous cluster. U+2744 SNOWFLAKE is in that list. So I expect that we handle this sequence correctly. Let me check.

behdad · 2021-06-09T21:04:49Z

Maybe we should add an equivalent of your code to the test suite.

behdad · 2021-06-09T21:09:01Z

Oops... Bad bug in emoji table generator...

Previously, the last of each range having Extended_Pictograph property was not processed as so. Ouch! Test: $ echo x > null; hb-shape null -u U+1f43b,U+200d,U+2744,U+fe0f Before: [gid0=0+1000|gid0=2+1000] After: [gid0=0+1000|gid0=0+1000] Caught by #3017

behdad · 2021-06-09T23:53:07Z

@rsheeter Can you please rerun your script against master and attach conclusion?

rsheeter · 2021-06-10T00:23:04Z

It would appear you have rescued the polar bear! rsheeter/hb-emoji-clusters@85bfd1f

behdad · 2021-06-10T00:34:26Z

Re the regional-indicator pairs: #2265 (comment)

rsheeter · 2021-06-10T01:54:15Z

#3018 probably fixes this;.

behdad · 2021-06-10T02:01:55Z

#3018 probably fixes this;.

Interesting to see if it actually does.

rsheeter · 2021-06-10T02:06:02Z

0 / 4702 failures, looks like it does :)

drott · 2021-06-10T07:41:15Z

Big thanks for doing this list, @rsheeter! And glad we found an actual issue here.

khaledhosny · 2021-06-14T14:20:25Z

This should be closed now, right?

behdad · 2021-06-14T15:06:06Z

Is fixed. Importing as a test would be nice.

behdad · 2021-06-14T20:23:50Z

Here's a slow way to get started. It's slow because we don't have --unicode-files...

$ echo x > null; curl https://www.unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt | cut -d';' -f1 | cut -d'#' -f1 | grep . | while read line; do ./hb-shape --font-file=null --unicodes="$line" --no-positions | grep -q '=[^0]' && ./hb-shape --font-file=null --unicodes="$line" --no-positions --show-text --show-unicode; done

Maybe just make gen-emoji-table.py emit a test file somewhere in test/shaping/data/in-house/tests. The need for non-empty null font file is also because of #2567

Fixes #3017 Uses AdobeBlank2.ttf from: https://github.com/adobe-fonts/adobe-blank-2 instead of a dummy empty font so that everything maps to GID 1 and control code points are kept instead of being dropped because there is not space glyph (otherwise we’d need to identify control code points somehow when generating the expectations).

rsheeter changed the title ~~Not all emoji rgi cluster~~ Not all emoji sequences recommended for general interchange (RGI) cluster Jun 9, 2021

khaledhosny added the tests label Jul 28, 2021

khaledhosny mentioned this issue Jul 28, 2021

[test] Add generated tests for emoji clusters #3087

Merged

behdad closed this as completed in #3087 Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all emoji sequences recommended for general interchange (RGI) cluster #3017

Not all emoji sequences recommended for general interchange (RGI) cluster #3017

rsheeter commented Jun 9, 2021 •

edited

Loading

behdad commented Jun 9, 2021 •

edited

Loading

behdad commented Jun 9, 2021

behdad commented Jun 9, 2021

behdad commented Jun 9, 2021

rsheeter commented Jun 10, 2021

behdad commented Jun 10, 2021

rsheeter commented Jun 10, 2021

behdad commented Jun 10, 2021

rsheeter commented Jun 10, 2021

drott commented Jun 10, 2021

khaledhosny commented Jun 14, 2021

behdad commented Jun 14, 2021

behdad commented Jun 14, 2021

Not all emoji sequences recommended for general interchange (RGI) cluster #3017

Not all emoji sequences recommended for general interchange (RGI) cluster #3017

Comments

rsheeter commented Jun 9, 2021 • edited Loading

behdad commented Jun 9, 2021 • edited Loading

behdad commented Jun 9, 2021

behdad commented Jun 9, 2021

behdad commented Jun 9, 2021

rsheeter commented Jun 10, 2021

behdad commented Jun 10, 2021

rsheeter commented Jun 10, 2021

behdad commented Jun 10, 2021

rsheeter commented Jun 10, 2021

drott commented Jun 10, 2021

khaledhosny commented Jun 14, 2021

behdad commented Jun 14, 2021

behdad commented Jun 14, 2021

rsheeter commented Jun 9, 2021 •

edited

Loading

behdad commented Jun 9, 2021 •

edited

Loading