Grapheme segmentation with ZWJ sequences #20

natecraddock · 2023-02-08T01:44:37Z

Hello, and thank you for all the hard work on this great library! It has been a pleasure to use.

I do think I have found a bug though. I am using the GraphemeIterator and I noticed that multiple emoji with Zero Width Joiners in a row are only counted as one grapheme. For example, 🐻‍❄️🐻‍❄️ would be considered a single grapheme. Here is a minimal reproducing example

const std = @import("std");
const ziglyph = @import("ziglyph");

pub fn main() !void {
    var iter = try ziglyph.GraphemeIterator.init("🐻‍❄️🐻‍❄️");
    while (iter.next()) |grapheme| {
        std.debug.print("Found Grapheme: {s} at offset: {}\n", .{grapheme.bytes, grapheme.offset});
    }
}

outputs

Found Grapheme: 🐻‍❄️🐻‍❄️ at offset: 0

Separating the polar bears with a single space leads to three graphemes displayed as expected

Found Grapheme: 🐻‍❄️ at offset: 0
Found Grapheme:   at offset: 13
Found Grapheme: 🐻‍❄️ at offset: 14

It's possible that I misunderstand grapheme clusters, but I was under the impression that these emoji should be considered distinct "characters"/"symbols". When rendered in the browser for example, each is a distinct symbol, not joined as one.

Also, this specific emoji ends with U+FE0F, but I tested several other ZWJ emoji with the same result.

The text was updated successfully, but these errors were encountered:

jecolon · 2023-02-09T19:51:31Z

Hi @natecraddock and thanks for the feedback. Although your reasoning makes sense to me, it seems that Unicode decided that since Emoji modifier sequences can include ZWJ as an integral part, when segmenting text into grapheme clusters, any such sequence cannot be broken. It's Grapheme Break Rule 11 . Basically it states that any emoji followed by zero or more extending characters followed by a ZWJ and then followed by another emoji is a single cluster and cannot be broken. Emoji modifier sequences are one of the wildest parts of Unicode IMO, given that technically there's no limit to how long a sequence can get! Hope this helps.

natecraddock · 2023-02-09T20:21:10Z

Thanks for looking into this @jecolon. I think understand what you are saying, but I want to double check that I was perfectly clear. I'm not thinking that ziglyph should be breaking at ZWJ. I looked at Grapheme Break Rule 11, and I think my example still fits within those requirements.

Basically it states that any emoji followed by zero or more extending characters followed by a ZWJ and then followed by another emoji is a single cluster and cannot be broken.

Yes, this makes sense. My issue here is that two such emoji clusters are being grouped as one in ziglyph.

The Polar Bear emoji in my example is

U+1F43B (Bear Face)
U+200D (ZWJ)
U+2744 (Snowflake)
U+FE0F (Variation Selector 16)

So this is an \p{Extended_Pictographic} a ZWJ followed by another \p{Extended_Pictographic}.

This should be one single grapheme cluster. Yet when two are placed one after the other in a slice of bytes, ziglyph treats the entire slice of bytes as a single grapheme cluster. So I am unable to iterate over the two emoji as a single emoji.

It seems to me like maybe the Variation Selector is throwing off ziglyph?

jecolon · 2023-02-09T23:13:06Z

You're right, and this is a bug. I think I just squashed it and pushed the commit. It includes a test to make sure the segmentation of contiguous emoji sequences is handled correctly. Interesting that the thousands of test cases provided by Unicode don't catch this! Let me know if it works in your use case.

natecraddock · 2023-02-10T00:54:54Z

Thanks for fixing this so fast! I tested and everything works as expected!

Interesting that the thousands of test cases provided by Unicode don't catch this!

I was thinking the same thing! After looking at the source and test cases you had before, I was hesitant to report a bug.

natecraddock mentioned this issue Feb 8, 2023

Grapheme ZWJ clusters in query natecraddock/zf#19

Closed

natecraddock closed this as completed Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grapheme segmentation with ZWJ sequences #20

Grapheme segmentation with ZWJ sequences #20

natecraddock commented Feb 8, 2023 •

edited

Loading

jecolon commented Feb 9, 2023

natecraddock commented Feb 9, 2023 •

edited

Loading

jecolon commented Feb 9, 2023

natecraddock commented Feb 10, 2023

Grapheme segmentation with ZWJ sequences #20

Grapheme segmentation with ZWJ sequences #20

Comments

natecraddock commented Feb 8, 2023 • edited Loading

jecolon commented Feb 9, 2023

natecraddock commented Feb 9, 2023 • edited Loading

jecolon commented Feb 9, 2023

natecraddock commented Feb 10, 2023

natecraddock commented Feb 8, 2023 •

edited

Loading

natecraddock commented Feb 9, 2023 •

edited

Loading