Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme segmentation with ZWJ sequences #20

Closed
natecraddock opened this issue Feb 8, 2023 · 4 comments
Closed

Grapheme segmentation with ZWJ sequences #20

natecraddock opened this issue Feb 8, 2023 · 4 comments

Comments

@natecraddock
Copy link
Contributor

natecraddock commented Feb 8, 2023

Hello, and thank you for all the hard work on this great library! It has been a pleasure to use.

I do think I have found a bug though. I am using the GraphemeIterator and I noticed that multiple emoji with Zero Width Joiners in a row are only counted as one grapheme. For example, 🐻‍❄️🐻‍❄️ would be considered a single grapheme. Here is a minimal reproducing example

const std = @import("std");
const ziglyph = @import("ziglyph");

pub fn main() !void {
    var iter = try ziglyph.GraphemeIterator.init("🐻‍❄️🐻‍❄️");
    while (iter.next()) |grapheme| {
        std.debug.print("Found Grapheme: {s} at offset: {}\n", .{grapheme.bytes, grapheme.offset});
    }
}

outputs

Found Grapheme: 🐻‍❄️🐻‍❄️ at offset: 0

Separating the polar bears with a single space leads to three graphemes displayed as expected

Found Grapheme: 🐻‍❄️ at offset: 0
Found Grapheme:   at offset: 13
Found Grapheme: 🐻‍❄️ at offset: 14

It's possible that I misunderstand grapheme clusters, but I was under the impression that these emoji should be considered distinct "characters"/"symbols". When rendered in the browser for example, each is a distinct symbol, not joined as one.

Also, this specific emoji ends with U+FE0F, but I tested several other ZWJ emoji with the same result.

@jecolon
Copy link
Owner

jecolon commented Feb 9, 2023

Hi @natecraddock and thanks for the feedback. Although your reasoning makes sense to me, it seems that Unicode decided that since Emoji modifier sequences can include ZWJ as an integral part, when segmenting text into grapheme clusters, any such sequence cannot be broken. It's Grapheme Break Rule 11 . Basically it states that any emoji followed by zero or more extending characters followed by a ZWJ and then followed by another emoji is a single cluster and cannot be broken. Emoji modifier sequences are one of the wildest parts of Unicode IMO, given that technically there's no limit to how long a sequence can get! Hope this helps.

@natecraddock
Copy link
Contributor Author

natecraddock commented Feb 9, 2023

Thanks for looking into this @jecolon. I think understand what you are saying, but I want to double check that I was perfectly clear. I'm not thinking that ziglyph should be breaking at ZWJ. I looked at Grapheme Break Rule 11, and I think my example still fits within those requirements.

Basically it states that any emoji followed by zero or more extending characters followed by a ZWJ and then followed by another emoji is a single cluster and cannot be broken.

Yes, this makes sense. My issue here is that two such emoji clusters are being grouped as one in ziglyph.

The Polar Bear emoji in my example is

U+1F43B (Bear Face)
U+200D (ZWJ)
U+2744 (Snowflake)
U+FE0F (Variation Selector 16)

So this is an \p{Extended_Pictographic} a ZWJ followed by another \p{Extended_Pictographic}.

This should be one single grapheme cluster. Yet when two are placed one after the other in a slice of bytes, ziglyph treats the entire slice of bytes as a single grapheme cluster. So I am unable to iterate over the two emoji as a single emoji.

It seems to me like maybe the Variation Selector is throwing off ziglyph?

@jecolon
Copy link
Owner

jecolon commented Feb 9, 2023

You're right, and this is a bug. I think I just squashed it and pushed the commit. It includes a test to make sure the segmentation of contiguous emoji sequences is handled correctly. Interesting that the thousands of test cases provided by Unicode don't catch this! Let me know if it works in your use case.

@natecraddock
Copy link
Contributor Author

Thanks for fixing this so fast! I tested and everything works as expected!

Interesting that the thousands of test cases provided by Unicode don't catch this!

I was thinking the same thing! After looking at the source and test cases you had before, I was hesitant to report a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants