-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grapheme segmentation with ZWJ sequences #20
Comments
Hi @natecraddock and thanks for the feedback. Although your reasoning makes sense to me, it seems that Unicode decided that since Emoji modifier sequences can include ZWJ as an integral part, when segmenting text into grapheme clusters, any such sequence cannot be broken. It's Grapheme Break Rule 11 . Basically it states that any emoji followed by zero or more extending characters followed by a ZWJ and then followed by another emoji is a single cluster and cannot be broken. Emoji modifier sequences are one of the wildest parts of Unicode IMO, given that technically there's no limit to how long a sequence can get! Hope this helps. |
Thanks for looking into this @jecolon. I think understand what you are saying, but I want to double check that I was perfectly clear. I'm not thinking that ziglyph should be breaking at ZWJ. I looked at Grapheme Break Rule 11, and I think my example still fits within those requirements.
Yes, this makes sense. My issue here is that two such emoji clusters are being grouped as one in ziglyph. The Polar Bear emoji in my example is
So this is an This should be one single grapheme cluster. Yet when two are placed one after the other in a slice of bytes, ziglyph treats the entire slice of bytes as a single grapheme cluster. So I am unable to iterate over the two emoji as a single emoji. It seems to me like maybe the Variation Selector is throwing off ziglyph? |
You're right, and this is a bug. I think I just squashed it and pushed the commit. It includes a test to make sure the segmentation of contiguous emoji sequences is handled correctly. Interesting that the thousands of test cases provided by Unicode don't catch this! Let me know if it works in your use case. |
Thanks for fixing this so fast! I tested and everything works as expected!
I was thinking the same thing! After looking at the source and test cases you had before, I was hesitant to report a bug. |
Hello, and thank you for all the hard work on this great library! It has been a pleasure to use.
I do think I have found a bug though. I am using the GraphemeIterator and I noticed that multiple emoji with Zero Width Joiners in a row are only counted as one grapheme. For example, 🐻❄️🐻❄️ would be considered a single grapheme. Here is a minimal reproducing example
outputs
Separating the polar bears with a single space leads to three graphemes displayed as expected
It's possible that I misunderstand grapheme clusters, but I was under the impression that these emoji should be considered distinct "characters"/"symbols". When rendered in the browser for example, each is a distinct symbol, not joined as one.
Also, this specific emoji ends with U+FE0F, but I tested several other ZWJ emoji with the same result.
The text was updated successfully, but these errors were encountered: