Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When removing emojis with .gsub, I'm getting error on compare with empty string. #4

Closed
andreleoni opened this issue Nov 21, 2018 · 2 comments
Labels

Comments

@andreleoni
Copy link

andreleoni commented Nov 21, 2018

Hello. I’m trying to use the gem to remove emojis from strings, but I’m getting an error when comparing the result with the expected string.

[33] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result = '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX_ANY, '')
=> "‍"
[34] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result == ''
=> false

[36] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump(regex_result)
=> "\x04\bI\"\b\xE2\x80\x8D\x06:\x06ET"
[37] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump('')
=> "\x04\bI\"\x00\x06:\x06ET"

What I’m doing wrong here? 😅

@andreleoni andreleoni changed the title Removing unicode marshall error When removing emojis with .gsub, I'm getting error on compare with empty string. Nov 21, 2018
@janlelis
Copy link
Owner

janlelis commented Nov 25, 2018

Hey Andre,

although REGEX_ANY does match a lot of emoji-related codepoints, it does not match some Unicode-codepoints that are used by emoji, but are also used outside of the emoji-world, like U+200D zero-width joiner. That's exactly what is happening here, there is still a ZJW in the data:

uniscribe '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX_ANY, '')

200D ├─ ]‍[		├─ ZERO WIDTH JOINER

I've clarified this behavior in the README table.

What you want to do is to use REGEX which gives you better (and more robust) results. For example:

uniscribe '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX, '')

Unfortunately, this will let through textual emoji like

2195 ├─ ↕		├─ UP DOWN ARROW`

To work around this issue, you can also remove emoji that respond to REGEX_TEXT, for example, like this:

'🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Regexp.union(Unicode::Emoji::REGEX, Unicode::Emoji::REGEX_TEXT), '') == "" # => true

Please leave some feedback, if this fixes your issue.

Actually, your feedback inspired me to have a REGEX_ALL regex in a future version of this gem, which will include textual emoji in its regex, see #5

@janlelis
Copy link
Owner

Closing, please re-open if problem persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants