Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji regex doesn't capture the variation selector after textual symbols #1

Closed
rgrove opened this issue May 24, 2017 · 3 comments
Closed

Comments

@rgrove
Copy link

rgrove commented May 24, 2017

Thanks for this excellent resource! I've found it incredibly useful. I have run into one problem though.

The emoji regex doesn't capture the \uFE0F variation selector that's used to indicate that a non-emoji character like \u2764 (heavy black heart) should be treated as an emoji. Here's a simple repro case:

const regex = require('emoji-database/regex');

'❤️'.match(regex);
// => [ '❤', index: 0, input: '❤️' ]

Here it is again with codepoint escapes instead of literal characters, in case your browser/OS combo actually renders a standalone \u2764 as a red heart (Chrome on OS X doesn't, at least):

const regex = require('emoji-database/regex');

'\u2764\uFE0F'.match(regex);
// => [ '\u2764', index: 0, input: '\u2764\uFE0F' ]

This issue seems to be present for all characters that don't have the Emoji_Presentation property (meaning that they default to a text representation rather than an emoji representation unless followed by \uFE0F).

The fix seems to be to match an optional \uFE0F at the end of the regex:

const fixedRegex = new RegExp(require('emoji-database/regex').source + '\uFE0F?');

'❤️'.match(fixedRegex);
// => [ '❤️', index: 0, input: '❤️' ]

I could see an argument for not capturing \uFE0F by default since that may not always be desirable, so I wasn't sure if this was intentional or not. If it is intentional, it might be helpful to mention this caveat (and the workaround) in the readme.

Thanks again!

@milesj
Copy link
Owner

milesj commented May 24, 2017

Thanks for pointing this out!

The regex is currently built using regexgen and the literal unicode character: https://github.com/milesj/emoji-database/blob/master/src/bin/generate-regex.js#L35 And based on the attached image, it looks like the default unicode character is the text representation (like you pointed out):

screen shot 2017-05-24 at 10 55 37 am

Now the best way to solve this... quite tricky. I suppose the default regex pattern should match everything, including text and emoji representations. I could then add additional regex patterns that are text only, or emoji only, etc. Will have to dig a bit deeper.

@milesj
Copy link
Owner

milesj commented Jun 4, 2017

This is fixed in the next version.

@rgrove
Copy link
Author

rgrove commented Jun 4, 2017

Awesome! Thanks. 😄

@milesj milesj closed this as completed Jun 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants