Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid matching emoji followed by text variation selector (U+FE0E) #61

Open
dbrgn opened this issue May 21, 2019 · 3 comments
Open

Avoid matching emoji followed by text variation selector (U+FE0E) #61

dbrgn opened this issue May 21, 2019 · 3 comments
Labels

Comments

@dbrgn
Copy link

dbrgn commented May 21, 2019

First of all, thanks for this project! It's very useful.

It appears that the regex even matches codepoints that are followed by a text variant selector (FE0E).

The exclamation mark is an emoji with emoji-default representation. It should be matched both without a variant selector and with an emoji variant selector (FE0F).

However, it should not be matched when followed by a text variant selector (FE0E).

let m: string[];
console.info('no variation');
const r1 = emojiRegex();
while ((m = r1.exec('\u2757')) !== null) {
    console.log('match', m, 'lastIndex', r1.lastIndex);
}
const r2 = emojiRegex();
console.info('text variation');
while ((m = r2.exec('\u2757\ufe0e')) !== null) {
    console.log('match', m, 'lastIndex', r2.lastIndex);
}
const r3 = emojiRegex();
console.info('emoji variation');
while ((m = r3.exec('\u2757\ufe0f')) !== null) {
    console.log('match', m, 'lastIndex', r3.lastIndex);
}

This will match the emoji 3 times, each time with length 1.

My expectation would be that the version without variant selector is matched with length 1, that the version with emoji variant selector is matched with length 2, and that the version with text variant selector is not matched at all.

@ragurney
Copy link

Any update on this?

@mathiasbynens mathiasbynens changed the title Support for text variation selector (FE0E) Avoid matching emoji followed by text variation selector (U+FE0E) Oct 20, 2020
@mathiasbynens
Copy link
Owner

Good catch. We could add a negative lookahead for \uFE0E (that is, (?!\uFE0E)) to avoid matching in the text variation case. I think that's all there is to do here, for the following reasons.

For the emoji variation selector case, I'd expect it to match just the emoji itself in case that's already an RGI_Emoji string, and only to match the emoji + the emoji variation selector as a whole in cases where the emoji is unqualified by itself (i.e. where the variation selector is not redundant). This matches the spec. https://unicode.org/reports/tr51/#def_basic_emoji_set says:

ED-20. basic emoji set — The set of emoji characters and emoji presentation sequences listed in the emoji-sequences.txt file [emoji-data] under the type_field Basic_Emoji.

  • This is the set of emoji intended for general-purpose input.
  • This set excludes all instances of an emoji component, which are not intended for independent, direct input.
  • This set otherwise includes all instances of an emoji character with the property value Emoji_Presentation = Yes and all instances of a valid emoji presentation sequence whose base character has the property value Emoji_Presentation = No.

The sequence U+2757 U+FE0F is a valid presentation sequence per emoji-variation-sequences.txt, but since U+2757 by itself already has Emoji_Presentation = Yes (not No) it’s not included in RGI_Emoji.

TL;DR U+2757 is RGI_Emoji, but U+2757 U+FE0F is not.

References to the relevant data files follow.


https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

2757          ; Emoji                # E0.6   [1] (❗)       exclamation mark

https://unicode.org/Public/13.0.0/ucd/emoji/emoji-variation-sequences.txt

2757 FE0E  ; text style;  # (5.2) HEAVY EXCLAMATION MARK SYMBOL
2757 FE0F  ; emoji style; # (5.2) HEAVY EXCLAMATION MARK SYMBOL

https://unicode.org/Public/emoji/13.1/emoji-sequences.txt

2757          ; Basic_Emoji                  ; red exclamation mark                                           # E0.6   [1] (❗)

@mirabilos
Copy link

Yes, fix this please! This is appalling, I hate seeing my favourite characters (like U+263A) get Emoji presentation by default suddenly because more than half the software I see doesn’t implement variant selection properly. This is a bug with massive impact, and anything suffixed with U+FE0E must not be rendered as Emoji!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants