Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Emoji capable tokenizer #3

Closed
damienalexandre opened this issue Mar 4, 2016 · 4 comments
Closed

An Emoji capable tokenizer #3

damienalexandre opened this issue Mar 4, 2016 · 4 comments

Comments

@damienalexandre
Copy link
Member

We need a specific tokenizer to handle emoji properly in Elasticsearch, here is the specification.

Emoji Tokenizer

Should be based on the standard tokenizer to keep the grammar logic and Unicode Text Segmentation implementation.

But add support for symbols / emoji as specified in tr51.

Test cases

Input Tokens
Get some donut's. Get some donut's
Get some 🍩's. Get some 🍩's
"🍩" 🍩
@damienalexandre
Copy link
Member Author

Currently working on a ICU 58 based tokenizer with additional custom rules to consider emoji & their combinaisons as word/token.
Elasticsearch's ICU plugin is out of date (ICU 54), Lucene ICU too (ICU 56), so I will have to ship my own ICU with the plugin, my only concern is about compatibility...

😋

@gibrown
Copy link

gibrown commented Apr 26, 2017

@damienalexandre did you manage to build an ICU tokenizer that supports emoji tokenization by any chance?

@damienalexandre
Copy link
Member Author

Yes! The plugin documentation is here: https://github.com/jolicode/emoji-search/tree/master/esplugin

@gibrown
Copy link

gibrown commented Apr 27, 2017

Looks really cool! Thanks! Adding to our project list to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants