-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin release for ElasticSearch 7.8 #26
Comments
Hi! There is no need for the plugin with Elasticsearch version >= 6.4 as the ICU library has been updated. So with your 7.8 you just have to install the "analysis-icu" plugin (because you need to use icu_tokenizer) and use the dictionary as synomym token filter. Something like this:
|
I suggest this blog post for more information: https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji |
Hmm, then maybe my question is wrong, haha. I got the following error when creating the index:
and assumed from that other thread I would need your plugin to fix this. |
Yes, I suspect you didn't use ICU at all when you got this error? |
That's true. Thanks for getting back to me so quickly, I'm sure I can make it work. 🙂 |
We did make it work, eventually! However, our client reported a strange bug in which the query "🍏☀️" would yield results, but "☀️🍏" would not. Upon inspection, the first emoji is converted into synonyms, but the second one isn't: GET /stedelijk_nl/_analyze
{
"analyzer": "dutch_with_emoji",
"text": "🍏☀️️️"
} Response: {
"tokens" : [
{
"token" : """🍏""",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "appel",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "fruit",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "groen",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "groen",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "☀",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<EMOJI>",
"position" : 1
},
{
"token" : "appel",
"start_offset" : 2,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 12
}
]
} When I flip the order of emoji, the ☀️ will be converted to synonyms, but the apple is not — very odd behavior. For the record:
Other than that there are some stemming and stopwords filters, but I've removed all of these and it doesn't seem to make a difference. |
Thanks for reporting this issue. I have some questions:
VARIATION SELECTOR-16 is used to force the EMOJI version of |
I remove it via a char filter, type
I agree, it does! I inserted it into Kibana using the standard MacOS emoji picker. Upon insertion it changed into the more "text-like" sun thing you see in my code snippet. However, the same thing happens with an avocado (which does look like an actual emoji): GET /stedelijk_nl/_analyze
{
"analyzer": "dutch_with_emoji",
"text": "🍏🥑️️️"
} Response{ "tokens" : [ { "token" : """🍏""", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "appel", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "fruit", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "groen", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "groen", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : """🥑""", "start_offset" : 2, "end_offset" : 7, "type" : "", "position" : 1 }, { "token" : "appel", "start_offset" : 2, "end_offset" : 7, "type" : "SYNONYM", "position" : 1 } ] } |
Just made some tests. The invisible char you have in the string ( So we need to clean off that emoji variation selector before giving them to the synonym token filter. This can be done like this: "emoji_variation_selector_filter": {
"type": "pattern_replace",
"pattern": "\\uFE0E|\\uFE0F",
"replace": ""
} Your search is
As When we apply the above filter we get those tokens:
And then the synonym filter can work to add the tokens! I added this filter in the README, added tests and I'm now closing this issue. Feel free to comment if there is anything else! See changes here: bea5b31 Also since last time the emoji files have been fixed for the "completely eliminated by analyzer" issue 😉 |
That's great! Thanks so much for maintaining this repo and debugging this issue. |
Oddly enough I still get an error on the line:
It might be relevant to know we do not use the file as-is but add them programmatically through
Do you think that makes a difference? |
Ah, got it: when using tokenizer |
So the synonyms file is correct right now, it imports correctly when using Request:
Response:
My
Any ideas? |
Hi! Long time no see 👋 I took the time to test with About your other issue, here is the full string you search:
As you can see we have a lot of VARIATION SELECTOR. For that you added the
Have a great day ;-) |
Yes! That works fantastically! I'm going to recreate my mappings and re-index. Thanks again! |
Hi @damienalexandre,
We're having the same issue you describe here, when using the dictionary file from this repository.
We've just upgraded from ElasticSearch 5.3 to 7.8 however, so we can't use your plugin to solve this issue (yet).
Is a 7.8 release on your roadmap by any chance?
Thanks in advance!
The text was updated successfully, but these errors were encountered: