Plugin release for ElasticSearch 7.8 #26

harmenjanssen · 2020-07-14T08:54:39Z

We're having the same issue you describe here, when using the dictionary file from this repository.

We've just upgraded from ElasticSearch 5.3 to 7.8 however, so we can't use your plugin to solve this issue (yet).
Is a 7.8 release on your roadmap by any chance?

Thanks in advance!

damienalexandre · 2020-07-14T09:58:32Z

Hi!

There is no need for the plugin with Elasticsearch version >= 6.4 as the ICU library has been updated.

So with your 7.8 you just have to install the "analysis-icu" plugin (because you need to use icu_tokenizer) and use the dictionary as synomym token filter.

Something like this:

PUT /emoji-capable
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" 
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_emoji"
          ]
        }
      }
    }
  }
}

damienalexandre · 2020-07-14T09:58:49Z

I suggest this blog post for more information: https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji

harmenjanssen · 2020-07-14T10:26:13Z

Hmm, then maybe my question is wrong, haha.

I got the following error when creating the index:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","caused_by":{"type":"parse_exception","reason":"Invalid synony
  m rule at line 1","caused_by":{"type":"illegal_argument_exception","reason":"term: \uD83C\uDFFB was completely eliminated by analyzer"}}},"status":400}

and assumed from that other thread I would need your plugin to fix this.
But am I right in concluding the dictionary file can be used when I configure the ICU tokenizer?

damienalexandre · 2020-07-14T13:04:05Z

Yes, I suspect you didn't use ICU at all when you got this error?

harmenjanssen · 2020-07-14T14:30:54Z

That's true.

Thanks for getting back to me so quickly, I'm sure I can make it work. 🙂

harmenjanssen · 2020-08-05T13:32:45Z

We did make it work, eventually!

However, our client reported a strange bug in which the query "🍏☀️" would yield results, but "☀️🍏" would not.

Upon inspection, the first emoji is converted into synonyms, but the second one isn't:

GET /stedelijk_nl/_analyze
{
  "analyzer": "dutch_with_emoji",
  "text": "🍏☀️️️"
}

Response:

{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appel",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "☀",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<EMOJI>",
      "position" : 1
    },
    {
      "token" : "appel",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 12
    }
  ]
}

When I flip the order of emoji, the ☀️ will be converted to synonyms, but the apple is not — very odd behavior.
Have you ever seen anything like this?

For the record:

I'm using Elasticsearch 7.8
My analyzer includes the icu tokenizer and your synonyms list from this repo, but also a mapping to replace the invalid characters mentioned in cldr-emoji-annotation-synonyms-en.txt: Terms "completely eliminated by analyzer" #27 with valid characters.

Other than that there are some stemming and stopwords filters, but I've removed all of these and it doesn't seem to make a difference.

damienalexandre · 2020-08-05T13:54:09Z

Thanks for reporting this issue.

I have some questions:

did you edit the synonym file to remove ☀ ? Or do you remove this via a char filter ?
the submitted string looks strange (copy pasted from your _analyze call):

uniscribe "🍏☀<fe0f><fe0f><fe0f>"

  1F34F ├─ 🍏		├─ GREEN APPLE
   ---- ├┬ ☀️️️		├┬ Composition
   2600 │├─ ☀		│├─ BLACK SUN WITH RAYS
   FE0F │├─ VS16	│├─ VARIATION SELECTOR-16
   FE0F │├─ VS16	│├─ VARIATION SELECTOR-16
   FE0F │└─ VS16	│└─ VARIATION SELECTOR-16

VARIATION SELECTOR-16 is used to force the EMOJI version of ☀ but it's only needed once.

harmenjanssen · 2020-08-05T18:57:54Z

did you edit the synonym file to remove ☀ ? Or do you remove this via a char filter ?

I remove it via a char filter, type mapping, with mappings like this:

'*=>star',
'✓=>checkmark',

the submitted string looks strange (copy pasted from your _analyze call):

I agree, it does! I inserted it into Kibana using the standard MacOS emoji picker. Upon insertion it changed into the more "text-like" sun thing you see in my code snippet.

However, the same thing happens with an avocado (which does look like an actual emoji):

GET /stedelijk_nl/_analyze
{
  "analyzer": "dutch_with_emoji",
  "text": "🍏🥑️️️"
}

Response

{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appel",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : """🥑""",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "appel",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

damienalexandre · 2020-08-18T16:55:32Z

Just made some tests.

The invisible char you have in the string (FE0F │├─ VS16 │├─ VARIATION SELECTOR-16) is not understood by Elasticsearch standard analyzer, neither by icu_tokenizer.

So we need to clean off that emoji variation selector before giving them to the synonym token filter.

This can be done like this:

"emoji_variation_selector_filter": {
    "type": "pattern_replace",
    "pattern": "\\uFE0E|\\uFE0F",
    "replace": ""
}

Your search is 🍏🥑<fe0f><fe0f><fe0f>, it produce two tokens by default:

🍏
🥑<fe0f>

As 🥑<fe0f> is not in the synonym file you don't get the annotations.

When we apply the above filter we get those tokens:

🍏
🥑

And then the synonym filter can work to add the tokens!

I added this filter in the README, added tests and I'm now closing this issue. Feel free to comment if there is anything else!

See changes here: bea5b31

Also since last time the emoji files have been fixed for the "completely eliminated by analyzer" issue 😉

harmenjanssen · 2020-08-19T07:09:32Z

That's great! Thanks so much for maintaining this repo and debugging this issue.
I will implement the filter and download the new dictionary files.

harmenjanssen · 2020-09-10T07:45:56Z

Oddly enough I still get an error on the line:

〽 => 〽, mark, part, part alternation mark

 {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","c
  aused_by":{"type":"parse_exception","reason":"Invalid synonym rule at line 1263","caused_by":{"type":"illegal_argument_exception","reason":"term: 〽 was completely eliminate
  d by analyzer"}}},"status":400}

It might be relevant to know we do not use the file as-is but add them programmatically through synonyms:

'filter' => [
    'english_emoji' => [
        'type' => 'synonym',
        'synonyms' => [],                // Will be filled by reading the synonyms file.
    ],
    // @see https://github.com/jolicode/emoji-search/issues/26
    'emoji_variation_selector_filter' => [
        'type' => 'pattern_replace',
        'pattern' => '\\uFE0E|\\uFE0F',
        'replace' => ''
    ],
    ...

Do you think that makes a difference?
We're using ES 7.8.0 — your tests are succeeding on 7.8.1 so I'm assuming 7.8.0 should work as well...

harmenjanssen · 2020-09-10T13:31:31Z

Ah, got it: when using tokenizer icu_tokenizer it fails, but with tokenizer standard it works.
I was under the impression I had to use icu_tokenizer.

harmenjanssen · 2020-09-10T13:57:47Z

So the synonyms file is correct right now, it imports correctly when using icu_tokenizer.
However, the original problem of translating subsequent emoji into synonyms still does not work, look:

Request:

GET /stedelijk_staging_en/_analyze
{
  "analyzer": "english_with_emoji",
  "text": "🍏🥑️️️️"
}

Response:

{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appl",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "green",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : """🥑""",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<EMOJI>",
      "position" : 1
    }
  ]
}

My english_with_emoji analyzer is setup like this:

'english_with_emoji' => [
    'char_filter' => [
        'html_strip'
    ],
    'tokenizer' => 'standard',
    'filter' => [
        'english_emoji',
        'emoji_variation_selector_filter',
        'lowercase',
        'english_stop',
        'english_stemmer',
    ],
],

Any ideas?

damienalexandre · 2020-10-16T16:26:15Z

Hi! Long time no see 👋

I took the time to test with icu_tokenizer and there was some emoji to remove, just opened #33 for that, thanks! (it's fully tested not with both the standard and icu tokenizers).

About your other issue, here is the full string you search:

  1F34F ├─ 🍏           ├─ GREEN APPLE
   ---- ├┬ 🥑️️️️               ├┬ Composition
  1F951 │├─ 🥑          │├─ AVOCADO
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │└─ VS16        │└─ VARIATION SELECTOR-16

As you can see we have a lot of VARIATION SELECTOR.

For that you added the emoji_variation_selector_filter but you put it after the english_emoji token filter, it must be before.

'english_with_emoji' => [
    'char_filter' => [
        'html_strip'
    ],
    'tokenizer' => 'standard',
    'filter' => [
        'emoji_variation_selector_filter',
        'english_emoji',
        'lowercase',
        'english_stop',
        'english_stemmer',
    ],
],

Have a great day ;-)

harmenjanssen · 2020-10-20T12:51:49Z

Yes! That works fantastically!
Thanks, that was easier than I expected.

I'm going to recreate my mappings and re-index. Thanks again!

harmenjanssen closed this as completed Jul 14, 2020

vicchi mentioned this issue Jul 19, 2020

cldr-emoji-annotation-synonyms-en.txt: Terms "completely eliminated by analyzer" #27

Closed

damienalexandre reopened this Aug 5, 2020

damienalexandre added a commit that referenced this issue Aug 18, 2020

Ref #26 Add VARIATION SELECTOR token filter in README and tests

bea5b31

damienalexandre closed this as completed Aug 18, 2020

harmenjanssen mentioned this issue Sep 10, 2020

Test commit: change passing of synonyms option #32

Closed

damienalexandre mentioned this issue Oct 16, 2020

Add support for icu_tokenizer by removing some more "emoji" #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugin release for ElasticSearch 7.8 #26

Plugin release for ElasticSearch 7.8 #26

harmenjanssen commented Jul 14, 2020

damienalexandre commented Jul 14, 2020

damienalexandre commented Jul 14, 2020

harmenjanssen commented Jul 14, 2020

damienalexandre commented Jul 14, 2020

harmenjanssen commented Jul 14, 2020

harmenjanssen commented Aug 5, 2020

damienalexandre commented Aug 5, 2020

harmenjanssen commented Aug 5, 2020

damienalexandre commented Aug 18, 2020

harmenjanssen commented Aug 19, 2020

harmenjanssen commented Sep 10, 2020

harmenjanssen commented Sep 10, 2020

harmenjanssen commented Sep 10, 2020

damienalexandre commented Oct 16, 2020

harmenjanssen commented Oct 20, 2020

Plugin release for ElasticSearch 7.8 #26

Plugin release for ElasticSearch 7.8 #26

Comments

harmenjanssen commented Jul 14, 2020

damienalexandre commented Jul 14, 2020

damienalexandre commented Jul 14, 2020

harmenjanssen commented Jul 14, 2020

damienalexandre commented Jul 14, 2020

harmenjanssen commented Jul 14, 2020

harmenjanssen commented Aug 5, 2020

damienalexandre commented Aug 5, 2020

harmenjanssen commented Aug 5, 2020

damienalexandre commented Aug 18, 2020

harmenjanssen commented Aug 19, 2020

harmenjanssen commented Sep 10, 2020

harmenjanssen commented Sep 10, 2020

harmenjanssen commented Sep 10, 2020

damienalexandre commented Oct 16, 2020

harmenjanssen commented Oct 20, 2020