Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin release for ElasticSearch 7.8 #26

Closed
harmenjanssen opened this issue Jul 14, 2020 · 15 comments
Closed

Plugin release for ElasticSearch 7.8 #26

harmenjanssen opened this issue Jul 14, 2020 · 15 comments

Comments

@harmenjanssen
Copy link
Contributor

Hi @damienalexandre,

We're having the same issue you describe here, when using the dictionary file from this repository.

We've just upgraded from ElasticSearch 5.3 to 7.8 however, so we can't use your plugin to solve this issue (yet).
Is a 7.8 release on your roadmap by any chance?

Thanks in advance!

@damienalexandre
Copy link
Member

Hi!

There is no need for the plugin with Elasticsearch version >= 6.4 as the ICU library has been updated.

So with your 7.8 you just have to install the "analysis-icu" plugin (because you need to use icu_tokenizer) and use the dictionary as synomym token filter.

Something like this:

PUT /emoji-capable
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" 
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_emoji"
          ]
        }
      }
    }
  }
}

@damienalexandre
Copy link
Member

I suggest this blog post for more information: https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji

@harmenjanssen
Copy link
Contributor Author

Hmm, then maybe my question is wrong, haha.

I got the following error when creating the index:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","caused_by":{"type":"parse_exception","reason":"Invalid synony
  m rule at line 1","caused_by":{"type":"illegal_argument_exception","reason":"term: \uD83C\uDFFB was completely eliminated by analyzer"}}},"status":400}

and assumed from that other thread I would need your plugin to fix this.
But am I right in concluding the dictionary file can be used when I configure the ICU tokenizer?

@damienalexandre
Copy link
Member

Yes, I suspect you didn't use ICU at all when you got this error?

@harmenjanssen
Copy link
Contributor Author

That's true.

Thanks for getting back to me so quickly, I'm sure I can make it work. 🙂

@harmenjanssen
Copy link
Contributor Author

We did make it work, eventually!

However, our client reported a strange bug in which the query "🍏☀️" would yield results, but "☀️🍏" would not.

Upon inspection, the first emoji is converted into synonyms, but the second one isn't:

GET /stedelijk_nl/_analyze
{
  "analyzer": "dutch_with_emoji",
  "text": "🍏☀️️️"
}

Response:

{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appel",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<EMOJI>",
      "position" : 1
    },
    {
      "token" : "appel",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 12
    }
  ]
}

When I flip the order of emoji, the ☀️ will be converted to synonyms, but the apple is not — very odd behavior.
Have you ever seen anything like this?

For the record:

Other than that there are some stemming and stopwords filters, but I've removed all of these and it doesn't seem to make a difference.

@damienalexandre
Copy link
Member

Thanks for reporting this issue.

I have some questions:

  • did you edit the synonym file to remove ? Or do you remove this via a char filter ?
  • the submitted string looks strange (copy pasted from your _analyze call):
uniscribe "🍏☀<fe0f><fe0f><fe0f>"

  1F34F ├─ 🍏		├─ GREEN APPLE
   ---- ├┬ ☀️️️		├┬ Composition
   2600 │├─ ☀		│├─ BLACK SUN WITH RAYS
   FE0F │├─ VS16	│├─ VARIATION SELECTOR-16
   FE0F │├─ VS16	│├─ VARIATION SELECTOR-16
   FE0F │└─ VS16	│└─ VARIATION SELECTOR-16

VARIATION SELECTOR-16 is used to force the EMOJI version of but it's only needed once.

@harmenjanssen
Copy link
Contributor Author

did you edit the synonym file to remove ☀ ? Or do you remove this via a char filter ?

I remove it via a char filter, type mapping, with mappings like this:

'*=>star',
'✓=>checkmark',

the submitted string looks strange (copy pasted from your _analyze call):

I agree, it does! I inserted it into Kibana using the standard MacOS emoji picker. Upon insertion it changed into the more "text-like" sun thing you see in my code snippet.

However, the same thing happens with an avocado (which does look like an actual emoji):

GET /stedelijk_nl/_analyze
{
  "analyzer": "dutch_with_emoji",
  "text": "🍏🥑️️️"
}
Response
{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appel",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : """🥑""",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "appel",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

@damienalexandre
Copy link
Member

Just made some tests.

The invisible char you have in the string (FE0F │├─ VS16 │├─ VARIATION SELECTOR-16) is not understood by Elasticsearch standard analyzer, neither by icu_tokenizer.

So we need to clean off that emoji variation selector before giving them to the synonym token filter.

This can be done like this:

"emoji_variation_selector_filter": {
    "type": "pattern_replace",
    "pattern": "\\uFE0E|\\uFE0F",
    "replace": ""
}

Your search is 🍏🥑<fe0f><fe0f><fe0f>, it produce two tokens by default:

  • 🍏
  • 🥑<fe0f>

As 🥑<fe0f> is not in the synonym file you don't get the annotations.

When we apply the above filter we get those tokens:

  • 🍏
  • 🥑

And then the synonym filter can work to add the tokens!

I added this filter in the README, added tests and I'm now closing this issue. Feel free to comment if there is anything else!

See changes here: bea5b31

Also since last time the emoji files have been fixed for the "completely eliminated by analyzer" issue 😉

@harmenjanssen
Copy link
Contributor Author

That's great! Thanks so much for maintaining this repo and debugging this issue.
I will implement the filter and download the new dictionary files.

@harmenjanssen
Copy link
Contributor Author

Oddly enough I still get an error on the line:

〽 => 〽, mark, part, part alternation mark
 {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","c
  aused_by":{"type":"parse_exception","reason":"Invalid synonym rule at line 1263","caused_by":{"type":"illegal_argument_exception","reason":"term: 〽 was completely eliminate
  d by analyzer"}}},"status":400}

It might be relevant to know we do not use the file as-is but add them programmatically through synonyms:

'filter' => [
    'english_emoji' => [
        'type' => 'synonym',
        'synonyms' => [],                // Will be filled by reading the synonyms file.
    ],
    // @see https://github.com/jolicode/emoji-search/issues/26
    'emoji_variation_selector_filter' => [
        'type' => 'pattern_replace',
        'pattern' => '\\uFE0E|\\uFE0F',
        'replace' => ''
    ],
    ...

Do you think that makes a difference?
We're using ES 7.8.0 — your tests are succeeding on 7.8.1 so I'm assuming 7.8.0 should work as well...

@harmenjanssen
Copy link
Contributor Author

Ah, got it: when using tokenizer icu_tokenizer it fails, but with tokenizer standard it works.
I was under the impression I had to use icu_tokenizer.

@harmenjanssen
Copy link
Contributor Author

So the synonyms file is correct right now, it imports correctly when using icu_tokenizer.
However, the original problem of translating subsequent emoji into synonyms still does not work, look:

Request:

GET /stedelijk_staging_en/_analyze
{
  "analyzer": "english_with_emoji",
  "text": "🍏🥑️️️️"
}
Response:
{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appl",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "green",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : """🥑""",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<EMOJI>",
      "position" : 1
    }
  ]
}

My english_with_emoji analyzer is setup like this:

'english_with_emoji' => [
    'char_filter' => [
        'html_strip'
    ],
    'tokenizer' => 'standard',
    'filter' => [
        'english_emoji',
        'emoji_variation_selector_filter',
        'lowercase',
        'english_stop',
        'english_stemmer',
    ],
],

Any ideas?

@damienalexandre
Copy link
Member

Hi! Long time no see 👋

I took the time to test with icu_tokenizer and there was some emoji to remove, just opened #33 for that, thanks! (it's fully tested not with both the standard and icu tokenizers).

About your other issue, here is the full string you search:

  1F34F ├─ 🍏           ├─ GREEN APPLE
   ---- ├┬ 🥑️️️️               ├┬ Composition
  1F951 │├─ 🥑          │├─ AVOCADO
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │└─ VS16        │└─ VARIATION SELECTOR-16

As you can see we have a lot of VARIATION SELECTOR.

For that you added the emoji_variation_selector_filter but you put it after the english_emoji token filter, it must be before.

'english_with_emoji' => [
    'char_filter' => [
        'html_strip'
    ],
    'tokenizer' => 'standard',
    'filter' => [
        'emoji_variation_selector_filter',
        'english_emoji',
        'lowercase',
        'english_stop',
        'english_stemmer',
    ],
],

Have a great day ;-)

@harmenjanssen
Copy link
Contributor Author

Yes! That works fantastically!
Thanks, that was easier than I expected.

I'm going to recreate my mappings and re-index. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants