Phrase match error in chinese #1714

gemini133 · 2021-09-22T07:38:02Z

documents. default settings.

curl -s http://localhost:7700/indexes/products/documents | rq
[
  {
    "id": "123",
    "title": "小化妆包"
  },
  {
    "id": "456",
    "title": "Ipad 包"
  }
]

search with phrase match "化妆". no result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"化妆\"" }' | rq

{
  "exhaustiveNbHits": false,
  "hits": [],
  "limit": 20,
  "nbHits": 0,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"化妆\""
}

search with phrase match "小化妆". no result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆\"" }' | rq
{
  "exhaustiveNbHits": false,
  "hits": [],
  "limit": 20,
  "nbHits": 0,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小化妆\""
}

search with phrase match "化妆包". result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"化妆包\"" }' | rq
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"化妆包\""
}

search with phrase match "小化妆包". result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆包\"" }' | rq
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小化妆包\""
}

Expected behavior
queries: "化妆", "小化妆", "化妆包", "小化妆包" all of them should return same result.

 {
      "id": "123",
      "title": "小化妆包"
    }

MeiliSearch version:

meilisearch-http 0.22.0

The text was updated successfully, but these errors were encountered:

gemini133 · 2021-09-22T07:42:56Z

I did a quick test with an english document: {"title": "adult face mask"} and utilize phrase match: "adult", "adult face", "face mask", document is returned as expected

YikSanChan · 2021-09-22T07:56:26Z

@gemini133 I guess this is because meilisearch uses jieba to tokenize the search word "小化妆包" and it is understood as "化妆包".

See https://docs.meilisearch.com/reference/under_the_hood/tokenization.html

gemini133 · 2021-09-22T07:59:44Z

yep, I think it's related to the tokenizer too.
why does "化妆" returns nothing then?
not sure how to work around this. confusing...

gemini133 · 2021-09-22T09:17:32Z

another test with just "包"

{
    "q": "\"包\""
}

{
    "hits": [
        {
            "id": "456",
            "title": "Ipad 包"
        }
    ],
    "nbHits": 1,
    "exhaustiveNbHits": false,
    "query": "\"包\"",
    "limit": 20,
    "offset": 0,
    "processingTimeMs": 0
}

ManyTheFish · 2021-09-27T12:16:56Z

Hello @gemini133, it seems that you are using phrase search in your queries putting the " around your word.
If you remove " do you have the expected behavior?

gemini133 · 2021-09-28T01:31:22Z

@ManyTheFish yes, I'm using phrase/exact search because the search results with normal search is sort of irrelevant to me
so I tried phrase search instead.

Please check screenshot

ManyTheFish · 2021-09-28T08:09:35Z

@gemini133, I can't speak any Chinese language, sorry if I don't understand the complete problem.

But, if I'm right, only the second request gets an irrelevant response, because Ipad 包 should never be returned for 化妆包.

Am I right?

gemini133 · 2021-09-28T08:23:34Z

sorry, and you are right. Ipad 包 should never be returned for 化妆包. is it related to ranking rules? I've tweaked it a couple times but still not what I want. So, I then go to phrase search

ManyTheFish · 2021-09-28T08:42:57Z

I will investigate,
thank you @gemini133 for your report. 👍

372: Fix Meilisearch 1714 r=Kerollmops a=ManyTheFish The bug comes from the typo tolerance, to know how many typos are accepted we were counting bytes instead of characters in a word. On Chinese Script characters, we were allowing 2 typos on 3 characters words. We are now counting the number of char instead of counting bytes to assign the typo tolerance. Related to [Meilisearch#1714](meilisearch/meilisearch#1714) Co-authored-by: many <maxime@meilisearch.com>

58: Test Meilisearch issue 1714 r=irevoire a=ManyTheFish Related to [Meilisearch#1714](meilisearch/meilisearch#1714) no bug in Tokenizer Co-authored-by: many <maxime@meilisearch.com>

curquiza · 2021-09-29T15:31:00Z

Closed by #1711 containing milli v0.17.0

gemini133 · 2021-10-02T03:06:52Z

here is the feedback of 0.23. Build it from the master branch


**Phrase search**

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小\"" }' | rq
//correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小\""
}



curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆\"" }' | rq
// wrong, must return 小化妆包
{
  "exhaustiveNbHits": false,
  "hits": [],
  "limit": 20,
  "nbHits": 0,
  "offset": 0,
  "processingTimeMs": 1,
  "query": "\"小化妆\"" 
}

---- 

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"化妆包\"" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"化妆包\""
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆包\"" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小化妆包\""
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"包\"" }' | rq
// wrong, must contain 小化妆包 too
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "456",
      "title": "Ipad 包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 1,
  "query": "\"包\""
}

---

**Normal Search**

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "包" }' | rq
// wrong, must contain 小化妆包 too
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "456",
      "title": "Ipad 包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "包"
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "化妆包" }' | rq
// not sure, if jieba splits into "化妆" 和 "包", then it should contain "Ipad 包"
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 1,
  "query": "化妆包"
}

----


curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "化妆" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "化妆"
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "小" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "小"
}

gemini133 · 2021-10-02T03:18:38Z

@curquiza @ManyTheFish

ManyTheFish · 2021-10-04T10:08:42Z

Hello @gemini133!
In your last comment, we have different reasons to "why Meilisearch doesn't return the good response?":

Tokenization & Phrase search

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": ""小化妆"" }' | rq
// wrong, must return 小化妆包
{
"exhaustiveNbHits": false,
"hits": [],
"limit": 20,
"nbHits": 0,
"offset": 0,
"processingTimeMs": 1,
"query": ""小化妆""
}

In Meilisearch we can only store 1 variation of a tokenized text, and this variation is given by Jieba.
here the text 小化妆包 is tokenized in two "words" as ["小", "化妆包"].
The search query is tokenized in two "words" as ["小", "化妆"], and because it's a phrase search, we enforce an exact match on the two words:

"小" == "小" ---------> ✅
"化妆" =/= "化妆包" ---> ❌

This does not match.

Tokenization & suffix search

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": ""包"" }' | rq
OR curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "包" }' | rq
// wrong, must contain 小化妆包 too
{
"exhaustiveNbHits": false,
"hits": [
{
"id": "456",
"title": "Ipad 包"
}
],
"limit": 20,
"nbHits": 1,
"offset": 0,
"processingTimeMs": 1,
"query": ""包""
}

In Meilisearch we can only store 1 variation of a tokenized text, and this variation is given by Jieba.
here the text 小化妆包 is tokenized in two "words" as ["小", "化妆包"] (we technically can't store more variation for now).
Moreover, we don't support suffix search, we only support prefix search. This means that the word 化妆包 is never a match of the word 包.

Tokenization & query Words

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "化妆包" }' | rq
// not sure, if jieba splits into "化妆" 和 "包", then it should contain "Ipad 包"
{
"exhaustiveNbHits": false,
"hits": [
{
"id": "123",
"title": "小化妆包"
}
],
"limit": 20,
"nbHits": 1,
"offset": 0,
"processingTimeMs": 1,
"query": "化妆包"
}

Meilisearch tokenizes 化妆包 as ["化妆包"] so 包 can't be a match for 化妆包.
Moreover, if Meilisearch would tokenize 化妆包 as ["化妆", "包"], we couldn't match Ipad 包 because the document doesn't contain the word "化妆".

I hope my explanations were clear. Thanks a lot @gemini133 for your detailed report! 👍

gemini133 · 2021-10-05T02:49:30Z

Thanks for your detailed explanation! It's very helpful.
Honestly, I'm sort of disappointed as compared to the past experience with elasticsearch.

ManyTheFish · 2021-10-05T07:53:56Z

I'm sorry for that.
But I have some questions.

is it really relevant to respond `Ipad 包` to the query `化妆包`?

I used a traduction tool, and 化妆包 seems to be a Cosmetic bag. I don't personally find that an Ipad bag is a good response to a Cosmetic bag. 🤔
Is there another traduction to 化妆包 that makes Ipad 包 relevant?

What is the most disappointing thing in the responses given by Meilisearch?

Is there one of your examples that makes meilisearch irrelevant for your usage?

gemini133 · 2021-10-06T01:36:41Z

no, no, no. Ipad 包 is not relevant to 化妆包, but when you search for "包", they both should be returned, right?

from the view of a normal user, when I search for "包", the search results should contains all products entitled with "包" , in this case, "ipad 包" 和 "化妆包".

ManyTheFish · 2021-10-06T08:59:34Z

Yes, we could tune Jieba to tokenize better for our users.
Unfortunately, I can't read/write any Chinese script, so if you know a bit about Jieba and how we could tune it to enhance your experience with meilisearch, I would gladly make the changes! 🙂

gemini133 · 2021-10-06T09:33:15Z

@ManyTheFish hey man, really appreciate your help. I'll look into jieba and see what I can do for all of us

gemini133 · 2021-10-11T02:30:25Z

@ManyTheFish unfortunately, didn't find a way to tweak jieba except for load_dict loading our own dictionary.

messense/jieba-rs#77

curquiza added the support Issues related to support questions label Sep 27, 2021

curquiza added bug Something isn't working as expected and removed support Issues related to support questions labels Sep 28, 2021

ManyTheFish self-assigned this Sep 28, 2021

curquiza added this to Candidates in Bug triage via automation Sep 28, 2021

This was referenced Sep 28, 2021

Test Meilisearch issue 1714 meilisearch/charabia#58

Merged

Fix: Count the number of char instead of counting bytes to assign the typo tolerance (fix Meilisearch 1714) meilisearch/milli#372

Merged

curquiza added this to the v0.24.0 milestone Sep 28, 2021

curquiza changed the title ~~phrase match error in chinese~~ Phrase match error in chinese Sep 28, 2021

bors bot added a commit to meilisearch/charabia that referenced this issue Sep 28, 2021

Merge #58

5b1ec61

58: Test Meilisearch issue 1714 r=irevoire a=ManyTheFish Related to [Meilisearch#1714](meilisearch/meilisearch#1714) no bug in Tokenizer Co-authored-by: many <maxime@meilisearch.com>

ManyTheFish modified the milestones: v0.24.0, v0.23.0 Sep 28, 2021

curquiza moved this from Candidates to Bugs - severity 2 in Bug triage Sep 29, 2021

curquiza closed this as completed Sep 29, 2021

Bug triage automation moved this from Bugs - severity 2 to Done Sep 29, 2021

ManyTheFish added the tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ label Dec 23, 2021

justin5267 mentioned this issue May 22, 2022

Chinese search support: Improve query segmentation squidfunk/mkdocs-material#3915

Closed

5 tasks

bors bot pushed a commit that referenced this issue Jan 16, 2023

Fix #1714 test

7f88c4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrase match error in chinese #1714

Phrase match error in chinese #1714

gemini133 commented Sep 22, 2021 •

edited

Loading

gemini133 commented Sep 22, 2021

YikSanChan commented Sep 22, 2021

gemini133 commented Sep 22, 2021 •

edited

Loading

gemini133 commented Sep 22, 2021

ManyTheFish commented Sep 27, 2021

gemini133 commented Sep 28, 2021 •

edited

Loading

ManyTheFish commented Sep 28, 2021

gemini133 commented Sep 28, 2021

ManyTheFish commented Sep 28, 2021

curquiza commented Sep 29, 2021

gemini133 commented Oct 2, 2021 •

edited

Loading

gemini133 commented Oct 2, 2021

ManyTheFish commented Oct 4, 2021

gemini133 commented Oct 5, 2021

ManyTheFish commented Oct 5, 2021 •

edited

Loading

gemini133 commented Oct 6, 2021 •

edited

Loading

ManyTheFish commented Oct 6, 2021 •

edited

Loading

gemini133 commented Oct 6, 2021

gemini133 commented Oct 11, 2021

Phrase match error in chinese #1714

Phrase match error in chinese #1714

Comments

gemini133 commented Sep 22, 2021 • edited Loading

gemini133 commented Sep 22, 2021

YikSanChan commented Sep 22, 2021

gemini133 commented Sep 22, 2021 • edited Loading

gemini133 commented Sep 22, 2021

ManyTheFish commented Sep 27, 2021

gemini133 commented Sep 28, 2021 • edited Loading

ManyTheFish commented Sep 28, 2021

gemini133 commented Sep 28, 2021

ManyTheFish commented Sep 28, 2021

curquiza commented Sep 29, 2021

gemini133 commented Oct 2, 2021 • edited Loading

gemini133 commented Oct 2, 2021

ManyTheFish commented Oct 4, 2021

Tokenization & Phrase search

Tokenization & suffix search

Tokenization & query Words

gemini133 commented Oct 5, 2021

ManyTheFish commented Oct 5, 2021 • edited Loading

is it really relevant to respond Ipad 包 to the query 化妆包?

What is the most disappointing thing in the responses given by Meilisearch?

gemini133 commented Oct 6, 2021 • edited Loading

ManyTheFish commented Oct 6, 2021 • edited Loading

gemini133 commented Oct 6, 2021

gemini133 commented Oct 11, 2021

gemini133 commented Sep 22, 2021 •

edited

Loading

gemini133 commented Sep 22, 2021 •

edited

Loading

gemini133 commented Sep 28, 2021 •

edited

Loading

gemini133 commented Oct 2, 2021 •

edited

Loading

ManyTheFish commented Oct 5, 2021 •

edited

Loading

is it really relevant to respond `Ipad 包` to the query `化妆包`?

gemini133 commented Oct 6, 2021 •

edited

Loading

ManyTheFish commented Oct 6, 2021 •

edited

Loading