Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrase match error in chinese #1714

Closed
gemini133 opened this issue Sep 22, 2021 · 19 comments
Closed

Phrase match error in chinese #1714

gemini133 opened this issue Sep 22, 2021 · 19 comments
Assignees
Labels
bug Something isn't working as expected tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/
Projects
Milestone

Comments

@gemini133
Copy link

gemini133 commented Sep 22, 2021

documents. default settings.

curl -s http://localhost:7700/indexes/products/documents | rq
[
  {
    "id": "123",
    "title": "小化妆包"
  },
  {
    "id": "456",
    "title": "Ipad 包"
  }
]

search with phrase match "化妆". no result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"化妆\"" }' | rq

{
  "exhaustiveNbHits": false,
  "hits": [],
  "limit": 20,
  "nbHits": 0,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"化妆\""
}

search with phrase match "小化妆". no result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆\"" }' | rq
{
  "exhaustiveNbHits": false,
  "hits": [],
  "limit": 20,
  "nbHits": 0,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小化妆\""
}

search with phrase match "化妆包". result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"化妆包\"" }' | rq
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"化妆包\""
}

search with phrase match "小化妆包". result returned.

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆包\"" }' | rq
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小化妆包\""
}

Expected behavior
queries: "化妆", "小化妆", "化妆包", "小化妆包" all of them should return same result.

 {
      "id": "123",
      "title": "小化妆包"
    }

MeiliSearch version:

meilisearch-http 0.22.0

@gemini133
Copy link
Author

I did a quick test with an english document: {"title": "adult face mask"} and utilize phrase match: "adult", "adult face", "face mask", document is returned as expected

@YikSanChan
Copy link

@gemini133 I guess this is because meilisearch uses jieba to tokenize the search word "小化妆包" and it is understood as "化妆包".

See https://docs.meilisearch.com/reference/under_the_hood/tokenization.html

@gemini133
Copy link
Author

gemini133 commented Sep 22, 2021

yep, I think it's related to the tokenizer too.
why does "化妆" returns nothing then?
not sure how to work around this. confusing...

@gemini133
Copy link
Author

another test with just "包"

{
    "q": "\"包\""
}

{
    "hits": [
        {
            "id": "456",
            "title": "Ipad 包"
        }
    ],
    "nbHits": 1,
    "exhaustiveNbHits": false,
    "query": "\"包\"",
    "limit": 20,
    "offset": 0,
    "processingTimeMs": 0
}

@ManyTheFish
Copy link
Member

Hello @gemini133, it seems that you are using phrase search in your queries putting the " around your word.
If you remove " do you have the expected behavior?

@curquiza curquiza added the support Issues related to support questions label Sep 27, 2021
@gemini133
Copy link
Author

gemini133 commented Sep 28, 2021

@ManyTheFish yes, I'm using phrase/exact search because the search results with normal search is sort of irrelevant to me
so I tried phrase search instead.

Please check screenshot

image

@ManyTheFish
Copy link
Member

@gemini133, I can't speak any Chinese language, sorry if I don't understand the complete problem.

But, if I'm right, only the second request gets an irrelevant response, because Ipad 包 should never be returned for 化妆包.

Am I right?

@gemini133
Copy link
Author

sorry, and you are right. Ipad 包 should never be returned for 化妆包. is it related to ranking rules? I've tweaked it a couple times but still not what I want. So, I then go to phrase search

@ManyTheFish
Copy link
Member

I will investigate,
thank you @gemini133 for your report. 👍

@curquiza curquiza added bug Something isn't working as expected and removed support Issues related to support questions labels Sep 28, 2021
@ManyTheFish ManyTheFish self-assigned this Sep 28, 2021
@curquiza curquiza added this to Candidates in Bug triage via automation Sep 28, 2021
@curquiza curquiza added this to the v0.24.0 milestone Sep 28, 2021
@curquiza curquiza changed the title phrase match error in chinese Phrase match error in chinese Sep 28, 2021
bors bot added a commit to meilisearch/milli that referenced this issue Sep 28, 2021
372: Fix Meilisearch 1714 r=Kerollmops a=ManyTheFish

The bug comes from the typo tolerance, to know how many typos are accepted we were counting bytes instead of characters in a word.
On Chinese Script characters, we were allowing  2 typos on 3 characters words.
We are now counting the number of char instead of counting bytes to assign the typo tolerance.

Related to [Meilisearch#1714](meilisearch/meilisearch#1714)

Co-authored-by: many <maxime@meilisearch.com>
bors bot added a commit to meilisearch/charabia that referenced this issue Sep 28, 2021
58: Test Meilisearch issue 1714 r=irevoire a=ManyTheFish

Related to [Meilisearch#1714](meilisearch/meilisearch#1714)

no bug in Tokenizer

Co-authored-by: many <maxime@meilisearch.com>
@ManyTheFish ManyTheFish modified the milestones: v0.24.0, v0.23.0 Sep 28, 2021
@curquiza curquiza moved this from Candidates to Bugs - severity 2 in Bug triage Sep 29, 2021
@curquiza
Copy link
Member

Closed by #1711 containing milli v0.17.0

Bug triage automation moved this from Bugs - severity 2 to Done Sep 29, 2021
@gemini133
Copy link
Author

gemini133 commented Oct 2, 2021

here is the feedback of 0.23. Build it from the master branch


**Phrase search**

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小\"" }' | rq
//correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小\""
}



curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆\"" }' | rq
// wrong, must return 小化妆包
{
  "exhaustiveNbHits": false,
  "hits": [],
  "limit": 20,
  "nbHits": 0,
  "offset": 0,
  "processingTimeMs": 1,
  "query": "\"小化妆\"" 
}

---- 

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"化妆包\"" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"化妆包\""
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"小化妆包\"" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "\"小化妆包\""
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "\"包\"" }' | rq
// wrong, must contain 小化妆包 too
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "456",
      "title": "Ipad 包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 1,
  "query": "\"包\""
}

---

**Normal Search**

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "包" }' | rq
// wrong, must contain 小化妆包 too
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "456",
      "title": "Ipad 包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "包"
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "化妆包" }' | rq
// not sure, if jieba splits into "化妆" 和 "包", then it should contain "Ipad 包"
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 1,
  "query": "化妆包"
}

----


curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "化妆" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "化妆"
}

----

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "小" }' | rq
// correct
{
  "exhaustiveNbHits": false,
  "hits": [
    {
      "id": "123",
      "title": "小化妆包"
    }
  ],
  "limit": 20,
  "nbHits": 1,
  "offset": 0,
  "processingTimeMs": 0,
  "query": "小"
}

@gemini133
Copy link
Author

@curquiza @ManyTheFish

@ManyTheFish
Copy link
Member

Hello @gemini133!
In your last comment, we have different reasons to "why Meilisearch doesn't return the good response?":

Tokenization & Phrase search

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": ""小化妆"" }' | rq
// wrong, must return 小化妆包
{
"exhaustiveNbHits": false,
"hits": [],
"limit": 20,
"nbHits": 0,
"offset": 0,
"processingTimeMs": 1,
"query": ""小化妆""
}

In Meilisearch we can only store 1 variation of a tokenized text, and this variation is given by Jieba.
here the text 小化妆包 is tokenized in two "words" as ["小", "化妆包"].
The search query is tokenized in two "words" as ["小", "化妆"], and because it's a phrase search, we enforce an exact match on the two words:

  • "小" == "小" ---------> ✅
  • "化妆" =/= "化妆包" ---> ❌

This does not match.

Tokenization & suffix search

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": ""包"" }' | rq
OR curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "包" }' | rq
// wrong, must contain 小化妆包 too
{
"exhaustiveNbHits": false,
"hits": [
{
"id": "456",
"title": "Ipad 包"
}
],
"limit": 20,
"nbHits": 1,
"offset": 0,
"processingTimeMs": 1,
"query": ""包""
}

In Meilisearch we can only store 1 variation of a tokenized text, and this variation is given by Jieba.
here the text 小化妆包 is tokenized in two "words" as ["小", "化妆包"] (we technically can't store more variation for now).
Moreover, we don't support suffix search, we only support prefix search. This means that the word 化妆包 is never a match of the word .

Tokenization & query Words

curl -s -X POST 'http://localhost:7700/indexes/products/search' --data '{ "q": "化妆包" }' | rq
// not sure, if jieba splits into "化妆" 和 "包", then it should contain "Ipad 包"
{
"exhaustiveNbHits": false,
"hits": [
{
"id": "123",
"title": "小化妆包"
}
],
"limit": 20,
"nbHits": 1,
"offset": 0,
"processingTimeMs": 1,
"query": "化妆包"
}

Meilisearch tokenizes 化妆包 as ["化妆包"] so can't be a match for 化妆包.
Moreover, if Meilisearch would tokenize 化妆包 as ["化妆", "包"], we couldn't match Ipad 包 because the document doesn't contain the word "化妆".

I hope my explanations were clear. Thanks a lot @gemini133 for your detailed report! 👍

@gemini133
Copy link
Author

Thanks for your detailed explanation! It's very helpful.
Honestly, I'm sort of disappointed as compared to the past experience with elasticsearch.

@ManyTheFish
Copy link
Member

ManyTheFish commented Oct 5, 2021

I'm sorry for that.
But I have some questions.

is it really relevant to respond Ipad 包 to the query 化妆包?

I used a traduction tool, and 化妆包 seems to be a Cosmetic bag. I don't personally find that an Ipad bag is a good response to a Cosmetic bag. 🤔
Is there another traduction to 化妆包 that makes Ipad 包 relevant?

What is the most disappointing thing in the responses given by Meilisearch?

Is there one of your examples that makes meilisearch irrelevant for your usage?

@gemini133
Copy link
Author

gemini133 commented Oct 6, 2021

no, no, no. Ipad 包 is not relevant to 化妆包, but when you search for "包", they both should be returned, right?

from the view of a normal user, when I search for "包", the search results should contains all products entitled with "包" , in this case, "ipad 包" 和 "化妆包".

@ManyTheFish
Copy link
Member

ManyTheFish commented Oct 6, 2021

Yes, we could tune Jieba to tokenize better for our users.
Unfortunately, I can't read/write any Chinese script, so if you know a bit about Jieba and how we could tune it to enhance your experience with meilisearch, I would gladly make the changes! 🙂

@gemini133
Copy link
Author

@ManyTheFish hey man, really appreciate your help. I'll look into jieba and see what I can do for all of us

@gemini133
Copy link
Author

@ManyTheFish unfortunately, didn't find a way to tweak jieba except for load_dict loading our own dictionary.

messense/jieba-rs#77

@ManyTheFish ManyTheFish added the tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ label Dec 23, 2021
bors bot pushed a commit that referenced this issue Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/
Projects
No open projects
Development

No branches or pull requests

4 participants