-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phrase match error in chinese #1714
Comments
I did a quick test with an english document: |
@gemini133 I guess this is because meilisearch uses jieba to tokenize the search word "小化妆包" and it is understood as "化妆包". See https://docs.meilisearch.com/reference/under_the_hood/tokenization.html |
yep, I think it's related to the tokenizer too. |
another test with just "包"
|
Hello @gemini133, it seems that you are using phrase search in your queries putting the |
@ManyTheFish yes, I'm using phrase/exact search because the search results with normal search is sort of irrelevant to me Please check screenshot |
@gemini133, I can't speak any Chinese language, sorry if I don't understand the complete problem. But, if I'm right, only the second request gets an irrelevant response, because Am I right? |
sorry, and you are right. |
I will investigate, |
372: Fix Meilisearch 1714 r=Kerollmops a=ManyTheFish The bug comes from the typo tolerance, to know how many typos are accepted we were counting bytes instead of characters in a word. On Chinese Script characters, we were allowing 2 typos on 3 characters words. We are now counting the number of char instead of counting bytes to assign the typo tolerance. Related to [Meilisearch#1714](meilisearch/meilisearch#1714) Co-authored-by: many <maxime@meilisearch.com>
58: Test Meilisearch issue 1714 r=irevoire a=ManyTheFish Related to [Meilisearch#1714](meilisearch/meilisearch#1714) no bug in Tokenizer Co-authored-by: many <maxime@meilisearch.com>
Closed by #1711 containing milli v0.17.0 |
here is the feedback of 0.23. Build it from the master branch
|
Hello @gemini133! Tokenization & Phrase search
In Meilisearch we can only store 1 variation of a tokenized text, and this variation is given by Jieba.
This does not match. Tokenization & suffix search
In Meilisearch we can only store 1 variation of a tokenized text, and this variation is given by Jieba. Tokenization & query Words
Meilisearch tokenizes I hope my explanations were clear. Thanks a lot @gemini133 for your detailed report! 👍 |
Thanks for your detailed explanation! It's very helpful. |
I'm sorry for that. is it really relevant to respond
|
no, no, no. from the view of a normal user, when I search for "包", the search results should contains all products entitled with "包" , in this case, "ipad 包" 和 "化妆包". |
Yes, we could tune Jieba to tokenize better for our users. |
@ManyTheFish hey man, really appreciate your help. I'll look into jieba and see what I can do for all of us |
@ManyTheFish unfortunately, didn't find a way to tweak jieba except for |
documents. default settings.
search with phrase match "化妆". no result returned.
search with phrase match "小化妆". no result returned.
search with phrase match "化妆包". result returned.
search with phrase match "小化妆包". result returned.
Expected behavior
queries: "化妆", "小化妆", "化妆包", "小化妆包" all of them should return same result.
MeiliSearch version:
meilisearch-http 0.22.0
The text was updated successfully, but these errors were encountered: