[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

miiton · 2023-03-07T01:49:28Z

Describe the bug

I was testing the search by adding Japanese documents, but when I added a Kanji-only document, past Japanese documents were no longer caught in the search.

To Reproduce

summary

add documents [{id: 1, name: "東京スカパラダイスオーケストラ"}] -> search with 東京 : expected
add documents [{id: 2, name: "東京バナナ"}, {id:3, name: "東京ポテチ"}] -> search with 東京 : expected
add documents [{id: 4, name: "東京特許許可局"}] -> search with 東京 : not expected

launch

docker run --rm -p 7700:7700 -e MEILI_MASTER_KEY=foobar getmeili/meilisearch:v1.1.0-rc.0

processing

# create an index
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes -d '{"uid":"hogehoge","primaryKey":"id"}'

# add a Japanese document
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/documents -d '{"id":1,"name":"東京スカパラダイスオーケストラ"}'

# search and hit 1 document ( correct )
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/search -d '{"q":"東京"}'

{"hits":[{"id":1,"name":"東京スカパラダイスオーケストラ"}],"query":"東京","processingTimeMs":0,"limit":20,"offset":0,"estimatedTotalHits":1}

# add more documents
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/documents -d '{"id":2,"name":"東京バナナ"}'
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/documents -d '{"id":3,"name":"東京 ポテチ"}'

# search and hit 3 documents ( correct ) 
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/search -d '{"q":"東京"}'

{"hits":[{"id":1,"name":"東京スカパラダイスオーケストラ"},{"id":2,"name":"東京バナナ"},{"id":3,"name":"東京 ポテチ"}],"query":"東京","processingTimeMs":0,"limit":20,"offset":0,"estimatedTotalHits":3}

# add a Kanji only document
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/documents -d '{"id":4,"name":"東京特許許可局"}'

# search and hit only last document ( not expected )
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/search -d '{"q":"東京"}'

{"hits":[{"id":4,"name":"東京特許許可局"}],"query":"東京","processingTimeMs":0,"limit":20,"offset":0,"estimatedTotalHits":1}

# add a document again
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/documents -d '{"id":5,"name":"東京スカイツリー"}'

# search but hit only 1 document ( not expected )
curl -H "Authorization: Bearer foobar" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/search -d '{"q":"東京"}'

{"hits":[{"id":4,"name":"東京特許許可局"}],"query":"東京","processingTimeMs":0,"limit":20,"offset":0,"estimatedTotalHits":1}

Expected behavior

I expected all documents could searched by 東京

Meilisearch version: v1.1.0-rc.0

Additional context

Ref #2403 #3347

The text was updated successfully, but these errors were encountered:

ManyTheFish · 2023-03-07T08:14:34Z

Hello @miiton, thank you for having tried this release candidate.

Your case is a hard one, I have a fix in mind but it's a partial one:
the documents that were searchable will continue being searchable, however, I have an issue to make '{"id":4,"name":"東京特許許可局"}' searchable 😬

I will investigate!

ManyTheFish · 2023-03-08T15:29:39Z

Hello @miiton,
I've tried to enhance the Language detection for Japanese support, I released a prototype on Docker, however, this prototype partially fixes the issue, see below the PR explaining the limits of this enhancement and how to try it:
#3569

Please share and try this prototype and make me some feedback!

Thanks!

miiton · 2023-03-08T23:31:31Z

You're so fast!

As you wrote in your PR, adding "Japanese sentences(= includes Hiragana or Katakana)" to the desc gave me the results I was expecting. So I understood that one of the key points of DB creation is that creating a DB with only words will not give the expected results.

BEFORE	AFTER

ManyTheFish · 2023-03-09T09:43:29Z

@miiton,
thank you for your feedback,
I will set the PR status to "ready for review" in order to merge it for v1.1.0-rc.1,
If you see some other cases where Meilisearch is wrong, don't hesitate to share them,
I think the next enhancement to do is to contribute to Whatlang as you suggested last year:
greyblake/whatlang-rs#122

3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](#3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: greyblake/whatlang-rs#122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>

curquiza · 2023-03-09T16:46:30Z

Fixed by #3569
Will be released in rc1 today

miiton · 2023-03-10T07:38:32Z

I tried to test a bit more.
It seems that the search function works well if there are many "amounts of text" in the document due to the changes merged into v1.1.0-rc1, but it doesn't work well when there are few.

When searching for 東京 after importing data_with_shortdesc.csv , only "Kanji-only words" are hit.
When searching for 東京 after importing data_with_longdesc.csv , I get roughly the desired results (I don't understand why only テレビ東京 doesn't hit).

Import command:

curl -X POST localhost:7700/indexes/:index_uid/documents --data-binary @data_with_shortdesc.csv
curl -X POST localhost:7700/indexes/:index_uid/documents --data-binary @data_with_longdesc.csv

At any rate, if I want to search in Japanese for sure, I need to put a long dummy Japanese string in each document, like data_with_longdesc.csv, otherwise it will not work well and I waste DB space in that case.

I think it would be easier to solve this problem if we could enhance the test cases in Japanese. Is there anything I can do to help?

data_with_shortdesc.csv
data_with_longdesc.csv

curquiza added the bug Something isn't working as expected label Mar 7, 2023

curquiza added this to the v1.1.0 milestone Mar 7, 2023

ManyTheFish mentioned this issue Mar 7, 2023

Enhance Japanese language detection #3569

Merged

curquiza closed this as completed Mar 9, 2023

ManyTheFish mentioned this issue Mar 13, 2023

Force japanese v1.1.0 #3588

Closed

meili-bot added the v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03 label Apr 6, 2023

ManyTheFish mentioned this issue May 24, 2023

No results found error when searching in Japanese #3769

Closed

miiton mentioned this issue Jun 6, 2023

Force japanese v1.2.0 #3812

Closed

ManyTheFish mentioned this issue Jul 3, 2023

Japanese specialized Meilisearch Docker Image #3882

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

miiton commented Mar 7, 2023 •

edited

ManyTheFish commented Mar 7, 2023

ManyTheFish commented Mar 8, 2023 •

edited

miiton commented Mar 8, 2023

ManyTheFish commented Mar 9, 2023 •

edited

curquiza commented Mar 9, 2023

miiton commented Mar 10, 2023

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

Comments

miiton commented Mar 7, 2023 • edited

ManyTheFish commented Mar 7, 2023

ManyTheFish commented Mar 8, 2023 • edited

miiton commented Mar 8, 2023

ManyTheFish commented Mar 9, 2023 • edited

curquiza commented Mar 9, 2023

miiton commented Mar 10, 2023

miiton commented Mar 7, 2023 •

edited

ManyTheFish commented Mar 8, 2023 •

edited

ManyTheFish commented Mar 9, 2023 •

edited