Enhance language detection #3347

ManyTheFish · 2023-01-16T16:42:36Z

Summary

Some completely unrelated Languages can share the same characters, in Meilisearch we detect the Languages using whatlang, which works well on large texts but fails on small search queries leading to a bad segmentation and normalization of the query.

This PR now stores the Languages detected during the indexing in order to reduce the Languages list that can be detected during the search.

Detail

Create a 19th database mapping the scripts and the Languages detected with the documents where the Language is detected
Fill the newly created database during indexing
Create an allow-list with this database and pass it to Charabia
Add a test ensuring that a Japanese request containing kanjis only is detected as Japanese and not Chinese

Related issues

Fixes #2403
Fixes #3513

…ith document(s) were stored during indexing

deleted docids

irevoire

Super nice. Could you add one little integration test on meilisearch just to ensure we don't disable the feature unexpectedly please?

This test converted for meilisearch should do it;

    fn store_detected_script_and_language_per_document_during_indexing() {
        use charabia::{Language, Script};
        let index = TempIndex::new();
        index
            .add_documents(documents!([
                { "id": 1, "title": "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!" },
                { "id": 2, "title": "人人生而自由﹐在尊嚴和權利上一律平等。他們賦有理性和良心﹐並應以兄弟關係的精神互相對待。" },
                { "id": 3, "title": "הַשּׁוּעָל הַמָּהִיר (״הַחוּם״) לֹא יָכוֹל לִקְפֹּץ 9.94 מֶטְרִים, נָכוֹן? ברר, 1.5°C- בַּחוּץ!" },
                { "id": 4, "title": "関西国際空港限定トートバッグ すもももももももものうち" },
                { "id": 5, "title": "ภาษาไทยง่ายนิดเดียว" },
                { "id": 6, "title": "The quick 在尊嚴和權利上一律平等。" },
            ]))
            .unwrap();

        let rtxn = index.read_txn().unwrap();
        let key_jpn = (Script::Cj, Language::Jpn);
        let key_cmn = (Script::Cj, Language::Cmn);
        let cj_jpn_docs = index.script_language_documents_ids(&rtxn, &key_jpn).unwrap().unwrap();
        let cj_cmn_docs = index.script_language_documents_ids(&rtxn, &key_cmn).unwrap().unwrap();
        let expected_cj_jpn_docids = [3].iter().collect();
        assert_eq!(cj_jpn_docs, expected_cj_jpn_docids);
        let expected_cj_cmn_docids = [1, 5].iter().collect();
        assert_eq!(cj_cmn_docs, expected_cj_cmn_docids);
    }

milli/src/heed_codec/script_language_codec.rs

milli/src/search/mod.rs

Co-authored-by: Tamo <tamo@meilisearch.com>

irevoire

Perfect, thanks! 🐚
bors merge

irevoire · 2023-02-20T17:03:20Z

Arf, there is a conflict @ManyTheFish

bors · 2023-02-20T17:14:50Z

Canceled.

github-actions · 2023-02-20T17:40:44Z

Uffizzi Preview Environment Deploying

☁️ https://app.uffizzi.com/github.com/meilisearch/meilisearch/pull/3347

⚙️ Updating now by workflow run 4231346523.

The meilisearch preview environment contains a web terminal from where you can run the
meilisearch command. You should be able to access this instance of meilisearch running in
the preview from the link Meilisearch Endpoint link given below.

Web Terminal Endpoint :
Meilisearch Endpoint : /meilisearch

irevoire

Thanks! 📦
bors merge

bors · 2023-02-21T11:52:38Z

Build succeeded:

3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](#3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: greyblake/whatlang-rs#122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>

Base automatically changed from import-milli to main January 18, 2023 11:01

f3r10 added 12 commits January 31, 2023 11:28

Create a new database on index and add a specialized codec for it

c45d1e3

Extract and index data

d97fb61

Delete and clear data from the new database

b216ddb

Add tests for checking that detected script and language associated w…

a27f329

…ith document(s) were stored during indexing

Filter from script_language_docids database soft deleted documents

34d04f3

Add test checking if from script_language_docids database were removed

369c057

deleted docids

Format code

fd60a39

Improve script language codec

2d58b28

Skip script,language insertion if language is undetected

d820735

Fix tests

50bc156

Format code

7681be5

Fix code format

2922c5c

curquiza mentioned this pull request Jan 31, 2023

Enhance language detection meilisearch/milli#749

Closed

3 tasks

ManyTheFish force-pushed the enhance-language-detection branch from 0417cd6 to d4dc5f9 Compare January 31, 2023 10:31

ManyTheFish added 5 commits February 1, 2023 15:26

Update Charabia version

f4569b0

Fix codec deserialization

77d32d0

Update test

064158e

Add expectancy test

643d99e

Use Languages list detected during indexing at search time

0bc1a18

ManyTheFish force-pushed the enhance-language-detection branch from d4dc5f9 to 0bc1a18 Compare February 1, 2023 17:57

Update Charabia to 0.7.1

cb8d5f2

ManyTheFish added this to the v1.1.0 milestone Feb 20, 2023

ManyTheFish requested a review from irevoire February 20, 2023 13:56

ManyTheFish linked an issue Feb 20, 2023 that may be closed by this pull request

Integrate Charabia 0.7.1 #3513

Closed

irevoire requested changes Feb 20, 2023

View reviewed changes

milli/src/heed_codec/script_language_codec.rs Show resolved Hide resolved

milli/src/search/mod.rs Show resolved Hide resolved

ManyTheFish and others added 2 commits February 20, 2023 15:33

Update milli/src/search/mod.rs

119e6d8

Co-authored-by: Tamo <tamo@meilisearch.com>

Add test ensuring that Meilisearch works on kanji only requests

23f4e82

ManyTheFish requested a review from irevoire February 20, 2023 14:43

irevoire previously approved these changes Feb 20, 2023

View reviewed changes

Merge branch 'main' into enhance-language-detection

8aa808d

ManyTheFish dismissed irevoire’s stale review via 8aa808d February 20, 2023 17:14

ManyTheFish requested a review from irevoire February 20, 2023 18:11

fix clippy

bbecab8

irevoire approved these changes Feb 21, 2023

View reviewed changes

bors bot merged commit 3940788 into main Feb 21, 2023

bors bot deleted the enhance-language-detection branch February 21, 2023 11:52

ManyTheFish linked an issue Feb 27, 2023 that may be closed by this pull request

Detect Language during indexing to enhance language detection at search time #3357

Closed

2 tasks

ManyTheFish mentioned this pull request Feb 27, 2023

Detect Language during indexing to enhance language detection at search time #3357

Closed

2 tasks

miiton mentioned this pull request Mar 7, 2023

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

Closed

ManyTheFish mentioned this pull request Mar 8, 2023

Enhance Japanese language detection #3569

Merged

meili-bot added the v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03 label Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance language detection #3347

Enhance language detection #3347

ManyTheFish commented Jan 16, 2023 •

edited

irevoire left a comment

irevoire left a comment

irevoire commented Feb 20, 2023

bors bot commented Feb 20, 2023

github-actions bot commented Feb 20, 2023 •

edited

irevoire left a comment

bors bot commented Feb 21, 2023

Enhance language detection #3347

Enhance language detection #3347

Conversation

ManyTheFish commented Jan 16, 2023 • edited

Summary

Detail

Related issues

irevoire left a comment

Choose a reason for hiding this comment

irevoire left a comment

Choose a reason for hiding this comment

irevoire commented Feb 20, 2023

bors bot commented Feb 20, 2023

github-actions bot commented Feb 20, 2023 • edited

Uffizzi Preview Environment Deploying

irevoire left a comment

Choose a reason for hiding this comment

bors bot commented Feb 21, 2023

ManyTheFish commented Jan 16, 2023 •

edited

github-actions bot commented Feb 20, 2023 •

edited