Chinese Languages support #503

ManyTheFish · 2022-08-01T16:20:31Z

ManyTheFish
Aug 1, 2022
Collaborator

Chinese Languages support

officially supported

Current behavior and pointed out issues

Segmentation

For Meilisearch, the segmenter's goal is to cut a text into several "words" which will be searchable in Meilisearch during a search query.
To segment Chinese texts, and because Chinese words are not always space-separated, we currently use Jieba instead of the default unicode-segmentation.

Drawbacks

Jieba helps us segment non-space-separated words, but our community reported that the quality of the segmentation was not sufficient:

The segmented words are too long and don't allow Meilisearch to find relevant documents unless the word is completely written
The segmentation result can differ for similar text parts, this breaks prefix search efficiency
- Search Chinese Content, Typo tolerance looks not work meilisearch#2222 (comment)

Normalization

For Meilisearch, the normalizer's goal is to alter the words that can be considered equivalent in order to make them converge to a common representation.
To Normalize Chinese words, we convert traditional characters into simplified ones using character_converter.

Drawbacks

This normalization process allows Meilisearch to find documents containing both traditional and simplified characters,
however, this seems to be insufficient for our community, therefore the typo tolerance struggles to find the real user typo:

Potential enhancements

Segmentation

As written before, Jieba segmentation creates too-long words that decrease the number of relevant documents found by Meilisearch.
However, we need an equivalent tokenizer and we can't just consider that each character is a word because it would make Meilisearch returns a lot of un-relevant documents (meilisearch/meilisearch#2390).
We have to find another tokenizer that cut the provided text into words without creating too-long words.

💭: I think that this approach, completed by the proximity ranking rules, will enhance the relevancy of Meilisearch.

In a below discussion some contributors suggested to:

remove HMM feature of Jieba which has low accuracy results and adds noise in the segmentation
change the default dictionary

Normalization

The current normalization process for Chinese script is meant to unify traditional and simplified Chinese, but, we could change the approach by encoding phonologically or visually the Chinese characters. In Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words, we can read:

"83% of these errors were related to
phonological similarity, and 48% of them
were related to visual similarity between
the involved characters. Generating the
lists of phonologically and visually similar
characters, our programs were able to
contain more than 90% of the incorrect
characters in the reported errors."
Abstract

Phonological normalization

In Phonology of Mandarin Chinese: Pinyin vs. IPA we can read:

"Chinese language has an ideographic and
logographic writing system, and it has always
been a difficult problem to correctly represent
the phonology of Chinese characters."
Abstract

Because many errors are phonological, and, because Chinese Script is not a phonological writing system, we should normalize tokens into a phonological representation.
The main issue of this is that Chinese dialects don't have always the same pronunciation for the same word, In Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words we can read:

"Sublanguages used in different
areas in China, e.g., Shanghai, Min, and Canton
share the same written forms with the Mandarin
Chinese, but have quite different though related
pronunciation systems. Hence, people living in
different areas in China may perceive phonologi-
cal similarity in very different ways."
2. Phonologically Similar Characters

In his below comment, hfhchan suggested several phonological normalizations:

Hanyu Pinyin: Pinyin is accurate in representing both for the standardized China mainland dialect of Mandarin and the standardized Taiwan dialect of Mandarin (Native speakers: 920 million)
Jyutping: Jyutping is a Romanization system representing Cantonese phonology (Native speakers: 80 million)
Peh-oe-ji: Peh-oe-ji is a Romanization system representing Taiwanese Hokkien phonology (Native speakers: 13.5 million)
Taiwanese_Hakka_Romanization_System: it is a Romanization system representing Taiwanese Hakka phonology (Native speakers: 2.5 million)

Visual normalization

Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words treat this subject, their solution is to encode each character in Cangjie that compose the original character linked to the structure of the original Character, this method would allow Meilisearch to retrieve similar characters via the typo criterion.

I found a promising GitHub repository, maintained by @hfhchan, where we can find Ideographic Description Sequence dictionaries to decompose CJK characters.

⚠️ : the dictionaries seems to describe characters recurcivelly (銩 -> ⿰金丟 but 丟 -> ⿱一去 so 銩 -> ⿰金⿱一去). We may have to do some restructurations.

Phonological vs Visual normalization, what should we choose could we have both?

TBD @ManyTheFish with the potential help of the community

Contribute!

In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:

⬆️ by upvoting this discussion to help us in prioritizing language supports
💬 by pointing out errors and oversights in this discussion
🧑‍💻 by opening a pull request on charabia, the tokenizer used by Meilisearch

Thanks for your help!

chenbimo · 2022-08-23T07:29:14Z

chenbimo
Aug 23, 2022

is this will support in next version?

1 reply

curquiza Aug 23, 2022
Maintainer

Hello @chenbimo we have not planned anything since we don't know exactly which change we want to do

noe · 2022-08-23T12:33:03Z

noe
Aug 23, 2022

About Jieba segmenting too-long words, I have looked into the examples from the linked issues. For instance, in this issue, 安装简单 is not segmented. I think this situation is due to the Hidden Markov Model (HMM) used by Jieba to identify words that are not in its dictionary, because 安装简单 does not seem to be in Jieba's dictionary.

I would suggest checking if the HMM is enabled or disabled and, if enabled, disabling it. At least I always disable the HMM in my own use of Jieba.

6 replies

noe Aug 23, 2022

I normally use Jieba (the C++ and Python versions) in my projects, but with custom dictionaries instead of the default ones, to fit my needs, and always with hmm=false.

ManyTheFish Aug 23, 2022
Collaborator Author

I see, how do you choose your dictionary, do you have accuracy benchmarks on segmentation?

noe Aug 23, 2022

In my use cases, I normally have a reference (long) list of words that are important to be segmented properly (e.g. a definition dictionary like CC-CEDICT). I use such a reference list of words to create a Jieba dictionary.

I don't have benchmarks, sorry, I normally test it by eyeballing a few sampled texts.

ms300 Aug 23, 2022

I see, how do you choose your dictionary, do you have accuracy benchmarks on segmentation?

Thanks for opening a proposal for enhancing Chinese language support.
There is a benchmark on popular datasets in https://github.com/lancopku/pkuseg-python/blob/master/readme/comparison.md .

MSRA	Precision	Recall	F-score
jieba	87.01	89.88	88.42
THULAC	95.60	95.91	95.71
pkuseg	96.94	96.81	96.88

Default	MSRA	CTB8	PKU	WEIBO	All Average
jieba	81.45	79.58	81.83	83.56	81.61
THULAC	85.55	87.84	92.29	86.65	88.08
pkuseg	87.29	91.77	92.68	93.43	91.29

Note that it's not a third-party benchmark. pkuseg has a better accuracy but it runs at a really low speed and doesn't have a Rust version.

And https://github.com/thunlp/THULAC#%E4%BB%A3%E8%A1%A8%E5%88%86%E8%AF%8D%E8%BD%AF%E4%BB%B6%E7%9A%84%E6%80%A7%E8%83%BD%E5%AF%B9%E6%AF%94 is a benchmark by THULAC.
THULAC has a better accuracy than jieba and runs at a reasonable speed. It's an alternative choice.

But jieba is the most popular and most ported language Chinese segmentation toolkit. I think there is no problem to use jieba as a default Chinese segmentation.
For now, i think we just need some additional options such as (cut_for_search mode (a cutting mode for searching engine) ，custom dictionary and HMM switch)

It is not easy to achieve these because each segmentation toolkits has different options. Maybe a solution that is compatible with these options is needed.

Thanks again for this amazing project!♥

ManyTheFish Aug 25, 2022
Collaborator Author

Hey @ms300, first, thank you for your detailed answer. 👍
I will consider changing the dictionary soon and I will take the time to choose a better one! Therefore, I agree that removing the HMM feature is a best approach than keeping it.

cut_for_search

Unfortunately, we can't use cut_for_search easily. 😖

The behavior of the method cut_for_search in Jieba is cutting segmented word in ngrams generating several version of the same word. This is an issue for Meilisearch because Jieba doesn't indicates the index of each generated ngram making difficult to retrieve positions, and Meilisearch needs word positions to compute a query properly.

However, Meilisearch is able to boost documents containing close query terms with the proximity ranking rule. If we manage to segment Chinese texts in short words, this may enhance the revelancy. When reading Investigating the Relationship between Word Segmentation
Performance and Retrieval Performance in Chinese IR (I've not finished to read it so far), it seems that choosing a dictionary is not as easy as taking the one with the best F-score.

"In fact, the curves show a clear
uni-modal shape, where for segmentation accura-
cies 44% to 70% the retrieval performance increases
steadily, but then plateaus for segmentation accu-
racies between 70% and 77%, and finally decreases
slightly when the segmentation performance increase
to 85%,90% and 95%."
4.5 Experimental results

hfhchan · 2022-09-08T17:26:32Z

hfhchan
Sep 8, 2022

the International Phonetic Alphabet (IPA) is a representation mostly used by linguists to give a precise overview of the pronunciation of a word.

the Pinyin system is a romanize representation of the Chinese Script, it is highly used by Chinese speakers but gives a way less precise phonological representation than the IPA

Pinyin system is suitable for representing the phonology of (Standard) Mandarin only while IPA can be used to represent basically any language and dialect. Pinyin is not "less precise" -- it reflects only one variation of Chinese, but precisely. It is also the most commonly spoken variant of Chinese. Pinyin is accurate in representing both for the standardized China mainland dialect of Mandarin and the standardized Taiwan dialect of Mandarin.

Different Sinitic languages have substantially different phonology systems, and their dialects also have variations. If your target is country and/or regions as per ISO 3166, you can get away with only implementing phonological representation in Standard Mandarin (both China mainland and Taiwan) and Standard Cantonese. You might also want to implement Taiwanese Hokkien or Taigi, where there are two standardized dialects, as well as Taiwanese Hakka, both of which are official languages of Taiwan in addition to Mandarin.

Unihan DB provides standardized readings for Mandarin (in pinyin) and Cantonese (in Jyutping).

The choice of romanization system is not important as long as it accurately reflects the phonology.

Note: due to the huge number of homophones, you likely want to do this only after segmentation, i.e. 法治 fǎ zhì also matches 法制 fǎ zhì, but you don't want someone searching 治 itself to match 制. (These two phrases are not pronounced identically in Cantonese.)

3 replies

ManyTheFish Sep 13, 2022
Collaborator Author

Hello @hfhchan,
I have several questions:

For Latin languages, we choose to remove accents from characters meaning that we don't consider a missing accent as a typo.
This allows our users to search without having to add an accent on each queried word because accents rarely change drastically the meaning. Can we consider removing accents from the romanized version of words for example consider fǎ zhì and fā zhí as fa zhi?

Do you consider Peh-oe-ji and the Taiwanese Language Phonetic Alphabet (TLPA) as two viable Phonetic romanizations for Taiwanese Hokkien and/or Taiwanese Hakka dialects?

Thank you a lot for your contribution to this discussion! 😄

hfhchan Sep 14, 2022

About removing the accents, searching the unaccented versions should match the accented ones (search query fa zhi search results contain fǎ zhì and fā zhí), but searching using the accented ones should yield the exact one only.

I'm not familiar enough with Taiwanese Hokkien or Hakka to be sure, but I'm under the impression POJ is most active in use for Taiwanese Hokkien while https://en.m.wikipedia.org/wiki/Taiwanese_Hakka_Romanization_System is the active one in use in Hakka textbooks.

ManyTheFish Sep 14, 2022
Collaborator Author

Ok, good to know, it seems similar to Latin accents. 🤔
As a first step, we will keep accents from the romanized version, counting any differences as typos.
But as you said, it would be relevant to return all the variations when searching with an unaccented word.

hfhchan · 2022-09-08T17:36:15Z

hfhchan
Sep 8, 2022

Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words treat this subject, their solution is to encode each character in Cangjie that compose the original character linked to the structure of the original Character, this method would allow Meilisearch to retrieve similar characters via the typo criterion.

I found a promising GitHub repository, maintained by @hfhchan, where we can find Ideographic Description Sequence dictionaries to decompose CJK characters.

I do not suggest using the IDS list for identifying visually similar characters.

You should use https://github.com/hfhchan/irg/blob/master/kVariants.txt instead, which provides mappings for (1) visually similar and identical meaning (i.e. z-variants), (2) semantic variants, (3) simplified (both standard and non-standard forms), as well as (4) other erroneous forms. Note this list only contains characters which are identical in meaning -- this is somewhat prescriptive, and intentionally ignores all irregular simplifications.

For visually similar characters only, you will want to look at kSpoofingVariant and kZVariant of Unihan too. Note this list excludes similarly looking common characters where someone from primary school grade should be able to distinguish, e.g. 土 vs 士. These are both very very common characters and normalizing them would bring more false positives than help.

For actually useful search for Chinese users, use the mappings in kTraditionalVariant, kSimplifiedVariant, kSemanticVariant and kSpecializedSemanticVariant of Unihan, which converts between characters of identical meaning (in all or certain contexts).

Note in Unihan, 温 and 溫 and etc are considered traditional/simplified pairs but normal native speakers treat them more or less identical. That's why in my kVariants list they are considered z-variants instead.

You should also look into the MSR mappings for Chinese (used by ICANN in domain names). Whatever is blocked means that the characters have been determined to be (somewhat) identical in meaning, or high probability of spoofing (i.e. indistinguishable a first glance). This list should be roughly identical to all the previous k* properties in Unihan.

6 replies

hfhchan Sep 15, 2022

About IDS

Am I completelly going in the wrong way or do you think that it is worse to try?

This is only going to be marginally useful if the mainstream input method is handwriting based.

Firstly handwriting input method is not that commonly used except for older people. In the China mainland region, the predominant input method is pinyin. In Hong Kong and Macau, the predominant input method is Cangjie or Quick. I am not familiar enough with Taiwan preferences.

For pinyin and Quick, since there are so many candidates for a given input sequence, the user is likely to pick the correct character or accidentally tap a homophone (pinyin) or character with same Quick code. You will get the best output based on this most common scenario, and you do not need to implement Levenshtein distance for this.

For cangjie, each input sequence more or less corresponds to only one or two characters at most. You can do the Levenshtein distance on the cangjie code for this.

For handwriting, the input method editor will suggest the most commonly used characters. I do not see too much value in doing Levenshtein distance on the character glyph.

In your graph, the most useful one would only be 亰 to 京, however the benefit is still very marginal. First, these two are actually variants, the former is more common in calligraphy, and a shitty handwriting input method from maybe the early 2000s might just output 亰 instead of 京. Or, the user accidentally pressed the stroke 1 button once too many and by chance happened to choose 亰 instead. No decent input method editor will suggest 亰 in high frequency, and certainly should not mix 哀 into it. 亯 and 畗 are not even common words either so no decent input method editor would suggest it.

By far the most common case of typo will be homophone. There are just too many characters, not just new learners, even native speakers often mix them up if they are pronounced the same. This is particularly prolific for so-called dialectal characters where no standard form has been agreed on or declared by an authority.

About kSpoofingVariant and kZVariant
I find it very interesting, it will be part of the standardization process.
However, I'm not sure to understand why these duplicates exist. 🤔

kZVariants are mostly duplicates that have been encoded into Unicode due to the Source Separation Rule (for compatibility with old character sets, i.e. "source"), or errors in the standardization process.
kSpoofingVariants usually refer to very archaic or uncommon characters which look nearly identical to common characters, but are encoded separately because they have a different semantic meaning.

MSR ?

Maximal Starting Repertoire (MSR)

hfhchan Sep 15, 2022

Levenshtein distance as in the picture is probably more useful for making a handwriting input method (also, you need to heavily reorder the characters by frequency). It is not useful for search, except for searching through digitized texts created by a shitty OCR, which I do not think is a common use case.

hfhchan Sep 15, 2022

About kSemanticVariant kSpecializedSemanticVariant and kTraditionalVariant

Because Japanese share some characters I'm a bit afraid to normalize based on the meaning where Meilisearch struggle to make the difference between Chinese Languages and Japanese when tokenizing a query.

You should do the normalization on the index only, not for display to the user.
The worst that can happen is that you deliver some results that end up not being relevant. But in most of the cases, you will get relevant search results.

Google, Baidu etc all match between traditional/simp variant, semantic variant / specialized semantic variant, etc by default.

hfhchan Sep 15, 2022

You can also tailor it by checking the lang=ja vs lang=zh attribute on the document. For slightly longer texts, it's very easy to distinguish because Japanese will have lots of kana in it.

hfhchan Sep 15, 2022

Note - contrary to what ICU might do, lang=zh does not automatically mean cmn (Mandarin) as used by most sites in the Chinese sphere. When you see a lang=zh, it could actually mean zh-HK, which would normally be yue-HK (Cantonese, Hong Kong).

Even yue-HK can stand for the so called "Formal Written Chinese" used in HK which is now nearly identical in syntax to cmn, possibly with some features of lzh, and various amounts of yue proper. You may want to import a language model to guess the predominant language of the text based on heuristics.

ManyTheFish · 2022-10-05T14:53:48Z

ManyTheFish
Oct 5, 2022
Collaborator Author

Hello all!
I come back to you to make some updates.
For Hacktoberfest we've created several issues to enhance the Chinese support of Meilisearch:

Implement Pinyin normalizer charabia#135: related discussion comment;

store a phonological representation of Chinese words in order to better retrieve the potential typos made by the end user.

Enhance Chinese normalizer by unifying Z, Simplified, and Semantic variants charabia#144: related discussion comment;

Instead of only unifying Simplified variants, we want to unify Z and Semantic variants too.

Disable HMM feature of Jieba charabia#136: related discussion comment;

disable the Hidden Markov Model for the Chinese word segmentation.

All these issues are open to external contributions during the whole month, so don't hesitate to contribute! 🧑‍💻
I'll be glad to help you if you have any questions.

This is another step in enhancing Chinese Language support, depending on future feedback, we will be able to go further.

Thanks for all your feedback! ✍️ 🇨🇳

0 replies

loulazynote · 2022-11-08T12:08:48Z

loulazynote
Nov 8, 2022

Excuse me, Is it possible to support traditional Chinese display for synonym output?

14 replies

houlang Dec 3, 2023

Hello, there is one month left until 2023. Go for it!

ManyTheFish Dec 5, 2023
Collaborator Author

Hello @houlang and @loulazynote, this issue should have been fixed since 1.4.0. The original representation of the synonyms should be the version displayed when using GET /indexes/:name/settings,
isn't it the case on your side?

loulazynote Dec 5, 2023

Hi @ManyTheFish, I used /indexes/{index_uid}/settings/synonyms in my project, it still displays pinyin.
So If I update to 1.4.0 version, is it possible to display traditional Chinese？

houlang Dec 5, 2023

Hi brother @ManyTheFish, I am more concerned about the issue I mentioned earlier. When I search for the Chinese character '新' (xin), I get the result '芯' (xin). Is there a solution to this problem? My inquiry is at the bottom of this discussion.

ManyTheFish Dec 5, 2023
Collaborator Author

Hi @loulazynote, yes, but you'll have to reupload your synonyms because Meilisearch lost the data in previous versions 😬

curquiza · 2023-01-10T08:29:17Z

curquiza
Jan 10, 2023
Maintainer

Hello everyone here

📣 📣 Meilisearch has just released its first RC (Release Candidate) for v1.0.0! This new version of Meilisearch will contain changes for the Chinese language support, so you might want to test it
https://github.com/meilisearch/meilisearch/releases/tag/v1.0.0-rc.0

How do we improve support for Chinese language?

Meilisearch now normalizes Chinese characters into Pinyin
We disabled the HMM (Hidden Markov Model algorithm) feature in Chinese segmentation

Please let us know here, in the thread, how these changes impact your usage 👇 👇

Thanks in advance for your help! 🙏

0 replies

drinkmystery · 2023-04-13T09:18:04Z

drinkmystery
Apr 13, 2023

It is now the era of AI. We use AI-based NLP for Chinese word segmentation to achieve better industry-specific segmentation results. If we use this method to pre-segment Chinese into strings separated by whitespaces, how can we disable Charabia’s Jieba segmentation and only allow Unicode-segmentation to perform segmentation by whitespaces?
@curquiza

1 reply

ManyTheFish Apr 26, 2023
Collaborator Author

Hello @drinkmystery,
If you are able to recompile Meilisearch with Rust, I suggest compiling it with the command:

$ cargo build --release --no-default-features --features analytics,mini-dashboard,charabia/greek,charabia/hebrew

This way, only unicode-segmentation will be used to segment the Chinese Language.
However, the drawback of this will be that normalizers will be deactivated as well which would reduce the relevancy in some way.

sivdead · 2023-07-05T10:10:48Z

sivdead
Jul 5, 2023

I notice that when use jieba.cu,Chinese word 永永远远是龙的传人 will result to 永永远远/是/龙的传人, but when use jieba.cut_for_search, the result would be 永远/远远/永永远远/是/传人/龙的传人,maybe this method is better?

1 reply

amoydavid Jul 14, 2023

that might resolved this meilisearch/meilisearch#2499

houlang · 2023-12-03T01:13:51Z

houlang
Dec 3, 2023

I had the same issue, "新" ==> “芯”

Use version: meilisearch-windows-amd64_1.5.0

I think this Chinese language issue is very important. If it's not solved, Meilisearch cannot be applied in a production environment.

Data List：

[{
        "id": 1763204760840700001,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "新中",
		"search_times": 10
    }, {
        "id": 1763204760840700002,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "浮雕装饰画",
		"search_times": 155
    }, {
        "id": 1763204760840700003,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "新中背景墙",
		"search_times": 20
    }, {
        "id": 1763204760840700004,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "中式",
		"search_times": 50
    }, {
        "id": 1763204760840700005,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "画芯",
		"search_times": 100
    }, {
        "id": 1763204760840700006,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "抽象装饰画",
		"search_times": 550
    }, {
        "id": 1763204760840700007,
        "create_date": "2023/9/8 17:20:23",
        "search_text": "中式背景墙",
		"search_times": 120
    }
]

Search Request:

{
    "q": "\"新\"",
    "matchingStrategy": "all",
    "attributesToSearchOn": [
        "search_text"
    ],
    "attributesToHighlight": [
        "search_text"
    ],
    "attributesToRetrieve": [
        "search_text"
    ],
    "limit": 10,
    "offset": 0,
    "showRankingScore": true
}

Search Result:

{
    "hits": [
        {
            "search_text": "新中",
            "_formatted": {
                "search_text": "<em>新</em>中"
            },
            "_rankingScore": 0.49242424242424243
        },
        {
            "search_text": "新中背景墙",
            "_formatted": {
                "search_text": "<em>新</em>中背景墙"
            },
            "_rankingScore": 0.49242424242424243
        },
        {
            "search_text": "画芯",
            "_formatted": {
                "search_text": "画<em>芯</em>"
            },
            "_rankingScore": 0.4621212121212121
        }
    ],
    "query": "\"新\"",
    "processingTimeMs": 0,
    "limit": 10,
    "offset": 0,
    "estimatedTotalHits": 3
}

Moreover, I found that words added to the stop_words list still seem to be searchable.
cURL:

localhost:7700/indexes/ithome_search_suggestions/settings/stop-words

Response：

[
    "画芯"
]

Test Request:

{
    "q": "\"画芯\"",
    "matchingStrategy": "all",
    "attributesToSearchOn": [
        "search_text"
    ],
    "attributesToHighlight": [
        "search_text"
    ],
    "attributesToRetrieve": [
        "search_text"
    ],
    "limit": 10,
    "offset": 0,
    "showRankingScore": true
}

Response Result: you can see "画芯" still seem to be searchable.

{
    "hits": [
        {
            "search_text": "画芯",
            "_formatted": {
                "search_text": "<em>画</em><em>芯</em>"
            },
            "_rankingScore": 0.5
        }
    ],
    "query": "\"画芯\"",
    "processingTimeMs": 0,
    "limit": 10,
    "offset": 0,
    "estimatedTotalHits": 1
}

3 replies

ManyTheFish Dec 5, 2023
Collaborator Author

Hello @houlang,
Your first issue comes from the normalization into pinyin. Because both characters share the same pinyin version, Meilisearch retrieves both documents; we have to find a good trade-off between being permissive and retrieving a character based on a potential typo or being restrictive and missing potential typos. The algorithm needs to be better shaped for now, and we should enhance it, but I didn't find a perfect alternative. I put below a discussion where we discussed how to improve it:
meilisearch/meilisearch#3655

Your second issue is related to how Meilisearch supports stopwords. A stopword is a unique word for Meilisearch, and as the Highlighter processes it, we can say that 画芯 is split into two words, so Meilisearch never catches a word containing 画芯.
However, you have several solutions:

Force Meilisearch to consider every sub-string containing 画芯 as a unique word by using the dictionary setting localhost:7700/indexes/ithome_search_suggestions/settings/dictionary, and keeping the same stop-words.
Split your stopword in 2, ["画芯"] -> ["画", "芯"]

Don't hesitate to try both solutions or mix them depending on the stopword,
sorry for this inconvenience. I can't speak Chinese, so catching all the subtleties of the language can be challenging.

Don't hesitate to suggest any algorithm, implementation or anything else that could enhance your experience with Meilisearch,
Thank you for your comment,
see you!

houlang Dec 5, 2023

Hi @ManyTheFish
I wish you could speak Chinese.I think the correct search method should be to only match Chinese characters when searching for them, and only match pinyin when searching for pinyin, so that you can get multiple Chinese characters with the same pronunciation.

However, I'm not really of much help to you. I'm just a beginner in Rust, and I only know how to do 'cargo run' and 'cargo build'.

Here are some famous Chinese tokenizer plugins for Elasticsearch, but they are all written in Java. Hopefully, they can be helpful for you.

Comparison of Chinese analyzers in Elasticsearch.
Smart Chinese Analysis：The official provided Chinese tokenizer is not user-friendly。
IK Analyzer(https://github.com/medcl/elasticsearch-analysis-ik)：A free and open-source Java tokenizer, which is currently one of the popular Chinese tokenizers. It is simple, stable, and supports customization of dictionaries. To achieve particularly good results, custom dictionary maintenance is recommended.
Jieba：An open-source Python tokenizer, with a corresponding Java version available on GitHub, featuring self-recognition of new words and support for custom dictionaries.。
Ansj：A Java implementation of Chinese tokenizer based on n-Gram, CRF, and HMM algorithms, which is free and open-source, and supports natural language processing applications.
hanlp(https://github.com/hankcs/HanLP)：Free and open-source, generously contributed by talented Chinese individuals in the field of natural language processing.

Their accuracy of word segmentation from high to low is as follows:
hanlp > ansj > jieba > IK > Smart Chinese Analysis

Especially HanLP, I think it's very intelligent. Here is the official website, which may be helpful for you:
(You can switch to English)
https://hanlp.hankcs.com/

ManyTheFish Dec 5, 2023
Collaborator Author

Ok, thank you for your suggestions. I will dig deeper into the HanLP segmenter when I have some time.
But to be honest, it's a vast work to reimplement a segmenter. 😬
However, you are not the only one pushing to enhance Pinyin normalization:

Which helps us to prioritize the fix.

Kimeiga · 2024-01-05T07:39:04Z

Kimeiga
Jan 5, 2024

I agree with @houlang's example of douyin's search: when you're typing chinese in the ime, it's in pinyin first and it will input the pinyin first to the search box

You can choose to select an IME result from your IME, or if you already found the result that you're looking for in the dropdown, you can just select that.

meanwhile if you use a character, douyin doesn't seem to have any typo tolerance since they probably figure that if you weren't sure what character it was, you would just not select from your IME and instead leave it in pinyin as is and select one of the search options.

When you are in your chinese IME if you press Enter, the popup goes away and you are left with the pinyin, if you press space you get the character selected in the IME popup

This is my guess

Baidu:

My suggestion is that the matching algorithm should prioritize exact match on the chinese characters if chinese is entered, if pinyin is entered, it can fuzzy search over the pinyin representations of the chinese (which it appears the pinyin normalization was developed to do). It seems that the pinyin normalization is good but there needs to be part of the code that prevents a cjk character from matching to that pinyin normalization.

3 replies

Kimeiga Jan 5, 2024

I see that the normalization code is here right?
https://github.com/meilisearch/charabia/blob/fcfa655ca24de5cee74b140b1814c164f69f09c5/charabia/src/normalizer/chinese.rs#L26C63-L26C63

probably that is good

one easy solution to all this might be: in meilisearch, simply turn off the conversion of chinese characters that you input to the search query into pinyin, that way they only exact match but if you have pinyin inputted, it can fuzzy pinyin match. Or if this was an option that you could turn on or off for the executable that would be great!

heurainbow Feb 27, 2024

is this going to support on next version

ManyTheFish Feb 28, 2024
Collaborator Author

Hello @heurainbow,
No, the next version of Meilisearch doesn't provide any improvement for this,
The Meilisearch team is focusing on the performance and stability of the engine for the whole Q1.
The other topic we are working on is the hybrid search (AI + keyword), which enhances the relevancy.

Sorry, if this issue is not fixed, it's not a simple task to mix Chinese characters and pinyin at the same time, the current Meilisearch structure forces us to choose 1 variation per word.

Chinese Languages support #503

ManyTheFish Aug 1, 2022 Collaborator

Chinese Languages support

Current behavior and pointed out issues

Segmentation

Drawbacks

Normalization

Drawbacks

Potential enhancements

Segmentation

Normalization

Phonological normalization

Visual normalization

Phonological vs Visual normalization, what should we choose could we have both?

Contribute!

Replies: 11 comments · 38 replies

curquiza Aug 23, 2022 Maintainer

ManyTheFish Aug 23, 2022 Collaborator Author

ManyTheFish Aug 25, 2022 Collaborator Author

cut_for_search

ManyTheFish Sep 13, 2022 Collaborator Author

ManyTheFish Sep 14, 2022 Collaborator Author

ManyTheFish Oct 5, 2022 Collaborator Author

ManyTheFish Dec 5, 2023 Collaborator Author

ManyTheFish Dec 5, 2023 Collaborator Author

curquiza Jan 10, 2023 Maintainer

ManyTheFish Apr 26, 2023 Collaborator Author

ManyTheFish Dec 5, 2023 Collaborator Author

ManyTheFish Dec 5, 2023 Collaborator Author

ManyTheFish Feb 28, 2024 Collaborator Author

ManyTheFish
Aug 1, 2022
Collaborator

Replies: 11 comments 38 replies

curquiza Aug 23, 2022
Maintainer

ManyTheFish Aug 23, 2022
Collaborator Author

ManyTheFish Aug 25, 2022
Collaborator Author

ManyTheFish Sep 13, 2022
Collaborator Author

ManyTheFish Sep 14, 2022
Collaborator Author

ManyTheFish
Oct 5, 2022
Collaborator Author

ManyTheFish Dec 5, 2023
Collaborator Author

ManyTheFish Dec 5, 2023
Collaborator Author

curquiza
Jan 10, 2023
Maintainer

ManyTheFish Apr 26, 2023
Collaborator Author

ManyTheFish Dec 5, 2023
Collaborator Author

ManyTheFish Dec 5, 2023
Collaborator Author

ManyTheFish Feb 28, 2024
Collaborator Author