-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Chinese normalizer by unifying Z
, Simplified
, and Semantic
variants
#144
Comments
ManyTheFish
changed the title
Enhance Chinese normalizer by unifying
Enhance Chinese normalizer by unifying Oct 5, 2022
Z
Simplified
and Semantic
variantsZ
, Simplified
, and Semantic
variants
Does the import here means something like embedding the kVariants.txt inside |
@choznerol, at least yes! 😄 |
choznerol
added a commit
to choznerol/charabia
that referenced
this issue
Nov 8, 2022
choznerol
added a commit
to choznerol/charabia
that referenced
this issue
Nov 12, 2022
3 tasks
bors bot
added a commit
that referenced
this issue
Nov 21, 2022
162: Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants r=ManyTheFish a=choznerol # Pull Request ## Related issue Fixes #144 ## What does this PR do? As titled, use [`kVariants.txt`](https://github.com/hfhchan/irg/blob/master/kVariants.md) as a dictionary to enhance Chinese normalization. ## TBD ### 1. Also normalizing `old` and `wrong!` variants > .. , it is relevant to normalize Chinese characters by unifying `Z` `Simplified` and `Semantic` variants before transliterating them into Pinyin. There are also `old` and `wrong!` variants in [`kVariants.txt`](https://github.com/hfhchan/irg/blob/master/kVariants.md#format). I didn't see a reason not also to handle them, so they are also convert. ### 2. Confirm direction of conversion For `=` `old`, `sem`, `wrong!` variants, I think it's obvious we want to convert from Source Ideograph to Destination Ideograph. However, for `simp` I personally think the same but am not 100% sure if there would be other considerations. The reason I think traditional variants should be the normalized form includes: 1. Traditional variants seem to be the source of truth, just like Source Destination in `=` `old`, `sem` and `wrong!` all represent source of truth. 2. A log of simplified variants seems to be rendered unsuccessfully (the boxes of Unicode codepoint like 𧦛). I would worry if `ToPinyin` could be handled these simplified variants correctly if they are chosen as normalized form. <br/> <img width="470" alt="image" src="https://user-images.githubusercontent.com/12410942/201456109-78ff9817-08aa-4169-8d26-a03faec9f8e9.png"> ### 3. Alternatives to copying and embedding the dictionary > > > Import and Rework the dictionary to be a key-value binding of each variant, ... > > > > Does the import here means something like embedding the [kVariants.txt](https://github.com/hfhchan/irg/blob/master/kVariants.txt) inside dictionaries/txt/cjk/... directly? > > `@choznerol,` at least yes! 😄 > > #144 (comment) If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Lawrence Chou <choznerol@protonmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Following the official discussion about Chinese support in Meilisearch, it is relevant to normalize Chinese characters by unifying
Z
Simplified
andSemantic
variants before transliterating them into Pinyin.There are several dictionaries listing variations that we can use, I suggest using the kvariants dictionary made by hfhchan (see the related documentation on the same repo).
technical approach
Import and Rework the dictionary to be a key-value binding of each variant, then, in the Chinese normalizer, convert the provided character before transliterating it into Pinyin.
Files expected to be modified
Misc
related to meilisearch/product#503
The text was updated successfully, but these errors were encountered: