Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Chinese normalizer by unifying Z, Simplified, and Semantic variants #144

Closed
ManyTheFish opened this issue Oct 5, 2022 · 2 comments · Fixed by #162
Closed

Enhance Chinese normalizer by unifying Z, Simplified, and Semantic variants #144

ManyTheFish opened this issue Oct 5, 2022 · 2 comments · Fixed by #162

Comments

@ManyTheFish
Copy link
Member

Following the official discussion about Chinese support in Meilisearch, it is relevant to normalize Chinese characters by unifying Z Simplified and Semantic variants before transliterating them into Pinyin.

to know more about each variant, you can read the dedicated report on unicode.org

There are several dictionaries listing variations that we can use, I suggest using the kvariants dictionary made by hfhchan (see the related documentation on the same repo).

technical approach

Import and Rework the dictionary to be a key-value binding of each variant, then, in the Chinese normalizer, convert the provided character before transliterating it into Pinyin.

Files expected to be modified

Misc

related to meilisearch/product#503

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

@ManyTheFish ManyTheFish changed the title Enhance Chinese normalizer by unifying Z Simplified and Semantic variants Enhance Chinese normalizer by unifying Z, Simplified, and Semantic variants Oct 5, 2022
@choznerol
Copy link
Contributor

Import and Rework the dictionary to be a key-value binding of each variant, ...

Does the import here means something like embedding the kVariants.txt inside dictionaries/txt/cjk/... directly?

@ManyTheFish
Copy link
Member Author

@choznerol, at least yes! 😄

choznerol added a commit to choznerol/charabia that referenced this issue Nov 8, 2022
choznerol added a commit to choznerol/charabia that referenced this issue Nov 12, 2022
bors bot added a commit that referenced this issue Nov 21, 2022
162: Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants r=ManyTheFish a=choznerol

# Pull Request

## Related issue
Fixes #144

## What does this PR do?

As titled, use [`kVariants.txt`](https://github.com/hfhchan/irg/blob/master/kVariants.md) as a dictionary to enhance Chinese normalization.

## TBD
### 1. Also normalizing `old` and `wrong!` variants

> .. , it is relevant to normalize Chinese characters by unifying `Z` `Simplified` and `Semantic` variants before transliterating them into Pinyin.

There are also `old` and `wrong!` variants in [`kVariants.txt`](https://github.com/hfhchan/irg/blob/master/kVariants.md#format). I didn't see a reason not also to handle them, so they are also convert.

### 2. Confirm direction of conversion

For `=` `old`, `sem`, `wrong!` variants, I think it's obvious we want to convert from Source Ideograph to Destination Ideograph. However, for `simp` I personally think the same but am not 100% sure if there would be other considerations. The reason I think traditional variants should be the normalized form includes:
1. Traditional variants seem to be the source of truth, just like Source Destination in `=` `old`, `sem` and `wrong!` all represent source of truth.
2. A log of simplified variants seems to be rendered unsuccessfully (the boxes of Unicode codepoint like 𧦛). I would worry if `ToPinyin` could be handled these simplified variants correctly if they are chosen as normalized form. <br/> <img width="470" alt="image" src="https://user-images.githubusercontent.com/12410942/201456109-78ff9817-08aa-4169-8d26-a03faec9f8e9.png">

### 3. Alternatives to copying and embedding the dictionary
> > > Import and Rework the dictionary to be a key-value binding of each variant, ...
> >
> > Does the import here means something like embedding the [kVariants.txt](https://github.com/hfhchan/irg/blob/master/kVariants.txt) inside dictionaries/txt/cjk/... directly?
>
> `@choznerol,` at least yes! 😄
>
> #144 (comment)

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Lawrence Chou <choznerol@protonmail.com>
@bors bors bot closed this as completed in 4a8e204 Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants