Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Japanese by setting up Lindera tokenizer #2185

Closed
5 tasks done
curquiza opened this issue Feb 21, 2022 · 6 comments
Closed
5 tasks done

Handle Japanese by setting up Lindera tokenizer #2185

curquiza opened this issue Feb 21, 2022 · 6 comments
Assignees
Labels
enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v0.27.0 PRs/issues solved in v0.27.0
Milestone

Comments

@curquiza
Copy link
Member

curquiza commented Feb 21, 2022

In progress by the community here:

If there is no answer from the community, the work should be finished to be integrated to v0.27.0

⚠️ This version of our tokenizer will NOT use whatlang::detect_lang() since we don't know the impact on the performance. For this first implementation, we will use only whatlang::detect_script() to detect the script and use Lindera accordingly or not.
The whatlang::detect_script() will not be perfectly efficient since Japanese documents can be detected as Mandarin script, and in this situation, our tokenizer will use Jieba tokenizer instead of Lindera.
A usage of whatlang::detect_lang() can be considered in the future but will need benchmarks to avoid any loss of performance.

Should also fix #2159


Steps

@curquiza curquiza added enhancement New feature or improvement tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ labels Feb 21, 2022
@curquiza curquiza added this to the v0.27.0 milestone Feb 21, 2022
@curquiza
Copy link
Member Author

@meilisearch/docs-team, I don't know if you have a page where you listed the supported languages.

@curquiza curquiza added the impacts docs This issue involves changes in the Meilisearch's documentation label Feb 21, 2022
@Northn
Copy link

Northn commented Feb 21, 2022

@meilisearch/docs-team, I don't know if you have a page where you listed the supported languages.

https://docs.meilisearch.com/learn/what_is_meilisearch/language.html

@ManyTheFish ManyTheFish changed the title Handle Japanese by setting up Lindra tokenizer Handle Japanese by setting up Lindera tokenizer Feb 28, 2022
@curquiza
Copy link
Member Author

Milli was bumped in #2244, with milli v0.24.0 containing the new version of the tokenizer. We can close this issue. The change will be effective in v0.27.0.

@curquiza
Copy link
Member Author

We have an issue with Lindera compilation. I re-open this issue until we find out a fix! @Kerollmops is on it

@Kerollmops
Copy link
Member

We are awaiting the next release of milli, the version after v0.24.0 which contains meilisearch/milli#475 that uses the latest version of lindera with lindera-morphology/lindera#164.

There were two issues:

  1. The lindera reqwest version was colliding with our own version and,
  2. lindera was downloading its dictionary from Google Drive and it was failing very often, breaking our builds, they switched to Sourceforge, which is more stable now.

@Kerollmops
Copy link
Member

Kerollmops commented Apr 5, 2022

We bumped milli to v0.24.1 in #2280 to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v0.27.0 PRs/issues solved in v0.27.0
Projects
None yet
Development

No branches or pull requests

4 participants