Handle Japanese by setting up Lindera tokenizer #2185
Labels
enhancement
New feature or improvement
impacts docs
This issue involves changes in the Meilisearch's documentation
tokenizer
Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/
v0.27.0
PRs/issues solved in v0.27.0
Milestone
In progress by the community here:
If there is no answer from the community, the work should be finished to be integrated to v0.27.0
whatlang::detect_lang()
since we don't know the impact on the performance. For this first implementation, we will use onlywhatlang::detect_script()
to detect the script and use Lindera accordingly or not.The
whatlang::detect_script()
will not be perfectly efficient since Japanese documents can be detected asMandarin
script, and in this situation, our tokenizer will use Jieba tokenizer instead of Lindera.A usage of
whatlang::detect_lang()
can be considered in the future but will need benchmarks to avoid any loss of performance.Should also fix #2159
Steps
The text was updated successfully, but these errors were encountered: