Handle Japanese by setting up Lindera tokenizer #2185

curquiza · 2022-02-21T15:11:22Z

In progress by the community here:

If there is no answer from the community, the work should be finished to be integrated to v0.27.0

⚠️ This version of our tokenizer will NOT use whatlang::detect_lang() since we don't know the impact on the performance. For this first implementation, we will use only whatlang::detect_script() to detect the script and use Lindera accordingly or not.
The whatlang::detect_script() will not be perfectly efficient since Japanese documents can be detected as Mandarin script, and in this situation, our tokenizer will use Jieba tokenizer instead of Lindera.
A usage of whatlang::detect_lang() can be considered in the future but will need benchmarks to avoid any loss of performance.

Should also fix #2159

Steps

Merge Setup lindera tokenizer for ja support ( related with #49 ) charabia#70
Release tokenizer
Bump tokenizer dependency in Milli
Release Milli with the new tokenizer dependency
Bump milli in Meilisearch

The text was updated successfully, but these errors were encountered:

curquiza · 2022-02-21T15:12:20Z

@meilisearch/docs-team, I don't know if you have a page where you listed the supported languages.

Northn · 2022-02-21T15:15:36Z

@meilisearch/docs-team, I don't know if you have a page where you listed the supported languages.

https://docs.meilisearch.com/learn/what_is_meilisearch/language.html

curquiza · 2022-03-16T18:51:44Z

Milli was bumped in #2244, with milli v0.24.0 containing the new version of the tokenizer. We can close this issue. The change will be effective in v0.27.0.

curquiza · 2022-03-23T09:39:10Z

We have an issue with Lindera compilation. I re-open this issue until we find out a fix! @Kerollmops is on it

Kerollmops · 2022-03-23T14:17:35Z

We are awaiting the next release of milli, the version after v0.24.0 which contains meilisearch/milli#475 that uses the latest version of lindera with lindera-morphology/lindera#164.

There were two issues:

The lindera reqwest version was colliding with our own version and,
lindera was downloading its dictionary from Google Drive and it was failing very often, breaking our builds, they switched to Sourceforge, which is more stable now.

Kerollmops · 2022-04-05T17:15:16Z

We bumped milli to v0.24.1 in #2280 to close this issue.

curquiza added enhancement New feature or improvement tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ labels Feb 21, 2022

curquiza added this to the v0.27.0 milestone Feb 21, 2022

curquiza added the impacts docs This issue involves changes in the Meilisearch's documentation label Feb 21, 2022

ManyTheFish changed the title ~~Handle Japanese by setting up Lindra tokenizer~~ Handle Japanese by setting up Lindera tokenizer Feb 28, 2022

curquiza assigned ManyTheFish Mar 1, 2022

curquiza closed this as completed Mar 16, 2022

curquiza reopened this Mar 23, 2022

curquiza assigned Kerollmops and unassigned ManyTheFish Mar 23, 2022

guimachiavelli mentioned this issue Mar 23, 2022

v0.27: Planned changes meilisearch/documentation#1531

Closed

Kerollmops closed this as completed Apr 5, 2022

guimachiavelli mentioned this issue Apr 12, 2022

v0.27: Add Japanese tokenizer meilisearch/documentation#1584

Closed

1 task

curquiza added the v0.27.0 PRs/issues solved in v0.27.0 label Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Japanese by setting up Lindera tokenizer #2185

Handle Japanese by setting up Lindera tokenizer #2185

curquiza commented Feb 21, 2022 •

edited

curquiza commented Feb 21, 2022

Northn commented Feb 21, 2022

curquiza commented Mar 16, 2022

curquiza commented Mar 23, 2022

Kerollmops commented Mar 23, 2022

Kerollmops commented Apr 5, 2022 •

edited

Handle Japanese by setting up Lindera tokenizer #2185

Handle Japanese by setting up Lindera tokenizer #2185

Comments

curquiza commented Feb 21, 2022 • edited

curquiza commented Feb 21, 2022

Northn commented Feb 21, 2022

curquiza commented Mar 16, 2022

curquiza commented Mar 23, 2022

Kerollmops commented Mar 23, 2022

Kerollmops commented Apr 5, 2022 • edited

curquiza commented Feb 21, 2022 •

edited

Kerollmops commented Apr 5, 2022 •

edited