Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the new version of the tokenizer: charabia #2375

Closed
6 tasks done
curquiza opened this issue May 4, 2022 · 3 comments · Fixed by #2468
Closed
6 tasks done

Introduce the new version of the tokenizer: charabia #2375

curquiza opened this issue May 4, 2022 · 3 comments · Fixed by #2468
Assignees
Labels
enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v0.28.0 PRs/issues solved in v0.28.0
Milestone

Comments

@curquiza
Copy link
Member

curquiza commented May 4, 2022

Why?

We would make easier the contribution to our tokenizer in order to support more and more language. Indeed our community speaks multiple languages, we are not, and they are the best to choose which tokenizer and normalizer they want to use for their own native language. So @ManyTheFish worked on a new version of the tokenizer. More detailed in this issue meilisearch/charabia#72

What is fixed?

Changes

  • Introduce the new version of the tokenizer
    • Integrate it into milli
    • Release milli
    • Bump the new milli version into Meilisearch
  • Define precisely which languages we support or not, and how
  • Improve the issue management regarding the tokenizer (with @curquiza)
@curquiza curquiza added enhancement New feature or improvement tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ labels May 4, 2022
@curquiza curquiza added this to the v0.28.0 milestone May 4, 2022
@curquiza curquiza added the impacts docs This issue involves changes in the Meilisearch's documentation label May 19, 2022
@curquiza
Copy link
Member Author

Improve the issue management regarding the tokenizer

Done with @ManyTheFish
Feature requests (like language support or language detection or separator customization) should be in a discussion the product repository
Issues involving Meilisearch usage should stay in this repo, like bad highlighting with the Chinese language
Any other issues really specific to the charabia library should be transfered to the charabia repo

@curquiza
Copy link
Member Author

curquiza commented May 19, 2022

Update for the @meilisearch/docs-team

Define precisely which languages we support or not, and how

We defined the supported languages as languages that have a dedicated tokenizer (segmenter + normalizer) into Charabia.
Supported language in v0.28.0 will still be

  • Latin languages
  • Chinese language
  • Japanese language

It does not mean Meilisearch does not work for non-listed languages. It means by default (for other languages than Japanese and Chinese currently) the Latin tokenizer will be used: so for some languages and situations, it can work as expected; for some other languages it can be a failure.
If Meilisearch does not behave as expected with the languages, it would be really awesome the docs redirect the users to the charabia repo, highlighting the CONTRIBUTING.md (the guide that will guide the users to contribute).

A contributor just did a PR to add the Hebrew support. I will open a dedicated issue if we will introduce it into v0.28.0 so that you don't miss it!

@curquiza
Copy link
Member Author

Issue opened about the Hebrew support: #2417

bors bot added a commit to meilisearch/milli that referenced this issue Jun 2, 2022
540: Integrate charabia r=Kerollmops a=ManyTheFish

related to meilisearch/meilisearch#2375
related to meilisearch/meilisearch#2144
related to meilisearch/meilisearch#2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit to meilisearch/milli that referenced this issue Jun 2, 2022
540: Integrate charabia r=Kerollmops a=ManyTheFish

related to meilisearch/meilisearch#2375
related to meilisearch/meilisearch#2144
related to meilisearch/meilisearch#2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
@ManyTheFish ManyTheFish mentioned this issue Jun 7, 2022
4 tasks
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=curquiza a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=curquiza a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=irevoire a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=curquiza a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
@bors bors bot closed this as completed in 6171f17 Jun 8, 2022
@bors bors bot closed this as completed in #2468 Jun 8, 2022
@curquiza curquiza added the v0.28.0 PRs/issues solved in v0.28.0 label Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v0.28.0 PRs/issues solved in v0.28.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants