-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.28: Hebrew language support #1688
Comments
As I work on this issue, I will also take a look at the tokenizer page, which is now somewhat out of date. These may be useful: meilisearch/meilisearch#2375 (comment) I will make some small changes to the tokenization page based on the content of these issues, and then request someone from @meilisearch/core-team to review. |
Hey @dichotommy, don't hesitate to ask if you need some explanations! 😊 |
Hello @ManyTheFish , I know that since the tokenizer has been refactored, the tokenization page on the docs is most likely no longer accurate. For example, the image on that page is out of date since the preprocessing step has been removed, the "tokenizer" step has been renamed "segmenter", and we now offer more than just Latin and Chinese. I'm wondering three main things:
Thank you very much for your help 😊 ❤️ 🙏🏻 |
Hello @dichotommy!
For now, I didn't make a schema that represents the new behavior of the tokenizer.
You're right! There is no technical explanation of the tokenizer anymore, I'll see with @gmourier if the specification repository is the most suitable for a technical explanation of the tokenizer
Something like below should be more accurate: - When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.
+ When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called a tokenizer. The tokenizer is responsible for splitting each field by Script (e.g., Latin alphabet, Chinese hanzi, etc.). Then, it applies the corresponding pipeline to each part of each field.
We can break down the tokenization process like so:
- Crawl the document(s) and determine the primary language for each field
+ Crawl the document(s) splitting each field by Script
- Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists
+ Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists |
1707: v0.28 r=guimachiavelli a=guimachiavelli This is a staging PR for all changes related to Meilisearch v0.28. Please avoid making changes directly to this PR; instead, create new child branches based off this one. Closes #1687, #1688, #1691, #1692, #1693, #1694, #1699, #1700, #1701, #1702, #1703, #1704, #1706, #1722, #1727, #561 Co-authored-by: gui machiavelli <hey@guimachiavelli.com> Co-authored-by: gui machiavelli <gui@meilisearch.com> Co-authored-by: Tommy Melvin <tommy@meilisearch.com> Co-authored-by: Maryam Sulemani <maryam@meilisearch.com> Co-authored-by: Maryam <90181761+maryamsulemani97@users.noreply.github.com>
Closed by #1707 |
As title states, Meilisearch v0.28 will have official support for Hebrew.
We need to update the "Languages" page with this information. More info required on the tokenizer and whether we need to specify it as we have done for Chinese and Japanese.
References
Issue on core: meilisearch/meilisearch#2417
Internal Meilisearch library for tokenizer management: https://github.com/meilisearch/charabia
SME: @ManyTheFish
The text was updated successfully, but these errors were encountered: