Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.28: Hebrew language support #1688

Closed
guimachiavelli opened this issue Jun 7, 2022 · 5 comments · Fixed by #1728
Closed

v0.28: Hebrew language support #1688

guimachiavelli opened this issue Jun 7, 2022 · 5 comments · Fixed by #1728
Assignees
Milestone

Comments

@guimachiavelli
Copy link
Member

As title states, Meilisearch v0.28 will have official support for Hebrew.

We need to update the "Languages" page with this information. More info required on the tokenizer and whether we need to specify it as we have done for Chinese and Japanese.

References

Issue on core: meilisearch/meilisearch#2417
Internal Meilisearch library for tokenizer management: https://github.com/meilisearch/charabia
SME: @ManyTheFish

@dichotommy
Copy link
Contributor

As I work on this issue, I will also take a look at the tokenizer page, which is now somewhat out of date. These may be useful:

meilisearch/meilisearch#2375 (comment)
meilisearch/charabia#72

I will make some small changes to the tokenization page based on the content of these issues, and then request someone from @meilisearch/core-team to review.

@ManyTheFish
Copy link
Member

Hey @dichotommy, don't hesitate to ask if you need some explanations! 😊

@dichotommy
Copy link
Contributor

Hello @ManyTheFish , I know that since the tokenizer has been refactored, the tokenization page on the docs is most likely no longer accurate. For example, the image on that page is out of date since the preprocessing step has been removed, the "tokenizer" step has been renamed "segmenter", and we now offer more than just Latin and Chinese.

I'm wondering three main things:

  1. Does the core team have a more up-to-date image to represent the analyzer/tokenization process? Of course, no problem if not, we will just remove it 😄
  2. To get a technical understanding of the tokenizer, is there a more recent place to point users than the old specification? Does CONTRIBUTING.md fill that same role?
  3. How would you change the text below, if at all:

When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.

We can break down the tokenization process like so:

Crawl the document(s) and determine the primary language for each field
Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists

Thank you very much for your help 😊 ❤️ 🙏🏻

@guimachiavelli guimachiavelli mentioned this issue Jun 15, 2022
@guimachiavelli guimachiavelli linked a pull request Jun 21, 2022 that will close this issue
@ManyTheFish
Copy link
Member

Hello @dichotommy!

Does the core team have a more up-to-date image to represent the analyzer/tokenization process? Of course, no problem if not, we will just remove it 😄

For now, I didn't make a schema that represents the new behavior of the tokenizer.
However, if you think that it's better with a schema, I can craft you a new one. 😄

To get a technical understanding of the tokenizer, is there a more recent place to point users than the old specification? Does CONTRIBUTING.md fill that same role?

You're right! There is no technical explanation of the tokenizer anymore, I'll see with @gmourier if the specification repository is the most suitable for a technical explanation of the tokenizer

How would you change the text below, if at all:

Something like below should be more accurate:

- When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.
+ When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called a tokenizer. The tokenizer is responsible for splitting each field by Script (e.g., Latin alphabet, Chinese hanzi, etc.). Then, it applies the corresponding pipeline to each part of each field.

We can break down the tokenization process like so:

- Crawl the document(s) and determine the primary language for each field
+ Crawl the document(s) splitting each field by Script
- Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists
+ Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists

bors bot added a commit that referenced this issue Jul 11, 2022
1707: v0.28 r=guimachiavelli a=guimachiavelli

This is a staging PR for all changes related to Meilisearch v0.28.

Please avoid making changes directly to this PR; instead, create new child branches based off this one.

Closes #1687, #1688, #1691, #1692, #1693, #1694, #1699, #1700, #1701, #1702, #1703, #1704, #1706, #1722, #1727, #561

Co-authored-by: gui machiavelli <hey@guimachiavelli.com>
Co-authored-by: gui machiavelli <gui@meilisearch.com>
Co-authored-by: Tommy Melvin <tommy@meilisearch.com>
Co-authored-by: Maryam Sulemani <maryam@meilisearch.com>
Co-authored-by: Maryam <90181761+maryamsulemani97@users.noreply.github.com>
@guimachiavelli
Copy link
Member Author

Closed by #1707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants