v0.28: Hebrew language support #1688

guimachiavelli · 2022-06-07T10:31:54Z

As title states, Meilisearch v0.28 will have official support for Hebrew.

We need to update the "Languages" page with this information. More info required on the tokenizer and whether we need to specify it as we have done for Chinese and Japanese.

References

Issue on core: meilisearch/meilisearch#2417
Internal Meilisearch library for tokenizer management: https://github.com/meilisearch/charabia
SME: @ManyTheFish

dichotommy · 2022-06-09T15:26:35Z

As I work on this issue, I will also take a look at the tokenizer page, which is now somewhat out of date. These may be useful:

meilisearch/meilisearch#2375 (comment)
meilisearch/charabia#72

I will make some small changes to the tokenization page based on the content of these issues, and then request someone from @meilisearch/core-team to review.

ManyTheFish · 2022-06-09T15:49:24Z

Hey @dichotommy, don't hesitate to ask if you need some explanations! 😊

dichotommy · 2022-06-15T13:20:07Z

Hello @ManyTheFish , I know that since the tokenizer has been refactored, the tokenization page on the docs is most likely no longer accurate. For example, the image on that page is out of date since the preprocessing step has been removed, the "tokenizer" step has been renamed "segmenter", and we now offer more than just Latin and Chinese.

I'm wondering three main things:

Does the core team have a more up-to-date image to represent the analyzer/tokenization process? Of course, no problem if not, we will just remove it 😄
To get a technical understanding of the tokenizer, is there a more recent place to point users than the old specification? Does CONTRIBUTING.md fill that same role?
How would you change the text below, if at all:

When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.

We can break down the tokenization process like so:

Crawl the document(s) and determine the primary language for each field
Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists

Thank you very much for your help 😊 ❤️ 🙏🏻

ManyTheFish · 2022-07-04T08:28:51Z

Hello @dichotommy!

Does the core team have a more up-to-date image to represent the analyzer/tokenization process? Of course, no problem if not, we will just remove it 😄

For now, I didn't make a schema that represents the new behavior of the tokenizer.
However, if you think that it's better with a schema, I can craft you a new one. 😄

To get a technical understanding of the tokenizer, is there a more recent place to point users than the old specification? Does CONTRIBUTING.md fill that same role?

You're right! There is no technical explanation of the tokenizer anymore, I'll see with @gmourier if the specification repository is the most suitable for a technical explanation of the tokenizer

How would you change the text below, if at all:

Something like below should be more accurate:

- When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.
+ When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called a tokenizer. The tokenizer is responsible for splitting each field by Script (e.g., Latin alphabet, Chinese hanzi, etc.). Then, it applies the corresponding pipeline to each part of each field.

We can break down the tokenization process like so:

- Crawl the document(s) and determine the primary language for each field
+ Crawl the document(s) splitting each field by Script
- Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists
+ Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists

1707: v0.28 r=guimachiavelli a=guimachiavelli This is a staging PR for all changes related to Meilisearch v0.28. Please avoid making changes directly to this PR; instead, create new child branches based off this one. Closes #1687, #1688, #1691, #1692, #1693, #1694, #1699, #1700, #1701, #1702, #1703, #1704, #1706, #1722, #1727, #561 Co-authored-by: gui machiavelli <hey@guimachiavelli.com> Co-authored-by: gui machiavelli <gui@meilisearch.com> Co-authored-by: Tommy Melvin <tommy@meilisearch.com> Co-authored-by: Maryam Sulemani <maryam@meilisearch.com> Co-authored-by: Maryam <90181761+maryamsulemani97@users.noreply.github.com>

guimachiavelli · 2022-07-11T14:46:05Z

Closed by #1707

guimachiavelli added this to the v0.28 milestone Jun 7, 2022

guimachiavelli added the new release label Jun 8, 2022

guimachiavelli assigned dichotommy Jun 9, 2022

guimachiavelli mentioned this issue Jun 15, 2022

v0.28 #1707

Merged

guimachiavelli linked a pull request Jun 21, 2022 that will close this issue

v0.28 hebrew tokenizer #1728

Merged

dichotommy mentioned this issue Jun 23, 2022

v0.28 hebrew tokenizer #1728

Merged

guimachiavelli closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.28: Hebrew language support #1688

v0.28: Hebrew language support #1688

guimachiavelli commented Jun 7, 2022

dichotommy commented Jun 9, 2022

ManyTheFish commented Jun 9, 2022

dichotommy commented Jun 15, 2022

ManyTheFish commented Jul 4, 2022

guimachiavelli commented Jul 11, 2022

v0.28: Hebrew language support #1688

v0.28: Hebrew language support #1688

Comments

guimachiavelli commented Jun 7, 2022

References

dichotommy commented Jun 9, 2022

ManyTheFish commented Jun 9, 2022

dichotommy commented Jun 15, 2022

ManyTheFish commented Jul 4, 2022

guimachiavelli commented Jul 11, 2022