-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mapping between bytes in original word and normalized word #59
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Towards fixing meilisearch#54 - Add a `char_map` vector in Token - It stores the number of bytes in the normalized word for each grapheme cluster (character) in the original word - To use it in a highlighter, given the number of bytes to highlight, use the `num_graphemes_from_bytes` fn to get the number of chars to highlight. - Though, note that the meilisearch highlighter only gets the raw string (&str) instead of the token struct. Hence it will need major changes in the highlighter, or an alternate solution here. - It must also be noted that this adds a significant overhead to tokenization as a new vector is stored for *every* token. - and `deunicode` has to be run twice on the same token - once for each grapheme cluster and once for the entire word.
Can you add some tests concerning your changes? Please. 馃槉 |
ManyTheFish
requested changes
Nov 8, 2021
Note: this considers an incomplete character (grapheme) too. Which means that if only 2 bytes of a grapheme consisting of 4 bytes are available, that grapheme will be added to the count too. Co-authored-by: many <maxime@meilisearch.com>
Some basic tests for char mapping between original chars and bytes in normalized string. Tests only for the string "Go馃捈od" which normalizes to "Gobriefcaseod".
ManyTheFish
approved these changes
Nov 9, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
Bors merge
Build succeeded: |
Samyak2
added a commit
to Samyak2/milli
that referenced
this pull request
Dec 17, 2021
The tests still fail due to a bug in meilisearch/charabia#59
Samyak2
added a commit
to Samyak2/milli
that referenced
this pull request
Jan 17, 2022
The tests still fail due to a bug in meilisearch/charabia#59
Kerollmops
pushed a commit
to meilisearch/milli
that referenced
this pull request
Jan 18, 2022
The tests still fail due to a bug in meilisearch/charabia#59
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
What does this PR do?
Towards fixing #54
char_map
vector in Tokennum_graphemes_from_bytes
fn to get the number of chars to highlight.deunicode
has to be run twice on the same token - once for each grapheme cluster and once for the entire word.PR checklist
Please check if your PR fulfills the following requirements:
This is my first Rust PR 馃槃, please let me know if something's not right