Add mapping between bytes in original word and normalized word #59

Samyak2 · 2021-11-04T14:29:41Z

Pull Request

What does this PR do?

Towards fixing #54

Added a char_map vector in Token
It stores the number of bytes in the normalized word for each grapheme cluster (character) in the original word
To use it in a highlighter, given the number of bytes to highlight, use the num_graphemes_from_bytes fn to get the number of chars to highlight.
- Though, note that the meilisearch highlighter only gets the raw string (&str) instead of the token struct. Hence it will need major changes in the highlighter, or an alternate solution here.
It must also be noted that this adds a significant overhead to tokenization as a new vector is stored for every token.
- and deunicode has to be run twice on the same token - once for each grapheme cluster and once for the entire word.

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

This is my first Rust PR 😄, please let me know if something's not right

Towards fixing meilisearch#54 - Add a `char_map` vector in Token - It stores the number of bytes in the normalized word for each grapheme cluster (character) in the original word - To use it in a highlighter, given the number of bytes to highlight, use the `num_graphemes_from_bytes` fn to get the number of chars to highlight. - Though, note that the meilisearch highlighter only gets the raw string (&str) instead of the token struct. Hence it will need major changes in the highlighter, or an alternate solution here. - It must also be noted that this adds a significant overhead to tokenization as a new vector is stored for *every* token. - and `deunicode` has to be run twice on the same token - once for each grapheme cluster and once for the entire word.

ManyTheFish · 2021-11-08T09:11:38Z

Can you add some tests concerning your changes? Please. 😊

src/token.rs

Note: this considers an incomplete character (grapheme) too. Which means that if only 2 bytes of a grapheme consisting of 4 bytes are available, that grapheme will be added to the count too. Co-authored-by: many <maxime@meilisearch.com>

Some basic tests for char mapping between original chars and bytes in normalized string. Tests only for the string "Go💼od" which normalizes to "Gobriefcaseod".

ManyTheFish

Thanks for your contribution!
Bors merge

bors · 2021-11-09T12:44:33Z

Build succeeded:

tests

The tests still fail due to a bug in meilisearch/charabia#59

curquiza requested a review from ManyTheFish November 4, 2021 16:52

ManyTheFish requested changes Nov 8, 2021

View reviewed changes

src/token.rs Outdated Show resolved Hide resolved

Samyak2 and others added 3 commits November 8, 2021 20:28

Document num_graphemes_from_bytes of Token

70e0019

Add tests for Token.num_graphemes_from_bytes

cb81531

Some basic tests for char mapping between original chars and bytes in normalized string. Tests only for the string "Go💼od" which normalizes to "Gobriefcaseod".

Samyak2 requested a review from ManyTheFish November 9, 2021 06:09

ManyTheFish approved these changes Nov 9, 2021

View reviewed changes

bors bot merged commit cb55578 into meilisearch:main Nov 9, 2021

ManyTheFish mentioned this pull request Nov 9, 2021

Get size of char after normalization #54

Closed

Samyak2 added a commit to Samyak2/milli that referenced this pull request Dec 17, 2021

Fix matching_words tests to compile successfully

cd2d010

The tests still fail due to a bug in meilisearch/charabia#59

Samyak2 mentioned this pull request Dec 17, 2021

num_graphemes_from_bytes does not work when used for a prefix of a raw Token #63

Closed

Samyak2 added a commit to Samyak2/milli that referenced this pull request Jan 17, 2022

Fix matching_words tests to compile successfully

e752bd0

The tests still fail due to a bug in meilisearch/charabia#59

Kerollmops pushed a commit to meilisearch/milli that referenced this pull request Jan 18, 2022

Fix matching_words tests to compile successfully

6ffb414

The tests still fail due to a bug in meilisearch/charabia#59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mapping between bytes in original word and normalized word #59

Add mapping between bytes in original word and normalized word #59

Samyak2 commented Nov 4, 2021 •

edited

ManyTheFish commented Nov 8, 2021

ManyTheFish left a comment

bors bot commented Nov 9, 2021

Add mapping between bytes in original word and normalized word #59

Add mapping between bytes in original word and normalized word #59

Conversation

Samyak2 commented Nov 4, 2021 • edited

Pull Request

What does this PR do?

PR checklist

ManyTheFish commented Nov 8, 2021

ManyTheFish left a comment

Choose a reason for hiding this comment

bors bot commented Nov 9, 2021

Samyak2 commented Nov 4, 2021 •

edited