Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mapping between bytes in original word and normalized word #59

Merged
merged 4 commits into from
Nov 9, 2021

Conversation

Samyak2
Copy link
Contributor

@Samyak2 Samyak2 commented Nov 4, 2021

Pull Request

What does this PR do?

Towards fixing #54

  • Added a char_map vector in Token
  • It stores the number of bytes in the normalized word for each grapheme cluster (character) in the original word
  • To use it in a highlighter, given the number of bytes to highlight, use the num_graphemes_from_bytes fn to get the number of chars to highlight.
    • Though, note that the meilisearch highlighter only gets the raw string (&str) instead of the token struct. Hence it will need major changes in the highlighter, or an alternate solution here.
  • It must also be noted that this adds a significant overhead to tokenization as a new vector is stored for every token.
    • and deunicode has to be run twice on the same token - once for each grapheme cluster and once for the entire word.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

This is my first Rust PR 馃槃, please let me know if something's not right

Towards fixing meilisearch#54

 - Add a `char_map` vector in Token
 - It stores the number of bytes in the normalized word for each
   grapheme cluster (character) in the original word
 - To use it in a highlighter, given the number of bytes to highlight,
   use the `num_graphemes_from_bytes` fn to get the number of chars to
   highlight.
    - Though, note that the meilisearch highlighter only gets the raw
      string (&str) instead of the token struct. Hence it will need
      major changes in the highlighter, or an alternate solution here.
 - It must also be noted that this adds a significant overhead to
   tokenization as a new vector is stored for *every* token.
    - and `deunicode` has to be run twice on the same token - once for
      each grapheme cluster and once for the entire word.
@ManyTheFish
Copy link
Member

Can you add some tests concerning your changes? Please. 馃槉

src/token.rs Outdated Show resolved Hide resolved
Samyak2 and others added 3 commits November 8, 2021 20:28
Note: this considers an incomplete character (grapheme) too. Which means
that if only 2 bytes of a grapheme consisting of 4 bytes are available, that
grapheme will be added to the count too.

Co-authored-by: many <maxime@meilisearch.com>
Some basic tests for char mapping between original chars and bytes in
normalized string. Tests only for the string "Go馃捈od" which normalizes
to "Gobriefcaseod".
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!
Bors merge

@bors
Copy link
Contributor

bors bot commented Nov 9, 2021

Build succeeded:

@bors bors bot merged commit cb55578 into meilisearch:main Nov 9, 2021
Samyak2 added a commit to Samyak2/milli that referenced this pull request Dec 17, 2021
Samyak2 added a commit to Samyak2/milli that referenced this pull request Jan 17, 2022
Kerollmops pushed a commit to meilisearch/milli that referenced this pull request Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants