-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix highlight should handle unicode #1480
Comments
Checked with the team, this is not easy to fix for v0.21.0 and not that an emergency, I remove this issue from the v0.21.0 milestones. |
Now that meilisearch/charabia#54 is closed, this should be possible to fix. Can I take a shot at this? |
Hey @Samyak2! |
Great! Thank you |
I have been working on this, but I have run into some issues. I changed the PrimitiveQueryPart struct to store Token itself instead of just the String (This is needed because the This leads to issues with the borrow checker as the reference internal to Token depends on the analyzer and the stop words. Cargo build errors
I'm quite new to Rust, so I couldn't figure out how to fix these issues. Any pointers would be appreciated :) |
Hey @Samyak2! I think you try to modify an unrelated code part. |
Hi there, |
Hi @ManyTheFish. Sorry I was too busy with college exams in the last month :( I can continue working on this now and will have a PR in the next 1-2 days (if it hasn't been fixed already). |
No worry @Samyak2, no pressure for the contributors here! 🙂 Hope your exams went well! Thanks a lot for your PRs and your involvement! |
Thank you! |
Reopen since it's only fixed milli's side but not on MeiliSearch side To close this issue, we need to
|
Fixed by #2005 that bump the milli dependency to milli v0.22.0 containing the fix of this issue 🙂 The bug fix will be released in Meilisearch v0.26.0. |
Still have this problem with meilisearch v0.27.2 and cyrillic text in json field. |
Hey @yngc0der, Thanks for your interest! |
Describe the bug
It is not possible for the actual Highlighter to count precisely how many bytes should be highlighted when a word contains deunicoded characters.
To Reproduce
Go💼od
orVývoj
Go💼
orvývo
<em>Go💼od</em>
or<em>Vývoj</em>
Expected behavior
<em>Go💼</em>od
or<em>Vývo</em>j
Version
meilisearch v0.21
Additional context
related to #1368 and #1173
Fix Proposals
Token as an array of characters
make tokenizer return
[str]
corresponding to each character of the current Token, the highlighter would be able to return a character count based on the array.Spaned Normalization
make the normalizer return bounds in bytes of each character in the token via a new function.
Forbid to the normalizer to return more than 1
char
per normalizedchar
forbid at least:
this last proposal is the easiest to implement but would be the ugliest
The text was updated successfully, but these errors were encountered: