Prefix highlight should handle unicode #1480

ManyTheFish · 2021-07-05T15:14:09Z

Describe the bug
It is not possible for the actual Highlighter to count precisely how many bytes should be highlighted when a word contains deunicoded characters.

To Reproduce

Add a document containing words like Go💼od or Vývoj
Search word by prefix Go💼 or vývo
result is Go💼od or Vývoj

Expected behavior
Go💼od or Vývoj

Version
meilisearch v0.21

Additional context
related to #1368 and #1173

Fix Proposals

Token as an array of characters
make tokenizer return [str] corresponding to each character of the current Token, the highlighter would be able to return a character count based on the array.

Spaned Normalization
make the normalizer return bounds in bytes of each character in the token via a new function.

Forbid to the normalizer to return more than 1 char per normalized char
forbid at least:

emoji deunicoding
non-Latin scripts deunicoding

this last proposal is the easiest to implement but would be the ugliest

The text was updated successfully, but these errors were encountered:

curquiza · 2021-07-26T14:37:56Z

Checked with the team, this is not easy to fix for v0.21.0 and not that an emergency, I remove this issue from the v0.21.0 milestones.
However, it should not crash, if you have a crash behavior, please report it 🙂

Samyak2 · 2021-11-09T15:24:07Z

Now that meilisearch/charabia#54 is closed, this should be possible to fix. Can I take a shot at this?

ManyTheFish · 2021-11-10T13:43:02Z

Hey @Samyak2!
Yes, you can, I would be glad to help you if you have any questions, we will soon make a release of the tokenizer allowing you to work on this. 🚀

Samyak2 · 2021-11-10T13:48:08Z

Great! Thank you

curquiza · 2021-11-10T14:08:08Z

@Samyak2 tokenizer v0.2.6 is out

Samyak2 · 2021-11-20T15:13:55Z

I have been working on this, but I have run into some issues.

I changed the PrimitiveQueryPart struct to store Token itself instead of just the String (This is needed because the num_graphemes_from_bytes of Token will be used to count number of characters to highlight). Since Token has a reference internally, I had to add lifetime specifiers to everything that uses PrimitiveQuery.

This leads to issues with the borrow checker as the reference internal to Token depends on the analyzer and the stop words.

Cargo build errors

error[E0597]: `stop_words.0` does not live long enough
   --> milli/src/search/mod.rs:118:29
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
118 |                 if let Some(ref stop_words) = stop_words {
    |                             ^^^^^^^^^^^^^^ borrowed value does not live long enough
...
126 |             }
    |             - `stop_words.0` dropped here while still borrowed

error[E0597]: `analyzer` does not live long enough
   --> milli/src/search/mod.rs:122:30
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
122 |                 let result = analyzer.analyze(query);
    |                              ^^^^^^^^ borrowed value does not live long enough
...
126 |             }
    |             - `analyzer` dropped here while still borrowed

error[E0597]: `result` does not live long enough
   --> milli/src/search/mod.rs:123:30
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
123 |                 let tokens = result.tokens();
    |                              ^^^^^^ borrowed value does not live long enough
...
126 |             }
    |             - `result` dropped here while still borrowed

error[E0597]: `builder` does not live long enough
   --> milli/src/search/mod.rs:124:53
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
124 |                 let (query_tree, primitive_query) = builder.build(tokens)?.map_or((None, None), |(qt, pq)| (Some(qt), Some(pq)));
    |                                                     ^^^^^^^ borrowed value does not live long enough
125 |                 (query_tree, primitive_query.clone())
126 |             }
    |             - `builder` dropped here while still borrowed

For more information about this error, try `rustc --explain E0597`.
error: could not compile `milli` due to 4 previous errors

I'm quite new to Rust, so I couldn't figure out how to fix these issues. Any pointers would be appreciated :)

ManyTheFish · 2021-11-22T15:14:44Z

Hey @Samyak2! I think you try to modify an unrelated code part.
I invite you to look at matching_bytes function in matching_words.rs, this function is used to compute highlights and matches in meilisearch.
For now, this function takes a &str but I know that the "caller" of the function can provide a &Token instead. I think that changing this function would do the job. 🙂

adellinocasas · 2021-12-17T15:25:10Z

Hi there,
has this issue - Prefix highlight should handle unicode - already been fixed?
(using version Scout / Meilisearch "pkgVersion":"0.20.0")
thanks

Samyak2 · 2021-12-17T16:11:29Z

Hi @ManyTheFish. Sorry I was too busy with college exams in the last month :(

I can continue working on this now and will have a PR in the next 1-2 days (if it hasn't been fixed already).

curquiza · 2021-12-20T09:17:41Z

No worry @Samyak2, no pressure for the contributors here! 🙂 Hope your exams went well!

Thanks a lot for your PRs and your involvement!

Samyak2 · 2021-12-20T09:51:25Z

Thank you!

curquiza · 2022-01-17T16:20:15Z

Reopen since it's only fixed milli's side but not on MeiliSearch side

To close this issue, we need to

Release milli
Update the milli dependency in this repo

curquiza · 2022-02-02T15:44:07Z

Fixed by #2005 that bump the milli dependency to milli v0.22.0 containing the fix of this issue 🙂

The bug fix will be released in Meilisearch v0.26.0.

yngc0der · 2022-06-08T12:42:58Z

Still have this problem with meilisearch v0.27.2 and cyrillic text in json field.
Query: "Компания"
Response formatted part: "Компания дарит бесплатные билеты"

ManyTheFish · 2022-06-08T14:05:25Z

Hey @yngc0der,
we completely changed the tokenizer for the version 0.28, and it should fix your issue.
This version will be pre-released next week (beta release) and will be stabilized in 4 weeks.
Don't hesitate to create a new issue if your bug persists.

Thanks for your interest!

ManyTheFish added the bug Something isn't working as expected label Jul 5, 2021

This was referenced Jul 5, 2021

Highlight on emojis #1368

Closed

Highlighting is broken with non ascii characters #1173

Closed

qdequele added this to the v0.21.0 milestone Jul 6, 2021

Kerollmops mentioned this issue Jul 6, 2021

 tags present inside arrays meilisearch/mini-dashboard#92

Closed

This was referenced Jul 13, 2021

Misplaced highlights when used together with values cropping #1395

Closed

searching indexed arrays in the meiliesearch browser app not highlight correctly #1516

Closed

curquiza removed this from the v0.21.0 milestone Jul 26, 2021

curquiza mentioned this issue Jul 26, 2021

Get size of char after normalization meilisearch/charabia#54

Closed

curquiza added this to Candidate in Bug triage via automation Aug 10, 2021

curquiza moved this from Candidate to Bugs - severity 3 in Bug triage Aug 10, 2021

ManyTheFish mentioned this issue Aug 25, 2021

Encoding bug with accent characters #1631

Closed

curquiza added this to the v0.26.0 milestone Dec 8, 2021

curquiza assigned ManyTheFish Dec 9, 2021

Samyak2 mentioned this issue Dec 17, 2021

Fix search highlight for non-unicode chars meilisearch/milli#426

Merged

3 tasks

bors bot closed this as completed in meilisearch/milli@4c516c0 Jan 17, 2022

Bug triage automation moved this from Bugs - severity 3 to Done Jan 17, 2022

curquiza reopened this Jan 17, 2022

Bug triage automation moved this from Done to Candidates Jan 17, 2022

curquiza added the milli Related to the milli workspace label Jan 24, 2022

curquiza mentioned this issue Jan 27, 2022

Hightlight with special characters #2121

Closed

curquiza closed this as completed Feb 2, 2022

Bug triage automation moved this from Candidates to Done Feb 2, 2022

curquiza added the v0.26.0 PRs/issues solved in v0.26.0 label Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix highlight should handle unicode #1480

Prefix highlight should handle unicode #1480

ManyTheFish commented Jul 5, 2021 •

edited

Loading

curquiza commented Jul 26, 2021

Samyak2 commented Nov 9, 2021

ManyTheFish commented Nov 10, 2021 •

edited

Loading

Samyak2 commented Nov 10, 2021

curquiza commented Nov 10, 2021

Samyak2 commented Nov 20, 2021

ManyTheFish commented Nov 22, 2021

adellinocasas commented Dec 17, 2021

Samyak2 commented Dec 17, 2021

curquiza commented Dec 20, 2021

Samyak2 commented Dec 20, 2021

curquiza commented Jan 17, 2022 •

edited

Loading

curquiza commented Feb 2, 2022

yngc0der commented Jun 8, 2022

ManyTheFish commented Jun 8, 2022 •

edited by Kerollmops

Loading

Prefix highlight should handle unicode #1480

Prefix highlight should handle unicode #1480

Comments

ManyTheFish commented Jul 5, 2021 • edited Loading

Fix Proposals

curquiza commented Jul 26, 2021

Samyak2 commented Nov 9, 2021

ManyTheFish commented Nov 10, 2021 • edited Loading

Samyak2 commented Nov 10, 2021

curquiza commented Nov 10, 2021

Samyak2 commented Nov 20, 2021

ManyTheFish commented Nov 22, 2021

adellinocasas commented Dec 17, 2021

Samyak2 commented Dec 17, 2021

curquiza commented Dec 20, 2021

Samyak2 commented Dec 20, 2021

curquiza commented Jan 17, 2022 • edited Loading

curquiza commented Feb 2, 2022

yngc0der commented Jun 8, 2022

ManyTheFish commented Jun 8, 2022 • edited by Kerollmops Loading

ManyTheFish commented Jul 5, 2021 •

edited

Loading

ManyTheFish commented Nov 10, 2021 •

edited

Loading

curquiza commented Jan 17, 2022 •

edited

Loading

ManyTheFish commented Jun 8, 2022 •

edited by Kerollmops

Loading