Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix highlight should handle unicode #1480

Closed
ManyTheFish opened this issue Jul 5, 2021 · 15 comments · Fixed by meilisearch/milli#426
Closed

Prefix highlight should handle unicode #1480

ManyTheFish opened this issue Jul 5, 2021 · 15 comments · Fixed by meilisearch/milli#426
Assignees
Labels
bug Something isn't working as expected milli Related to the milli workspace v0.26.0 PRs/issues solved in v0.26.0
Projects
Milestone

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Jul 5, 2021

Describe the bug
It is not possible for the actual Highlighter to count precisely how many bytes should be highlighted when a word contains deunicoded characters.

To Reproduce

  1. Add a document containing words like Go💼od or Vývoj
  2. Search word by prefix Go💼 or vývo
  3. result is <em>Go💼od</em> or <em>Vývoj</em>

Expected behavior
<em>Go💼</em>od or <em>Vývo</em>j

Version
meilisearch v0.21

Additional context
related to #1368 and #1173

Fix Proposals

Token as an array of characters
make tokenizer return [str] corresponding to each character of the current Token, the highlighter would be able to return a character count based on the array.

Spaned Normalization
make the normalizer return bounds in bytes of each character in the token via a new function.

Forbid to the normalizer to return more than 1 char per normalized char
forbid at least:

  • emoji deunicoding
  • non-Latin scripts deunicoding

this last proposal is the easiest to implement but would be the ugliest

@curquiza
Copy link
Member

Checked with the team, this is not easy to fix for v0.21.0 and not that an emergency, I remove this issue from the v0.21.0 milestones.
However, it should not crash, if you have a crash behavior, please report it 🙂

@curquiza curquiza removed this from the v0.21.0 milestone Jul 26, 2021
@curquiza curquiza added this to Candidate in Bug triage via automation Aug 10, 2021
@curquiza curquiza moved this from Candidate to Bugs - severity 3 in Bug triage Aug 10, 2021
@Samyak2
Copy link
Contributor

Samyak2 commented Nov 9, 2021

Now that meilisearch/charabia#54 is closed, this should be possible to fix. Can I take a shot at this?

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Nov 10, 2021

Hey @Samyak2!
Yes, you can, I would be glad to help you if you have any questions, we will soon make a release of the tokenizer allowing you to work on this. 🚀

@Samyak2
Copy link
Contributor

Samyak2 commented Nov 10, 2021

Great! Thank you

@curquiza
Copy link
Member

@Samyak2 tokenizer v0.2.6 is out

@Samyak2
Copy link
Contributor

Samyak2 commented Nov 20, 2021

I have been working on this, but I have run into some issues.

I changed the PrimitiveQueryPart struct to store Token itself instead of just the String (This is needed because the num_graphemes_from_bytes of Token will be used to count number of characters to highlight). Since Token has a reference internally, I had to add lifetime specifiers to everything that uses PrimitiveQuery.

This leads to issues with the borrow checker as the reference internal to Token depends on the analyzer and the stop words.

Cargo build errors
error[E0597]: `stop_words.0` does not live long enough
   --> milli/src/search/mod.rs:118:29
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
118 |                 if let Some(ref stop_words) = stop_words {
    |                             ^^^^^^^^^^^^^^ borrowed value does not live long enough
...
126 |             }
    |             - `stop_words.0` dropped here while still borrowed

error[E0597]: `analyzer` does not live long enough
   --> milli/src/search/mod.rs:122:30
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
122 |                 let result = analyzer.analyze(query);
    |                              ^^^^^^^^ borrowed value does not live long enough
...
126 |             }
    |             - `analyzer` dropped here while still borrowed

error[E0597]: `result` does not live long enough
   --> milli/src/search/mod.rs:123:30
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
123 |                 let tokens = result.tokens();
    |                              ^^^^^^ borrowed value does not live long enough
...
126 |             }
    |             - `result` dropped here while still borrowed

error[E0597]: `builder` does not live long enough
   --> milli/src/search/mod.rs:124:53
    |
108 |         let (query_tree, primitive_query) = match self.query.as_ref() {
    |              ---------- borrow later used here
...
124 |                 let (query_tree, primitive_query) = builder.build(tokens)?.map_or((None, None), |(qt, pq)| (Some(qt), Some(pq)));
    |                                                     ^^^^^^^ borrowed value does not live long enough
125 |                 (query_tree, primitive_query.clone())
126 |             }
    |             - `builder` dropped here while still borrowed

For more information about this error, try `rustc --explain E0597`.
error: could not compile `milli` due to 4 previous errors

I'm quite new to Rust, so I couldn't figure out how to fix these issues. Any pointers would be appreciated :)

@ManyTheFish
Copy link
Member Author

Hey @Samyak2! I think you try to modify an unrelated code part.
I invite you to look at matching_bytes function in matching_words.rs, this function is used to compute highlights and matches in meilisearch.
For now, this function takes a &str but I know that the "caller" of the function can provide a &Token instead. I think that changing this function would do the job. 🙂

@curquiza curquiza added this to the v0.26.0 milestone Dec 8, 2021
@adellinocasas
Copy link

Hi there,
has this issue - Prefix highlight should handle unicode - already been fixed?
(using version Scout / Meilisearch "pkgVersion":"0.20.0")
thanks

@Samyak2
Copy link
Contributor

Samyak2 commented Dec 17, 2021

Hi @ManyTheFish. Sorry I was too busy with college exams in the last month :(

I can continue working on this now and will have a PR in the next 1-2 days (if it hasn't been fixed already).

@curquiza
Copy link
Member

No worry @Samyak2, no pressure for the contributors here! 🙂 Hope your exams went well!

Thanks a lot for your PRs and your involvement!

@Samyak2
Copy link
Contributor

Samyak2 commented Dec 20, 2021

Thank you!

Bug triage automation moved this from Bugs - severity 3 to Done Jan 17, 2022
@curquiza
Copy link
Member

curquiza commented Jan 17, 2022

Reopen since it's only fixed milli's side but not on MeiliSearch side

To close this issue, we need to

  • Release milli
  • Update the milli dependency in this repo

@curquiza curquiza reopened this Jan 17, 2022
Bug triage automation moved this from Done to Candidates Jan 17, 2022
@curquiza curquiza added the milli Related to the milli workspace label Jan 24, 2022
@curquiza
Copy link
Member

curquiza commented Feb 2, 2022

Fixed by #2005 that bump the milli dependency to milli v0.22.0 containing the fix of this issue 🙂

The bug fix will be released in Meilisearch v0.26.0.

@curquiza curquiza closed this as completed Feb 2, 2022
Bug triage automation moved this from Candidates to Done Feb 2, 2022
@yngc0der
Copy link

yngc0der commented Jun 8, 2022

Still have this problem with meilisearch v0.27.2 and cyrillic text in json field.
Query: "Компания"
Response formatted part: "Компания дарит бесплатные билеты"

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Jun 8, 2022

Hey @yngc0der,
we completely changed the tokenizer for the version 0.28, and it should fix your issue.
This version will be pre-released next week (beta release) and will be stabilized in 4 weeks.
Don't hesitate to create a new issue if your bug persists.

Thanks for your interest!

@curquiza curquiza added the v0.26.0 PRs/issues solved in v0.26.0 label Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected milli Related to the milli workspace v0.26.0 PRs/issues solved in v0.26.0
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants