Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird highlight results with non-ascii #2144

Closed
OrkhanAlikhanov opened this issue Feb 5, 2022 · 7 comments · Fixed by #2468
Closed

Weird highlight results with non-ascii #2144

OrkhanAlikhanov opened this issue Feb 5, 2022 · 7 comments · Fixed by #2468
Labels
bug Something isn't working as expected tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v0.28.0 PRs/issues solved in v0.28.0
Milestone

Comments

@OrkhanAlikhanov
Copy link

OrkhanAlikhanov commented Feb 5, 2022

Describe the bug
Meilisearch is giving weird results compared to algolia and typesense.

To be honest I don't know if the software is built to behave in this way, or the language I use is unsupported. I wanted to report this in any case. I there is a settings toggle for it to become strict, I'd like to know.

To Reproduce
Steps to reproduce the behavior:

  1. Go to meillisearch dashboard
  2. Search for "Müəllimlərin İşə Qəbulu - Təsviri İncəsənət" or "Təsviri İncəsənət"
  3. See the weird results

Expected behavior
Should show correct results as algolia and typesense.

Screenshots
Check out the video where I explain the results. Don't forget to enable audio.

meilisearch_issue.mp4

if you want, here is youtube link https://www.youtube.com/watch?v=kumI6XbjnUA

Meilisearch version: v0.22.0, v0.25.2

@OrkhanAlikhanov
Copy link
Author

With ascii-only it seems more correct, but I still would like to get single result in both cases

(Make sure to enable audio)

Screen.Recording.2022-02-05.at.20.15.48.mov

@Kerollmops
Copy link
Member

Kerollmops commented Feb 7, 2022

Hey @OrkhanAlikhanov,

Thank you very much for your issue and detailed explanation!
This bug seems to be highly related to the default Meilisearch tokenizer, Algolia uses a Unicode segmentation algorithm, and Typesense, IIRC, uses a basic split by whitespace system. We use a custom algorithm and it seems like it is the source of the issue here. BTW we do not support Turk, IIRC.

However, we plan to improve the tokenizer we use and we will probably expose a Unicode segmentation system too. @ManyTheFish plan to work on this for future releases. We will also expose parameters to disable the remove-words-at-the-end-of-the-query for when there are not enough documents to fulfill the 20 results.

@curquiza curquiza added bug Something isn't working as expected tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ labels Feb 7, 2022
@OrkhanAlikhanov
Copy link
Author

Hey! Thank you for the detailed answer. How can one contribute to this. Is it easy to update tokenizer function or? Any directions if I want to mess around to workaround this issue?

@Kerollmops
Copy link
Member

Hey! You should look into our tokenizer repository, it is the crate we use in Meilisearch, more specifically in the milli crate.

However, could first create an issue and wait until next week for an answer as we plan to rewrite the tokenizer and @ManyTheFish is the one who will do this. I would prefer that he plan the rewriting with you. You can always propose a patch on the tokenizer and patch meilisearch with your fork first.

@curquiza curquiza changed the title Weird results Weird highlight results with non-ascii May 4, 2022
@curquiza curquiza added this to the v0.28.0 milestone May 18, 2022
@curquiza
Copy link
Member

curquiza commented May 18, 2022

I put this issue in v0.28.0 Milestones. The new tokenizer might fix this, nothing sure, we need to test once the first RC is done

@ManyTheFish
Copy link
Member

Hello @curquiza, after some tests I'm confident that the new tokenizer will fix this issue. 😄

@curquiza
Copy link
Member

So this issue will be fixed when #2375 will be fixed 🚀

bors bot added a commit to meilisearch/milli that referenced this issue Jun 2, 2022
540: Integrate charabia r=Kerollmops a=ManyTheFish

related to meilisearch/meilisearch#2375
related to meilisearch/meilisearch#2144
related to meilisearch/meilisearch#2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit to meilisearch/milli that referenced this issue Jun 2, 2022
540: Integrate charabia r=Kerollmops a=ManyTheFish

related to meilisearch/meilisearch#2375
related to meilisearch/meilisearch#2144
related to meilisearch/meilisearch#2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
@ManyTheFish ManyTheFish mentioned this issue Jun 7, 2022
4 tasks
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=curquiza a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 7, 2022
2468: Update milli 0.29 r=curquiza a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=irevoire a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=curquiza a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
bors bot added a commit that referenced this issue Jun 8, 2022
2468: Update milli 0.29 r=ManyTheFish a=ManyTheFish

- [x] Update milli to 0.29
- [x] Integrate charabia
- [x] Set disabled_words to default when Index::exact_words returns None
- [x] Fix ranking rules integration test

fixes #2375
fixes #2144
fixes #2417
fixes #2407

Co-authored-by: ManyTheFish <many@meilisearch.com>
@bors bors bot closed this as completed in #2468 Jun 8, 2022
@bors bors bot closed this as completed in 6171f17 Jun 8, 2022
@curquiza curquiza added the v0.28.0 PRs/issues solved in v0.28.0 label Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v0.28.0 PRs/issues solved in v0.28.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants