Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English: searching words via their non-diacritic form doesn't return results #230

Closed
toomuchdesign opened this issue Jan 9, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@toomuchdesign
Copy link

toomuchdesign commented Jan 9, 2023

Describe the bug
English language doesn't return results when a word containing diacritic is searched via its non-diacritic form. Eg:

Currently the sentence My name is Josè, nice to meet you in NOT searchable via jose or Jose (diacritic removed).

Any other variation of the word seems to produce the expected result (josè, josė, josę, josâ, jos)

To Reproduce
https://codesandbox.io/s/lyra-diacritics-english-language-b8tzv6

Expected behavior
Words containing diacritics should be searchable via both their diacritical or non-diacritical form (or a mix of it).

Eg. The sentence My name is Josè, nice to meet you should be searchable via jose, Jose.

Desktop (please complete the following information):

  • OS: MacOS 12
  • Browser Chrome

Additional context
Related to issue #213

Spanish language handles this example (josè) correctly, but fails when the input contains non-spanish diacritics (ē, ė, ę). Feel free to play with CodeSandbox demo.

English tokenizer
Looking at the runtime execution, the most relevant difference seems to be how words are tokenized (english). More specifically diacritic forms (josè) get tokenized as jo while jose as jose.

term variable here gets initialized with different values, therefore the 2 different results.

The root cause seems to be the english regex used to generate tokens here. This repro shows how the regex handles diacritics and non-diacritics differently: josè becomes jos while jose remains jose.

@toomuchdesign
Copy link
Author

In case we wanted to extend the tokenizer regexes here is a good starting point.

@toomuchdesign
Copy link
Author

A solution might consist of extending tokenizer regexes to match all/some diacritics so that the tokenizer splits strings diacritic insensitively. Would it make any sense @micheleriva? If so shall we consider doing the same on other languages?

@toomuchdesign toomuchdesign changed the title English: searching words via their non-diacritic form doesn't produce results English: searching words via their non-diacritic form doesn't return results Jan 9, 2023
@micheleriva
Copy link
Member

Currently the sentence My name is Josè, nice to meet you in NOT searchable via jose or Jose (diacritic removed).
Any other variation of the word seems to produce the expected result (josè, josė, josę, josâ, jos)

This is a bug, they should ALWAYS return and it shouldn't depend on the language

@micheleriva micheleriva added the bug Something isn't working label Jan 16, 2023
@kamilogorek
Copy link

Fixed in #271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants