English: searching words via their non-diacritic form doesn't return results #230

toomuchdesign · 2023-01-09T11:42:28Z

Describe the bug
English language doesn't return results when a word containing diacritic is searched via its non-diacritic form. Eg:

Currently the sentence My name is Josè, nice to meet you in NOT searchable via jose or Jose (diacritic removed).

Any other variation of the word seems to produce the expected result (josè, josė, josę, josâ, jos)

To Reproduce
https://codesandbox.io/s/lyra-diacritics-english-language-b8tzv6

Expected behavior
Words containing diacritics should be searchable via both their diacritical or non-diacritical form (or a mix of it).

Eg. The sentence My name is Josè, nice to meet you should be searchable via jose, Jose.

Desktop (please complete the following information):

OS: MacOS 12
Browser Chrome

Additional context
Related to issue #213

Spanish language handles this example (josè) correctly, but fails when the input contains non-spanish diacritics (ē, ė, ę). Feel free to play with CodeSandbox demo.

English tokenizer
Looking at the runtime execution, the most relevant difference seems to be how words are tokenized (english). More specifically diacritic forms (josè) get tokenized as jo while jose as jose.

term variable here gets initialized with different values, therefore the 2 different results.

The root cause seems to be the english regex used to generate tokens here. This repro shows how the regex handles diacritics and non-diacritics differently: josè becomes jos while jose remains jose.

The text was updated successfully, but these errors were encountered:

toomuchdesign · 2023-01-09T12:18:03Z

In case we wanted to extend the tokenizer regexes here is a good starting point.

toomuchdesign · 2023-01-09T12:42:47Z

A solution might consist of extending tokenizer regexes to match all/some diacritics so that the tokenizer splits strings diacritic insensitively. Would it make any sense @micheleriva? If so shall we consider doing the same on other languages?

micheleriva · 2023-01-16T09:22:55Z

Currently the sentence My name is Josè, nice to meet you in NOT searchable via jose or Jose (diacritic removed).
Any other variation of the word seems to produce the expected result (josè, josė, josę, josâ, jos)

This is a bug, they should ALWAYS return and it shouldn't depend on the language

kamilogorek · 2023-05-17T09:04:56Z

Fixed in #271

toomuchdesign changed the title ~~English: searching words via their non-diacritic form doesn't produce results~~ English: searching words via their non-diacritic form doesn't return results Jan 9, 2023

toomuchdesign mentioned this issue Jan 10, 2023

Lyra not ignoring tildes in words (Expected behavior?) #213

Closed

micheleriva added the bug Something isn't working label Jan 16, 2023

Adibla mentioned this issue Feb 1, 2023

feat: update regex tokenizer (en-du-it) #271

Merged

micheleriva closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English: searching words via their non-diacritic form doesn't return results #230

English: searching words via their non-diacritic form doesn't return results #230

toomuchdesign commented Jan 9, 2023 •

edited

Loading

toomuchdesign commented Jan 9, 2023

toomuchdesign commented Jan 9, 2023

micheleriva commented Jan 16, 2023

kamilogorek commented May 17, 2023

English: searching words via their non-diacritic form doesn't return results #230

English: searching words via their non-diacritic form doesn't return results #230

Comments

toomuchdesign commented Jan 9, 2023 • edited Loading

toomuchdesign commented Jan 9, 2023

toomuchdesign commented Jan 9, 2023

micheleriva commented Jan 16, 2023

kamilogorek commented May 17, 2023

toomuchdesign commented Jan 9, 2023 •

edited

Loading