You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spanish language handles this example (josè) correctly, but fails when the input contains non-spanish diacritics (ē, ė, ę). Feel free to play with CodeSandbox demo.
English tokenizer
Looking at the runtime execution, the most relevant difference seems to be how words are tokenized (english). More specifically diacritic forms (josè) get tokenized as jo while jose as jose.
term variable here gets initialized with different values, therefore the 2 different results.
The root cause seems to be the english regex used to generate tokens here. This repro shows how the regex handles diacritics and non-diacritics differently: josè becomes jos while jose remains jose.
The text was updated successfully, but these errors were encountered:
A solution might consist of extending tokenizer regexes to match all/some diacritics so that the tokenizer splits strings diacritic insensitively. Would it make any sense @micheleriva? If so shall we consider doing the same on other languages?
toomuchdesign
changed the title
English: searching words via their non-diacritic form doesn't produce results
English: searching words via their non-diacritic form doesn't return results
Jan 9, 2023
Currently the sentence My name is Josè, nice to meet you in NOT searchable via jose or Jose (diacritic removed).
Any other variation of the word seems to produce the expected result (josè, josė, josę, josâ, jos)
This is a bug, they should ALWAYS return and it shouldn't depend on the language
Describe the bug
English language doesn't return results when a word containing diacritic is searched via its non-diacritic form. Eg:
Currently the sentence
My name is Josè, nice to meet you
in NOT searchable viajose
orJose
(diacritic removed).Any other variation of the word seems to produce the expected result (
josè
,josė
,josę
,josâ
,jos
)To Reproduce
https://codesandbox.io/s/lyra-diacritics-english-language-b8tzv6
Expected behavior
Words containing diacritics should be searchable via both their diacritical or non-diacritical form (or a mix of it).
Eg. The sentence
My name is Josè, nice to meet you
should be searchable viajose
,Jose
.Desktop (please complete the following information):
Additional context
Related to issue #213
Spanish language handles this example (
josè
) correctly, but fails when the input contains non-spanish diacritics (ē
,ė
,ę
). Feel free to play with CodeSandbox demo.English tokenizer
Looking at the runtime execution, the most relevant difference seems to be how words are tokenized (english). More specifically diacritic forms (
josè
) get tokenized asjo
whilejose
asjose
.term
variable here gets initialized with different values, therefore the 2 different results.The root cause seems to be the english regex used to generate tokens here. This repro shows how the regex handles diacritics and non-diacritics differently:
josè
becomesjos
whilejose
remainsjose
.The text was updated successfully, but these errors were encountered: