Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace diacritics when doing fuzzy searches #3236

Merged
merged 2 commits into from
Jul 8, 2016
Merged

Replace diacritics when doing fuzzy searches #3236

merged 2 commits into from
Jul 8, 2016

Conversation

bhousel
Copy link
Member

@bhousel bhousel commented Jul 8, 2016

tldr: it means that strings like "fussball" will fuzzy match strings like "fußball"

more details:
In collection.js search results are returned in the following order:

  1. match name
  2. match terms
  3. match tag value
  4. similar name
  5. similar terms

Diacritical marks are only replaced when calculating the "similar" search results string distance, so this is a fallback strategy from strict string matches.

(closes #3159)

I also

  • Expanded the tests to make sure results are returned in the right order
  • Added a few tests to test the diacritical mark replacement on fuzzy matches

cc @1ec5

@1ec5
Copy link
Collaborator

1ec5 commented Jul 8, 2016

I tried this branch out with a few preset search terms that have been inconvenient in the past in Vietnamese. The change produces a slight improvement over master. I haven’t found any real problems yet.

As a further test, I took this branch and removed the manually diacritic-folded Vietnamese terms using the regular expression (, [-a-z ,]+)(?=") on the presets object in vi.json. (I had to perform the find and replace manually because a handful of terms, like “TV”, “boong ke”, and “ga ra”, are supposed to lack diacritics.) It’s a wash, with significant improvements in cases where the search term mostly matches the diacritic-folded preferred term, but regressions in cases where the search term only partly matches the diacritic-folded preferred term or exactly matches a diacritic-folded secondary term. It may be possible to get better results by fiddling with the orders of terms in individual presets. I’ve pushed my changes to the diacritics-vi-folding-removed branch in my fork, but I understand that I’d have to apply the changes in Transifex for them to stick.

preferred master this branch my branch
sans manually folded terms
⬆️ sân chơi
⬇️ công viên
⬆️ sân cỏ
⬆️ sân bóng đá
san choi, master san choi, diacritics san choi, simplified
⬆️ trường học
⬆️ nhà trường
⬇️ sân bay
⬇️ ga sân bay
⬆️ tường
truong, master truong, diacritics truong, simplified
⬇️ hồ bơi
⬇️ trung tâm bơi lội
⬇️ tiệm dụng cụ bơi lội
boi, master boi, diacritics boi, simplified
⬇️ trung tâm bơi lội
⬇️ tiệm dụng cụ bơi lội
boi loi, master boi loi, diacritics boi loi, simplified
⬆️ nhà ở
⬇️ tòa nhà dân cư
⬇️ đất dân cư
⬆️ nhà thờ
⬆️ nha sĩ
nha o, master nha o, diacritics nha o, simplified

@bhousel
Copy link
Member Author

bhousel commented Jul 8, 2016

@1ec5 Thanks for digging into this!

It’s a wash, with significant improvements in cases where the search term mostly matches the diacritic-folded preferred term, but regressions in cases where the search term only partly matches the diacritic-folded preferred term or exactly matches a diacritic-folded secondary term. It may be possible to get better results by fiddling with the orders of terms in individual presets.

Yes, this change definitely won't solve all problems, but it should be possible to work around the more common issues by using the "preset terms".

collection.js search() contains the code which prioritizes how matches happen.

Exact matching on the leading part of name and leading part of terms are stuff you can control by adjusting the strings in Transifex. Exact matching of the leading tag value might cause more problems in other languages. Maybe we should disable that unless the locale is en? In your examples above, I don't think it's affecting the results.

I think the diacritic replacement has a more pronounced effect on 'ß' -> 'ss', because it normalizes the Levenshtein distance between the search strings. For example before this change, "grass" looks 3 chars different from "glaß", and after this change, they only differ by 1 character.

@bhousel bhousel merged commit c42bd2a into master Jul 8, 2016
@bhousel bhousel deleted the diacritics branch July 8, 2016 22:10
@1ec5
Copy link
Collaborator

1ec5 commented Jul 9, 2016

it should be possible to work around the more common issues by using the "preset terms".

To clarify, the Vietnamese localization is already (ab)using the preset terms to include the main preset name, any synonyms, the main name diacritic-folded, and the synonyms diacritic-folded, in that order. I believe that's why this change has little effect.

Removing the diacritic-folded terms results in some results getting a lot better and some getting a lot worse, which in my opinion shows that the language-agnostic diacritic folding may be weighted too high for Vietnamese. (It would ideally count less toward the edit distance than base letter changes, whereas for other languages it should count more or the same.) To the extent that the workaround works, it's because we've specified a lot of synonyms in the Vietnamese presets.

So I'll probably keep the workaround in the Vietnamese localization (despite the bloat) and hold out for a more sophisticated solution in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

equal ss to ß in the searchfield
2 participants