Replace diacritics when doing fuzzy searches #3236

bhousel · 2016-07-08T04:04:51Z

tldr: it means that strings like "fussball" will fuzzy match strings like "fußball"

more details:
In collection.js search results are returned in the following order:

match name
match terms
match tag value
similar name
similar terms

Diacritical marks are only replaced when calculating the "similar" search results string distance, so this is a fallback strategy from strict string matches.

(closes #3159)

I also

Expanded the tests to make sure results are returned in the right order
Added a few tests to test the diacritical mark replacement on fuzzy matches

cc @1ec5

(closes #3159)

1ec5 · 2016-07-08T16:32:52Z

I tried this branch out with a few preset search terms that have been inconvenient in the past in Vietnamese. The change produces a slight improvement over master. I haven’t found any real problems yet.

As a further test, I took this branch and removed the manually diacritic-folded Vietnamese terms using the regular expression (, [-a-z ,]+)(?=") on the presets object in vi.json. (I had to perform the find and replace manually because a handful of terms, like “TV”, “boong ke”, and “ga ra”, are supposed to lack diacritics.) It’s a wash, with significant improvements in cases where the search term mostly matches the diacritic-folded preferred term, but regressions in cases where the search term only partly matches the diacritic-folded preferred term or exactly matches a diacritic-folded secondary term. It may be possible to get better results by fiddling with the orders of terms in individual presets. I’ve pushed my changes to the diacritics-vi-folding-removed branch in my fork, but I understand that I’d have to apply the changes in Transifex for them to stick.

preferred	master	this branch	my branch ^{sans manually folded terms}
⬆️ sân chơi ⬇️ công viên ⬆️ sân cỏ ⬆️ sân bóng đá
⬆️ trường học ⬆️ nhà trường ⬇️ sân bay ⬇️ ga sân bay ⬆️ tường
⬇️ hồ bơi ⬇️ trung tâm bơi lội ⬇️ tiệm dụng cụ bơi lội
⬇️ trung tâm bơi lội ⬇️ tiệm dụng cụ bơi lội
⬆️ nhà ở ⬇️ tòa nhà dân cư ⬇️ đất dân cư ⬆️ nhà thờ ⬆️ nha sĩ

bhousel · 2016-07-08T17:49:12Z

@1ec5 Thanks for digging into this!

It’s a wash, with significant improvements in cases where the search term mostly matches the diacritic-folded preferred term, but regressions in cases where the search term only partly matches the diacritic-folded preferred term or exactly matches a diacritic-folded secondary term. It may be possible to get better results by fiddling with the orders of terms in individual presets.

Yes, this change definitely won't solve all problems, but it should be possible to work around the more common issues by using the "preset terms".

collection.js search() contains the code which prioritizes how matches happen.

Exact matching on the leading part of name and leading part of terms are stuff you can control by adjusting the strings in Transifex. Exact matching of the leading tag value might cause more problems in other languages. Maybe we should disable that unless the locale is en? In your examples above, I don't think it's affecting the results.

I think the diacritic replacement has a more pronounced effect on 'ß' -> 'ss', because it normalizes the Levenshtein distance between the search strings. For example before this change, "grass" looks 3 chars different from "glaß", and after this change, they only differ by 1 character.

1ec5 · 2016-07-09T16:48:11Z

it should be possible to work around the more common issues by using the "preset terms".

To clarify, the Vietnamese localization is already (ab)using the preset terms to include the main preset name, any synonyms, the main name diacritic-folded, and the synonyms diacritic-folded, in that order. I believe that's why this change has little effect.

Removing the diacritic-folded terms results in some results getting a lot better and some getting a lot worse, which in my opinion shows that the language-agnostic diacritic folding may be weighted too high for Vietnamese. (It would ideally count less toward the edit distance than base letter changes, whereas for other languages it should count more or the same.) To the extent that the workaround works, it's because we've specified a lot of synonyms in the Vietnamese presets.

So I'll probably keep the workaround in the Vietnamese localization (despite the bloat) and hold out for a more sophisticated solution in the future.

bhousel added 2 commits July 7, 2016 23:54

Replace diacritics when doing fuzzy searches

0b3df36

(closes #3159)

Add tests for diacritic mark replacement

c2629a3

bhousel merged commit c42bd2a into master Jul 8, 2016

bhousel deleted the diacritics branch July 8, 2016 22:10

bhousel mentioned this pull request Jul 13, 2016

Switch to async storage (i.e. indexdb instead of localstorage) #3239

Open

homersimpsons mentioned this pull request Aug 31, 2016

search by address almost unusable in the french countryside osmandapp/OsmAnd#3029

Closed

bhousel mentioned this pull request Apr 22, 2017

tweak autocomplete for search features (for languages with accents...) #3979

Closed

bagage mentioned this pull request May 1, 2017

Preset matching failure on synonyms #4002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace diacritics when doing fuzzy searches #3236

Replace diacritics when doing fuzzy searches #3236

bhousel commented Jul 8, 2016 •

edited

Loading

1ec5 commented Jul 8, 2016 •

edited

Loading

bhousel commented Jul 8, 2016

1ec5 commented Jul 9, 2016

Replace diacritics when doing fuzzy searches #3236

Replace diacritics when doing fuzzy searches #3236

Conversation

bhousel commented Jul 8, 2016 • edited Loading

1ec5 commented Jul 8, 2016 • edited Loading

bhousel commented Jul 8, 2016

1ec5 commented Jul 9, 2016

bhousel commented Jul 8, 2016 •

edited

Loading

1ec5 commented Jul 8, 2016 •

edited

Loading