-
-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apostrophe U+02BC should be considered the same character as U+0027 and U+2019 #2569
Comments
|
Nominatim removes all three characters, just in different ways. As U+0027 and U+2019 are considered punctuation, they are replaced with a space. U+02BC is considered a letter and thus simply removed. That's all correct and intentional behaviour: one can be a word boundary, the other shouldn't be one. That said, correct behaviour doesn't get us anywhere when users (and mappers) can't really understand the difference. So the way forward here is to consider U+02BC punctuation like the others. I've had a quick look at the OSM data and the character isn't in wide use and thus the impact minimal. Note, however, that such a change requires a reimport to really have an effect. I'm sorry to say that the next reimport on nominatim.openstreetmap.org is quite a bit away. |
While technically being a letter, the apostrophe is often replaced with a normal apostrophe in writing which is a punctuation mark. This makes sure that the modifier letter apostrophe yields the same normalization results and thus is really interchangable. Only has an effect after the next reimport. Fixes osm-search#2569.
|
Thanks, @lonvia ! I now have a better understanding of how Nominatim works). Your commit, I think, will solve the problem. In the future, I'll see if Ukrainian words with an apostrophe ever cause problems by splitting into two words. If we do not have time for this reimport, when can we expect the next one? ) |
|
We just had a reimport. I cannot say when the next one will happen. It will be at least a couple of months. |
Do you already know when you will reimport, @lonvia? |
|
@lonvia , has the reimport already taken place? Or when it is planned, is it known? I have one more question. You made changes to "normalization" routine by adding "modifier letter apostrophe" U+02BC to the punctuation marks. I understand that OSM features will be indexed according to the new rules only during a new reimport. However, shouldn't the new "normalization" rules already apply to search queries? If so, a search query containing U+02BC should return results containing U+0027 and U+2019 apostrophes. However, as of today, nothing has changed: my query with U+02BC still returns only 4 results that do contain U+02BC, but no U+0027 / U+2019: https://nominatim.openstreetmap.org/ui/search.html?q=Словʼянська+вулиця |
[ A similar problem with "е" and "ё" has been fixed in #886 ]
There is a letter ʼ (apostrophe) in Ukrainian. It so happened that three different Unicode characters are used to write it:
U+0027 and U+2019 are still used in the most of cases, while U+02BC is used much less frequently. However, U+02BC is more correct character for Ukrainian apostrophe letter: this character is from "Modifier Letter" category of Unicode, while the first two are punctuation marks. Moreover, Cyrillic domain names in Ukrainian domains permit to use U+02BC only (for the same reason): https://www.iana.org/domains/idn-tables/tables/ua_cyrl_1.2.txt.
So, all three characters must be used equally and should be considered the same character in Nominatim search. Currently U+0027 and U+2019 are considered the same character, while U+02BC is not.
Note. The problem may be not only in Ukrainian language. The apostrophe U+02BC is used in:
What did you search for?
What result did you get?
What result did you expect?
All 3 search patterns should yield the same list of results, including >100 results with U+0027/U+2019 plus 4 results with U+02BC.
The text was updated successfully, but these errors were encountered: