Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophe U+02BC should be considered the same character as U+0027 and U+2019 #2569

Closed
velikodsky opened this issue Jan 9, 2022 · 5 comments · Fixed by #2571
Closed

Apostrophe U+02BC should be considered the same character as U+0027 and U+2019 #2569

velikodsky opened this issue Jan 9, 2022 · 5 comments · Fixed by #2571

Comments

@velikodsky
Copy link

[ A similar problem with "е" and "ё" has been fixed in #886 ]

There is a letter ʼ (apostrophe) in Ukrainian. It so happened that three different Unicode characters are used to write it:

  1. U+0027 "apostrophe"
  2. U+2019 "right single quotation mark"
  3. U+02BC "modifier letter apostrophe"

U+0027 and U+2019 are still used in the most of cases, while U+02BC is used much less frequently. However, U+02BC is more correct character for Ukrainian apostrophe letter: this character is from "Modifier Letter" category of Unicode, while the first two are punctuation marks. Moreover, Cyrillic domain names in Ukrainian domains permit to use U+02BC only (for the same reason): https://www.iana.org/domains/idn-tables/tables/ua_cyrl_1.2.txt.

So, all three characters must be used equally and should be considered the same character in Nominatim search. Currently U+0027 and U+2019 are considered the same character, while U+02BC is not.

Note. The problem may be not only in Ukrainian language. The apostrophe U+02BC is used in:

  1. Nenets languages (Cyrillic) and other small languages;
  2. English: U+02BC was the preffered apostrophe character in Unicode prior to 3.0 version (September 1999). Currently it still can be found sometime in internet as a part of English words.

What did you search for?

  1. With U+0027: https://nominatim.openstreetmap.org/ui/search.html?q=Слов%27янська+вулиця
  2. With U+2019: https://nominatim.openstreetmap.org/ui/search.html?q=Слов’янська+вулиця
  3. With U+02BC: https://nominatim.openstreetmap.org/ui/search.html?q=Словʼянська+вулиця

What result did you get?

  1. With U+0027: >100 results
  2. With U+2019: >100 results (the same as with U+0027, but sorted differently)
  3. With U+02BC: 4 results (these results are not included in U+0027/U+2019 results)

What result did you expect?

All 3 search patterns should yield the same list of results, including >100 results with U+0027/U+2019 plus 4 results with U+02BC.

@lonvia
Copy link
Member

lonvia commented Jan 10, 2022

Nominatim removes all three characters, just in different ways. As U+0027 and U+2019 are considered punctuation, they are replaced with a space. U+02BC is considered a letter and thus simply removed. That's all correct and intentional behaviour: one can be a word boundary, the other shouldn't be one.

That said, correct behaviour doesn't get us anywhere when users (and mappers) can't really understand the difference. So the way forward here is to consider U+02BC punctuation like the others. I've had a quick look at the OSM data and the character isn't in wide use and thus the impact minimal.

Note, however, that such a change requires a reimport to really have an effect. I'm sorry to say that the next reimport on nominatim.openstreetmap.org is quite a bit away.

lonvia added a commit to lonvia/Nominatim that referenced this issue Jan 10, 2022
While technically being a letter, the apostrophe is often replaced
with a normal apostrophe in writing which is a punctuation mark.
This makes sure that the modifier letter apostrophe yields the same
normalization results and thus is really interchangable.

Only has an effect after the next reimport.

Fixes osm-search#2569.
@velikodsky
Copy link
Author

Thanks, @lonvia ! I now have a better understanding of how Nominatim works). Your commit, I think, will solve the problem.

In the future, I'll see if Ukrainian words with an apostrophe ever cause problems by splitting into two words.

If we do not have time for this reimport, when can we expect the next one? )

@lonvia
Copy link
Member

lonvia commented Jan 11, 2022

We just had a reimport. I cannot say when the next one will happen. It will be at least a couple of months.

@velikodsky
Copy link
Author

velikodsky commented Mar 15, 2022

We just had a reimport. I cannot say when the next one will happen. It will be at least a couple of months.

Do you already know when you will reimport, @lonvia?

@velikodsky
Copy link
Author

@lonvia , has the reimport already taken place? Or when it is planned, is it known?

I have one more question. You made changes to "normalization" routine by adding "modifier letter apostrophe" U+02BC to the punctuation marks. I understand that OSM features will be indexed according to the new rules only during a new reimport. However, shouldn't the new "normalization" rules already apply to search queries? If so, a search query containing U+02BC should return results containing U+0027 and U+2019 apostrophes. However, as of today, nothing has changed: my query with U+02BC still returns only 4 results that do contain U+02BC, but no U+0027 / U+2019:

https://nominatim.openstreetmap.org/ui/search.html?q=Словʼянська+вулиця

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants