Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata translations lead to troublesome labels #1547

Open
jdhoek opened this issue May 27, 2023 · 4 comments
Open

Wikidata translations lead to troublesome labels #1547

jdhoek opened this issue May 27, 2023 · 4 comments

Comments

@jdhoek
Copy link
Contributor

jdhoek commented May 27, 2023

I noticed in openstreetmap/openstreetmap-website#4042 that OpenMapTiles is being considered for inclusion on the main openstreetmap.org website. Great! It desperately needs a new high quality general purpose layer.

One jarring thing I've noticed browsing https://osm.openmaptiles.org/#map=17/53.21006/5.77774&layers=V is that when the user's preferred language is missing in the OpenStreetMap entities, it gets pulled in through its Wikidata link if it has one. In my case my browser is set to prefer English, so browsing the map of my home town where Frisian and Dutch are used for most names, I've noticed some worrying discrepancies.

For example the railway station which should just be called Leeuwarden is now labelled as Leeuwarden railway station, and indeed, that is what Wikidata lists as its English name. This is wrong of course; the station's name does not include the descriptive 'railway station' suffix, and the OpenStreetMap entity omits this in all tagged languages. The correct behaviour for such local features is to use name in the absence of name:en, which is what anyone writing in English would do if the scripts are the same and no transliteration is needed.

This is just an example of a type of which I am seeing a lot just scanning the map, and it make me worry about a fundamental choice being made here which has a potentially huge impact.

One aspect is that OpenStreetMap mappers make a significant effort to correctly label things on the map, especially names. For points-of-interest with international appeal this invariably means a large list of translations. For local entities however, translated names often don't exist, and that's fine. The problem with drawing missing translations from Wikidata is that it is a different project, with different rules (and policies) regarding naming things, different user accounts, and different priorities. So as a mapper who cares about correct names, I am now faced with a dilemma.

I could go and edit Wikidata and remove such non-translations like Leeuwarden railway station or Leeuwarden Northern General Cemetery (the latter is literally made up based on the local Dutch name by someone with no knowledge of Dutch or the local names). This will likely get me in conflict with people who desire these translations for other projects. This doesn't feel right, and I've noticed in discussions about the use of Wikidata in the OpenStreetMap community that I am not alone in this. I.e., linking to Wikidata via the wikidata key is desirable, but pulling in metadata perhaps not so much.

I see that it is nice to be able to pull in translations where these are missing, but I wonder if doing this at the level of the renderer is the right place.

@1ec5
Copy link

1ec5 commented Jun 5, 2023

The correct behaviour for such local features is to use name in the absence of name:en, which is what anyone writing in English would do if the scripts are the same and no transliteration is needed.

This is perhaps more workable for POIs than for other things like places and natural features, which may have very different naming conventions across languages.

Moreover, parts of the OSM community insist on only tagging name:* values that are visible on the ground, omitting names that are in common use by languages that aren’t locally common. This inevitably results in data consumers relying on Wikidata as a fallback: ZeLonewolf/openstreetmap-americana#428 (comment).

I think it’s fair to say that OpenMapTiles is a little further on the spectrum of data consumers that care about the end user experience, which would implement these fallbacks, whereas the other featured tile layers on osm.org have traditionally been focused solely on mapper feedback.

I could go and edit Wikidata and remove such non-translations like Leeuwarden railway station or Leeuwarden Northern General Cemetery (the latter is literally made up based on the local Dutch name by someone with no knowledge of Dutch or the local names). This will likely get me in conflict with people who desire these translations for other projects. This doesn't feel right, and I've noticed in discussions about the use of Wikidata in the OpenStreetMap community that I am not alone in this. I.e., linking to Wikidata via the wikidata key is desirable, but pulling in metadata perhaps not so much.

You would be well-justified in changing the English label to just “Leeuwarden”. The only reason it has the “railway station” suffix is that the label was imported from the English Wikipedia, where article titles have to be unique, and hasn’t been cleaned up yet. But Wikidata does welcome mechanical edits to clean up its labels. (A similar example came up in a forum discussion where someone had proposed requiring data consumers to rely on Wikidata labels in the first instance, which would’ve been a step too far in my opinion.)

@jdhoek
Copy link
Contributor Author

jdhoek commented Jun 5, 2023

This is perhaps more workable for POIs than for other things like places and natural features, which may have very different naming conventions across languages.

If they do they'll be tagged on OpenStreetMap correctly. For places, using name as a fallback in case of missing language tags is almost always correct when the scripts match.

Moreover, parts of the OSM community insist on only tagging name:* values that are visible on the ground, omitting names that are in common use by languages that aren’t locally common.

For the Netherlands I am not aware of these issues. Is this really a big problem internationally? International names for the bigger places are maintained accurately, and for smaller places often don't exist (which is natural). It doesn't say アムステルダム anywhere on the signage for Amsterdam, but name:ja is tagged as expected. It's easily confirmed data after all.

I think it’s fair to say that OpenMapTiles is a little further on the spectrum of data consumers that care about the end user experience,

I'm not convinced importing missing names with lots of errors is significantly improving the user experience. It will also lead to cases where the origin of the name is unclear in case of errors ("is OpenStreetMap wrong or is the name magically imported from some other source?").

Using names from Wikidata might improve Wikidata, but does it benefit OpenStreetMap and its users?

You would be well-justified in changing the English label to just “Leeuwarden”.

The problem is that mappers now have to maintain two sets of name tags in order to prevent faulty information from leaking through. That is quite an additional burden! In this case it would also be wrong: there is no real name:en for the local railway station. Barring specific cases, name should suffice for any English, German, or French usage.

It will also inevitably lead to conflicts with other Wikidata consumers, like sister project Wikipedia. These often have different policies for naming things. One specific case which has bugged me for years is how the Dutch Wikipedia refers to the village of Grou near me as Grouw, which is the old Dutch spelling officially dropped in the nineties. Gradually all references to this village in Dutch changed to the Frisian spelling in the past thirty years, which is now the de facto and de jure Dutch spelling too. For OpenStreetMap this is fine: it's the ground-truth after all! (Not to mention factually and empirically correct.)

Dutch Wikipedia refuses to change the name used on their lemma, because they have a policy (set in stone as these are on Wikipedia) which points to a specific language institution as the arbiter for such names. That institute publishes a list of Dutch and Frisian names, and the village is referred by its archaic name there without noting its obsolescence, so any attempt to change the Wikipedia page get reverted. The only reason this name is correct on Wikidata is because the editors concerned haven't looked there yet.

Another example closer to you: how would the Dutch refer to San José? With or without the acute accent? Orthographically, the é is part of the Dutch vocabulary, but following the American English spelling which omits diacritics could be acceptable too. OpenStreetMap correctly leaves out name:nl, but Wikidata apparently knows how the Dutch ought to write it (simply because a Wikipedia page exists).

This goes even further for Frisian, the minority language spoken in my Dutch province. It too has the é available, but Wikidata is dead sure Frisians would write San Jose without it. Is that correct, or is the Frisian Wikipedia (edited by a handful of well-meaning amateurs) simply using whatever spelling the English Wikipedia chose? In OpenStreetMap we would omit name:fy in that case (which is really the status quo for almost all American names), and any future change in name would be reflected in Frisian too, but Wikidata doesn't work that way. If I delete the Frisian entry, some bot will replace it because there is page on the Frisian Wikipedia!

@1ec5
Copy link

1ec5 commented Jun 5, 2023

For the Netherlands I am not aware of these issues. Is this really a big problem internationally? International names for the bigger places are maintained accurately, and for smaller places often don't exist (which is natural). It doesn't say アムステルダム anywhere on the signage for Amsterdam, but name:ja is tagged as expected. It's easily confirmed data after all.

If you’re right, that would be wonderful, although there’s still the issue of not being able to determine the language of name, even from a redundant name:*. Should a Japanese speaker looking at a Japanese-language map see “Amsterdam” or “アムステルダム” when looking at Amsterdam, New York, which only has name:en and name:moh tagged? I don’t think the U.S. community would get up in arms about name:ja being tagged on it too, but you bet there are countries where that would become a cause célèbre.

I'm not convinced importing missing names with lots of errors is significantly improving the user experience. It will also lead to cases where the origin of the name is unclear in case of errors ("is OpenStreetMap wrong or is the name magically imported from some other source?").

Using names from Wikidata might improve Wikidata, but does it benefit OpenStreetMap and its users?

This is exactly my point. It would be profoundly confusing and ill-advised for a purely mapper-oriented tile layer/stylesheet like openstreetmap-carto to pull in Wikidata labels, especially if there’s no indication of the source.1 But from the perspective of a user of a consumer application that depends on OpenMapTiles, does it really matter where the incorrect name comes from, as long as it can be fixed easily? Even more OSM-oriented clients of OpenMapTiles, such as OSM Americana, have decided that a rising tide lifts all boats and use the name:* properties despite (or even because of) the Wikidata fallback.

Putting an OpenMapTiles-powered, openstreetmap-carto-inspired style in openstreetmap.org, as in openstreetmap/openstreetmap-website#4042, does blur the line between the two audiences. I think it would be reasonable to expect this particular style to not use Wikidata labels, for the sake of mapper feedback. MapTiler could do that by skipping this step when generating tiles specifically for openstreetmap.org. But enforcing that decision on other styles for other audiences would be a lot less reasonable in my opinion.

Another example closer to you: how would the Dutch refer to San José? With or without the acute accent? Orthographically, the é is part of the Dutch vocabulary, but following the American English spelling which omits diacritics could be acceptable too. OpenStreetMap correctly leaves out name:nl, but Wikidata apparently knows how the Dutch ought to write it (simply because a Wikipedia page exists).

English isn’t even consistent about it. 😅 Wikidata does a great job of clarifying the situation, if I may say so myself, although OpenMapTiles is only using labels, not name statements. Preferring name statements over labels would improve the translation quality in some cases.

If I delete the Frisian entry, some bot will replace it because there is page on the Frisian Wikipedia!

That’s probably true, but not necessarily because Wikidata prefers Wikipedia article titles. I’m unfamiliar with the Frisian Wikipedia community, but I can say that there’s no love lost between the English Wikipedia and Wikidata over the issue of labels and descriptions. (Wikipedia essentially forked Wikidata in that regard.) The best practice on Wikidata would be to set the Frisian label to something, even if it’s the same as the native name. To affirm there’s no distinct Frisian name, you’d add a “no value” Frisian-language name statement to the item, though that’s pretty rare.

Footnotes

  1. Like OpenMapTiles, openstreetmap-carto does pull in Natural Earth data that contradicts OSM data. However, it only does so at low zoom levels where generalization is expected and mappers are less likely to detect these discrepancies.

@1ec5
Copy link

1ec5 commented Oct 2, 2023

English isn’t even consistent about it. 😅 Wikidata does a great job of clarifying the situation, if I may say so myself, although OpenMapTiles is only using labels, not name statements. Preferring name statements over labels would improve the translation quality in some cases.

openmaptiles/openmaptiles-tools#437 onthegomap/planetiler#679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants