Skip to content
This repository has been archived by the owner on Mar 15, 2024. It is now read-only.

OSM name of a feature matches to Wikidata name #105

Merged
merged 9 commits into from
Mar 17, 2017
Merged

Conversation

bkowshik
Copy link
Contributor

@bkowshik bkowshik commented Mar 16, 2017

I downloaded the Wikidata dump to find a total of 25,327,505 features, out of which 3,090,713 features have a latitude tag. i.e: 12.2% of Wikidata features have a location component. 🎉

There are 589,087 features on OpenStreetMap with a Wikidata tag. For this iteration the focus is on name modification to a feature with a Wikidata tag. So, querying the Wikidata API in realtime is a better option compared to creating a local dump similar to landmarks.sqlite for a couple of reasons:

  • Data does not get stale as we query live data.
  • The API is simple and clear so the overhead to build is minimal.
  • The number of requests we make everyday should be pretty low.

Sample Wikidata query

Holige is a 😋 sweet flatbread from South India with a Wikidata ID: Q19891734

640px-holige1

$ curl "https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q19891734&format=json"
{"entities":{"Q19891734":{"pageid":21535239,"ns":0,"title":"Q19891734","lastrevid":390829462,"modified":"2016-10-18T23:45:44Z","type":"item","id":"Q19891734","labels":{"en":{"language":"en","value":"Holige"},"pa":{"language":"pa","value":"\u0a2a\u0a42\u0a30\u0a28 \u0a2a\u0a4b\u0a32\u0a40"}},"descriptions":{"pa":{"language":"pa","value":"\u0a2d\u0a3e\u0a30\u0a24\u0a40 \u0a16\u0a3e\u0a23\u0a3e"},"en":{"language":"en","value":"Indian Food"}},"aliases":{},"claims":{"P279":[{"mainsnak":{"snaktype":"value","property":"P279","datavalue":{"value":{"entity-type":"item","numeric-id":2095,"id":"Q2095"},"type":"wikibase-entityid"},"datatype":"wikibase-item"},"type":"statement","id":"Q19891734$5AD3D27B-6D89-4435-84B4-D25744E4D81C","rank":"normal"}],"P495":[{"mainsnak":{"snaktype":"value","property":"P495","datavalue":{"value":{"entity-type":"item","numeric-id":668,"id":"Q668"},"type":"wikibase-entityid"},"datatype":"wikibase-item"},"type":"statement","id":"Q19891734$FC9CC4A4-9858-44B9-BAC2-1C4FAAF03A70","rank":"normal","references":[{"hash":"7eb64cf9621d34c54fd4bd040ed4b61a88c4a1a0","snaks":{"P143":[{"snaktype":"value","property":"P143","datavalue":{"value":{"entity-type":"item","numeric-id":328,"id":"Q328"},"type":"wikibase-entityid"},"datatype":"wikibase-item"}]},"snaks-order":["P143"]}]}],"P373":[{"mainsnak":{"snaktype":"value","property":"P373","datavalue":{"value":"Obbattu","type":"string"},"datatype":"string"},"type":"statement","id":"Q19891734$3E388244-771E-44AC-91BB-57F72AEDA0D5","rank":"normal","references":[{"hash":"7eb64cf9621d34c54fd4bd040ed4b61a88c4a1a0","snaks":{"P143":[{"snaktype":"value","property":"P143","datavalue":{"value":{"entity-type":"item","numeric-id":328,"id":"Q328"},"type":"wikibase-entityid"},"datatype":"wikibase-item"}]},"snaks-order":["P143"]}]}],"P18":[{"mainsnak":{"snaktype":"value","property":"P18","datavalue":{"value":"Holige1.JPG","type":"string"},"datatype":"commonsMedia"},"type":"statement","id":"Q19891734$A56187C7-F1B4-4FA6-B3A9-4770D2B33BB3","rank":"normal","references":[{"hash":"7eb64cf9621d34c54fd4bd040ed4b61a88c4a1a0","snaks":{"P143":[{"snaktype":"value","property":"P143","datavalue":{"value":{"entity-type":"item","numeric-id":328,"id":"Q328"},"type":"wikibase-entityid"},"datatype":"wikibase-item"}]},"snaks-order":["P143"]}]}]},"sitelinks":{"enwiki":{"site":"enwiki","title":"Puran poli","badges":[]},"pawiki":{"site":"pawiki","title":"\u0a2a\u0a42\u0a30\u0a28 \u0a2a\u0a4b\u0a32\u0a40","badges":[]}}}},"success":1}

Wikidata alias

An object in Wikidata can have one or more aliases. For example, the city of Bengaluru has the name Bengaluru in OpenStreetMap but Bangalore on Wikidata. I have made appropriate modifications to the comparator so that the Bangalore is flagged only when the name differentiates from either Bangalore or Bengaluru. 😃

Quality of data

It has been an eye-opening experience seeing OpenStreetMap data used with other open data sources. The boost in quality of data and maintainability is a win-win! 🚀

@bkowshik
Copy link
Contributor Author

Some 👀 and feedback if any from you @planemad and @amishas157 would be really helpful.

@bkowshik
Copy link
Contributor Author

cc: @batpad @geohacker

},
{
"description": "Test OSM name matches with aliases on Wikidata",
"expectedResult": {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this give "result:name_matches_to_wikidata": true ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@planemad the current design of all compare functions is to return results only if interesting, and by interesting we mean mostly harmful changes. When things are good, the compare functions return nothing, i.e: {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design essentially does not differentiate between no data and a positive result.

If the goal is to have every changeset reviewed by the community, comparators should pass on as much useful knowledge as possible to a human reviewer to make the final decision. This comparator has essentially done the tedious effort of looking up Wikidata and comparing names, witholding this finding will lead to duplicate human effort on the same activity. What do we gain by this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparators should pass on as much useful knowledge as possible to a human reviewer to make the final decision.

Totally agree @planemad, 💯 The easier bit is returning "result:name_matches_to_wikidata": true from the compare function. The harder one is on osmcha's side. Just to keep the scale of things manageable, osmcha stores just the features that the comparators have flagged for being potentially harmful/problematic and not all the features. Yes, osmcha has all the changesets but not all the features.

I am curious to hear more on this, shall we create a separate ticket for the same?


cc: @willemarcel @batpad @geohacker @amishas157

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkowshik @planemad That makes sense. But need to find a good way to deal with parsing these different kind of results and scale it as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we create a separate ticket for the same?

Yes please. We should have consistent design principles that will serve as a guide to build useful compare functions without being constrained by limitations of osmcha.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new ticket here: #106

Copy link
Contributor

@planemad planemad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start. My biggest concern is that name on OSM is the name in the local language and not necessarily English.

Since the comparator compares only to the English Wikidata label there is a potential for a lot of noise from data in non English regions.

Can we get an idea of the level of noise this will generate if it goes out live?

Copy link
Contributor Author

@bkowshik bkowshik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start. My biggest concern is that name on OSM is the name in the local language and not necessarily English.

Great observation @planemad

Can we get an idea of the level of noise this will generate if it goes out live?

Once deployed on osmcha, all feature changes flagged by this comparator can be filtered by the reason: Name does not match to Wikidata

screen shot 2017-03-16 at 4 21 49 pm

},
{
"description": "Test OSM name matches with aliases on Wikidata",
"expectedResult": {},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@planemad the current design of all compare functions is to return results only if interesting, and by interesting we mean mostly harmful changes. When things are good, the compare functions return nothing, i.e: {}

if ((osmName !== wikidataName) && (wikidataAliasNames.indexOf(osmName) === -1)) return callback(null, {
'result:name_matches_to_wikidata': false
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkowshik Though we are assuming that we won't be hitting wikidata API too hard, but just to be 💯 , what we can do is, catch the errors when wikidata API is ratelimited and find out a way for it to report to us. Maybe we can also use: 'result:wikidataApiLimitExceeded: true, the way we do it for escalate and then read it on vandalism side to send us these error. [This](https://www.mediawiki.org/wiki/API:Errors_and_warnings) list the error codes returned by wikidata API. We can catch for ratelimited`. If we get such errors from vandalism, we can figure out some other way, so as to not hit wikidata API hard and also be ensured that this comparator has worked the way it is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @amishas157 copied over your comments to a new ticket about best practices for working with external APIs here: #107

Copy link
Contributor

@amishas157 amishas157 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkowshik ,The comparator looks 🎉

@bkowshik
Copy link
Contributor Author

Published to npm as version: 4.15.0

@bkowshik bkowshik deleted the matches-to-wikidata branch March 17, 2017 08:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants