In [1]:
%matplotlib inline

# Data Science with OpenStreetMap and Wikidata

### Nikolai Janakiev [@njanakiev](https://twitter.com/njanakiev/)

# Outline

### Part I: _Wikidata and OpenStreetMap_

- Difference between Wikidata and OpenStreetMap
- Ways to connect data between Wikidata and OpenStreetMap

### Part II: _Data Science with Wikidata and OSM_

- Libraries and Tools
- Exhibition of Various Analyses and Results

# OpenStreetMap Elements

![OSM Elements](assets/osm_elements.png)

# Metadata in OpenStreetMap

![OSM Key Amenity](assets/osm_key_amenity.png)

![OSM Salzburg](assets/osm_salzburg.png)

# Wikidata is a Knowledge Graph

![Wikipedia Wikidata Link](assets/wikipedia_wikidata_link.png)

![Wikidata Data Model](assets/wikidata_data_model.png)

![wikidata linked data graph](assets/wikidata_linked_data_graph.png)

# Querying Wikidata with SPARQL

- [https://query.wikidata.org/](https://query.wikidata.org/)

![Wikidata Query](assets/wikidata_query.png)

# All Windmills in Wikidata

```sparql
SELECT ?item ?itemLabel ?image ?location ?country ?countryLabel
WHERE {
  ?item wdt:P31 wd:Q38720.
  OPTIONAL { ?item wdt:P18 ?image. }
  OPTIONAL { ?item wdt:P625 ?location. }
  OPTIONAL { ?item wdt:P17 ?country. }
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". 
  }
}
```
[Query link](https://w.wiki/5cv)

![Wikidata Windmills](assets/wikidata_windmills.png)

# OpenStreetMap and Wikidata in Numbers

### OpenStreetMap ([Source](https://www.openstreetmap.org/stats/data_stats.html))

- __Started 2004__
- Number of users: __5,630,923__
- Number of uploaded GPS points: __7,459,170,764__
- Number of nodes: __5,424,072,098__
- Number of ways: __601,538,972__
- Number of relations: __7,038,670__

### Wikidata ([Source](https://www.wikidata.org/wiki/Wikidata:Statistics))

- __Started 2012__
- Number of active users: __20,798__
- Number of items: __59,218,423__
- Number of edits: __1,000,545,117__

# Linking OpenStreetMap with Wikidata?

![OSM Wikidata Bridge](assets/osm_wikidata_bridge.jpg)

[File:WdOsm-semanticBridge.jpg](https://wiki.openstreetmap.org/wiki/File:WdOsm-semanticBridge.jpg)

# OpenStreetMap to Wikidata

- `wikidata=*` tag _(stable)_

# Wikidata to OpenStreetMap

- [OSM relation ID (P402)](https://www.wikidata.org/wiki/Property:P402), in total __97704__ entities  _(unstable)_ <br><font color="red"><b>Note:</b> Should not be used for <b>Nodes, Ways or Areas</b></font>

- [Permanent ID](https://wiki.openstreetmap.org/wiki/Permanent_ID) proposal

- [OSM tag or key (P1282)](https://www.wikidata.org/wiki/Property:P1282) mapping of OSM key-values to Wikidata entities, in total __1862__ entities (e.g. [lighthouse](https://www.wikidata.org/wiki/Q39715) and [Tag:man_made=lighthouse](https://wiki.openstreetmap.org/wiki/Tag:man_made=lighthouse))

# Data Science

- Donoho, David. ["50 years of data science."](https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734) Journal of Computational and Graphical Statistics 26.4 (2017): 745-766.

# Used Tools and Libraries

- [Jupyter](https://jupyter.org/) - interactive notebook development environment
- [PostGIS](https://postgis.net/) - spatial database extender for [PostgreSQL](http://postgresql.org/)
- [GDAL ogr2ogr](https://gdal.org/programs/ogr2ogr.html) - converting simple features between file formats

# Python Libraries

- [NumPy](https://www.numpy.org/) - numerical and scientific computing
- [Pandas](https://pandas.pydata.org/) - data analysis library
- [Matplotlib](https://matplotlib.org/) - 2D plotting library
- [Shapely](https://shapely.readthedocs.io/en/stable/manual.html) - analysis and manipulation of [GEOS](http://trac.osgeo.org/geos/) features
- [GeoPandas](http://geopandas.org/) - Pandas extension for spatial operations and geometric types
- [PySAL](https://pysal.org/) - spatial analysis library
- [Datashader](http://datashader.org/) - graphics pipeline system for large datasets

# OpenStreetMap Elements with Wikidata Tag

<img alt="wikidata europe osm points" src="assets/wikidata_europe_osm_points.png" style="width: 80%; height: 80%;" />

<img alt="osm europe wikidata" src="assets/osm_europe_wikidata.png" style="width: 90%; height: 90%;" />

<img alt="osm europe wikidata lisa clusters" src="assets/osm_europe_wikidata_lisa_clusters.png" style="width: 90%; height: 90%;" />

# How to load Wikidata Data

- https://query.wikidata.org/sparql/ SPARQL endpoint
- [sparqldataframe](https://github.com/njanakiev/sparqldataframe/) retrieve Pandas dataframe from SPARQL query
- [wdtools](https://github.com/njanakiev/wdtools) Wikidata utilities and tools

In [3]:
import sparqldataframe

query = """
    SELECT ?item ?itemLabel ?location WHERE {
      ?item wdt:P31 wd:Q532.
      ?item wdt:P17 wd:Q218.
      ?item wdt:P625 ?location.
      SERVICE wikibase:label { 
        bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". 
      }
    }"""

df = sparqldataframe.wikidata_query(query)
df.head()

Unnamed: 0,item,itemLabel,location
0,http://www.wikidata.org/entity/Q238698,Cașolț,Point(24.28055556 45.77916667)
1,http://www.wikidata.org/entity/Q238916,Boholț,Point(24.91888889 45.88833333)
2,http://www.wikidata.org/entity/Q239159,Bodoș,Point(25.663126 46.076358)
3,http://www.wikidata.org/entity/Q239209,Zăuan-Băi,Point(22.644737 47.247748)
4,http://www.wikidata.org/entity/Q239519,Vultureni,Point(23.5574 46.9646)


# Wikidata _Instance of_ (P31) Property

<img alt="wikidata europe points" src="assets/wikidata_europe_points.png" style="width: 80%; height: 80%;" />

<img alt="wikidata europe most common instances" src="assets/wikidata_europe_most_common_instances.png" style="width: 90%; height: 90%;" />

<img alt="wikidata europe companies most common instances" src="assets/wikidata_europe_companies_most_common_instances.png" style="width: 90%; height: 90%;" />

<img alt="wikidata uk companies most common instances" src="assets/wikidata_uk_companies_most_common_instances.png" style="width: 90%; height: 90%;" />

# Analyzing Websites Regionally

<blockquote class="twitter-tweet" data-lang="en-gb"><p lang="en" dir="ltr">I like how everyone is saying that jQuery is dead and at the same time - it powers 70% of the Web</p>&mdash; Tomasz Łakomy (@tlakomy) <a href="https://twitter.com/tlakomy/status/1141327543699726336?ref_src=twsrc%5Etfw">19 June 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<img alt="websites percentage jquery histogram" src="assets/websites_percentage_jquery_histogram.png" style="width: 90%; height: 90%;" />

<img alt="websites percentage jquery" src="assets/websites_percentage_jquery.png" style="width: 90%; height: 90%;" />

<img alt="websites percentage jquery" src="assets/websites_percentage_jquery_lisa_clusters.png" />

# Classifying Countries and Regions with OSM

- Using counts of various amenities as signatures for regions
- [osm.janakiev.com](https://osm.janakiev.com/)

![OSM Data Science](assets/osm_data_science.png)

# Castle Dossier Map of Switzerland

- Thematic maps made with OpenStreetMap, Wikidata, Wikimedia Commons (images) and Wikipedia [castle-map.infs.ch](https://castle-map.infs.ch), [Burgen-Dossier_Schweiz - OSM Wiki](https://wiki.openstreetmap.org/wiki/Burgen-Dossier_Schweiz)
- [BLICK sucht die schönsten Schlösser und Burgen der Schweiz](https://www.blick.ch/community/zurueck-ins-mittelalter-blick-sucht-die-schoensten-schloesser-und-burgen-der-schweiz-id15481900.html)

![castle dossier map switzerland](assets/castle_dossier_map_switzerland.png)

# Conclusion

- __Naming things is hard__, meaningfully categorizing even harder

- Wikidata can tends to show variations in definitions between countries but tends to be consistent within countries (_this hypothesis has not been tested_)

- __Wittgensteins ruler:__ [_"When you use a ruler to measure the table, you are also using the table to measure the ruler."_](https://en.wikiquote.org/wiki/Nassim_Nicholas_Taleb), Biased data can tell you more about the people behind the data than the data itself

- __Know Thy Data__: Data Provenance and Completeness is crucial for data analysis and prediction

# Data Completeness

### OpenStreetMap

- Barrington-Leigh, Christopher, and Adam Millard-Ball. ["The world’s user-generated road map is more than 80% complete."](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180698) PloS one 12.8 (2017): e0180698.
- [wiki.openstreetmap.org/wiki/Completeness](https://wiki.openstreetmap.org/wiki/Completeness)

### Wikidata

-  Michael Luggen, Djellel Difallah, Cristina Sarasua, Demartini and Philippe Cudré-Mauroux. ["How to estimate completeness of classes in Wikidata."](https://www.societybyte.swiss/2019/07/05/how-to-estimate-completeness-of-classes-in-wikidata/) Sociebyte. 2019.
- Ahmeti, Albin, Simon Razniewski, and Axel Polleres. ["Assessing the completeness of entities in knowledge bases."](https://link.springer.com/chapter/10.1007/978-3-319-70407-4_2) European Semantic Web Conference. Springer, Cham, 2017.
- [COOL-WD: A Completeness Tool for Wikidata](http://ceur-ws.org/Vol-1963/paper466.pdf)

# Data Science with OpenStreetMap and Wikidata

### Nikolai Janakiev [@njanakiev](https://twitter.com/njanakiev/)

- Slides @ [https://janakiev.com/slides/data-science-osm-wikidata](https://janakiev.com/slides/wikidata-mayors)

## Resources

- [Wikidata - OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Wikidata)
- FOSSGIS 2016: [OpenStreetMap und Wikidata](https://www.youtube.com/watch?v=Zcv_7t7RcNM) - Michael Maier
- FOSDEM 2019: [Linking OpenStreetMap and Wikidata A semi-automated, user-assisted editing tool](https://www.youtube.com/watch?v=UWcZ1WKXHNo) - Edward Betts
- [WDTools](https://github.com/njanakiev/wdtools) - Wikidata Utilities and Tools