# Apache Sedona and OSM

In this tutorial, we will use apache sedona to read geospatial data which are provided by OSM(Open Street Map).
OSM provides data with various format (e.g. PBF, shapefile, etc.). In this tutorial, we will use `PBF format` ([Protocolbuffer Binary Format](https://wiki.openstreetmap.org/wiki/PBF_Format)).

## 1. Get the sample data

The sample data which I will use in this tutorial is the download from this page 
https://download.geofabrik.de/europe/france.html . I use the `Ile-de-France` map (`ile-de-france-latest.osm.pbf`)


In [1]:
from sedona.spark import *
from pathlib import Path
import json

from ipyleaflet import Map, basemaps, basemap_to_tiles, MarkerCluster, Marker, AwesomeIcon
from ipywidgets import Layout
import numpy as np

Skipping SedonaKepler import, verify if keplergl is installed
Skipping SedonaPyDeck import, verify if pydeck is installed


In [2]:
# build a sedona session (sedona = 1.5.1)
config = SedonaContext.builder() \
    .appName("Sedona with pyspark") \
    .master("local[*]") \
    .config("spark.driver.memory", "6g") \
    .config('spark.jars.packages',
            'com.acervera.osm4scala:osm4scala-spark3-shaded_2.12:1.0.11,' 
            'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,' 
            'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
     getOrCreate()

# create a sedona context
sedona = SedonaContext.create(config)

24/04/12 12:08:16 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
24/04/12 12:08:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/pengfei/opt/spark/spark-3.3.0/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/pengfei/.ivy2/cache
The jars for the packages stored in: /home/pengfei/.ivy2/jars
com.acervera.osm4scala#osm4scala-spark3-shaded_2.12 added as a dependency
org.apache.sedona#sedona-spark-shaded-3.0_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-006cb2f2-0e5e-4daf-aa73-7b1bc4998fe1;1.0
	confs: [default]
	found com.acervera.osm4scala#osm4scala-spark3-shaded_2.12;1.0.11 in central
	found org.apache.sedona#sedona-spark-shaded-3.0_2.12;1.4.1 in central
	found org.datasyslab#geotools-wrapper;1.4.0-28.2 in central
:: resolution report :: resolve 408ms :: artifacts dl 21ms
	:: modules in use:
	com.acervera.osm4scala#osm4scala-spark3-shaded_2.12;1.0.11 from central in [default]
	org.apache.sedona#sedona-spark-shaded-3.0_2.12;1.4.1 from central in [default]
	org.datasyslab#geotools-wrapper;1.4.0-28.2 from central in [default]
	---------------------------------------

24/04/12 12:08:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

## 2. Read the PBF format with Sedona(Spark)

By default, sedona can not read PBF file directly. But there is a `spark polyglot connector` called [osm4scala](https://simplexspatial.github.io/osm4scala/). You can also visit their github page [here](https://github.com/simplexspatial/osm4scala) 

To make it work, it's quite simple:
- online mode: In the sparkSession creation clause add the **.config('spark.jars.packages','com.acervera.osm4scala:osm4scala-spark3-shaded_2.12:1.0.11')**
- offline mode: You need to download the jar file first, then in the sparkSession creation clause add the **.config(''spark.jars', jar_path')**

> You can find more information about the jar file [here](https://simplexspatial.github.io/osm4scala/docs/spark-connector)



In [3]:
homePath = "/home/pengfei/data_set/geo_spatial"
filePath= f"{homePath}/ile-de-france-latest.osm.pbf"

In [4]:
raw_df = sedona.read.format("osm.pbf").load(filePath)

## 3. Explore the OSM dataset

### 3.1 Understand the basic OSM data structure

OpenStreetMap uses a `topological data structure`, with `four core elements` (aka data primitives):

- **Nodes**: are points with a geographic position, stored as coordinates (pairs of a latitude and a longitude) according to `WGS 84`. Outside their usage in ways, they are used to represent map features without a size, such as `points of interest` or mountain peaks.
- **Ways**: are ordered lists of `nodes`, representing a `polyline, or possibly a polygon` if they form a closed loop. They are used both for representing linear features such as streets and rivers, and areas, like forests, parks, parking areas and lakes.
- **Relations**: are `ordered lists of nodes, ways and relations` (together called "members"), where each member can optionally have a "role" (a string). Relations are used for representing the relationship of existing nodes and ways. Examples include turn restrictions on roads, routes that span several existing ways (for instance, a long-distance motorway), and areas with holes.
- **Tags**: are `key-value pairs (both arbitrary strings)`. They are used to store metadata about the map objects (such as their type, their name and their physical properties). `Tags are not freestanding, but are always attached to an object: to a node, a way or a relation`. A recommended ontology of map features (the meaning of tags) is maintained on a wiki. New tagging schemes can always be proposed by a popular vote of a written proposal in OpenStreetMap wiki, however, there is no requirement to follow this process. There are over 89 million different kinds of tags in use as of June 2017.

![OpenStreetMap_data_primitives_in_iD.png](../images/OpenStreetMap_data_primitives_in_iD.png)

In [5]:
raw_df.show()

+------+----+------------------+------------------+-----+---------+--------------------+--------------------+
|    id|type|          latitude|         longitude|nodes|relations|                tags|                info|
+------+----+------------------+------------------+-----+---------+--------------------+--------------------+
|122626|   0|49.115966300000004|         2.5549119|   []|       []|                  {}|{3, 2020-05-10 11...|
|122627|   0|49.110294100000004|         2.5521725|   []|       []|                  {}|{4, 2009-02-13 19...|
|122631|   0|        49.0834393|2.5511375000000003|   []|       []|                  {}|{15, 2021-06-30 1...|
|122632|   0|        49.0675225|2.5524679000000003|   []|       []|                  {}|{17, 2019-04-10 1...|
|122633|   0|         49.063616|2.5522412000000005|   []|       []|                  {}|{17, 2009-02-13 1...|
|122634|   0|        49.0597465|2.5509097000000005|   []|       []|                  {}|{2, 2009-02-13 19...|
|122635|  

In [6]:
raw_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- type: byte (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- nodes: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- relations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- relationType: byte (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- info: struct (nullable = true)
 |    |-- version: integer (nullable = true)
 |    |-- timestamp: timestamp (nullable = true)
 |    |-- changeset: long (nullable = true)
 |    |-- userId: integer (nullable = true)
 |    |-- userName: string (nullable = true)
 |    |-- visible: boolean (nullable = true)



In [7]:
outPath=f"{homePath}/ile-de-france-geo-parquet"

In [8]:
raw_df.write.mode("overwrite").format("parquet").save(outPath)

                                                                                

Here we also test the geoparquet and parquet format. The .pbf format is smaller than parquet file (293Mo vs 856 Mo), which surprised me a little.

> We can't write geoparquet, because there is no geometry column

### 3.2 Filter the node and ways

As we explained before, there are four different entities(`four core elements`). In our dataset, each row has a type.
- Type 0: Node
- Type 1: Ways
- Type 2: Relations

In this tutorial, we only keep nodes and ways.

In [7]:
raw_df.select("type").distinct().show()



+----+
|type|
+----+
|   0|
|   1|
|   2|
+----+


                                                                                

In [9]:
# get all nodes 
node_df = raw_df.where("type = 0")

node_df.show(5)


+------+----+------------------+------------------+-----+---------+----+--------------------+
|    id|type|          latitude|         longitude|nodes|relations|tags|                info|
+------+----+------------------+------------------+-----+---------+----+--------------------+
|122626|   0|49.115966300000004|         2.5549119|   []|       []|  {}|{3, 2020-05-10 11...|
|122627|   0|49.110294100000004|         2.5521725|   []|       []|  {}|{4, 2009-02-13 19...|
|122631|   0|        49.0834393|2.5511375000000003|   []|       []|  {}|{15, 2021-06-30 1...|
|122632|   0|        49.0675225|2.5524679000000003|   []|       []|  {}|{17, 2019-04-10 1...|
|122633|   0|         49.063616|2.5522412000000005|   []|       []|  {}|{17, 2009-02-13 1...|
+------+----+------------------+------------------+-----+---------+----+--------------------+
only showing top 5 rows



In [10]:
# We need to remove useless column
node_simple_df = node_df.select("id","latitude", "longitude")
node_simple_df.show(5)

+------+------------------+------------------+
|    id|          latitude|         longitude|
+------+------------------+------------------+
|122626|49.115966300000004|         2.5549119|
|122627|49.110294100000004|         2.5521725|
|122631|        49.0834393|2.5511375000000003|
|122632|        49.0675225|2.5524679000000003|
|122633|         49.063616|2.5522412000000005|
+------+------------------+------------------+
only showing top 5 rows



In [11]:
# get all ways row
way_df = raw_df.where("type = 1")

way_df.show(5)



+------+----+--------+---------+--------------------+---------+--------------------+--------------------+
|    id|type|latitude|longitude|               nodes|relations|                tags|                info|
+------+----+--------+---------+--------------------+---------+--------------------+--------------------+
|  2569|   1|    null|     null|[382017, 37454347...|       []|{lane_markings ->...|{20, 2023-03-29 0...|
|  2570|   1|    null|     null|[7223004840, 361709]|       []|{name -> Avenue d...|{24, 2023-06-29 0...|
|  2573|   1|    null|     null|[6865885341, 3617...|       []|{name -> Place de...|{20, 2024-03-23 2...|
|  2574|   1|    null|     null|[5467645835, 5185...|       []|{name -> Rue Tain...|{27, 2023-10-20 1...|
|286314|   1|    null|     null|[218267571, 49581...|       []|{cycleway:both ->...|{18, 2024-02-17 1...|
+------+----+--------+---------+--------------------+---------+--------------------+--------------------+
only showing top 5 rows



                                                                                

Since the latitude and longitude columns of ways do not contain any information we can remove them. We can also notice that a way contains a list of nodes(e.g. the value is the node id). If we link these nodes, we can build the way. We can consider the first node is the starting point of the way, the last node is ending point of the way. If we join all the ways with the first nodes listed in each way, we use the starting point of the way to draw the way.

In [12]:

way_simple_df = way_df.drop("id","latitude", "longitude")
way_with_gps_df = way_simple_df.join(
    node_simple_df, way_simple_df.nodes.getItem(0) == node_simple_df.id)

way_trans_df = way_with_gps_df.select("latitude", "longitude", "tags")


In [13]:
way_trans_df.show(5, truncate=False)


[Stage 13:>                                                         (0 + 1) / 1]

+------------------+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|latitude          |longitude         |tags                                                                                                                                                                                                                                                                           |
+------------------+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|48.74317940000001 |2.3225308         |{name -> Autoroute du Sol

                                                                                

In [14]:
node_trans_df = node_df.select("latitude", "longitude", "tags")

In [15]:
hospital_and_food_df = way_trans_df.union(node_trans_df).\
    where("element_at(tags, 'amenity') in ('hospital', 'clinic','doctors')")
hospital_and_food_df.cache()
hospital_and_food_df.show()

                                                                                

+------------------+------------------+--------------------+
|          latitude|         longitude|                tags|
+------------------+------------------+--------------------+
|  48.8228839999944| 2.397936000000003|{website -> https...|
|49.101812299999956| 2.543690700000001|{name -> Maison d...|
|48.903260199999906| 2.305047599999987|{healthcare:speci...|
|48.885186399999995|2.4011061000000105|{name -> Clinique...|
|48.692460400000265|2.2671637000000016|{amenity -> docto...|
| 48.80802140000021|2.1349204000000053|{amenity -> hospi...|
|  48.8539673000001|         2.3481897|{operator:short -...|
| 48.80471840000002| 2.424477000000009|{website -> https...|
| 48.59296099999903|2.2486759999999957|{name -> Clinique...|
| 48.85146449999998|2.3714151999999813|{website -> http:...|
| 48.82344670000005|2.5382796000000027|{source -> cadast...|
|48.570080700000055| 2.432605999999993|{healthcare:speci...|
| 48.56786900000003| 2.431551999999988|{healthcare:speci...|
|48.764957100000224| 2.3

In [16]:
hospital_df = hospital_and_food_df.select("latitude", "longitude").\
    where("element_at(tags, 'amenity') == 'hospital'")
hospital_df.createOrReplaceTempView("hospital")

clinic_df = hospital_and_food_df.select("latitude", "longitude").\
    where("element_at(tags, 'amenity') == 'clinic'")
clinic_df.createOrReplaceTempView("clinic")

doctors_df = hospital_and_food_df.select("latitude", "longitude").\
    where("element_at(tags, 'amenity') == 'doctors'")
doctors_df.createOrReplaceTempView("doctors")

In [17]:
hospital_number=hospital_df.count()
print(f"Total hospital number in Ile-de-France: {hospital_number}")



Total hospital number in Ile-de-France: 315


                                                                                

In [18]:
hospital_df.show(5,truncate=False)

+------------------+------------------+
|latitude          |longitude         |
+------------------+------------------+
|48.8228839999944  |2.397936000000003 |
|48.885186399999995|2.4011061000000105|
|48.80802140000021 |2.1349204000000053|
|48.8539673000001  |2.3481897         |
|48.59296099999903 |2.2486759999999957|
+------------------+------------------+
only showing top 5 rows



In [19]:
clinic_number=clinic_df.count()
print(f"Total clinic number in Ile-de-France: {clinic_number}")



Total clinic number in Ile-de-France: 219


                                                                                

In [20]:
doctors_number=doctors_df.count()
print(f"Total doctor office number in Ile-de-France: {doctors_number}")



Total doctor office number in Ile-de-France: 1294


                                                                                

In [23]:
icon_hospital = AwesomeIcon(
    name='h-square',
    marker_color='green',
    icon_color='darkgreen'
)

icon_clinic = AwesomeIcon(
    name='hospital-o',
    marker_color='red',
    icon_color='black'
)

icon_doctors = AwesomeIcon(
    name='user-md',
    marker_color='blue',
    icon_color='gray'
)

In [28]:
hospital_pos = tuple([Marker(location=tuple(row), icon=icon_hospital) for row in hospital_df.limit(250).collect()])
clinic_pos  = tuple([Marker(location=tuple(row), icon=icon_clinic ) for row in clinic_df.limit(250).collect()])
doctors_pos  = tuple([Marker(location=tuple(row), icon=icon_doctors ) for row in doctors_df.limit(250).collect()])

marker_hospital = MarkerCluster(markers=hospital_pos)
marker_clinic = MarkerCluster(markers=clinic_pos)
marker_doctors = MarkerCluster(markers=doctors_pos)

latitudes =  np.array([x.location[0] for x in hospital_pos]+[x.location[0] for x in doctors_pos])
longitudes = np.array([x.location[1] for x in hospital_pos]+[x.location[1] for x in doctors_pos])
ce = [latitudes.mean(), longitudes.mean()]

m = Map(
    basemap=basemap_to_tiles(basemaps.OpenStreetMap.Mapnik),
    center=ce,
    layout=Layout(width='50%', height='800px'),
    zoom=7
)

m.add_layer(marker_hospital)
m.add_layer(marker_clinic)
m.add_layer(marker_doctors)

display(m)

Map(center=[48.8168256002, 2.336495916399999], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoo…

In [14]:
epsg_code = "epsg:25832"

charger_geo = sedona.sql(f"""
SELECT 
ST_Transform(ST_Point(CAST(latitude AS Decimal(24,20)), CAST(longitude AS Decimal(24,20))), 'epsg:4326', '{epsg_code}') AS charger_point 
from charger""")
charger_geo.cache()
charger_geo.createOrReplaceTempView("charger_geo")

fast_food_geo = sedona.sql(f"""
SELECT ST_Transform(ST_Point(CAST(latitude AS Decimal(24,20)), CAST(longitude AS Decimal(24,20))), 'epsg:4326', '{epsg_code}') AS fast_food_point from fast_food
""")
fast_food_geo.cache()
fast_food_geo.createOrReplaceTempView("fast_food_geo")

print(f"Charger count:   {charger_geo.count()}")
print(f"Fast food count: {fast_food_geo.count()}")

                                                                                

Charger count:   40




Fast food count: 5913


                                                                                

In [15]:
food_near_charger_df = sedona.sql(f"""
SELECT 
ST_AsGeoJSON(
   ST_Transform(charger_geo.charger_point,     '{epsg_code}', 'epsg:4326')
) charger_point, 
ST_AsGeoJSON(
   ST_Transform(fast_food_geo.fast_food_point, '{epsg_code}', 'epsg:4326')
) fast_food_point, 
ST_Distance(
  charger_geo.charger_point, fast_food_geo.fast_food_point
) distance_meter
FROM charger_geo, fast_food_geo 
WHERE 
ST_Distance(charger_geo.charger_point, fast_food_geo.fast_food_point) <= 100
""").cache()

charger_near_df = food_near_charger_df.select("charger_point").distinct()
charger_near_df.cache()
food_near_df = food_near_charger_df.select("fast_food_point").distinct()
food_near_df.cache()

DataFrame[fast_food_point: string]

In [16]:
print(f"Fast food: {food_near_df.count()}")
print(f"Charger: {charger_near_df.count()}")

                                                                                

Fast food: 8




Charger: 7


                                                                                

In [19]:
charger_pos = tuple([Marker(location=tuple(json.loads(row["charger_point"])["coordinates"]), icon=icon_charger) for row in charger_near_df.collect()])
burger_pos  = tuple([Marker(location=tuple(json.loads(row["fast_food_point"])["coordinates"]), icon=icon_fast_food) for row in food_near_df.collect()])

marker_charger = MarkerCluster(markers=charger_pos)
marker_burger = MarkerCluster(markers=burger_pos)

latitudes =  np.array([x.location[0] for x in charger_pos]+[x.location[0] for x in burger_pos])
longitudes = np.array([x.location[1] for x in charger_pos]+[x.location[1] for x in burger_pos])

ce = [latitudes.mean(), longitudes.mean()]


m = Map(
    basemap=basemap_to_tiles(basemaps.OpenStreetMap.Mapnik),
    center=ce,
    layout=Layout(width='50%', height='800px'),
    zoom=7
)

m.add_layer(marker_charger)
m.add_layer(marker_burger)

display(m)

                                                                                

Map(center=[48.63091086836675, 2.5216525934245593], controls=(ZoomControl(options=['position', 'zoom_in_text',…