# Apache Sedona and OSM

In this tutorial, we will use apache sedona to read geospatial data which are provided by OSM(Open Street Map).
OSM provides data with various format (e.g. PBF, shapefile, etc.). In this tutorial, we will use `PBF format` ([Protocolbuffer Binary Format](https://wiki.openstreetmap.org/wiki/PBF_Format)).

## 1. Get the sample data

## 2. Read the PBF format with Sedona(Spark)

By default, sedona can not read PBF file directly. But there is a `spark polyglot connector` called [osm4scala](https://simplexspatial.github.io/osm4scala/). You can also visit their github page [here](https://github.com/simplexspatial/osm4scala) 

To make it work, it's quite simple:
- online mode: In the sparkSession creation clause add the **.config('spark.jars.packages','com.acervera.osm4scala:osm4scala-spark3-shaded_2.12:1.0.11')**
- offline mode: You need to download the jar file first, then in the sparkSession creation clause add the **.config(''spark.jars', jar_path')**

> You can find more information about the jar file [here](https://simplexspatial.github.io/osm4scala/docs/spark-connector)

## 3. Understand the basic OSM data structure

OpenStreetMap uses a `topological data structure`, with `four core elements` (aka data primitives):

- **Nodes**: are points with a geographic position, stored as coordinates (pairs of a latitude and a longitude) according to `WGS 84`. Outside their usage in ways, they are used to represent map features without a size, such as `points of interest` or mountain peaks.
- **Ways**: are ordered lists of `nodes`, representing a `polyline, or possibly a polygon` if they form a closed loop. They are used both for representing linear features such as streets and rivers, and areas, like forests, parks, parking areas and lakes.
- **Relations**: are `ordered lists of nodes, ways and relations` (together called "members"), where each member can optionally have a "role" (a string). Relations are used for representing the relationship of existing nodes and ways. Examples include turn restrictions on roads, routes that span several existing ways (for instance, a long-distance motorway), and areas with holes.
- **Tags**: are `key-value pairs (both arbitrary strings)`. They are used to store metadata about the map objects (such as their type, their name and their physical properties). `Tags are not freestanding, but are always attached to an object: to a node, a way or a relation`. A recommended ontology of map features (the meaning of tags) is maintained on a wiki. New tagging schemes can always be proposed by a popular vote of a written proposal in OpenStreetMap wiki, however, there is no requirement to follow this process. There are over 89 million different kinds of tags in use as of June 2017.

In [None]:
from sedona.spark import *
from pathlib import Path
import json

from ipyleaflet import Map, basemaps, basemap_to_tiles, MarkerCluster, Marker, AwesomeIcon
from ipywidgets import Layout
import numpy as np

In [2]:
# build a sedona session (sedona = 1.5.1)
config = SedonaContext.builder() \
    .appName("Sedona with pyspark") \
    .master("local[*]") \
    .config("spark.driver.memory", "6g") \
    .config('spark.jars.packages',
            'com.acervera.osm4scala:osm4scala-spark3-shaded_2.12:1.0.11,' 
            'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,' 
            'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
     getOrCreate()

# create a sedona context
sedona = SedonaContext.create(config)

24/04/11 10:05:06 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
24/04/11 10:05:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/pengfei/opt/spark/spark-3.3.0/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/pengfei/.ivy2/cache
The jars for the packages stored in: /home/pengfei/.ivy2/jars
com.acervera.osm4scala#osm4scala-spark3-shaded_2.12 added as a dependency
org.apache.sedona#sedona-spark-shaded-3.0_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0e67dc31-cc58-4873-b7ed-443abd90fa96;1.0
	confs: [default]
	found com.acervera.osm4scala#osm4scala-spark3-shaded_2.12;1.0.11 in central
	found org.apache.sedona#sedona-spark-shaded-3.0_2.12;1.4.1 in central
	found org.datasyslab#geotools-wrapper;1.4.0-28.2 in central
:: resolution report :: resolve 362ms :: artifacts dl 16ms
	:: modules in use:
	com.acervera.osm4scala#osm4scala-spark3-shaded_2.12;1.0.11 from central in [default]
	org.apache.sedona#sedona-spark-shaded-3.0_2.12;1.4.1 from central in [default]
	org.datasyslab#geotools-wrapper;1.4.0-28.2 from central in [default]
	---------------------------------------

24/04/11 10:05:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

In [3]:
homePath = "/home/pengfei/data_set/geo_spatial"
filePath= f"{homePath}/ile-de-france-latest.osm.pbf"

In [4]:
raw_df = sedona.read.format("osm.pbf").load(filePath)

In [5]:
raw_df.show()

+------+----+------------------+------------------+-----+---------+--------------------+--------------------+
|    id|type|          latitude|         longitude|nodes|relations|                tags|                info|
+------+----+------------------+------------------+-----+---------+--------------------+--------------------+
|122626|   0|49.115966300000004|         2.5549119|   []|       []|                  {}|{3, 2020-05-10 11...|
|122627|   0|49.110294100000004|         2.5521725|   []|       []|                  {}|{4, 2009-02-13 19...|
|122631|   0|        49.0834393|2.5511375000000003|   []|       []|                  {}|{15, 2021-06-30 1...|
|122632|   0|        49.0675225|2.5524679000000003|   []|       []|                  {}|{17, 2019-04-10 1...|
|122633|   0|         49.063616|2.5522412000000005|   []|       []|                  {}|{17, 2009-02-13 1...|
|122634|   0|        49.0597465|2.5509097000000005|   []|       []|                  {}|{2, 2009-02-13 19...|
|122635|  

In [6]:
raw_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- type: byte (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- nodes: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- relations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- relationType: byte (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- info: struct (nullable = true)
 |    |-- version: integer (nullable = true)
 |    |-- timestamp: timestamp (nullable = true)
 |    |-- changeset: long (nullable = true)
 |    |-- userId: integer (nullable = true)
 |    |-- userName: string (nullable = true)
 |    |-- visible: boolean (nullable = true)


In [7]:
outPath=f"{homePath}/ile-de-france-geo-parquet"

In [11]:
raw_df.write.mode("overwrite").format("parquet").save(outPath)

                                                                                

# Separate the node and ways



In [7]:
node_df = raw_df.where("type = 0")
way_df = raw_df.where("type = 1")

In [8]:
node_simple_df = node_df.select("id","latitude", "longitude")
way_simple_df = way_df.drop("id","latitude", "longitude")
way_with_gps_df = way_simple_df.join(
    node_simple_df, way_simple_df.nodes.getItem(0) == node_simple_df.id)

way_simple_df = way_with_gps_df.select("latitude", "longitude", "tags")
node_simple_df = node_df.select("latitude", "longitude", "tags")

In [22]:
way_simple_df.show(truncate=False)




+------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|latitude          |longitude         |tags                                                                                                                                                                                                                                                                                                                                                                                                                

                                                                                

In [23]:
node_simple_df.show(truncate=False)

+------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
|latitude          |longitude         |tags                                                                                                                   |
+------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
|49.115966300000004|2.5549119         |{}                                                                                                                     |
|49.110294100000004|2.5521725         |{}                                                                                                                     |
|49.0834393        |2.5511375000000003|{}                                                                                                                     |
|49.0675225        |2.5524679000000003|{

In [9]:
charge_and_food_df = way_simple_df.union(node_simple_df).\
    where("element_at(tags, 'amenity') in ('charging_station', 'fast_food')")
charge_and_food_df.cache()

DataFrame[latitude: double, longitude: double, tags: map<string,string>]

In [10]:
charger_df = charge_and_food_df.select("latitude", "longitude").\
    where("element_at(tags, 'amenity') == 'charging_station' and instr(element_at(tags, 'socket:type2_combo:output'),' kW') > 0 and replace(element_at(tags, 'socket:type2_combo:output'), ' kW','') > 50")
charger_df.createOrReplaceTempView("charger")

fast_food_df = charge_and_food_df.select("latitude", "longitude").\
    where("element_at(tags, 'amenity') == 'fast_food'")
fast_food_df.createOrReplaceTempView("fast_food")

In [11]:
icon_charger = AwesomeIcon(
    name='fa-battery-full',
    marker_color='green',
    icon_color='darkgreen'
)

icon_fast_food = AwesomeIcon(
    name='fa-cutlery',
    marker_color='red',
    icon_color='black'
)

In [12]:
charger_pos = tuple([Marker(location=tuple(row), icon=icon_charger) for row in charger_df.limit(250).collect()])
fast_food_pos  = tuple([Marker(location=tuple(row), icon=icon_fast_food ) for row in fast_food_df.limit(250).collect()])

marker_charger = MarkerCluster(markers=charger_pos)
marker_fast_food = MarkerCluster(markers=fast_food_pos)

latitudes =  np.array([x.location[0] for x in charger_pos]+[x.location[0] for x in fast_food_pos])
longitudes = np.array([x.location[1] for x in charger_pos]+[x.location[1] for x in fast_food_pos])
ce = [latitudes.mean(), longitudes.mean()]

m = Map(
    basemap=basemap_to_tiles(basemaps.OpenStreetMap.Mapnik),
    center=ce,
    layout=Layout(width='50%', height='800px'),
    zoom=7
)

m.add_layer(marker_charger)
m.add_layer(marker_fast_food)

display(m)

                                                                                

Map(center=[48.79681462758624, 2.3822608034482764], controls=(ZoomControl(options=['position', 'zoom_in_text',…

In [14]:
epsg_code = "epsg:25832"

charger_geo = sedona.sql(f"""
SELECT 
ST_Transform(ST_Point(CAST(latitude AS Decimal(24,20)), CAST(longitude AS Decimal(24,20))), 'epsg:4326', '{epsg_code}') AS charger_point 
from charger""")
charger_geo.cache()
charger_geo.createOrReplaceTempView("charger_geo")

fast_food_geo = sedona.sql(f"""
SELECT ST_Transform(ST_Point(CAST(latitude AS Decimal(24,20)), CAST(longitude AS Decimal(24,20))), 'epsg:4326', '{epsg_code}') AS fast_food_point from fast_food
""")
fast_food_geo.cache()
fast_food_geo.createOrReplaceTempView("fast_food_geo")

print(f"Charger count:   {charger_geo.count()}")
print(f"Fast food count: {fast_food_geo.count()}")

                                                                                

Charger count:   40




Fast food count: 5913


                                                                                

In [15]:
food_near_charger_df = sedona.sql(f"""
SELECT 
ST_AsGeoJSON(
   ST_Transform(charger_geo.charger_point,     '{epsg_code}', 'epsg:4326')
) charger_point, 
ST_AsGeoJSON(
   ST_Transform(fast_food_geo.fast_food_point, '{epsg_code}', 'epsg:4326')
) fast_food_point, 
ST_Distance(
  charger_geo.charger_point, fast_food_geo.fast_food_point
) distance_meter
FROM charger_geo, fast_food_geo 
WHERE 
ST_Distance(charger_geo.charger_point, fast_food_geo.fast_food_point) <= 100
""").cache()

charger_near_df = food_near_charger_df.select("charger_point").distinct()
charger_near_df.cache()
food_near_df = food_near_charger_df.select("fast_food_point").distinct()
food_near_df.cache()

DataFrame[fast_food_point: string]

In [16]:
print(f"Fast food: {food_near_df.count()}")
print(f"Charger: {charger_near_df.count()}")

                                                                                

Fast food: 8




Charger: 7


                                                                                

In [19]:
charger_pos = tuple([Marker(location=tuple(json.loads(row["charger_point"])["coordinates"]), icon=icon_charger) for row in charger_near_df.collect()])
burger_pos  = tuple([Marker(location=tuple(json.loads(row["fast_food_point"])["coordinates"]), icon=icon_fast_food) for row in food_near_df.collect()])

marker_charger = MarkerCluster(markers=charger_pos)
marker_burger = MarkerCluster(markers=burger_pos)

latitudes =  np.array([x.location[0] for x in charger_pos]+[x.location[0] for x in burger_pos])
longitudes = np.array([x.location[1] for x in charger_pos]+[x.location[1] for x in burger_pos])

ce = [latitudes.mean(), longitudes.mean()]


m = Map(
    basemap=basemap_to_tiles(basemaps.OpenStreetMap.Mapnik),
    center=ce,
    layout=Layout(width='50%', height='800px'),
    zoom=7
)

m.add_layer(marker_charger)
m.add_layer(marker_burger)

display(m)

                                                                                

Map(center=[48.63091086836675, 2.5216525934245593], controls=(ZoomControl(options=['position', 'zoom_in_text',…