This tutorial shows how to read in Open Street Map (OSM) data, and then rasterize it using GeoPySpark.
Note: This guide is aimed at users who are already familiar with GeoPySpark.
To start, let's first grab an orc file, which a special file type that is optimized for Hadoop operations. The following command will use curl
to download the file from S3
and move it to the /tmp
directory.
curl -o /tmp/boyertown.orc https://s3.amazonaws.com/geopyspark-test/example-files/boyertown.orc
A side note: Files can be retrieved directly from S3. However, this could not be done in this instance due to permission requirements needed to access the file.
Now that we have our data, we can now read it in and begin to work with.
import geopyspark as gps
from pyspark import SparkContext
conf = gps.geopyspark_conf(appName="osm-rasterize-example", master="local[*]")
pysc = SparkContext(conf=conf)
features = gps.osm_reader.from_orc("/tmp/boyertown.orc")
The above code sets up a SparkContext
and then reads in the boyertown.orc
file as features
, which is an instance of FeaturesCollection
.
When OSM data is read into GeoPySpark, each OSM Element is turned into single or multiple different geometries. With each of these geometries retaining the metadata from the derived OSM Element. These geometry metadata pairs are referred to as a Feature
. These features are grouped together by the type of geometry they contain. When accessing features from a FeaturesCollection
, it is done by geometry.
There are four different types of geometries in the FeaturesCollection
:
Point
Line
Polygon
MultiPolygon
For this example, we're interested in rasterizing the Line
s and Polygon
s from the OSM data, so we will select those Feature
s from the FeaturesCollection
that contain them. The following code will create a Python RDD
of Feature
s that contains all Line
geometries (lines
), and a Python RDD
that contains all Polygon
geometries (polygons
).
lines = features.get_line_features_rdd()
polygons = features.get_polygon_features_rdd()
When we rasterize the Polygon
Feature
s, we'd like for schools to have a different value than all of the other Polygon
s. However, we are unsure if any schools were contained within the original data, and we'd like to see if any are. One method we could use to determine if there are schools is to look at the tags of the Polygon
Feature
s. The following code will show all of the unique tags for all of the Polygon
s in the collection.
features.get_polygon_tags()
Which has the following output:
{'NHD:ComID': '25964412',
'NHD:Elevation': '0.00000000000',
'NHD:FCode': '39004',
'NHD:FDate': '2001/08/16',
'NHD:FTYPE': 'LakePond',
'NHD:Permanent_': '25964412',
'NHD:ReachCode': '02040203004486',
'NHD:Resolution': 'High',
'addr:city': 'Gilbertsville',
'addr:housenumber': '1100',
'addr:postcode': '19525',
'addr:state': 'PA',
'addr:street': 'E Philadelphia Avenue',
'amenity': 'school',
'area': 'yes',
'building': 'yes',
'leisure': 'pitch',
'name': 'Boyertown Area Junior High School-West Center',
'natural': 'water',
'railway': 'platform',
'smoking': 'outside',
'source': 'Yahoo',
'sport': 'baseball',
'tourism': 'museum',
'wikidata': 'Q8069423',
'wikipedia': "en:Zern's Farmer's Market"}
So it appears that there are schools in this dataset, and that we can continue on.
Now that we have our Feature
s, it's time to assign them values. The reason we need to do so is because when a vector becomes a raster, its cells need to have some kind of value. When rasterizing Feature
s, each geometry contained within it will be given a single value, and all cells that intersect that shape will have that value. In addition to value of the actual cells, there's another property that we will want to set for each Feature
, Z-Index
.
The Z-Index
of a Feature
determines what value a cell will be if more than one geometry intersects it. With a higher Z-Index
taking priority over a lower one. This is important as there may be cases where multiple geometries are present at a single cell, but that cell can only contain one value.
For this example, we are going to want all Polygon
s to have a higher Z-Index
than the Line
s. In addition, since we're interested in schools, Polygon
s that are labeled as schools will have a greater Z-Index
than other Polygon
s.
mapped_lines = lines.map(lambda feature: gps.Feature(feautre.geometry, gps.CellValue(value=1, zindex=1)))
def assign_polygon_feature(feature):
tags = feature.properties.tags.values()
if 'school' in tags.values():
return gps.Feature(feature.geometry, gps.CellValue(value=3, zindex=3))
else:
return gps.Feature(feature.geometry, gps.CellValue(value=2, zindex=2))
mapped_polygons = polygons.map(assign_polygon_feature)
We create the mapped_lines
variable that contains an RDD
of Feature
s, where each Feature
has a CellValue
with a value
and zindex
of 1. The assign_polygon_feature
function is then created which will test to see if a Polygon
is a school or not. If it is, then the resulting Feature
will have a CellValue
with a value
and zindex
of 3. Otherwise, those two values will be 2.
Now that the Feature
s have been given CellValue
s, it is now time to rasterize them.
unioned_features = pysc.union((mapped_lines, mapped_polygons))
rasterized_layer = gps.rasterize_features(features=unioned_features, crs=4326, zoom=12)
The rasterize_features
function requires a single RDD
of Feature
s. Therefore, we union together mapped_lines
and mapped_polygons
which gives us unioned_features
. Along with passing in our RDD
, we must also set the crs
and zoom
of the resulting Layer. In this case, the crs
is in LatLng
, so we set it to be 4326
. zoom
varies between use cases, so it was just chosen arbitrarily for this example. The resulting rasterized_layer
is a TiledRasterLayer
that we can now analyze and/or ingest.