Spark job to perform massive Point in Polygon (PiP) operations
Scala Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
docker
media
src
.gitignore
LICENSE
README.md
application.properties
local.properties
local.sh
pom.xml
yarn-client.sh

README.md

Spark Point In Polygon

Spark job to perform massive Point in Polygon (PiP) operations on a distributed share-nothing cluster.

In this job context, a feature is defined to have a geometry and a set of attributes. The geometry can be a point, a polygon or a multi-polygon, and an attribute is a string encoded value.

The job takes 2 inputs, a path to a set of files containing point features and a path to a set of files containing polygon features. The job iterates through every point feature and identifies which polygon it is inside of and then emits a union of user-defined attributes from the point feature and the polygon feature.

In this current implementation, each point feature is read from a field delimited text file where the x and y coordinates are double-typed fields. And each polygon feature is also read from a field delimited text file where the coordinates are defined in a WKT format.

Here is a point feature sample:

111.509,80.9801,7.07783,R14

And here is polygon feature sample:

0|EC02|ECU-AZU|Azuay|EC|ECU|Ecuador|555881|Province|Provincia|8788.83|3393.37|1|0|MULTIPOLYGON (((-79.373981904060528 -3.3344269140348786, ...

Note that in the above sample, the geometry point fields are separated by a comma, where the feature fields are separated by a pipe |. The delimited characters can be defined at execution time.

Update: Nov 4, 2016

  • JDK 8 is a requirement.
  • Updated the map phase where, rather than sending to the reducer the whole geometry for each grid cell key, now the clipped geometry for a cell is sent to the reduce phase. This minimizes the memory consumption and the network traffic by transferring smaller geometries.
  • Updated Docker content to Spark 1.6.2 and JDK 8 to accommodate the clip function in org.geotools:gt-main:15.2.

Building the Project

The build process uses Maven

mvn package

This will create a Spark jar in the target folder, and any runtime jar dependencies, will be copied to the target/libs folder.

Configuring the job

The job can accept a command line argument that is a file in a properties format. If no argument is found, then it tries to load a file named applications.properties from the current directory.

Key Description Default Value
geometry.precision The geometry factory precision 1000000.0
extent.xmin minimum horizontal bounding box filter -180.0
extent.xmax maximum horizontal bounding box filter 180.0
extent.ymin minimum vertical bounding box filter -90.0
extent.ymax maximum vertical bounding box filter 90.0
points.path path of point features must be defined
points.sep the points field character separator tab
points.x the index of the horizontal point coordinate 0
points.y the index of the vertical point coordinate 1
points.fields comma separated list of field value indexes to emit empty
polygons.path path of polygon features must be defined
polygons.sep the polygons field character separator tab
polygons.wky the index of the WKT field 0
output.path path where to write the emitted fields must be defined
output.sep the output fields character separator tab
reduce.size the grid size of the spatial join (see below for details) 1.0

The file can also include Spark configuration properties

Implementation details

The data can be massive and cannot fit in memory all at once to perform the PiP operations and to create a spatial index of all the polygons for quick lookup. So...you will have to segment the input data and operate on each segment individually. And since these segments share nothing, they can be processed in parallel. In the geospatial domain, segmentation takes the form of subdividing the area of interest into small rectangular areas. Rectangles are "nice", as you can quickly find out if a point is inside of it or outside of it, and polygonal shapes can be subdivided into coarse edge matching rectangles. It is that last feature that we will use to segment the input points and polygons, in such that that we can perform a massive PiP per segment/rectangle.

Take for example the below points and polygons in an area of interest.

Feature Polygon A can be segmented into (0,1) (0,2) (0,3) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3).

Feature Polygon B can be segmented into (1,0) (1,1) (2,0) (2,1) because of its bounding box.

Feature Polygon C can be segmented into (3,1) (3,2).

Feature Point A can be segmented into (1,3).

Feature Point B can be segmented into (1,2).

Feature Point C can be segmented into (1,0).

Feature Point D can be segmented into (3,0).

The features are now encapsulated by their segments and are operated on in parallel.

Segment (1,2) has Polygon A and Point B. And since Point B is inside Polygon A then, the selected attributed of Point B and the selected attributes of Polygon A will be written to the output file.

In the case of Segment (1,3), it contains Polygon A and Point A. However, because Point A is not inside Polygon A, then nothing is written out.

In this implementation, the rectangle is actually square, and the size of the square (the segmentation area) is defined by the reduce.size properties. This value is very data specific and you might have to run the job a multiple times with different values while tracking the execution time to determine the "sweet spot".

Running the job

It is assumed that Spark was downloaded and installed in a folder that is hereon referred to as ${SPARK_HOME}

Unzip data/points.tsv.zip and data/world.tsv.zip into the /tmp folder.

points.tsv is a 1,000,000 randomly generated points and world.tsv contains the world administrative polygons.

Run In Local Mode

${SPARK_HOME}/bin/spark-submit\
 target/spark-pip-0.3.jar\
 local.properties

The output can be viewed using:

more /tmp/output/part-*

Run on a Hadoop Cluster

For testing purposes, a Hadoop pseudo cluster can be created using Docker with HDFS, YARN and Spark. Check out README.md in the docker folder.

In the top level cloned folder, execute:

docker run\
  -it\
  --rm=true\
  --volume=$(pwd):/spark-pip\
  -h boot2docker\
  -p 8088:8088\
  -p 9000:9000\
  -p 50010:50010\
  -p 50070:50070\
  -p 50075:50075\
  mraad/hdfs\
  /etc/bootstrap.sh

Put the supporting data in HDFS:

hdfs dfsadmin -safemode wait
unzip -p /spark-pip/data/world.tsv.zip | hdfs dfs -put - /world/world.tsv
unzip -p /spark-pip/data/points.tsv.zip | hdfs dfs -put - /points/points.tsv

The following reads by default the application properties from the file application.properties in the current folder if no file argument is passed along.

cd /spark-pip
hdfs dfs -rm -r -skipTrash /output
time spark-submit\
 --master yarn-client\
 --driver-memory 512m\
 --num-executors 2\
 --executor-cores 1\
 --executor-memory 2g\
 target/spark-pip-0.3.jar

View Data in HDFS

The output of the previous job resides in HDFS as CSV text files in hdfs://boot2docker:9000/output/part-*. This form of output from a BigData job is what I term GIS Data and needs to be visualized. The HDFS configuration in the docker container has the WebHDFS REST API enabled.

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

The PiPToolbox ArcPy toolbox contains Load Points as a tool to open and read the content of CSV files in HDFS and converts each row into a feature in an in-memory feature class.