-
Notifications
You must be signed in to change notification settings - Fork 17
Introduction & Philosophy
make traditional GISer handle geospatial big data easier.
The origin idea borrows from Uber, which proposed an ESRI Hive UDF + Presto solution to solve large-scale geospatial data processing problem with spatial index in production.
However, The Uber solution is not open source yet and Presto is not popular than Spark.
Until 2018, it is hard to find a tidy and easy to learn spatial big data open source solution over the internet.
At that time, some candidates as follows:
- PostGIS
-
pyspark
andshapely
Python package hacking way -
sparklyr
andsf
R package hacking way - ESRI Hive UDF java package
- GeoMesa scala package
- GeoSpark scala package
- Magellan scala package
Due to the business reason, our scenario is seeking a high precision and high throughput solution on geospatial data processing under a production level.
-
Firstly, the PostGIS solution is working on a single machine, and hard to process data in PB or even TB level.
-
Secondly, when it comes to the hacking way, sparklyr or pyspark distribute R or Python code solution is losing performance than the native scala implementation.
-
Thirdly, the spatial join performance is pretty slow without spatial index support in ESRI Hive UDF solution.
-
Lastly, the most competitive package is GeoMesa, which is developed over 10 years and well tested in production. However, GeoMesa is too heavy to work with HBase and runs slowly in range join and distance join.
According to the paper, Spatial data management in apache spark: the GeoSpark perspective and beyond, GeoSpark beats any other exist scala package in usability and functionality.
Item | GeoSpark | Pyspark+shapely/Sparklyr+sf | GeoMesa | PostGIS | ESRI_Hive_UDF |
---|---|---|---|---|---|
usability | +++++ | + | ++++ | ++++ | +++++ |
functionality | +++++ | ++++ | +++++ | ++++ | +++++ |
maintainment | +++++ | ++ | ++ | +++ | +++ |
performance | +++++ | +++ | +++++ | +++++ | ++ |
scalability | +++++ | ++++ | +++++ | +++++ | +++++ |
In that, geospark
R package aims at bringing local sf functions to distributed spark mode with GeoSpark scala package in the first stage. That means most of the people who already knew PostGIS
or sf
R package can hands on geospark
in few hours. And even the GIS new bee can play with geospark
with sf
cheatsheet and leaflet
cheetsheet in a very short time too. And the community is working on GeoSpark
and GeoMesa
fusion project, and it will add to R family in the near future.
Currently, geospark
support the most of important sf
functions in spark, here is a summary comparison. And the geospark
R package is keeping close with geospatial and big data community, which powered by sparklyr, sf, dplyr and dbplyr.