Skip to content

Introduction & Philosophy

HarryZhu edited this page Apr 10, 2019 · 21 revisions

Goal

make traditional GISer handle geospatial big data easier.


Motivation

The origin idea borrows from Uber, which proposed an ESRI Hive UDF + Presto solution to solve large-scale geospatial data processing problem with spatial index in production.

However, The Uber solution is not open source yet and Presto is not popular than Spark.

Comparison

Until 2018, it is hard to find a tidy and easy to learn spatial big data open source solution over the internet.

At that time, some candidates as follows:

  1. PostGIS
  2. pyspark and shapely Python package hacking way
  3. sparklyr and sf R package hacking way
  4. ESRI Hive UDF java package
  5. GeoMesa scala package
  6. GeoSpark scala package
  7. Magellan scala package

Due to the business reason, our scenario is seeking a high precision and high throughput solution on geospatial data processing under a production level.

  • Firstly, the PostGIS solution is working on a single machine, and hard to process data in PB or even TB level.

  • Secondly, when it comes to the hacking way, sparklyr or pyspark distribute R or Python code solution is losing performance than the native scala implementation.

  • Thirdly, the spatial join performance is pretty slow without spatial index support in ESRI Hive UDF solution.

  • Lastly, the most competitive package is GeoMesa, which is developed over 10 years and well tested in production. However, GeoMesa is too heavy to work with HBase and runs slowly in range join and distance join.

According to the paper, Spatial data management in apache spark: the GeoSpark perspective and beyond, GeoSpark beats any other exist scala package in usability and functionality.

Summary

Item GeoSpark Pyspark+shapely/Sparklyr+sf GeoMesa PostGIS ESRI_Hive_UDF
usability +++++ + ++++ ++++ +++++
functionality +++++ ++++ +++++ ++++ +++++
maintainment +++++ ++ ++ +++ +++
performance +++++ +++ +++++ +++++ ++
scalability +++++ ++++ +++++ +++++ +++++

Philosophy

In that, geospark R package aims at bringing local sf functions to distributed spark mode with GeoSpark scala package in the first stage. That means most of the people who already knew PostGIS or sf R package can hands on geospark in few hours. And even the GIS new bee can play with geospark with sf cheatsheet and leaflet cheetsheet in a very short time too. And the community is working on GeoSpark and GeoMesa fusion project, and it will add to R family in the near future.

Currently, geospark support the most of important sf functions in spark, here is a summary comparison. And the geospark R package is keeping close with geospatial and big data community, which powered by sparklyr, sf, dplyr and dbplyr.

References

Clone this wiki locally