# Scalable Geospatial (Vector) Data Science

### Nikolai Janakiev [@njanakiev](https://github.com/njanakiev)

- Slides - [janakiev.com/slides/scalable-geospatial-data-science](https://janakiev.com/slides/scalable-geospatial-data-science)
- Repository - [github.com/njanakiev/scalable-geospatial-data-science](https://github.com/njanakiev/scalable-geospatial-data-science)

# (Geospatial) Big Data

### OpenStreetMap

- __6,140,639,049__ nodes
- __677,822,881__ ways
- __7,940,285__ relations

[www.openstreetmap.org/stats/data_stats.html](https://www.openstreetmap.org/stats/data_stats.html)

### NYC Taxi & Limousine Commission

- __1.5 billion__ rows (__50 GB__) from 2009 to 2018

[azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/)

![PostgreSQL](assets/postgresql.jpeg)

# PostgreSQL and PostGIS

### PostgreSQL Parallelization

- Since __PostgreSQL 9.6__ (2016)
- Enabled by default since __PostgreSQL 10__ (2017)

### PostGIS Parallelization

- With tweaks since __PostgreSQL 11__ and __PostGIS 2.5__ 
- Out of the box since __PostgreSQL 12__ and __PostGIS 3__

# Scaling PostgreSQL

### Columnar Storage (Foreign Data Wrapper)

- cstore_fdw - [citusdata.github.io/cstore_fdw/](https://citusdata.github.io/cstore_fdw/)

### GPU Processing

- PG-Strom - [heterodb.github.io/pg-strom](http://heterodb.github.io/pg-strom/)

### Distributed PostgreSQL

- CitusDB - [github.com/citusdata/citus](https://github.com/citusdata/citus)
- Postgres-XL - [www.postgres-xl.org](https://www.postgres-xl.org/)

![Hadoop](assets/hadoop.png)

# Apache Hadoop

- __2003__ Google - [The Google File System](https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)

- __2004__ Google - [MapReduce: Simplified Data Processing on Large Clusters](https://static.usenix.org/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf)

- __2006__ Apache Nutch - Hadoop

- __2007__ Used by Facebook, LinkedIn, Twitter, among others

- __2008__ Apache Foundation Project

# Apache Hadoop

### Hadoop Distributed File System (HDFS)

- __2003__ Google - [The Google File System](https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)

### Hadoop MapReduce

- __2004__ Google - [MapReduce: Simplified Data Processing on Large Clusters](https://static.usenix.org/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf)

### Hadoop YARN

# Apache Spark

- Created 2009, open sourced 2010

- __2010__ - [Spark: Cluster Computing with Working Sets](https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf)

- distributed computation engine, written in Scala, with bindings for Java, Python, and R

# GeoMesa ([geomesa.org](https://www.geomesa.org/))

- [LocationTech](https://locationtech.org) under the Eclipse Foundation

### Spatial Indexing

- Space Filling Curve (2d/3d) - [Index Overview](https://www.geomesa.org/documentation/user/datastores/index_overview.html)

### Database

- [Apache Accumulo](https://accumulo.apache.org/)
- [Apache HBase](https://hbase.apache.org/)
- [Apache Cassandra](https://cassandra.apache.org/)

### Integrations

- [Apache Spark](http://spark.apache.org/) RDD/SQL Interface
- GeoServer Integration

# Beyond the Elephants

# Presto ([prestodb.io](https://prestodb.io/)) or Trino ([trino.io](https://trino.io/))

- Distributed SQL Query Engine for Big Data

- [Presto Connectors](https://prestodb.io/docs/current/connector.html) - connectors to most common SQL and NoSQL databases

- [Geospatial Functions](https://prestodb.io/docs/current/functions/geospatial.html) - various common `ST_` functions available, uses __RTree__ indexing [Issue #13079](https://github.com/prestodb/presto/pull/13079) for spatial joins

# Dask ([dask.org](https://dask.org/))

- Parallel and distributed computing library for analytics written in Python

# GeoPandas ([geopandas.org](https://geopandas.org/))

- Extends the datatypes used by [Pandas](https://pandas.pydata.org/) to allow spatial operations on geometric types

- [rtree](https://github.com/Toblerity/rtree) - Spatial index for Python GIS

- (Experimental) [geopandas/dask-geopandas](https://github.com/geopandas/dask-geopandas) - New implementation of GeoPandas and Dask

# ElasticSearch ([elastic.co](https://www.elastic.co/))

### [Geo Point Indexing](https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html)

- GeohashPrefixTree
- QuadPrefixTree
- BKD-trees (since ES 6.0) - [BKD-backed geo_shapes in Elasticsearch: precision + efficiency + speed](https://www.elastic.co/blog/bkd-backed-geo-shapes-in-elasticsearch-precision-efficiency-speed) (2019)

### [Geo queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html)

- `geo_bounding_box` query
- `geo_distance` query
- `geo_polygon` query
- `geo_shape` query

# MongoDB ([mongodb.com](https://www.mongodb.com/))

- Document-oriented NoSQL database which uses JSON-like documents with optional schemas

### Geospatial Index

- [GeoJSON Object](https://docs.mongodb.com/manual/reference/geojson/)
- [2d Index](https://docs.mongodb.com/manual/core/2d/), [2d Index Internals](https://docs.mongodb.com/manual/core/geospatial-indexes/) - Uses GeoHash Indexing
- [2d Sphere Index](https://docs.mongodb.com/manual/core/2dsphere/)
- [geoHaystack Indexes](https://docs.mongodb.com/manual/core/geohaystack/) - Special index that is optimized to return results over small areas

### [Geospatial Queries](https://docs.mongodb.com/manual/geospatial-queries/)

- `$geoIntersects`
- `$geoWithin`
- `$near`
- `$nearSphere`
- `$geoNear`

# Redis ([redis.io](https://redis.io/))

- in-memory data structure store, used as a database, cache and message broker

### Geospatial Indexing

- [Geohash](https://en.wikipedia.org/wiki/Geohash) Indexing

### [Geospatial Commands](https://redislabs.com/redis-best-practices/indexing-patterns/geospatial/)
- [GEOADD](https://redis.io/commands/geoadd)
- [GEOHASH](https://redis.io/commands/geohash)
- [GEODIST](https://redis.io/commands/geodist)
- [GEORADIUS](https://redis.io/commands/georadius)
- [GEORADIUSBYMEMBER](https://redis.io/commands/georadiusbymember)

# Scalable Geospatial Data Science

### Nikolai Janakiev [@njanakiev](https://github.com/njanakiev)

- Slides - [janakiev.com/slides/scalable-geospatial-data-science](https://janakiev.com/slides/scalable-geospatial-data-science)
- Repository - [github.com/njanakiev/scalable-geospatial-data-science](https://github.com/njanakiev/scalable-geospatial-data-science)

### Resources

- Benchmark (2017) - [Benchmarking of Big Data Technologies for Ingesting and Querying Geospatial Datasets](https://www.reply.com/en/topics/big-data-and-analytics/Shared%20Documents/DSTL-Report-Data-Reply-2017.pdf)
- MDPI 2020 - [State-of-the-Art Geospatial Information Processing in NoSQL Databases](https://www.mdpi.com/2220-9964/9/5/331/pdf)
- [Summary of the 1.1 Billion Taxi Rides Benchmarks](https://tech.marksblogg.com/benchmarks.html)