# Scalable Geospatial (Vector) Data Science

### Nikolai Janakiev [@njanakiev](https://github.com/njanakiev)

# (Geospatial) Big Data

### OpenStreetMap

- __6,140,639,049__ nodes
- __677,822,881__ ways
- __7,940,285__ relations

[www.openstreetmap.org/stats/data_stats.html](https://www.openstreetmap.org/stats/data_stats.html)

### NYC Taxi & Limousine Commission

- __1.5 billion__ rows (__50 GB__) from 2009 to 2018

[azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/)

![PostgreSQL](assets/postgresql.jpeg)

# PostgreSQL and PostGIS

### PostgreSQL Parallelization

- Since __PostgreSQL 9.6__ (2016)
- Enabled by default since __PostgreSQL 10__ (2017)

### PostGIS Parallelization

- With tweaks since __PostgreSQL 11__ and __PostGIS 2.5__ 
- Out of the box since __Postgres 12__ and __PostGIS 3__

# Scaling Postgres

### Columnar Storage (Foreign Data Wrapper)

- cstore_fdw - [citusdata.github.io/cstore_fdw/](https://citusdata.github.io/cstore_fdw/)

### GPU Processing

- PG-Strom - [heterodb.github.io/pg-strom](http://heterodb.github.io/pg-strom/)

### Distributed PostgreSQL

- CitusDB - [github.com/citusdata/citus](https://github.com/citusdata/citus)
- Postgres-XL - [www.postgres-xl.org](https://www.postgres-xl.org/)

![Hadoop](assets/hadoop.png)

# Big Data

- __2003__ Google - [The Google File System](https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)

- __2004__ Google - [MapReduce: Simplified Data Processing on Large Clusters](https://static.usenix.org/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf)

- __2006__ Apache Nuts - Hadoop

- __2008__ Apache Foundation Project

## Hadoop Distributed File System (HDFS)

## Hadoop MapReduce

## Hadoop YARN

# Apache Spark

- Created 2009, open sourced 2010

- __2010__ - [Spark: Cluster Computing with Working Sets](https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf)

- distributed computation engine, written in Scala, with bindings for Java, Python, and R

# GeoMesa

[geomesa.org](https://www.geomesa.org/)

### Spatial Indexing

- 2d/3d Space Filling Curve

### Database

- [Apache Accumulo](https://accumulo.apache.org/)
- [Apache HBase](https://hbase.apache.org/)
- [Apache Cassandra](https://cassandra.apache.org/)

# LocationTech

[projects.eclipse.org/projects/locationtech](https://projects.eclipse.org/projects/locationtech)

- [GeoWave](https://locationtech.github.io/geowave/)

- [GeoTrellis](https://geotrellis.io/)

- [RasterFrames](https://rasterframes.io/)

- [JTS Topology Suite](https://locationtech.github.io/jts/)

# Beyond the Elephants

# Presto

[prestodb.io](https://prestodb.io/)

- Distributed SQL Query Engine for Big Data

- [Presto Connectors](https://prestodb.io/docs/current/connector.html) - connectors to most common SQL and NoSQL databases

- [Geospatial Functions](https://prestodb.io/docs/current/functions/geospatial.html) - various common `ST_` functions available 

# Dask

[dask.org](https://dask.org/)

- parallel and distributed computing library for analytics written in Python

# GeoPandas

[geopandas.org](https://geopandas.org/)

-  xtends the datatypes used by pandas to allow spatial operations on geometric types

# ElasticSearch

### Geo Point Indexing

[Geo-point datatype](https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html)

- BKD-trees
- 2019 - [BKD-backed geo_shapes in Elasticsearch: precision + efficiency + speed](https://www.elastic.co/blog/bkd-backed-geo-shapes-in-elasticsearch-precision-efficiency-speed)

### Geo Queries

[ElasticSearch - Geo queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html)

- `geo_bounding_box` query
- `geo_distance` query
- `geo_polygon` query
- `geo_shape` query

# MongoDB

- [mongodb.com](https://www.mongodb.com/)
- Supports [GeoJSON Object](https://docs.mongodb.com/manual/reference/geojson/)

### Geospatial Index

- [2d Index](https://docs.mongodb.com/manual/core/2d/), [2d Index Internals](https://docs.mongodb.com/manual/core/geospatial-indexes/) - Uses GeoHash Indexing
- [2d Sphere Index](https://docs.mongodb.com/manual/core/2dsphere/)
- [geoHaystack Indexes](https://docs.mongodb.com/manual/core/geohaystack/) - Special index that is optimized to return results over small areas

### Geospatial Queries

- `$geoIntersects`
- `$geoWithin`
- `$near`
- `$nearSphere`
- `$geoNear`

[Geospatial Queries](https://docs.mongodb.com/manual/geospatial-queries/)

# Redis

[redis.io](https://redis.io/)

- in-memory data structure store, used as a database, cache and message broker
- Lyft 10 million QPS
- [Geohash](https://en.wikipedia.org/wiki/Geohash) Indexing

### Geospatial Commands

- [GEOADD](https://redis.io/commands/geoadd)
- [GEOHASH](https://redis.io/commands/geohash)
- [GEODIST](https://redis.io/commands/geodist)
- [GEORADIUS](https://redis.io/commands/georadius)
- [GEORADIUSBYMEMBER](https://redis.io/commands/georadiusbymember)

[Redis Best Practices - Geospatial](https://redislabs.com/redis-best-practices/indexing-patterns/geospatial/)

# Scalable Geospatial Data Science

### Nikolai Janakiev [@njanakiev](https://github.com/njanakiev)

- Slides - [janakiev.com/scalable-geospatial-data-science](https://janakiev.com/scalable-geospatial-data-science)
- Benchmark (2017) - [Benchmarking of Big Data Technologies for Ingesting and Querying Geospatial Datasets](https://www.reply.com/en/topics/big-data-and-analytics/Shared%20Documents/DSTL-Report-Data-Reply-2017.pdf)
- MDPI 2020 - [State-of-the-Art Geospatial Information Processing inNoSQL Databases](https://www.mdpi.com/2220-9964/9/5/331/pdf)
- [Summary of the 1.1 Billion Taxi Rides Benchmarks](https://tech.marksblogg.com/benchmarks.html)