Geospatial Indexing for Magellan #118

halfabrane · 2017-06-27T10:38:40Z

This Pull Request continues the work started in #95

Geospatial Relations read by Magellan are automatically indexed now.
The schema of such relations includes two new fields : the list of ZOrderCurves that cover the geometry, along with the Relation. A Relation can be one of Contains, Within, Disjoint, Intersects indicating the relationship of this ZOrderCurve to the given geometry.

The option magellan.index.precision governs at what precision we resolve the ZOrderCurves (GeoHashes in the standard case of lat/ long)

Dataframes involving shapes can also now be indexed explicitly by invoking:
df.withColumn("index", $"point" precision 30) etc., where in this case we are covering the point by a six character (30/ 5) geohash.

Similarly, for polygons,
df.withColumn("index", $"polygon" precision 30) gives us all the six character geohashes that are either contained in, contain or intersect this polygon

A future PR will use these indices to automatically choose a spatial join based on these indices.
For now, if you don;t mind manually rewriting queries to take advantage of spatial indexing in Magellan, you can do the following:

// assuming you have two datasets (points and polygons) that you want to join that have been indexed as above

import magellan.index._

val indexedPoints = points.withColumn("index", explode($"index")).select("point", "index.curve", "index.relation")

val indexedPolygons = polygons.withColumn("index", explode($"index")).select("polygon", "index.curve", "index.relation")

// instead of joined = points.join(polygons).where($"point" within $"polygon") you have

val joined = indexedPoints.join(indexedPolygons, indexedPoints("curve") === indexedPolygons("curve")).where((indexedPolygons("relation") === "Within") or ($"point" within $"polygon"))

A sample Databricks community notebook that illustrates how to set up the indices and perform the spatial join is here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/2088756965947706/6891974485343070/latest.html

codecov-io · 2017-06-27T10:44:11Z

Codecov Report

Merging #118 into master will increase coverage by 0.91%.
The diff coverage is 88.07%.

@@            Coverage Diff            @@
##           master    #118      +/-   ##
=========================================
+ Coverage   76.89%   77.8%   +0.91%     
=========================================
  Files          42      47       +5     
  Lines        1402    1487      +85     
  Branches       98     103       +5     
=========================================
+ Hits         1078    1157      +79     
- Misses        324     330       +6

Impacted Files	Coverage Δ
src/main/scala/magellan/GeoJSONRelation.scala	`83.87% <ø> (ø)`	⬆️
src/main/scala/magellan/OsmFileRelation.scala	`92.45% <ø> (ø)`	⬆️
src/main/scala/magellan/ShapefileRelation.scala	`95.23% <ø> (ø)`	⬆️
src/main/scala/magellan/index/ZOrderCurve.scala	`94.28% <ø> (ø)`	⬆️
src/main/scala/magellan/index/Index.scala	`0% <0%> (ø)`
...la/org/apache/spark/sql/types/ZOrderCurveUDT.scala	`100% <100%> (ø)`
src/main/scala/magellan/catalyst/SpatialJoin.scala	`100% <100%> (ø)`
...main/scala/magellan/index/ZOrderCurveIndexer.scala	`100% <100%> (ø)`	⬆️
src/main/scala/magellan/Utils.scala	`100% <100%> (ø)`
src/main/scala/magellan/DefaultSource.scala	`85.71% <100%> (ø)`	⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0cbd46...e18a966. Read the comment docs.

jtmurphy89 · 2017-06-27T17:13:22Z

Thanks for the update! Quick question: should "df.withColumn("index", $"point" precision 30)" read "df.withColumn("index", $"point" index 30)"?

harsha2010 · 2017-06-27T17:22:52Z

@jtmurphy89 here is a data bricks notebook illustrating how to set up indices
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/2088756965947706/6891974485343070/latest.html

I'll also add it to the documentation

Ram Sriharsha and others added 2 commits June 27, 2017 12:24

Add indexing metadata

32cdce8

Merge with master

e18a966

harsha2010 merged commit c918aa9 into harsha2010:master Jun 27, 2017

halfabrane mentioned this pull request Jul 24, 2017

Transparent Spatial Join for Within #122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geospatial Indexing for Magellan #118

Geospatial Indexing for Magellan #118

halfabrane commented Jun 27, 2017 •

edited by harsha2010

Loading

codecov-io commented Jun 27, 2017

jtmurphy89 commented Jun 27, 2017

harsha2010 commented Jun 27, 2017

Geospatial Indexing for Magellan #118

Geospatial Indexing for Magellan #118

Conversation

halfabrane commented Jun 27, 2017 • edited by harsha2010 Loading

codecov-io commented Jun 27, 2017

Codecov Report

jtmurphy89 commented Jun 27, 2017

harsha2010 commented Jun 27, 2017

halfabrane commented Jun 27, 2017 •

edited by harsha2010

Loading