Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geospatial Indexing for Magellan #118

Merged
merged 2 commits into from
Jun 27, 2017

Conversation

halfabrane
Copy link
Contributor

@halfabrane halfabrane commented Jun 27, 2017

This Pull Request continues the work started in #95

Geospatial Relations read by Magellan are automatically indexed now.
The schema of such relations includes two new fields : the list of ZOrderCurves that cover the geometry, along with the Relation. A Relation can be one of Contains, Within, Disjoint, Intersects indicating the relationship of this ZOrderCurve to the given geometry.

The option magellan.index.precision governs at what precision we resolve the ZOrderCurves (GeoHashes in the standard case of lat/ long)

Dataframes involving shapes can also now be indexed explicitly by invoking:
df.withColumn("index", $"point" precision 30) etc., where in this case we are covering the point by a six character (30/ 5) geohash.

Similarly, for polygons,
df.withColumn("index", $"polygon" precision 30) gives us all the six character geohashes that are either contained in, contain or intersect this polygon

A future PR will use these indices to automatically choose a spatial join based on these indices.
For now, if you don;t mind manually rewriting queries to take advantage of spatial indexing in Magellan, you can do the following:

// assuming you have two datasets (points and polygons) that you want to join that have been indexed as above

import magellan.index._

val indexedPoints = points.withColumn("index", explode($"index")).select("point", "index.curve", "index.relation")

val indexedPolygons = polygons.withColumn("index", explode($"index")).select("polygon", "index.curve", "index.relation")

// instead of joined = points.join(polygons).where($"point" within $"polygon") you have

val joined = indexedPoints.join(indexedPolygons, indexedPoints("curve") === indexedPolygons("curve")).where((indexedPolygons("relation") === "Within") or ($"point" within $"polygon"))

A sample Databricks community notebook that illustrates how to set up the indices and perform the spatial join is here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/2088756965947706/6891974485343070/latest.html

@codecov-io
Copy link

Codecov Report

Merging #118 into master will increase coverage by 0.91%.
The diff coverage is 88.07%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #118      +/-   ##
=========================================
+ Coverage   76.89%   77.8%   +0.91%     
=========================================
  Files          42      47       +5     
  Lines        1402    1487      +85     
  Branches       98     103       +5     
=========================================
+ Hits         1078    1157      +79     
- Misses        324     330       +6
Impacted Files Coverage Δ
src/main/scala/magellan/GeoJSONRelation.scala 83.87% <ø> (ø) ⬆️
src/main/scala/magellan/OsmFileRelation.scala 92.45% <ø> (ø) ⬆️
src/main/scala/magellan/ShapefileRelation.scala 95.23% <ø> (ø) ⬆️
src/main/scala/magellan/index/ZOrderCurve.scala 94.28% <ø> (ø) ⬆️
src/main/scala/magellan/index/Index.scala 0% <0%> (ø)
...la/org/apache/spark/sql/types/ZOrderCurveUDT.scala 100% <100%> (ø)
src/main/scala/magellan/catalyst/SpatialJoin.scala 100% <100%> (ø)
...main/scala/magellan/index/ZOrderCurveIndexer.scala 100% <100%> (ø) ⬆️
src/main/scala/magellan/Utils.scala 100% <100%> (ø)
src/main/scala/magellan/DefaultSource.scala 85.71% <100%> (ø) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0cbd46...e18a966. Read the comment docs.

@harsha2010 harsha2010 merged commit c918aa9 into harsha2010:master Jun 27, 2017
@jtmurphy89
Copy link

Thanks for the update! Quick question: should "df.withColumn("index", $"point" precision 30)" read "df.withColumn("index", $"point" index 30)"?

@harsha2010
Copy link
Owner

@jtmurphy89 here is a data bricks notebook illustrating how to set up indices
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/137058993011870/2088756965947706/6891974485343070/latest.html

I'll also add it to the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants