Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra Support #1398

Closed
echeipesh opened this issue Mar 17, 2016 · 3 comments
Closed

Cassandra Support #1398

echeipesh opened this issue Mar 17, 2016 · 3 comments

Comments

@echeipesh
Copy link
Contributor

GeoTrellis's integration to spark currently supports Accumulo as backends to store and retrieve raster data across a cluster. Cassandra is another distributed data store that could provide a rich set of features and performance opportunities to GeoTrellis instances running on Spark. It is also a popular distributed data store that a number of people interested in doing large scale geospatial computations are already using.

Google Summer of Code 2015 scholar has tackled this project and created a prototype of GeoTrellis catalog implementation for Cassandra. Several lessons have been learned about the impact of the Cassandra architecture on performance when interfacing with Apache Spark and the project is at a point where further effort is expected to bring this feature to completion. The base objective is to achieve performant multi-dimensional range query from Apache Spark, using available low level interfaces.

Stretch goals on this project include optimizing the implementation to consider the the cluster data locality information available from Cassandra when distributing the IO loads.

Previous work has attempted to use the https://github.com/datastax/spark-cassandra-connector to implement this feature but the connectors focus on supporting query like interface forced us to rely on performing a spark union over multiple component rang queries, this turned out to be slow.

The key insight for the second stage of the project is that GeoTrellis already supports AWS S3 as a backend, which is extremely limited in its capabilities, only allowing for List/Get/Put operations on a Key/Value store. This is done by iterating over all of the possible index keys covered by the query region on the clients and asking for them directly. Similar feature must be available in a Cassandra driver, something that supports a "set get" of keys would be ideal.

@echeipesh echeipesh added the GSOC Google Summer of Code label Mar 17, 2016
@fosskers
Copy link
Contributor

This can probably be closed.

@pomadchin
Copy link
Member

Definitely can be closed, merged #1452 @lossyrob @echeipesh @rossbernet.

@lossyrob
Copy link
Member

lossyrob commented Oct 6, 2016

Giving this one to Grisha since he finished off the feature :)

@lossyrob lossyrob removed the GSOC Google Summer of Code label Oct 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants