New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra support #1452

Merged
merged 76 commits into from Jun 10, 2016

Conversation

Projects
None yet
4 participants
@pomadchin
Member

pomadchin commented Apr 19, 2016

No description provided.

.mapPartitions { partition: Iterator[Seq[(Long, Long)]] =>
val session = instance.session
val statement = session.prepare(s"SELECT value FROM ${instance.keySpace}.${table} WHERE key = ?")

This comment has been minimized.

@pomadchin

pomadchin Apr 19, 2016

Member

probably that would be faster to use range queries there, but not sure (the only way to figure it out -- to try)

pomadchin added some commits Apr 19, 2016

fixed cassandra session management: 1. added close method to Attribut…
…e store (unit by default); 2. fixed session close on read and write not breaking async and / or lazy evaluations; 3. a bit difficult session management for client code, though as a good point this management not breaks speed.
@lossyrob

This comment has been minimized.

Member

lossyrob commented May 3, 2016

Hey @allixender, would you be able to review this? We're also having issues getting the unit tests to run on Travis, any thoughts? Thanks!

@allixender

This comment has been minimized.

Contributor

allixender commented May 4, 2016

Hi @lossyrob , @pomadchin ,
sure thing. Now that 0.10.0 is stable it'd be great to look into it again. I might need some time to get up to speed with the setup and the code again. I have seen you've added a lot of documentation on the website, too. Love it.
Umm, where should I start, which branch should I look at?
Cheers,
Alex

@pomadchin

This comment has been minimized.

Member

pomadchin commented May 4, 2016

@allixender First of all you can look diff of this pr. Second point: we have separate sub projects for s3, accumulo, and cassandra backends, hadoop and file are included into spark sub project.

What's different from your pr: we don't use datastax connector, but we use datastax java driver. Main differences in rdd reader and rdd writer

So possible questions are:

  • is it really fast (we haven't checked it on a huge amount of data right now (think will do it this week / next week)), and what do you think all in all about that approach reading / writing, can it be faster?
  • how to test it properly: I tried embedded cassandra and to start cassandra as a service for travis ci. First variant is rally slow, the second is much faster (on a local machine), but ci is out of mem.

By the way, cassandra space time tests are much slower than others, but it's understandable: other backends has mock clients (accumulo / s3), but for cassandra we need a real / embedded cassandra, probably that's one of the reasons why cassandra tests are all in all a bit slower.

P.S. I tried to load LC8 tiles, and moved chatta demo to use any backend (including cassandra) and there were no significant differences for tile read speed / tile ingest (but i hadn't checked exact timings), and everything works as expected.

@pomadchin

This comment has been minimized.

Member

pomadchin commented May 31, 2016

Includes fix for open jdk 7: travis-ci/travis-ci#5227
This fix yet not works in PR's: travis-ci/travis-ci#5669
/cc: @lossyrob

@pomadchin

This comment has been minimized.

Member

pomadchin commented May 31, 2016

Rollback to an old version and added links to travis issues.

@lossyrob

This comment has been minimized.

Member

lossyrob commented Jun 3, 2016

@pomadchin when I try to run the tests I get these types of errors:

DeferredAbortedSuite:
[info] Exception encountered when attempting to run a suite with class name: org.scalatest.DeferredAbortedSuite *** ABORTED ***
[info]   com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1] Cannot connect))
[info]   at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
[info]   at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
[info]   at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1414)
[info]   at com.datastax.driver.core.Cluster.init(Cluster.java:162)
[info]   at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:333)
[info]   at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:308)
[info]   at com.datastax.driver.core.Cluster.connect(Cluster.java:250)
[info]   at geotrellis.spark.io.cassandra.CassandraInstance$class.getSession(CassandraInstance.scala:18)
[info]   at geotrellis.spark.io.cassandra.BaseCassandraInstance.getSession(CassandraInstance.scala:47)
[info]   at geotrellis.spark.io.cassandra.CassandraInstance$class.withSessionDo(CassandraInstance.scala:34)

What's with that?

@lossyrob

This comment has been minimized.

Member

lossyrob commented Jun 6, 2016

@pomadchin can we have the unit tests fail in a more specific way when there is no environment? That informs users how to make them pass?

This is my worry: There is a new user that wants to work with GeoTrellis. As I often do, I say, step 1 is to fork the repo, clone your fork, add upstream, and then run the unit tests.

So they buildall.sh, and it all the sudden fails with a sudo-cryptic error message. They freak out and flee to the wilderness.

Is there a way we can capture the scenario of not-having-started-up-cassandra-docker-yet and explain that to users that are simply running all unit tests?

@pomadchin

This comment has been minimized.

Member

pomadchin commented Jun 6, 2016

@lossyrob think yes, we can add special exception type, and to check cassandra availability before all tests run.

@lossyrob

This comment has been minimized.

Member

lossyrob commented Jun 6, 2016

@pomadchin also we need the script to be inside the geotrellis project. It should be a one liner to run the cassandra instance for the unit tests; it should also say in a README.md in the cassandra subproject about why and how to run the cassandra docker container and the unit tests.

@pomadchin

This comment has been minimized.

Member

pomadchin commented Jun 6, 2016

@lossyrob what do you mean by "inside geotrellis project"? this one liner is not enough? Good information about README.md edition.

@lossyrob

This comment has been minimized.

Member

lossyrob commented Jun 6, 2016

@pomadchin nevermind, I thought you had linked to a gist. Disregard the script request.

@lossyrob

This comment has been minimized.

Member

lossyrob commented Jun 6, 2016

What's with the

00:14:50 DCAwareRoundRobinPolicy: Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)

a bunch of times in the last build failure logs?

@pomadchin

This comment has been minimized.

Member

pomadchin commented Jun 6, 2016

It is related to a load balancing policy, would add configuration settings, forgot about it. However these annoying messages still would be present, nothing to do with it, it's logger.info level:

DCAwareRoundRobinPolicy: Using provided data-center name 'datacenter1' for DCAwareRoundRobinPolicy

@lossyrob lossyrob merged commit a94a765 into locationtech:master Jun 10, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@pomadchin pomadchin referenced this pull request Sep 30, 2016

Closed

Cassandra Support #1398

@lossyrob lossyrob added this to the 1.0 milestone Oct 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment