New opt-in partitioning strategy that may help with read optimization… #2855

ALPSMAC · 2018-12-21T13:44:51Z

… for cassandra by taking better advantage of the SFC locality guarantees.

Overview

Strawman PR for comments/feedback... code is still quite rough but looking for thoughts on indexing changes and API changes required.

Addresses thoughts in #2831

Checklist

docs/CHANGELOG.rst updated, if necessary (not sure if necessary?)
docs guides update, if necessary
New user API has useful Scaladoc strings (existing layer API doesn't seem to)
Unit tests added for bug-fix or new feature (TBD - need to exercise both index strategies in cassandra test)

Notes

I'm hoping that:

You can help me avoid having to grab the range of the key via the KeyIndex... Is there another way to access this information that I'm missing? If so then I think we can avoid the change to the LayerManager and LayerDeleter APIs. My suspicion is that I've missed something fundamental here about the abstraction, but I'm not quite sure what.
You can help me understand the type signature of KeyIndex.indexRanges as I'm not sure I'm making proper usage of it. I'm guessing that, due to the fact that an SFC may have "breaks" in it where there are large contiguous blocks of its range that are not used, that the Seq[(BigInt, BigInt)] return type is representing something like this:

|(begin SFC index range)---------|(unused part of range index)*|(more used index)--------|(more unused index)|------(end SFC index range)|

so that really what the (BigInt, BigInt) tuples are capturing is the coverage area over the range of the SFC like this:

|------|*****|------|****|------|
....0..............1............2...

Seq[(0th tuple, 1st tuple, 2nd tuple)]

is this correct?

You can point me in the right direction for doing a real performance analysis to compare the differences in the partition strategies. I tried running the geotrellis/cassandra test cases both with and without the read-optimized-partitioner I introduced here and it doesn't seem to break anything... but I also noticed most of the tests only seem to operate on Zoom Level 1 (based on observations from cqlsh) and they're all mixed workloads (read/write), so it's hard to say if there's much benefit one way or the other. I did add some timing code into the test spec itself as a very rough way of benchmarking and it looks like at worst we've not done much harm:

With the following (updated) configuration (note the new "read-optimized-partitioner"):

geotrellis.cassandra {
  port                 = 9042
  catalog              = "metadata"
  keyspace             = "geotrellis"
  replicationStrategy  = "SimpleStrategy"
  replicationFactor    = 1
  localDc              = "datacenter1"
  usedHostsPerRemoteDc = 0
  allowRemoteDCsForLocalConsistencyLevel = false
  partitionStrategy    = "read-optimized-partitioner"
  tilesPerPartition    = 64
  threads {
    collection.read = default
    rdd {
      write = default
      read  = default
    }
  }
}

I got these timings from the test spec:

18/12/21 08:00:40 INFO CassandraTileFeatureSpaceTimeSpec: Execution Time: 100655ms
18/12/21 08:01:16 INFO CassandraSpatialSpec: Execution Time: 31812ms
18/12/21 08:01:16 INFO CassandraLayerProviderSpec: Execution Time: 538ms
18/12/21 08:01:47 INFO CassandraTileFeatureSpatialSpec: Execution Time: 30784ms
18/12/21 08:03:20 INFO CassandraSpaceTimeSpec: Execution Time: 93133ms
18/12/21 08:03:21 INFO CassandraAttributeStoreSpec: Execution Time: 457ms

And with this (original) configuration:

geotrellis.cassandra {
  port = 9042
  catalog = "metadata"
  keyspace = "geotrellis"
  replicationStrategy = "SimpleStrategy"
  replicationFactor = 1
  localDc = "datacenter1"
  usedHostsPerRemoteDc = 0
  allowRemoteDCsForLocalConsistencyLevel = false
  threads {
    collection.read = default
    rdd {
      write = default
      read = default
    }
  }
}

I got these timings from the test spec:

18/12/21 08:09:23 INFO CassandraTileFeatureSpaceTimeSpec: Execution Time: 95980ms
18/12/21 08:09:58 INFO CassandraSpatialSpec: Execution Time: 31297ms
18/12/21 08:09:59 INFO CassandraLayerProviderSpec: Execution Time: 1135ms
18/12/21 08:10:29 INFO CassandraTileFeatureSpatialSpec: Execution Time: 29870ms
18/12/21 08:11:59 INFO CassandraSpaceTimeSpec: Execution Time: 89329ms
18/12/21 08:11:59 INFO CassandraAttributeStoreSpec: Execution Time: 411ms

All of the differences nominally seem to me to be within the margin of error... if anything I've made things a little slower, which only makes sense... more data = more time to write it :-)

Anyway - I'd be interested to know if you have any thoughts on how to more properly benchmark this.

Addresses #2831

ALPSMAC · 2018-12-21T14:09:54Z

Sorry about the PR churn... was fighting with gpg to get the signatures to work correctly.

ALPSMAC · 2018-12-21T14:12:05Z

cassandra/src/main/resources/reference.conf

+
+# Comment out the block above and uncomment this to run the Cassandra tests
+# using the new 'read-optimized-partitioner' strategy.
+#geotrellis.cassandra {


We should probably remove this and just add in some commenting into the reference.conf to document the possibilities.

I think we need to remove it and to add commented out options into the original cassandra config:

# partitionStrategy = "read-optimized-partitioner" # tilesPerPartition = 64

Works for me. Happy to do whatever your project's convention is - just wasn't sure how to handle this.

ALPSMAC · 2018-12-21T14:21:53Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraCollectionReader.scala

+      binRanges.flatten.zipWithIndex
+    }
+
+    def zoomBin(key: BigInteger): java.lang.Integer = {


This bit is duplicative across several classes and should be refactored into a common trait (or something).

More concerned with validating the approach is sound right now... can make it pretty if it seems reasonable.

Basic idea is to produce a consistent binning strategy for partitioning in Cassandra so that we don't have too many or too few tiles in a given partition.

I believe this lets us take more advantage of Cassandra's partition key cache when we're fetching tiles landing in the same Zoom bin.

I'm not sure if these zoom bins map consistently based on the underlying calculations in the intervals val... I need some help understanding how that behaves to know if this is "safe" or not.

@ALPSMAC look into the tests here: https://github.com/locationtech/sfcurve i hope it's better than just words here under the issue. Look into the tests. I don't think we need zoombin at all.

SFC gives us a consistent enough ranges => i think we can use just this:

instance.cassandraConfig.partitionStrategy match { case conf.RadOptimizedPartitioner => session.execute( SchemaBuilder.createTable(keyspace, table).ifNotExists() .addPartitionKey("name", text) .addPartitionKey("zoom", cint) .addClusteringColumn("key", varint) .addColumn("value", blob) ) case conf.WriteOptimizedPartitioner => session.execute( SchemaBuilder.createTable(keyspace, table).ifNotExists() .addPartitionKey("key", varint) .addClusteringColumn("name", text) .addClusteringColumn("zoom", cint) .addColumn("value", blob) ) }

Hey @pomadchin - thanks for taking the time to review what I've got here so far. I appreciate your feedback!

It would be great news if, as you suggest, we could avoid the API change... this PR as it sits feels far more invasive than I'd like it to be. I'll poke around more in the sfcurve repo as you suggested and see if the light-bulb flickers for me, but on first blush I'm not sure I follow your logic.

I don't want to jump to conclusions here - so please forgive my ignorance if I've grossly missed the mark, but...

Wouldn't creating the ReadOptimizedPartitioner schema as you suggested here result in large partition sizes for higher zoom levels irrespective of the clustering column? The partition size, as far as I know, is a function of the uniqueness of the partition keys - so the more collisions of a partition key you have the larger the partition grows. In your example schema the partition size wouldn't have anything to do with the ranges of the SFC since that key is declared as only a clustering column (which determines order within a partition, but not the partition itself).

For instance - if I have a layer called "cloud_cover" and was indexing it at zoom level 15, wouldn't that mean that all of the tiles for zoom 15 for my cloud_cover layer would be stored in the same partition in Cassandra? That seems like an unrealistically large partition, especially if I'm storing large tiles or double array tiles that don't compress well...

Now - I am not at all a Cassandra expert (so perhaps I am completely out to lunch here), but Cassandra does warn about large partition sizes in its logs when they occur, and StackOverflow seems to agree (not that it's an authoritative source... but it's not a bad place to start): https://stackoverflow.com/questions/46272571/why-is-it-so-bad-to-have-large-partitions-in-cassandra

My premise in this PR was to give the client, if they so choose, an ability to tweak maximum partition sizes in a reasonable way to optimize performance for their given use case (with careful bench-marking). How would indexing as you suggest avoid unrealistically large partition sizes (which would seem to me to break that premise at higher zoom levels)?

Ah it makes sense, I was just confused about what it means; your answer clarifies everything.

Sending you here than.

Roger - thank you for the clarification :-)

So I'll proceed with code cleanup then as well as working up some benchmarks to either prove or disprove the utility of this PR.

ALPSMAC · 2018-12-21T14:23:05Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraLayerDeleter.scala

+      }
+
+      lazy val intervals: Vector[(Interval[BigInt], Int)] = {
+        val keyIndex = attributeStore.readKeyIndex[K](id)


This is what forced the change to the delete API for me... I'm not sure if we can fetch the indexRanges we need without it though... is there a better way that I'm missing?

ALPSMAC · 2018-12-21T14:26:16Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/conf/CassandraConfig.scala

@@ -41,10 +63,13 @@ case class CassandraConfig(
  localDc: String = "datacenter1",
  usedHostsPerRemoteDc: Int = 0,
  allowRemoteDCsForLocalConsistencyLevel: Boolean = false,
-  threads: CassandraThreadsConfig = CassandraThreadsConfig()
+  threads: CassandraThreadsConfig = CassandraThreadsConfig(),
+  partitionStrategy: CassandraPartitionStrategy = `Write-Optimized-Partitioner`,


Default behavior is to use the same behavior as before.

NOTE: Changing the CassandraPartitionStrategy requires dumping the old keyspaces associated with the layer - the schemas are not compatible.

ALPSMAC · 2018-12-21T14:27:02Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/conf/CassandraConfig.scala

 )

 object CassandraConfig extends CamelCaseConfig {
+  private implicit lazy val partitionStrategyHint = new EnumCoproductHint[CassandraPartitionStrategy]


Required for ADT encoding of Enum to parse reasonably in pureconfig

ALPSMAC · 2018-12-21T14:29:45Z

cassandra/src/test/scala/geotrellis/spark/CassandraTestEnvironment.scala

@@ -39,6 +42,12 @@ trait CassandraTestEnvironment extends TestEnvironment { self: Suite =>
        println("\u001b[0;33mA script for setting up the Cassandra environment necessary to run these tests can be found at scripts/cassandraTestDB.sh - requires a working docker setup\u001b[m")
        cancel
    }
+    startTime = System.currentTimeMillis()


We probably shouldn't keep this timing logic in here...

I'd love to contribute an IO benchmark for Cassandra layer access as well... but I have to admit I'm not quite sure where to start on that. I see the bench subproject, but most of that seems to be about logic within geotrellis itself and doesn't concern itself with the database IO.

Yes, I agree, probably benchmarks can be placed here: https://github.com/locationtech/geotrellis/tree/master/bench

Works for me - I'll see what I can do about adding some Read, Write, and Mixed benchmarks for Cassandra under both schema strategies.

Indeed; wondering how much differently a PK change would behave in tests on reads; since we're not performing range queries: https://github.com/locationtech/geotrellis/blob/master/cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraRDDReader.scala#L80

Agreed entirely - I don't want to introduce more overhead or complexity if it's not to the benefit of performance.

My thought is that we may benefit from Cassandra's row cache even across multiple queries if those queries land in the same partition. I must admit though that I can't confirm whether or not that's actually a reasonable expectation.

ALPSMAC · 2018-12-21T14:35:25Z

docs/cassandra/cassandra-test.md

@@ -15,6 +15,11 @@ to launch tests. Before running Cassandra tests, be sure, that a local
 Cassandra instance using Docker is provided
 [here](https://github.com/pomadchin/geotrellis/blob/feature/cassandra-nmr/scripts/cassandraTestDB.sh).

+One can also use [ccm](https://github.com/riptano/ccm) for running a Cassandra 


I've tested this PR by changing the configuration in reference.conf and then running against a test geotrellis cassandra cluster built by ccm with vnodes enabled and partition key caching on.

ALPSMAC · 2018-12-21T14:36:42Z

docs/guide/tile-backends.rst

@@ -274,6 +274,10 @@ fast, column-based NoSQL database. It is likely the most performant of
 our backends, although this has yet to be confirmed. To work with
 GeoTrellis, it requires an external Cassandra process to be running.

+The indexing strategy can be optimized for read-heavy or write-heavy loads by adjusting
+the partition strategy applied - see the reference.conf in the `geotrellis-cassandra`
+subproject on github for more details.


There's probably more to be said about this to document it clearly...

Using the reference.conf configuration method for db access seems pretty common across the different tile backends, but I wasn't exactly sure where those configuration options get documented. Can you point me to where that should be recorded?

ALPSMAC · 2018-12-21T14:38:51Z

spark-etl/src/main/scala/geotrellis/spark/etl/config/json/ConfigFormats.scala

@@ -128,10 +128,35 @@ trait ConfigFormats {
      }
  }

+  implicit val cassandraPartitionStrategyFormat = new JsonFormat[CassandraPartitionStrategy]{


I don't think Spray JSON has a nice way to automatically encode/decode enums encoded as ADTs. There may be other add-on libraries out there that know how to automate this boilerplate as a macro or something... not sure what this project's feeling is on that sort of thing.

ALPSMAC · 2019-01-02T14:44:04Z

@pomadchin FYI - I'm back from the holidays and am looking at code cleanup now. I'll also take a look at how to do a better performance benchmark with results that I can believe a little bit more to convince ourselves we didn't introduce a feature that makes things worse ;-)

pomadchin

The main clarification is mostly related to the zoombin parameter. It looks like we can avoid using it.

If it's not necessary then it's easy to avoid such a massive and undesired API change.

pomadchin · 2019-01-15T14:24:10Z

cassandra/src/main/resources/reference.conf

+
+# Comment out the block above and uncomment this to run the Cassandra tests
+# using the new 'read-optimized-partitioner' strategy.
+#geotrellis.cassandra {


I think we need to remove it and to add commented out options into the original cassandra config:

# partitionStrategy = "read-optimized-partitioner" # tilesPerPartition = 64

pomadchin · 2019-01-15T14:25:17Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraCollectionReader.scala

-      .and(eqs("zoom", layerId.zoom))
-      .toString
+    val query = instance.cassandraConfig.partitionStrategy match {
+      case conf.`Read-Optimized-Partitioner` =>


ReadOptimizedPartitioner and WriteOptimizedPartitioner looks better to me

Yes - and in uncommitted work I've moved away from the idea of naming it "Partitioner" to begin with since that conflates a different idea in Cassandra. We're not setting a partitioner... we're adjusting the strategy behind the schema optimization, so I've change the name of this around a bit.

I'll remove the dashes though as you suggest :-)

pomadchin · 2019-01-15T14:28:13Z

cassandra/src/test/scala/geotrellis/spark/CassandraTestEnvironment.scala

@@ -39,6 +42,12 @@ trait CassandraTestEnvironment extends TestEnvironment { self: Suite =>
        println("\u001b[0;33mA script for setting up the Cassandra environment necessary to run these tests can be found at scripts/cassandraTestDB.sh - requires a working docker setup\u001b[m")
        cancel
    }
+    startTime = System.currentTimeMillis()


Yes, I agree, probably benchmarks can be placed here: https://github.com/locationtech/geotrellis/tree/master/bench

pomadchin · 2019-01-15T14:48:47Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraCollectionReader.scala

+      binRanges.flatten.zipWithIndex
+    }
+
+    def zoomBin(key: BigInteger): java.lang.Integer = {


@ALPSMAC look into the tests here: https://github.com/locationtech/sfcurve i hope it's better than just words here under the issue. Look into the tests. I don't think we need zoombin at all.

SFC gives us a consistent enough ranges => i think we can use just this:

instance.cassandraConfig.partitionStrategy match { case conf.RadOptimizedPartitioner => session.execute( SchemaBuilder.createTable(keyspace, table).ifNotExists() .addPartitionKey("name", text) .addPartitionKey("zoom", cint) .addClusteringColumn("key", varint) .addColumn("value", blob) ) case conf.WriteOptimizedPartitioner => session.execute( SchemaBuilder.createTable(keyspace, table).ifNotExists() .addPartitionKey("key", varint) .addClusteringColumn("name", text) .addClusteringColumn("zoom", cint) .addColumn("value", blob) ) }

… for cassandra by taking better advantage of the SFC locality guarantees.

…nfo on cassandra partitioner strategies. TODO: I we should improve the reference.conf there to document the different strategies etc.

ALPSMAC · 2019-02-11T17:36:11Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraIndexing.scala

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+


@pomadchin ,

I squeezed in a bit more time to work on this today and got some of the duplicate code cleaned up and pulled into this CassandraIndexing class.

I'm going to try to dive into some sort of benchmarking thing to see if I can prove to myself whether or not this indexing change really has done anything to improve read speed next.

If nothing else, I like that this centralizes all of the schema-related stuff in one class so you don't have to hunt across the layering API to find it.

Thanks,
Andy

ALPSMAC · 2019-02-11T17:40:56Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraCollectionReader.scala


    instance.withSessionDo { session =>
-      val statement = session.prepare(query)
+      val statement = indexStrategy.prepareQuery(query)(session)


Made the evaluation of session lazy since you might want to get clever with a function that evaluates to a Session only when in the proper scope for it to be bound to a particular Executor in Spark.

Maybe I'm overthinking this... but not evaluating my Cassandra sessions lazily has been something that's bitten me before in other Spark code.

ALPSMAC · 2019-02-11T17:44:03Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraIndexing.scala

+  //Value Classes - should be unwrapped at compile time:
+  case class ZoomBinIntervals(intervals: Vector[(Interval[BigInt], Int)]) extends AnyVal
+
+  final case class WriteValueStatement(statement: BuiltStatement) extends AnyVal with CassandraStatement


Trying to add in some type safety here... Would have reached for tagged types, but I wasn't sure how the project felt about them (or which library to use for them). I opted for the native-supported "value classes" construct instead... which relies upon the compiler to do the "right thing" a little too much for my taste, but should still avoid some of the (un)boxing if I understand correctly.

ALPSMAC · 2019-02-11T17:45:23Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraLayerDeleter.scala

-          .from(header.keyspace, header.tileTable).allowFiltering()
-          .where(eqs("name", id.name))
-          .and(eqs("zoom", id.zoom))
+      val keyIndex = attributeStore.readKeyIndex[K](id)


Still not thrilled with having to go out and fetch the KeyIndex here, but I've yet to noodle through another way to fetch the range information I need in order to assign a consistent zoomBin.

ALPSMAC · 2019-02-11T17:46:42Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraLayerDeleter.scala

-          session.execute(statement.bind(entry.getVarint("key")))
+        //NOTE: use `iterator()` when possible as opposed to `all()` since the latter forces
+        //      materialization of the entirety of the query results into on-heap memory.
+        session.execute(squery).iterator().asScala.map { entry =>


Note the change to iterator() here from the all() call earlier. Should hopefully reduce some heap churn on large deletes.

This change might be worth making independent of the rest of this PR as it is a simple optimization.

…nds/evidence parameters on delete API

ALPSMAC · 2019-02-11T18:12:21Z

accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloLayerDeleter.scala


 class AccumuloLayerDeleter(val attributeStore: AttributeStore, connector: Connector) extends LazyLogging with LayerDeleter[LayerId] {

-  def delete(id: LayerId): Unit = {
+  def delete[K: ClassTag](id: LayerId): Unit = {


Was able to reduce the number of required bounds/evidence parameters to just the ClassTag itself... evidently that's all that is required to look up the KeyIndex.

I'm curious as to whether the metadata in the attributes table that's already loaded in the delete implementation contains the required classname so that we could materialize the class/type required at runtime rather than have to rely upon an API change.

Even if that information is available though, I may need an assist using it. I'm not sure how I'd go from a reflective call to look up a class by name to a type constructor argument required for the KeyIndex lookup.

…mark are not positive, so likely will abandon this effort.

pomadchin · 2019-02-20T17:23:00Z

@ALPSMAC have you tried to run some tests on a real (N nodes) Cassandra cluster?

ALPSMAC · 2019-02-20T17:40:51Z

cassandra/src/test/scala/geotrellis/spark/io/cassandra/bench/CassandraIndexStrategySpec.scala

@@ -0,0 +1,223 @@
+package geotrellis.spark.io.cassandra.bench
+


@pomadchin ,

So ultimately I wasn't sure exactly where to add some sort of benchmarking to this PR. I decided on this location, but the location might be moot. I think I am going to recommend declining this PR for now...

Results of benchmarking were not favorable unfortunately! I was really hoping that leveraging the locality guarantees of the SFC would net us better performance out of Cassandra than simply ignoring it like we do now. Unfortunately that does not seem to be the case!

The following output from this test case was performed with ccm with vnodes on with 3 instances running on a 4 core (8 w/ hyperthreading) xeon processor with 64GB of RAM:

Average write-time for READ optimized schema: 2451.9333333333334ms Average write-time for WRITE optimized schema: 1119.6666666666667ms STDDEV write-time for READ optimized schema: 973.7087495185041ms STDDEV write-time for WRITE optimized schema: 183.95712060755415ms ... Average read-time for READ optimized schema: 311.19ms Average read-time for WRITE optimized schema: 135.7ms STDDEV read-time for READ optimized schema: 170.76438123917995ms STDDEV read-time for WRITE optimized schema: 23.697468219200122ms

The gap narrows if you run with replicationFactor = 2 as opposed to 1 by default, but the story still isn't good:

Average write-time for READ optimized schema: 1938.0ms Average write-time for WRITE optimized schema: 1155.2666666666667ms STDDEV write-time for READ optimized schema: 478.0826985644499ms STDDEV write-time for WRITE optimized schema: 72.90584491124304ms ... Average read-time for READ optimized schema: 262.4ms Average read-time for WRITE optimized schema: 207.8ms STDDEV read-time for READ optimized schema: 106.12699939223765ms STDDEV read-time for WRITE optimized schema: 43.671729986342434ms

At this point I'm a little stuck... my intuition says there should be some way to leverage the locality guarantees provided by the SFC within Cassandra, but if there is, I don't think this approach is going to get us there.

So I don't want you guys to lose anything that's in this PR that might be worth keeping around... but my suspicion is that the main thrust of it isn't worth merging. I really appreciate your time and help though while I worked through this implementation and got to a place where I could start testing and getting some numbers to invalidate my changes.

Perhaps if nothing else it's worth recording that this experiment was conducted somewhere in your internal docs so you'll know in the future that the proverbial "juice isn't worth the squeeze". Let me know if there's anything further I can do to help in that regard or if there's something else you'd like to see tried as part of this PR before abandoning it.

Kind Regards,
Andy

… effort

ALPSMAC · 2019-02-20T19:26:15Z

docs/cassandra/cassandra-optimization.md

@@ -0,0 +1,202 @@
+# geotrelis.cassandra.optimization


@pomadchin - as per our gitter discussion, here's a summary doc. on what I did. Suggestions welcome.

That's great! Figuring out how to merge everything

pomadchin · 2019-02-22T10:05:31Z

Merging it into a separate branch! All docs would be back ported against master branch.

ALPSMAC force-pushed the 2831-CassandraIndexingImprovements branch 4 times, most recently from 43f07d7 to 7f2a7cf Compare December 21, 2018 14:09

ALPSMAC commented Dec 21, 2018

View reviewed changes

pomadchin requested changes Jan 15, 2019

View reviewed changes

ALPSMAC added 3 commits February 11, 2019 12:11

New opt-in partitioning strategy that may help with read optimization…

718486f

… for cassandra by taking better advantage of the SFC locality guarantees.

update tile backends docs to point to reference.conf for additional i…

69c33e7

…nfo on cassandra partitioner strategies. TODO: I we should improve the reference.conf there to document the different strategies etc.

Refactor common code for index strategy management into utility class

c40619d

ALPSMAC force-pushed the 2831-CassandraIndexingImprovements branch from 0ece030 to c40619d Compare February 11, 2019 17:13

ALPSMAC commented Feb 11, 2019

View reviewed changes

correct defaults in reference.conf

c598ed3

ALPSMAC commented Feb 11, 2019

View reviewed changes

Fix serialization code after name refactor and remove unnecessary bou…

afafcdf

…nds/evidence parameters on delete API

ALPSMAC commented Feb 11, 2019

View reviewed changes

Adding a benchmark test in the Cassandra test cases. Results of bench…

2dbe1a0

…mark are not positive, so likely will abandon this effort.

ALPSMAC commented Feb 20, 2019

View reviewed changes

Add some documentation on what was done as part of this investigative…

daa1350

… effort

ALPSMAC commented Feb 20, 2019

View reviewed changes

Fix docs formatting

5e034b3

pomadchin force-pushed the 2831-CassandraIndexingImprovements branch from 0feb792 to 5e034b3 Compare February 22, 2019 09:57

pomadchin changed the base branch from master to rnd/cassandra-indexing February 22, 2019 10:04

pomadchin merged commit 266a1e7 into locationtech:rnd/cassandra-indexing Feb 22, 2019

pomadchin changed the title ~~[WIP] New opt-in partitioning strategy that may help with read optimization…~~ New opt-in partitioning strategy that may help with read optimization… Feb 22, 2019

pomadchin mentioned this pull request Feb 22, 2019

Back port Cassandra indexing research PR results #2876

Merged

		@@ -0,0 +1,223 @@
		package geotrellis.spark.io.cassandra.bench

New opt-in partitioning strategy that may help with read optimization… #2855

New opt-in partitioning strategy that may help with read optimization… #2855

Conversation

ALPSMAC commented Dec 21, 2018 • edited

Overview

Checklist

Notes

ALPSMAC commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin Jan 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ALPSMAC Dec 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ALPSMAC commented Jan 2, 2019

pomadchin left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ALPSMAC Feb 11, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin commented Feb 20, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin commented Feb 22, 2019

ALPSMAC commented Dec 21, 2018 •

edited

pomadchin Jan 16, 2019 •

edited

ALPSMAC Dec 21, 2018 •

edited

pomadchin left a comment •

edited

ALPSMAC Feb 11, 2019 •

edited

pomadchin commented Feb 20, 2019 •

edited