Incremental Layer Updater #2396

echeipesh · 2017-09-27T17:54:52Z

Supersedes: #2357 #2369 #2370
#2371 #2378 #2379 #2384

Add layer updating functionality to LayerWriter interface
Deprecates LayerUpdater trait
Update no longer supports changes in Avro schema
Updates RDDs with empty key bounds are skipped with WARN rather than throwing
AccumuloLayerWriter.update does not support HDFS write strategy

Schema update logic relies on loading entire layer into spark memory, which is not a viable strategy for large layers.

Removing duplicate arguments also factored out the update verification logic. deprecate LayerUpdater.update and preserve API

Helps run the cassandra tests on macs

to avoid java serialization errors in spark jobs

metasim

Mostly nitpick comments. My only concern from a global, long-term stability/complexity perspective is whether the tests are extensive enough; I don't have enough background to know.

metasim · 2017-09-28T23:23:51Z

accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloRDDWriter.scala

+          val kvs1: Vector[(K,V)] = _kvs1.toVector
+          val kvs2: Vector[(K,V)] =
+            if (mergeFunc != None) {
+              val scanner = instance.connector.createScanner(table, new Authorizations())


I'm not much on the details of Accumulo, but is new Authorizations() the same as Authorizations.EMPTY? If so I'd use the latter to make it clear/explicit Authorizations aren't being used. At any rate, it caught my eye.

metasim · 2017-09-28T23:26:08Z

accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloRDDWriter.scala

+        .map { case (key, _kvs1) =>
+          val kvs1: Vector[(K,V)] = _kvs1.toVector
+          val kvs2: Vector[(K,V)] =
+            if (mergeFunc != None) {


Minor nitpick, so ignore if you don't care, but below you do pattern matching on Option[(V,V) => V] and here using identity boolean check. Consistency might help readability. I prefer pattern matching or mergeFunc.nonEmpty myself.

metasim · 2017-09-28T23:28:05Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraLayerWriter.scala

+    update(id, rdd, None)
+  }
+
+  def update[


These LayerWriters seem to have a number of common delegate/convenience methods. Any way they can be abstracted over all/most of the cases?

We could expose the overload with mergeFunc: Option[(V, V) => V] but its a little iffy because having that as None turns a function into overwrite mode, which is drastically more dangerous. I hesitated to that that, do you think it would be helpful though?

Not sure which way the thumb is going is it:

"No, way too dangerous!"

"Yes, Lets expose the common API"

metasim · 2017-09-28T23:29:13Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraRDDWriter.scala

+    decomposeKey: K => Long,
+    keyspace: String,
+    table: String,
+    threads: Int = ConfigFactory.load().getThreads("geotrellis.cassandra.threads.rdd.write")


I suggest pulling the default number of threads into a final (constant) member of the object.

metasim · 2017-09-28T23:34:44Z

s3/src/main/scala/geotrellis/spark/io/s3/S3RDDWriter.scala

+            } else None
+          })
+
+        def elaborateRow(row: (String, Vector[(K,V)])): Process[Task, (String, Vector[(K,V)])] = {


Looks like a similar pattern to Cassandra case. Opportunity for refactoring?

metasim · 2017-09-28T23:35:19Z

scripts/cassandraTestDB.sh

@@ -4,7 +4,8 @@ docker pull cassandra:latest

 docker run \
  --rm \
-  --net=host \
+  -p 9160:9160 \
+  -p 9042:9042 \


metasim · 2017-09-28T23:37:44Z

spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopRDDWriter.scala

+    id: LayerId,
+    as: AttributeStore,
+    mergeFunc: Option[(V,V) => V],
+    indexInterval: Int = 4


Curious to know what's magic about 4... move to object constant with comment?

metasim · 2017-09-28T23:39:24Z

spark/src/test/scala/geotrellis/spark/io/LayerUpdateSpaceTimeTileSpec.scala

@@ -65,12 +65,21 @@ trait LayerUpdateSpaceTimeTileSpec
        updater.update(layerId, sample)
      }

+      it("should overwrite a layer") {
+        updater.overwrite(layerId, sample)


Does this actually confirm that the update happened?

It only confirms it didn't through, its possible that that an update was empty, either in bounds or in records in which case it would be NOOP.

Overall I agree it would be much more helpful if this returned number of records updated/written but that would require changing interface of _RDDWriter.write... or (I'm realizing as I'm typing this) to add a new counting method and wrap it to preserve the interface.

I would push this change off to 2.0 though, what do you think?

metasim · 2017-09-28T23:39:38Z

spark/src/test/scala/geotrellis/spark/io/LayerUpdateSpaceTimeTileSpec.scala

      it("should not update a layer (empty set)") {
        intercept[EmptyBoundsError] {
          updater.update(layerId, new ContextRDD[SpaceTimeKey, Tile, TileLayerMetadata[SpaceTimeKey]](sc.emptyRDD[(SpaceTimeKey, Tile)], emptyTileLayerMetadata))
        }
      }

+      it("should silently not overwrite a layer (empty set)") {
+        updater.overwrite(layerId, new ContextRDD[SpaceTimeKey, Tile, TileLayerMetadata[SpaceTimeKey]](sc.emptyRDD[(SpaceTimeKey, Tile)], emptyTileLayerMetadata))


pomadchin

Looks nice, though I have some questions.

pomadchin · 2017-10-02T13:00:03Z

accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloLayerUpdater.scala

    val LayerAttributes(header, metadata, keyIndex, writerSchema) = try {
      attributeStore.readLayerAttributes[AccumuloLayerHeader, M, K](id)
    } catch {
-      case e: AttributeNotFoundError => throw new LayerUpdateError(id).initCause(e)
+      case e: AttributeNotFoundError => throw new LayerReadError(id).initCause(e)


Would not be there a confusion that LayerReaderError was thrown during LayerUpdater usage?

good catch.

pomadchin · 2017-10-03T11:45:01Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraLayerUpdater.scala

+    V: AvroRecordCodec: ClassTag,
+    M: JsonFormat: GetComponent[?, Bounds[K]]: Mergable
+  ](id: LayerId, rdd: RDD[(K, V)] with Metadata[M]): Unit = {
+    val CassandraLayerHeader(_, _, keyspace, table) = attributeStore.readHeader[CassandraLayerHeader](id)


What will happen if there is no layer? (question is about all attributeStore.readHeader calls)

You'll get AttributeReadError with readHeader in the stack trace. I think thats pretty descriptive.

pomadchin · 2017-10-03T11:47:37Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraRDDWriter.scala

+                Process eval Task ({
+                  val (key, kvs1) = row
+                  val kvs2 =
+                    if (mergeFunc != None) {


mergeFunc.isDefined; everywhere in this PR where mergeFunc != None is used.

pomadchin · 2017-10-03T11:48:38Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraRDDWriter.scala

+                      (kvs2 ++ kvs1)
+                        .groupBy({ case (k,v) => k })
+                        .map({ case (k, kvs) =>
+                          val vs = kvs.map({ case (k,v) => v }).toSeq


Is .toSeq necessary here?

Doesn't appear to be, but also it'll do basically nothing except waste a stack frame.

pomadchin · 2017-10-03T11:49:29Z

cassandra/src/main/scala/geotrellis/spark/io/cassandra/CassandraRDDWriter.scala

              }

              val results = nondeterminism.njoin(maxOpen = threads, maxQueued = threads) {
-                queries map write
+                rows flatMap elaborateRow flatMap rowToBytes map retire


Looks cool :D

I think you meant:
rows flatMap elaborateRow flatMap rowToBytes map buyBoat map retire

rows flatMap boat flatMap upstream flatMap map lifeTodream

pomadchin · 2017-10-03T12:32:29Z

s3/src/main/scala/geotrellis/spark/io/s3/S3LayerUpdater.scala

@@ -37,71 +38,40 @@ class S3LayerUpdater(
 ) extends LayerUpdater[LayerId] with LazyLogging {

  def rddWriter: S3RDDWriter = S3RDDWriter
+  def _rddWriter(): S3RDDWriter = rddWriter
+
+  class InnerS3LayerWriter(


Mb to make this class private? Or add some overloads to make construction of S3LayerWriter with a custom rddWriter more convenient?

pomadchin · 2017-10-03T12:33:12Z

s3/src/main/scala/geotrellis/spark/io/s3/S3LayerUpdater.scala

+    override def rddWriter() = _rddWriter
+  }
+
+  val as = attributeStore.asInstanceOf[S3AttributeStore]


This cast looks very dirty and makes LayerUpdater incompatible with other backends, it's a normal use case when our AttributeStore and the catalog itself are in different backends. What do you think about it? Or mb to make AttributeStore types more restrictive in 2.0?

Yes, the interface for S3LayerUpdater is not great. I'm going to make these patch fields private and call it a day. The updaters should be removed in 2.0 and the way they're constructed right now should never trigger the error case.

pomadchin · 2017-10-03T12:35:12Z

s3/src/main/scala/geotrellis/spark/io/s3/S3RDDWriter.scala

            case e: AmazonS3Exception if e.getStatusCode == 503 => true
            case _ => false
          }
        }

-        val results = nondeterminism.njoin(maxOpen = threads, maxQueued = threads) { requests map write }(Strategy.Executor(pool))
+        val results = nondeterminism.njoin(maxOpen = threads, maxQueued = threads) {
+          rows flatMap elaborateRow flatMap rowToRequest map retire


Can't we abstract over it a bit more? Cassandra uses the same thing; or it's hardly possible?

pomadchin · 2017-10-03T12:41:48Z

spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopRDDWriter.scala

+  final val DefaultIndexInterval = 4
+
+  // From https://github.com/apache/spark/blob/3b049abf102908ca72674139367e3b8d9ffcc283/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
+  private class SerializableConfiguration(@transient var value: Configuration) extends Serializable {


Can we put it into util package? It could be reused by RF and in my COGs layer support.

RF // checked by time!

COGs

I suggest to use this though not sure about necessity in @transient

I wonder if a lack of Serializable trait has ever caused so much grief as in Configuration

Can't put it into geotrellis.util on account of the missing hadoop dependency but it has a home at geotrellis.spark.util

yes, that's what I meant; thanks!

I wonder if a lack of Serializable trait has ever caused so much grief as in Configuration

CRS not being Product has a similar feel.

pomadchin · 2017-10-03T12:43:54Z

util/src/main/scala/geotrellis/util/Filesystem.scala

@@ -24,6 +24,11 @@ import java.nio.file._


 object Filesystem {
+
+  def exists(path : String): Boolean = {
+    Files.exists(Paths.get(path))


pomadchin · 2017-10-04T08:19:33Z

hbase/src/main/scala/geotrellis/spark/io/hbase/HBaseRDDWriter.scala

    raster.groupBy({ row => decomposeKey(row._1) }, numPartitions = raster.partitions.length)
      .foreachPartition { partition: Iterator[(Long, Iterable[(K, V)])] =>
        if(partition.nonEmpty) {
          instance.withConnectionDo { connection =>
            val mutator = connection.getBufferedMutator(table)
+            val _table = instance.getConnection.getTable(table)


Ah, sry I missed it %) we already have a connection

Just change one line, should be:

val _table = connection.getTable(table)

withConnectionDo is a function which closes connection after the block of code is done:

def withConnectionDo[T](block: Connection => T): T = { val connection = getConnection try block(connection) finally connection.close() }

I had to introduce this function to keep an eye on open connections as the blowed up HBase even in local tests.

…nection again

pomadchin

Looks good to me!

echeipesh force-pushed the layer-updater branch from 5e31ed8 to 9621407 Compare September 27, 2017 17:55

James McClain added 29 commits September 27, 2017 13:56

Add Non-Merging Updater To Interface

e03fc7e

Non-Merging Updaters For File and Hadoop

1ef5e85

Non-Merging Updater For Accumulo

f61c4fe

Non-Merging Updater For Cassandra

520207e

Non-Merging Updater For Hbase

cb0b95f

Non-Merging Updater For S3

6306b65

Update Tests

12879cc

Remove Protected Methods in LayerUpdater

90488a8

Remove Commented-Out Code

418af01

Move Layer Updating To Layer Writer

bf7ef10

Move Functionality in File and Hadoop

60bc203

Move Functionality in Accumulo

a69d4a7

Move Functionality in Cassandra

c865be9

Move Functionality in Hbase

35effbb

Move Functionality in S3 (Part 1)

7dc6d5c

Move Functionality in S3 (Part 2)

339d52d

Internal API

964b52a

Mark More Things Private

4cdc4fe

Get Spark Context From RDD

bd5ae7a

Remove Full Outer Join From HDFS Backend

97d92c8

Files Can Overlap w/o Overlapping Updates

b78eed5

Remove Commented-Out Code

697a251

Remove Full Outer Join From Cassandra Backend

6a2b849

Cassandra: Pipeline Stages

bbef29d

Remove Full Outer Join From HBase Backend

ee2d091

Reduce Test Log Traffic

fc957ae

Remove Full Outer Join From Accumulo Backend

65a23cb

Add exists Predicate

967a61f

Remove Full Outer Join From File Backend

ec90abc

James McClain and others added 3 commits September 27, 2017 13:56

Remove Full Outer Join From S3

9e18f6b

S3: Pipeline Stages

892a388

Remove schema updates on layer update

bd6921e

Schema update logic relies on loading entire layer into spark memory, which is not a viable strategy for large layers.

echeipesh force-pushed the layer-updater branch from 9621407 to d224192 Compare September 28, 2017 01:16

Cleanup update and refactor API

9a851a4

Removing duplicate arguments also factored out the update verification logic. deprecate LayerUpdater.update and preserve API

echeipesh force-pushed the layer-updater branch from d224192 to 9a851a4 Compare September 28, 2017 04:35

echeipesh added 5 commits September 28, 2017 09:25

Turn down cassandra test log chatter

8180a5c

Bind ports rather than interface in cassandra docker

7b1be2d

Helps run the cassandra tests on macs

Reduce accumulo test log chatter

c0b4893

Add scaladoc for LayerWriter.update/overwrite

c8edfa4

Refactor LayerWriter update validation

2fea760

to avoid java serialization errors in spark jobs

echeipesh requested review from metasim and pomadchin September 28, 2017 20:41

metasim reviewed Sep 28, 2017

View reviewed changes

echeipesh added 2 commits October 2, 2017 16:56

Move RDDWriter/Reader config into constant field

0f11b37

Prefer use of .nonEmpty and static values

eb36d7d

pomadchin mentioned this pull request Oct 3, 2017

Add Layer Overwriting Support #2357

Closed

pomadchin requested changes Oct 3, 2017

View reviewed changes

echeipesh added 6 commits October 3, 2017 12:39

Use LayerWriteError in AccumuloLayerUpdater

367e0dc

Use mergeFunc.nonEmpty in layer updaters

f53a5c4

Maintain HBase connection per partition

6a1b6db

fix: missing brace

0ea2c79

Make S3LayerUpdate fields private

11d2895

fix type-o

23ee9b5

pomadchin reviewed Oct 4, 2017

View reviewed changes

Add geotrellis.spark.util.SerializableConfiguration and fix HBase con…

8e19fa9

…nection again

pomadchin force-pushed the layer-updater branch from 2d328b8 to 8e19fa9 Compare October 4, 2017 09:12

pomadchin approved these changes Oct 4, 2017

View reviewed changes

echeipesh merged commit 57800a6 into locationtech:master Oct 4, 2017

echeipesh mentioned this pull request Oct 9, 2017

LayerUpdate feature that does not try to merge tiles #2327

Closed

Incremental Layer Updater #2396

Incremental Layer Updater #2396

Conversation

echeipesh commented Sep 27, 2017

metasim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pomadchin Oct 4, 2017 • edited Loading

Choose a reason for hiding this comment

pomadchin left a comment • edited Loading

Choose a reason for hiding this comment

pomadchin Oct 4, 2017 •

edited

Loading

pomadchin left a comment •

edited

Loading