Collections API #1606

pomadchin · 2016-07-27T11:30:15Z

#1605

Backends support:

…p not working

pomadchin · 2016-07-29T13:31:29Z

geotrellis.slick.PostgisSpec test failed

…/collections-api

pomadchin · 2016-08-02T14:53:32Z

geotrellis.slick.PostgisSpec

echeipesh · 2016-08-18T21:18:26Z

spark/src/main/scala/geotrellis/spark/io/CollectionLayerReader.scala

+    K: AvroRecordCodec: Boundable: JsonFormat: ClassTag,
+    V: AvroRecordCodec: ClassTag,
+    M: JsonFormat: GetComponent[?, Bounds[K]]
+  ](id: ID, rasterQuery: LayerQuery[K, M], numPartitions: Int, indexFilterOnly: Boolean): Seq[(K, V)] with Metadata[M]


this read should not have numPartitions argument. All the reads should be using threads to distributed the work as much as makes sense, but by definition you only ever get one partition, the collection.

echeipesh · 2016-08-18T21:32:02Z

It looks like a lot of code is shared between the collection readers and rdd readers in S3, File, Hadoop, and Cassandra. HBase and Accumulo are exceptions because they have InputFormats but even then we could replace those easily with custom readers and possible gain some control.

Reading a collection seems exactly like reading a partition. Maybe we can abstract all of that code into something that reads just the ranges and the splitting/filtering of those ranges can be up to the whichever reader. I'm mostly concerned because this async code is very fiddly and error-prone. I can see us having to refactor it again when we learn something new.

echeipesh · 2016-08-18T21:34:56Z

spark/src/main/scala/geotrellis/spark/io/hadoop/HadoopCollectionReader.scala

+          }
+      }
+
+      nondeterminism.njoin(maxOpen = threads, maxQueued = threads) { range map read }.runFoldMap(identity).unsafePerformSync


readers need to be closed, I'm unsure on what consequence, but probably nothing good.

are not they closing with cache expiration?

I can see how that would mess with the caching though, if you had multiple async calls n to a Collection reader. If that's the case it almost makes it seem like CollectionsReader needs to be closable, as it will invariably represent some resources.

Main use case for these is to be called from endpoints to work on "small" areas, so async read calls are totally in scope of design.

Same story goes for HBase and Accumulo collection reads, they're based on some version of range/batch writers.

…/collections-api

pomadchin · 2016-08-30T06:11:27Z

Includes important HBase and ETL fixes.

…by already runing jobs on a runing cluster

echeipesh · 2016-08-31T13:08:36Z

spark/src/main/scala/geotrellis/spark/io/CollectionLayerReader.scala

+}
+
+object CollectionLayerReader {
+  def njoin[K: AvroRecordCodec: Boundable, V: AvroRecordCodec](


We ask for AvroRecordCodec but we do not use them. If the readFunc is Long => Option[Array[Byte]] we can spin the iterator here and do the decoding. You'll need an extra refinementFilter: K => Boolean to capture the filterIndexOnly param.

echeipesh · 2016-08-31T13:45:54Z

This is the list of places where it looks like we should use CollectionLayerReader.njoin:

They can not be used in:

AccumuloCollectionReader
AccumuloRDDReader
HBaseCollectionReader
HBaseRDDReader
HadoopRDDReader

.. because one or another those backends read ranges rather than (K,V). I think that's fine, we don't need to abstract over that.

pomadchin · 2016-08-31T14:37:16Z

Cannot be used in HadoopRDDReader as we use MR job

…n/geotrellis into feature/collections-api # Conflicts: # accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloCollectionReader.scala

echeipesh · 2016-09-01T00:43:25Z

+1, submitted a meta-PR that tidies up the njoin signature a little.

Feature/collections api njoin

pomadchin added 3 commits July 27, 2016 14:29

collections api init

c70a674

LayerCollection readers for all backends without optimisations; Hadoo…

688d088

…p not working

fix hadoop collection reader

591ab02

pomadchin added 6 commits July 30, 2016 19:21

add reading

0928cc5

Merge branch 'master' of github.com:pomadchin/geotrellis into feature…

c67d646

…/collections-api

+file multithread reads

43405d5

fix collections api

ee67aa6

hbase collection reader

2abeb1f

parallelize reads in collections api

2b91182

pomadchin changed the title ~~[WIP] Collections API~~ Collections API Aug 2, 2016

pomadchin added 3 commits August 2, 2016 22:51

improve collection api reads

50e0023

fixed thread pools in collection readers

e85fb2d

collections reading threads are configurable

7158cf2

echeipesh reviewed Aug 18, 2016
View reviewed changes

pomadchin added 5 commits August 22, 2016 14:05

Accumulo sim; removed partitions number, generic njoin func

47352b7

Merge branch 'master' of github.com:pomadchin/geotrellis into feature…

16b4117

…/collections-api

accumulo and hbase etl fix

807134f

hide thread pool creation / closing inside njoin function

7367d66

improve thread pool size definition

821b09d

pomadchin mentioned this pull request Aug 26, 2016

ETL Refactor geotrellis/geotrellis-landsat-emr-demo#5

Merged

7 tasks

explicit return type in all colelction readers

1e083f0

pomadchin mentioned this pull request Aug 29, 2016

Collections polygonal summary functions #1614

Merged

3 tasks

hbase conenction control fix

bbf27c3

pomadchin changed the title ~~Collections API~~ [WIP] Collections API Aug 30, 2016

pomadchin changed the title ~~[WIP] Collections API~~ Collections API Aug 30, 2016

pomadchin added 3 commits August 30, 2016 19:14

hbase reads performance improvements

3eead88

rollback readers; they were operating normally; problems were caused …

20df234

…by already runing jobs on a runing cluster

safer hbase scanners handle

d08f30d

echeipesh reviewed Aug 31, 2016
View reviewed changes

pomadchin and others added 3 commits August 31, 2016 18:29

LayerCollection.njoin function usage

238f320

LayerCollection.njoin function usage

7a499dd

Merge branch 'feature/collections-api' of https://github.com/pomadchi…

7b2f659

…n/geotrellis into feature/collections-api # Conflicts: # accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloCollectionReader.scala

Merge pull request #15 from echeipesh/feature/collections-api-njoin

3c0134c

Feature/collections api njoin

echeipesh merged commit 2e87199 into locationtech:master Sep 1, 2016

lossyrob added this to the 1.0 milestone Oct 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collections API #1606

Collections API #1606

pomadchin commented Jul 27, 2016 •

edited

Loading

pomadchin commented Jul 29, 2016

pomadchin commented Aug 2, 2016 •

edited

Loading

echeipesh Aug 18, 2016

echeipesh commented Aug 18, 2016

echeipesh Aug 18, 2016

pomadchin Aug 18, 2016

echeipesh Aug 18, 2016

echeipesh Aug 18, 2016

pomadchin commented Aug 30, 2016 •

edited

Loading

echeipesh Aug 31, 2016 •

edited

Loading

echeipesh commented Aug 31, 2016 •

edited

Loading

pomadchin commented Aug 31, 2016

echeipesh commented Sep 1, 2016

Collections API #1606

Collections API #1606

Conversation

pomadchin commented Jul 27, 2016 • edited Loading

pomadchin commented Jul 29, 2016

pomadchin commented Aug 2, 2016 • edited Loading

echeipesh Aug 18, 2016

Choose a reason for hiding this comment

echeipesh commented Aug 18, 2016

echeipesh Aug 18, 2016

Choose a reason for hiding this comment

pomadchin Aug 18, 2016

Choose a reason for hiding this comment

echeipesh Aug 18, 2016

Choose a reason for hiding this comment

echeipesh Aug 18, 2016

Choose a reason for hiding this comment

pomadchin commented Aug 30, 2016 • edited Loading

echeipesh Aug 31, 2016 • edited Loading

Choose a reason for hiding this comment

echeipesh commented Aug 31, 2016 • edited Loading

pomadchin commented Aug 31, 2016

echeipesh commented Sep 1, 2016

pomadchin commented Jul 27, 2016 •

edited

Loading

pomadchin commented Aug 2, 2016 •

edited

Loading

pomadchin commented Aug 30, 2016 •

edited

Loading

echeipesh Aug 31, 2016 •

edited

Loading

echeipesh commented Aug 31, 2016 •

edited

Loading