New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections API #1606

Merged
merged 26 commits into from Sep 1, 2016

Conversation

Projects
None yet
3 participants
@pomadchin
Member

pomadchin commented Jul 27, 2016

#1605

Backends support:

  • File
  • Hadoop
  • S3
  • Accumulo
  • Cassandra
  • HBase
  • we can improve performance by reads parallelizing
@pomadchin

This comment has been minimized.

Member

pomadchin commented Jul 29, 2016

  • geotrellis.slick.PostgisSpec test failed
@pomadchin

This comment has been minimized.

Member

pomadchin commented Aug 2, 2016

  • geotrellis.slick.PostgisSpec

@pomadchin pomadchin changed the title from [WIP] Collections API to Collections API Aug 2, 2016

K: AvroRecordCodec: Boundable: JsonFormat: ClassTag,
V: AvroRecordCodec: ClassTag,
M: JsonFormat: GetComponent[?, Bounds[K]]
](id: ID, rasterQuery: LayerQuery[K, M], numPartitions: Int, indexFilterOnly: Boolean): Seq[(K, V)] with Metadata[M]

This comment has been minimized.

@echeipesh

echeipesh Aug 18, 2016

Contributor

this read should not have numPartitions argument. All the reads should be using threads to distributed the work as much as makes sense, but by definition you only ever get one partition, the collection.

val scanner = instance.connector.createScanner(table, new Authorizations())
scanner.setRange(range)
scanner.fetchColumnFamily(columnFamily)
scanner.iterator.map { case entry =>

This comment has been minimized.

@echeipesh

echeipesh Aug 18, 2016

Contributor

I know scanner will have a thread pool available to run multiple requests, does it do the thing that is very useful here and process multiple ranges async or are they essentially sequential from range to range but async for each range?

This comment has been minimized.

@pomadchin

pomadchin Aug 18, 2016

Member

async for each range; would fix that issue with updating that pr up to master; i thought it is completely async, but forgot that thing is valid only in case of hbase, where we can setup multirange scanner

@echeipesh

This comment has been minimized.

Contributor

echeipesh commented Aug 18, 2016

It looks like a lot of code is shared between the collection readers and rdd readers in S3, File, Hadoop, and Cassandra. HBase and Accumulo are exceptions because they have InputFormats but even then we could replace those easily with custom readers and possible gain some control.

Reading a collection seems exactly like reading a partition. Maybe we can abstract all of that code into something that reads just the ranges and the splitting/filtering of those ranges can be up to the whichever reader. I'm mostly concerned because this async code is very fiddly and error-prone. I can see us having to refactor it again when we learn something new.

}
}
nondeterminism.njoin(maxOpen = threads, maxQueued = threads) { range map read }.runFoldMap(identity).unsafePerformSync

This comment has been minimized.

@echeipesh

echeipesh Aug 18, 2016

Contributor

readers need to be closed, I'm unsure on what consequence, but probably nothing good.

This comment has been minimized.

@pomadchin

pomadchin Aug 18, 2016

Member

are not they closing with cache expiration?

This comment has been minimized.

@echeipesh

echeipesh Aug 18, 2016

Contributor

I can see how that would mess with the caching though, if you had multiple async calls n to a Collection reader. If that's the case it almost makes it seem like CollectionsReader needs to be closable, as it will invariably represent some resources.

Main use case for these is to be called from endpoints to work on "small" areas, so async read calls are totally in scope of design.

This comment has been minimized.

@echeipesh

echeipesh Aug 18, 2016

Contributor

Same story goes for HBase and Accumulo collection reads, they're based on some version of range/batch writers.

@pomadchin pomadchin referenced this pull request Aug 26, 2016

Merged

ETL Refactor #5

7 of 7 tasks complete

@pomadchin pomadchin referenced this pull request Aug 29, 2016

Merged

Collections polygonal summary functions #1614

3 of 3 tasks complete
@pomadchin

This comment has been minimized.

Member

pomadchin commented Aug 30, 2016

Includes important HBase and ETL fixes.

@pomadchin pomadchin changed the title from Collections API to [WIP] Collections API Aug 30, 2016

@pomadchin pomadchin changed the title from [WIP] Collections API to Collections API Aug 30, 2016

pomadchin added some commits Aug 30, 2016

rollback readers; they were operating normally; problems were caused …
…by already runing jobs on a runing cluster
}
object CollectionLayerReader {
def njoin[K: AvroRecordCodec: Boundable, V: AvroRecordCodec](

This comment has been minimized.

@echeipesh

echeipesh Aug 31, 2016

Contributor

We ask for AvroRecordCodec but we do not use them. If the readFunc is Long => Option[Array[Byte]] we can spin the iterator here and do the decoding. You'll need an extra refinementFilter: K => Boolean to capture the filterIndexOnly param.

@@ -33,7 +34,7 @@ trait S3RDDReader {
filterIndexOnly: Boolean,
writerSchema: Option[Schema] = None,
numPartitions: Option[Int] = None,
threads: Int = ConfigFactory.load().getInt("geotrellis.s3.threads.rdd.read")
threads: Int = ConfigFactory.load().getThreads("geotrellis.s3.threads.rdd.read")

This comment has been minimized.

@echeipesh

echeipesh Aug 31, 2016

Contributor

S3RDDReader can now use CollectionLayerReader.njoin ?

@echeipesh

This comment has been minimized.

Contributor

echeipesh commented Aug 31, 2016

This is the list of places where it looks like we should use CollectionLayerReader.njoin:

  • CassandraCollectionReader
  • CassandraRDDReader
  • S3CollectionReader
  • S3RDDReader
  • HadoopCollectionReader
  • FileCollectionReader
  • FileRDDReader

They can not be used in:

  • AccumuloCollectionReader
  • AccumuloRDDReader
  • HBaseCollectionReader
  • HBaseRDDReader
  • HadoopRDDReader

.. because one or another those backends read ranges rather than (K,V). I think that's fine, we don't need to abstract over that.

@pomadchin

This comment has been minimized.

Member

pomadchin commented Aug 31, 2016

Cannot be used in HadoopRDDReader as we use MR job

pomadchin and others added some commits Aug 31, 2016

Merge branch 'feature/collections-api' of https://github.com/pomadchi…
…n/geotrellis into feature/collections-api

# Conflicts:
#	accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloCollectionReader.scala
@echeipesh

This comment has been minimized.

Contributor

echeipesh commented Sep 1, 2016

+1, submitted a meta-PR that tidies up the njoin signature a little.

@echeipesh echeipesh merged commit 2e87199 into locationtech:master Sep 1, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@lossyrob lossyrob added this to the 1.0 milestone Oct 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment