Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections API #1606

Merged
merged 26 commits into from
Sep 1, 2016
Merged

Collections API #1606

merged 26 commits into from
Sep 1, 2016

Conversation

pomadchin
Copy link
Member

@pomadchin pomadchin commented Jul 27, 2016

#1605

Backends support:

  • File
  • Hadoop
  • S3
  • Accumulo
  • Cassandra
  • HBase
  • we can improve performance by reads parallelizing

@pomadchin
Copy link
Member Author

  • geotrellis.slick.PostgisSpec test failed

@pomadchin
Copy link
Member Author

pomadchin commented Aug 2, 2016

  • geotrellis.slick.PostgisSpec

@pomadchin pomadchin changed the title [WIP] Collections API Collections API Aug 2, 2016
K: AvroRecordCodec: Boundable: JsonFormat: ClassTag,
V: AvroRecordCodec: ClassTag,
M: JsonFormat: GetComponent[?, Bounds[K]]
](id: ID, rasterQuery: LayerQuery[K, M], numPartitions: Int, indexFilterOnly: Boolean): Seq[(K, V)] with Metadata[M]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this read should not have numPartitions argument. All the reads should be using threads to distributed the work as much as makes sense, but by definition you only ever get one partition, the collection.

@echeipesh
Copy link
Contributor

It looks like a lot of code is shared between the collection readers and rdd readers in S3, File, Hadoop, and Cassandra. HBase and Accumulo are exceptions because they have InputFormats but even then we could replace those easily with custom readers and possible gain some control.

Reading a collection seems exactly like reading a partition. Maybe we can abstract all of that code into something that reads just the ranges and the splitting/filtering of those ranges can be up to the whichever reader. I'm mostly concerned because this async code is very fiddly and error-prone. I can see us having to refactor it again when we learn something new.

}
}

nondeterminism.njoin(maxOpen = threads, maxQueued = threads) { range map read }.runFoldMap(identity).unsafePerformSync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers need to be closed, I'm unsure on what consequence, but probably nothing good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are not they closing with cache expiration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that would mess with the caching though, if you had multiple async calls n to a Collection reader. If that's the case it almost makes it seem like CollectionsReader needs to be closable, as it will invariably represent some resources.

Main use case for these is to be called from endpoints to work on "small" areas, so async read calls are totally in scope of design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story goes for HBase and Accumulo collection reads, they're based on some version of range/batch writers.

@pomadchin
Copy link
Member Author

pomadchin commented Aug 30, 2016

Includes important HBase and ETL fixes.

@pomadchin pomadchin changed the title Collections API [WIP] Collections API Aug 30, 2016
@pomadchin pomadchin changed the title [WIP] Collections API Collections API Aug 30, 2016
}

object CollectionLayerReader {
def njoin[K: AvroRecordCodec: Boundable, V: AvroRecordCodec](
Copy link
Contributor

@echeipesh echeipesh Aug 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ask for AvroRecordCodec but we do not use them. If the readFunc is Long => Option[Array[Byte]] we can spin the iterator here and do the decoding. You'll need an extra refinementFilter: K => Boolean to capture the filterIndexOnly param.

@echeipesh
Copy link
Contributor

echeipesh commented Aug 31, 2016

This is the list of places where it looks like we should use CollectionLayerReader.njoin:

  • CassandraCollectionReader
  • CassandraRDDReader
  • S3CollectionReader
  • S3RDDReader
  • HadoopCollectionReader
  • FileCollectionReader
  • FileRDDReader

They can not be used in:

  • AccumuloCollectionReader
  • AccumuloRDDReader
  • HBaseCollectionReader
  • HBaseRDDReader
  • HadoopRDDReader

.. because one or another those backends read ranges rather than (K,V). I think that's fine, we don't need to abstract over that.

@pomadchin
Copy link
Member Author

Cannot be used in HadoopRDDReader as we use MR job

pomadchin and others added 3 commits August 31, 2016 18:29
…n/geotrellis into feature/collections-api

# Conflicts:
#	accumulo/src/main/scala/geotrellis/spark/io/accumulo/AccumuloCollectionReader.scala
@echeipesh
Copy link
Contributor

+1, submitted a meta-PR that tidies up the njoin signature a little.

@echeipesh echeipesh merged commit 2e87199 into locationtech:master Sep 1, 2016
@lossyrob lossyrob added this to the 1.0 milestone Oct 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants