New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate partitions number basing on GeoTiff segments #2296

Merged
merged 15 commits into from Aug 1, 2017

Conversation

Projects
None yet
2 participants
@pomadchin
Member

pomadchin commented Jul 24, 2017

  • Poor performance fix
  • Tests (new & fix old)
  • Better repartition in border cases (size / windows size in stripped / tiles cases)
  • Rethink multiband fromSegments function, see d952665
  • Rethink a singleband fromSegments function

The aim of two last points is to reduce memory usage

@pomadchin pomadchin force-pushed the pomadchin:feature/s3tiff-segments branch from 51a4abe to 15baaef Jul 24, 2017

@pomadchin pomadchin changed the title from [WIP] Estimate partitions number basing on GeoTiff segments to Estimate partitions number basing on GeoTiff segments Jul 25, 2017

@pomadchin pomadchin force-pushed the pomadchin:feature/s3tiff-segments branch from 3e43b8f to 42e05fa Jul 25, 2017

@pomadchin pomadchin requested a review from echeipesh Jul 25, 2017

@pomadchin pomadchin changed the title from Estimate partitions number basing on GeoTiff segments to [WIP] Estimate partitions number basing on GeoTiff segments Jul 25, 2017

@pomadchin pomadchin force-pushed the pomadchin:feature/s3tiff-segments branch from 85a3a7e to 8d991e3 Jul 25, 2017

pomadchin added some commits Jul 19, 2017

Added S3 metadata reader to estimate partitions number basing on a de…
…sired partition size in bytes.

Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Add segments reads
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Make multiband case better
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Improve repartition strategy.
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>

@pomadchin pomadchin force-pushed the pomadchin:feature/s3tiff-segments branch from 8d991e3 to d952665 Jul 25, 2017

private def _subsetBandsFromSegments(
bandSequence: Seq[Int],
segmentIds: Traversable[Int],
deinterleaveBitSegment: (GeoTiffSegment, Int, Int, Int, Traversable[Int]) => Array[Array[Byte]],

This comment has been minimized.

@echeipesh

echeipesh Jul 25, 2017

Contributor

Thats a rough signature. Is there any way to sugar it up? Its used a couple of times, maybe type alias where we can hang a comment?

@@ -219,6 +219,14 @@ abstract class GeoTiffMultibandTile(
(bytes, bandCount, bytesPerSample, _) => GeoTiffSegment.deinterleave(bytes, bandCount, bytesPerSample)
).toVector
def bandsFromSegments(segmentIds: Traversable[Int]): Vector[Tile] =

This comment has been minimized.

@echeipesh

echeipesh Jul 25, 2017

Contributor

Should be private as not to add to API ?

def readSegments(ids: Traversable[Int], info: GeoTiffReader.GeoTiffInfo, options: Options) = {

This comment has been minimized.

@echeipesh

echeipesh Jul 25, 2017

Contributor

Needs return type

import scala.collection.mutable
case class S3GeoTiffInfoReader(

This comment has been minimized.

@echeipesh

echeipesh Jul 25, 2017

Contributor

By the way this is used it looks like it could be decomposed into a companion object with some methods on it, that would probably be safer than a case class that potentially carries client around.

pomadchin added some commits Jul 26, 2017

Use crop function istead of bySegments, added Iterator and Option var…
…iant to read by segments, code refactor.

Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
/** Read segments iterator, each segment into a single R */
def readSegmentsIterator(gbs: Array[GridBounds], info: GeoTiffReader.GeoTiffInfo, options: O): Iterator[R]
/** Reads all segments into one R */
def readSegments(gbs: Array[GridBounds], info: GeoTiffReader.GeoTiffInfo, options: O): Option[R]

This comment has been minimized.

@pomadchin

pomadchin Jul 26, 2017

Member

Two strategies to read segments, which is better for our use cases?

@pomadchin pomadchin changed the title from [WIP] Estimate partitions number basing on GeoTiff segments to Estimate partitions number basing on GeoTiff segments Jul 26, 2017

* each segment can only be in a single partition.
* */
def segmentsByPartitionBytes(partitionBytes: Long = Long.MaxValue, maxTileSize: Option[Int] = None)
(implicit sc: SparkContext): RDD[((String, GeoTiffInfo), Array[GridBounds])] = {

This comment has been minimized.

@pomadchin

pomadchin Jul 26, 2017

Member

We can still operate with segment refs

pomadchin added some commits Jul 28, 2017

Fix tests to use fromSegmentsIterator, looks like a cfor bug / feature
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Remove useless method, rename fromSegments into fromWindows, make mos…
…t of GeoTiffInfoReader function private

Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
* @param chunkSize How many bytes should be read in at a time.
* @param delimiter Delimiter to use for S3 objet listings. See
* @param getS3Client A function to instantiate an S3Client. Must be serializable.
* @param persistLevel A spark persist sotrage level, MEMORY_ONLY by default (similar to RDD.cache())
* @param bySegments Minimize segments reads, read input data by segments.

This comment has been minimized.

@echeipesh

echeipesh Jul 28, 2017

Contributor

Semver doesn't give a lot of room here, can't change the Options object.

Lets assume that persistLevel is MEMORY_ONLY, if there is not enough memory to read the tiff headers, its probably a lost cause anyway. Also re-reading them twice if memory cache is purged is good soft failover.

Lets keep the maxWindowSize=None as default, that will be the flag for using the segment wise read. Setting it to some value will be the flag for reading exactly those window chunks, as before.

This means that we will never read full GeoTiff again unless its smaller than the maxWindowSize value. It seems that that GeoTiffs small enough to be appropriate for RDD record basically don't get written, so this is good default

This comment has been minimized.

@pomadchin

pomadchin Jul 29, 2017

Member

So you think for by segment reads we don't need that window option by default? (at least in 1.2); But that can be a good feature to try to pack segments into windows we want.

This comment has been minimized.

@echeipesh

echeipesh Jul 31, 2017

Contributor

It seems like the actual size of the window is not important there because the partition size will control it. If there are too few windows in partitions, you are packing more. You never splice segments together, so the effect of the window size is just to provide logic to keep the segments spatially related. So I would assume that it can just be hard coded to 1024 for segment reads.

else segments
val result = repartition.flatMap { case ((key, md), segmentIndices) =>
rr.readWindows(segmentIndices, md, options).map { case (k, v) =>

This comment has been minimized.

@echeipesh

echeipesh Jul 28, 2017

Contributor

segmentIndices is stale variable name, I guess this is segmentBounds now ?

import scala.collection.mutable
trait GeoTiffInfoReader extends LazyLogging {

This comment has been minimized.

@echeipesh

echeipesh Jul 28, 2017

Contributor

Why not just make the the private [geotrellis] trait ?

@pomadchin pomadchin force-pushed the pomadchin:feature/s3tiff-segments branch from 59504a4 to 9bedc00 Jul 31, 2017

@echeipesh echeipesh added this to the 1.2 milestone Aug 1, 2017

@echeipesh echeipesh merged commit d506901 into locationtech:master Aug 1, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
ip-validation
Details

@echeipesh echeipesh referenced this pull request Aug 31, 2017

Closed

Partition GeoTiff reading by GeoTiff segment layout #2173

2 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment