Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate partitions number basing on GeoTiff segments #2296

Merged
merged 15 commits into from
Aug 1, 2017

Conversation

pomadchin
Copy link
Member

@pomadchin pomadchin commented Jul 24, 2017

  • Poor performance fix
  • Tests (new & fix old)
  • Better repartition in border cases (size / windows size in stripped / tiles cases)
  • Rethink multiband fromSegments function, see d952665
  • Rethink a singleband fromSegments function

The aim of two last points is to reduce memory usage

@pomadchin pomadchin changed the title [WIP] Estimate partitions number basing on GeoTiff segments Estimate partitions number basing on GeoTiff segments Jul 25, 2017
@pomadchin pomadchin requested a review from echeipesh July 25, 2017 13:58
@pomadchin pomadchin changed the title Estimate partitions number basing on GeoTiff segments [WIP] Estimate partitions number basing on GeoTiff segments Jul 25, 2017
…sired partition size in bytes.

Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
private def _subsetBandsFromSegments(
bandSequence: Seq[Int],
segmentIds: Traversable[Int],
deinterleaveBitSegment: (GeoTiffSegment, Int, Int, Int, Traversable[Int]) => Array[Array[Byte]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a rough signature. Is there any way to sugar it up? Its used a couple of times, maybe type alias where we can hang a comment?

@@ -219,6 +219,14 @@ abstract class GeoTiffMultibandTile(
(bytes, bandCount, bytesPerSample, _) => GeoTiffSegment.deinterleave(bytes, bandCount, bytesPerSample)
).toVector

def bandsFromSegments(segmentIds: Traversable[Int]): Vector[Tile] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be private as not to add to API ?


def readSegments(ids: Traversable[Int], info: GeoTiffReader.GeoTiffInfo, options: Options) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs return type


import scala.collection.mutable

case class S3GeoTiffInfoReader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way this is used it looks like it could be decomposed into a companion object with some methods on it, that would probably be safer than a case class that potentially carries client around.

/** Read segments iterator, each segment into a single R */
def readSegmentsIterator(gbs: Array[GridBounds], info: GeoTiffReader.GeoTiffInfo, options: O): Iterator[R]
/** Reads all segments into one R */
def readSegments(gbs: Array[GridBounds], info: GeoTiffReader.GeoTiffInfo, options: O): Option[R]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two strategies to read segments, which is better for our use cases?

@pomadchin pomadchin changed the title [WIP] Estimate partitions number basing on GeoTiff segments Estimate partitions number basing on GeoTiff segments Jul 26, 2017
* each segment can only be in a single partition.
* */
def segmentsByPartitionBytes(partitionBytes: Long = Long.MaxValue, maxTileSize: Option[Int] = None)
(implicit sc: SparkContext): RDD[((String, GeoTiffInfo), Array[GridBounds])] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still operate with segment refs

Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
…t of GeoTiffInfoReader function private

Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
* @param chunkSize How many bytes should be read in at a time.
* @param delimiter Delimiter to use for S3 objet listings. See
* @param getS3Client A function to instantiate an S3Client. Must be serializable.
* @param persistLevel A spark persist sotrage level, MEMORY_ONLY by default (similar to RDD.cache())
* @param bySegments Minimize segments reads, read input data by segments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semver doesn't give a lot of room here, can't change the Options object.

Lets assume that persistLevel is MEMORY_ONLY, if there is not enough memory to read the tiff headers, its probably a lost cause anyway. Also re-reading them twice if memory cache is purged is good soft failover.

Lets keep the maxWindowSize=None as default, that will be the flag for using the segment wise read. Setting it to some value will be the flag for reading exactly those window chunks, as before.

This means that we will never read full GeoTiff again unless its smaller than the maxWindowSize value. It seems that that GeoTiffs small enough to be appropriate for RDD record basically don't get written, so this is good default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you think for by segment reads we don't need that window option by default? (at least in 1.2); But that can be a good feature to try to pack segments into windows we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the actual size of the window is not important there because the partition size will control it. If there are too few windows in partitions, you are packing more. You never splice segments together, so the effect of the window size is just to provide logic to keep the segments spatially related. So I would assume that it can just be hard coded to 1024 for segment reads.

else segments

val result = repartition.flatMap { case ((key, md), segmentIndices) =>
rr.readWindows(segmentIndices, md, options).map { case (k, v) =>
Copy link
Contributor

@echeipesh echeipesh Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

segmentIndices is stale variable name, I guess this is segmentBounds now ?


import scala.collection.mutable

trait GeoTiffInfoReader extends LazyLogging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make the the private [geotrellis] trait ?

@echeipesh echeipesh added this to the 1.2 milestone Aug 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants