-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate partitions number basing on GeoTiff segments #2296
Estimate partitions number basing on GeoTiff segments #2296
Conversation
51a4abe
to
15baaef
Compare
3e43b8f
to
42e05fa
Compare
85a3a7e
to
8d991e3
Compare
…sired partition size in bytes. Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
8d991e3
to
d952665
Compare
private def _subsetBandsFromSegments( | ||
bandSequence: Seq[Int], | ||
segmentIds: Traversable[Int], | ||
deinterleaveBitSegment: (GeoTiffSegment, Int, Int, Int, Traversable[Int]) => Array[Array[Byte]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a rough signature. Is there any way to sugar it up? Its used a couple of times, maybe type alias where we can hang a comment?
@@ -219,6 +219,14 @@ abstract class GeoTiffMultibandTile( | |||
(bytes, bandCount, bytesPerSample, _) => GeoTiffSegment.deinterleave(bytes, bandCount, bytesPerSample) | |||
).toVector | |||
|
|||
def bandsFromSegments(segmentIds: Traversable[Int]): Vector[Tile] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be private
as not to add to API ?
|
||
def readSegments(ids: Traversable[Int], info: GeoTiffReader.GeoTiffInfo, options: Options) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs return type
|
||
import scala.collection.mutable | ||
|
||
case class S3GeoTiffInfoReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way this is used it looks like it could be decomposed into a companion object with some methods on it, that would probably be safer than a case class that potentially carries client around.
…iant to read by segments, code refactor. Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
/** Read segments iterator, each segment into a single R */ | ||
def readSegmentsIterator(gbs: Array[GridBounds], info: GeoTiffReader.GeoTiffInfo, options: O): Iterator[R] | ||
/** Reads all segments into one R */ | ||
def readSegments(gbs: Array[GridBounds], info: GeoTiffReader.GeoTiffInfo, options: O): Option[R] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two strategies to read segments, which is better for our use cases?
* each segment can only be in a single partition. | ||
* */ | ||
def segmentsByPartitionBytes(partitionBytes: Long = Long.MaxValue, maxTileSize: Option[Int] = None) | ||
(implicit sc: SparkContext): RDD[((String, GeoTiffInfo), Array[GridBounds])] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still operate with segment refs
Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
…t of GeoTiffInfoReader function private Signed-off-by: Grisha Pomadchin <gr.pomadchin@gmail.com>
* @param chunkSize How many bytes should be read in at a time. | ||
* @param delimiter Delimiter to use for S3 objet listings. See | ||
* @param getS3Client A function to instantiate an S3Client. Must be serializable. | ||
* @param persistLevel A spark persist sotrage level, MEMORY_ONLY by default (similar to RDD.cache()) | ||
* @param bySegments Minimize segments reads, read input data by segments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semver doesn't give a lot of room here, can't change the Options
object.
Lets assume that persistLevel
is MEMORY_ONLY
, if there is not enough memory to read the tiff headers, its probably a lost cause anyway. Also re-reading them twice if memory cache is purged is good soft failover.
Lets keep the maxWindowSize=None
as default, that will be the flag for using the segment wise read. Setting it to some value will be the flag for reading exactly those window chunks, as before.
This means that we will never read full GeoTiff again unless its smaller than the maxWindowSize
value. It seems that that GeoTiffs small enough to be appropriate for RDD record basically don't get written, so this is good default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you think for by segment reads we don't need that window option by default? (at least in 1.2); But that can be a good feature to try to pack segments into windows we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the actual size of the window is not important there because the partition size will control it. If there are too few windows in partitions, you are packing more. You never splice segments together, so the effect of the window size is just to provide logic to keep the segments spatially related. So I would assume that it can just be hard coded to 1024 for segment reads.
else segments | ||
|
||
val result = repartition.flatMap { case ((key, md), segmentIndices) => | ||
rr.readWindows(segmentIndices, md, options).map { case (k, v) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
segmentIndices
is stale variable name, I guess this is segmentBounds
now ?
|
||
import scala.collection.mutable | ||
|
||
trait GeoTiffInfoReader extends LazyLogging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make the the private [geotrellis] trait
?
59504a4
to
9bedc00
Compare
fromSegments
function, see d952665fromSegments
functionThe aim of two last points is to reduce memory usage